If I Bought Minecraft Before on Pc Can I Get It Again>=?

Ever wonder about that mysterious Content-Type tag? You lot know, the one y'all're supposed to put in HTML and you never quite know what information technology should exist?

Did you ever get an email from your friends in Bulgaria with the subject line "???? ?????? ??? ????"?

I've been dismayed to discover just how many software developers aren't really completely up to speed on the mysterious globe of grapheme sets, encodings, Unicode, all that stuff. A couple of years ago, a beta tester for FogBUGZ was wondering whether information technology could handle incoming email in Japanese. Japanese? They take email in Japanese? I had no thought. When I looked closely at the commercial ActiveX control we were using to parse MIME email letters, nosotros discovered information technology was doing exactly the wrong thing with character sets, and so nosotros actually had to write heroic code to disengage the incorrect conversion it had washed and redo it correctly. When I looked into another commercial library, it, too, had a completely broken character code implementation. I corresponded with the developer of that package and he sort of thought they "couldn't practice anything well-nigh it." Like many programmers, he just wished it would all accident over somehow.

Only it won't. When I discovered that the pop spider web development tool PHP has almost complete ignorance of grapheme encoding bug, blithely using 8 bits for characters, making it darn near impossible to develop good international web applications, I thought, enough is plenty.

Then I have an proclamation to brand: if you are a programmer working in 2003 and you don't know the basics of characters, graphic symbol sets, encodings, and Unicode, and I catch you, I'm going to punish you by making you lot peel onions for half dozen months in a submarine. I swear I will.

And one more thing:

IT'S Non THAT Hard.

In this article I'll fill you in on exactly what every working programmer should know. All that stuff well-nigh "manifestly text = ascii = characters are 8 bits" is non only wrong, information technology's hopelessly incorrect, and if you're still programming that way, you're non much better than a medical doctor who doesn't believe in germs. Please do not write another line of code until yous finish reading this article.

Before I get started, I should warn you that if you are one of those rare people who knows about internationalization, you are going to find my entire discussion a piddling bit oversimplified. I'm really just trying to set up a minimum bar hither so that everyone can understand what's going on and tin write lawmaking that has a hope of working with text in any language other than the subset of English that doesn't include words with accents. And I should warn you that character handling is only a tiny portion of what it takes to create software that works internationally, only I can simply write about ane thing at a fourth dimension then today it'southward grapheme sets.

A Historical Perspective

The easiest style to sympathise this stuff is to go chronologically.

You probably think I'yard going to talk most very former character sets similar EBCDIC here. Well, I won't. EBCDIC is not relevant to your life. We don't have to go that far dorsum in time.

ASCII tableBack in the semi-olden days, when Unix was existence invented and Chiliad&R were writing The C Programming Linguistic communication, everything was very simple. EBCDIC was on its way out. The just characters that mattered were proficient old unaccented English language letters, and we had a lawmaking for them called ASCII which was able to represent every character using a number between 32 and 127. Space was 32, the letter "A" was 65, etc. This could conveniently exist stored in 7 $.25. Most computers in those days were using eight-bit bytes, and so not only could you store every possible ASCII character, merely you lot had a whole bit to spare, which, if you were wicked, yous could use for your own stray purposes: the dim bulbs at WordStar really turned on the high chip to signal the last letter in a word, condemning WordStar to English language text only. Codes beneath 32 were called unprintable and were used for cussing. Simply kidding. They were used for command characters, like vii which made your computer beep and 12 which caused the current page of paper to go flying out of the printer and a new one to be fed in.

And all was good, bold you lot were an English speaker.

Because bytes have room for upward to viii bits, lots of people got to thinking, "gosh, we can use the codes 128-255 for our ain purposes." The trouble was, lots of people had this idea at the same fourth dimension, and they had their own ideas of what should go where in the infinite from 128 to 255. The IBM-PC had something that came to be known as the OEM character set which provided some accented characters for European languages and a agglomeration of line cartoon characters… horizontal confined, vertical confined, horizontal bars with piffling dingle-dangles dangling off the right side, etc., and you could use these line drawing characters to make spiffy boxes and lines on the screen, which you can still see running on the 8088 figurer at your dry cleaners'. In fact  every bit shortly as people started ownership PCs outside of America all kinds of unlike OEM character sets were dreamed up, which all used the top 128 characters for their ain purposes. For example on some PCs the graphic symbol code 130 would display every bit é, but on computers sold in Israel information technology was the Hebrew letter Gimel (ג), so when Americans would send their résumés to State of israel they would arrive as rגsumגdue south. In many cases, such as Russian, there were lots of different ideas of what to practice with the upper-128 characters, and then you couldn't even reliably interchange Russian documents.

Somewhen this OEM free-for-all got codified in the ANSI standard. In the ANSI standard, everybody agreed on what to do beneath 128, which was pretty much the same equally ASCII, just there were lots of different means to handle the characters from 128 and on upward, depending on where y'all lived. These unlike systems were called lawmaking pages. So for case in Israel DOS used a code page chosen 862, while Greek users used 737. They were the same below 128 only different from 128 up, where all the funny messages resided. The national versions of MS-DOS had dozens of these code pages, handling everything from English to Icelandic and they even had a few "multilingual" code pages that could do Esperanto and Galician on the same reckoner! Wow! But getting, say, Hebrew and Greek on the same figurer was a complete impossibility unless yous wrote your ain custom program that displayed everything using bitmapped graphics, because Hebrew and Greek required different code pages with different interpretations of the high numbers.

Meanwhile, in Asia, even more than crazy things were going on to accept into account the fact that Asian alphabets accept thousands of messages, which were never going to fit into 8 $.25. This was usually solved by the messy organization called DBCS, the "double byte character prepare" in which some letters were stored in one byte and others took two. Information technology was easy to move forward in a cord, just dang near incommunicable to move backwards. Programmers were encouraged not to use southward++ and s– to movement backwards and forwards, only instead to call functions such as Windows' AnsiNext and AnsiPrev which knew how to deal with the whole mess.

Merely nonetheless, most people just pretended that a byte was a character and a grapheme was 8 bits and as long as you never moved a string from i calculator to another, or spoke more than one linguistic communication, it would sort of always piece of work. Just of course, as soon every bit the Internet happened, it became quite commonplace to move strings from one computer to another, and the whole mess came tumbling downwards. Luckily, Unicode had been invented.

Unicode

Unicode was a brave endeavor to create a single grapheme set that included every reasonable writing system on the planet and some brand-believe ones like Klingon, as well. Some people are under the misconception that Unicode is simply a 16-bit code where each graphic symbol takes 16 bits and therefore there are 65,536 possible characters. This is non, actually, correct. It is the single most mutual myth nearly Unicode, and so if yous thought that, don't experience bad.

In fact, Unicode has a dissimilar way of thinking about characters, and you have to empathize the Unicode way of thinking of things or nothing will make sense.

Until now, nosotros've assumed that a letter of the alphabet maps to some bits which you can store on deejay or in retention:

A -> 0100 0001

In Unicode, a letter maps to something called a lawmaking indicate which is still only a theoretical concept. How that code point is represented in memory or on disk is a whole nuther story.

In Unicode, the letter of the alphabet A is a ideal ideal. It'south just floating in heaven:

A

This platonic A is dissimilar than B, and different from a, but the same as A and A and A. The thought that A in a Times New Roman font is the aforementioned character as the A in a Helvetica font, only different from "a" in lower case, does not seem very controversial, but in some languages simply figuring out what a alphabetic character is can cause controversy. Is the German alphabetic character ß a real letter of the alphabet or just a fancy way of writing ss? If a letter'due south shape changes at the cease of the discussion, is that a dissimilar letter? Hebrew says yes, Arabic says no. Anyway, the smart people at the Unicode consortium have been figuring this out for the last decade or so, accompanied past a dandy deal of highly political debate, and you don't take to worry nearly information technology. They've figured it all out already.

Every platonic letter of the alphabet in every alphabet is assigned a magic number by the Unicode consortium which is written like this: U+0639.  This magic number is called a code point. The U+ ways "Unicode" and the numbers are hexadecimal. U+0639 is the Arabic letter of the alphabet Ain. The English letter A would be U+0041. You can observe them all using the charmap utility on Windows 2000/XP or visiting the Unicode web site.

There is no real limit on the number of letters that Unicode tin can define and in fact they accept gone beyond 65,536 and so not every unicode letter can actually be squeezed into two bytes, simply that was a myth anyhow.

OK, then say nosotros have a string:

Hello

which, in Unicode, corresponds to these five code points:

U+0048 U+0065 U+006C U+006C U+006F.

Merely a bunch of code points. Numbers, really. We haven't all the same said anything about how to shop this in memory or stand for information technology in an email bulletin.

Encodings

That's where encodings come in.

The earliest thought for Unicode encoding, which led to the myth about the two bytes, was, hey, let's just shop those numbers in two bytes each. And then Hello becomes

00 48 00 65 00 6C 00 6C 00 6F

Right? Non so fast! Couldn't it also be:

48 00 65 00 6C 00 6C 00 6F 00 ?

Well, technically, yes, I do believe it could, and, in fact, early implementors wanted to be able to shop their Unicode code points in high-endian or depression-endian mode, whichever their detail CPU was fastest at, and lo, it was evening and it was morning and in that location were already two means to store Unicode. And so the people were forced to come upwards with the bizarre convention of storing a FE FF at the first of every Unicode cord; this is called a Unicode Byte Order Mark and if you are swapping your high and depression bytes information technology will look like a FF FE and the person reading your string volition know that they have to swap every other byte. Phew. Not every Unicode string in the wild has a byte order mark at the get-go.

For a while it seemed like that might be good enough, just programmers were complaining. "Look at all those zeros!" they said, since they were Americans and they were looking at English text which rarely used code points above U+00FF. Also they were liberal hippies in California who wanted to conserve (sneer). If they were Texans they wouldn't have minded guzzling twice the number of bytes. But those Californian wimps couldn't bear the idea of doubling the corporeality of storage it took for strings, and anyhow, there were already all these doggone documents out there using various ANSI and DBCS character sets and who'southward going to catechumen them all? Moi? For this reason alone most people decided to ignore Unicode for several years and in the meantime things got worse.

Thus was invented the brilliant concept of UTF-8. UTF-eight was another system for storing your string of Unicode code points, those magic U+ numbers, in retention using 8 bit bytes. In UTF-8, every code signal from 0-127 is stored in a single byte. Only code points 128 and in a higher place are stored using 2, iii, in fact, up to 6 bytes.

How UTF-8 works

This has the neat side effect that English text looks exactly the aforementioned in UTF-8 as it did in ASCII, so Americans don't even notice anything wrong. Only the rest of the world has to jump through hoops. Specifically, Hello, which was U+0048 U+0065 U+006C U+006C U+006F, will be stored every bit 48 65 6C 6C 6F, which, behold! is the same as it was stored in ASCII, and ANSI, and every OEM graphic symbol assail the planet. Now, if you lot are so bold as to use accented messages or Greek letters or Klingon messages, you'll have to use several bytes to store a unmarried code indicate, but the Americans volition never detect. (UTF-eight also has the dainty property that ignorant old cord-processing code that wants to use a single 0 byte as the null-terminator will not truncate strings).

So far I've told yous three ways of encoding Unicode. The traditional shop-information technology-in-two-byte methods are chosen UCS-two (considering it has two bytes) or UTF-xvi (because it has xvi bits), and you yet have to effigy out if information technology's loftier-endian UCS-2 or low-endian UCS-two. And in that location's the popular new UTF-8 standard which has the prissy property of also working respectably if you have the happy coincidence of English text and braindead programs that are completely unaware that there is anything other than ASCII.

In that location are actually a bunch of other means of encoding Unicode. There'due south something chosen UTF-7, which is a lot similar UTF-eight but guarantees that the high bit will always exist naught, and so that if you accept to pass Unicode through some kind of draconian police-state email system that thinks 7 $.25 are quite enough, thank you information technology can yet clasp through unscathed. At that place's UCS-iv, which stores each code point in 4 bytes, which has the nice property that every unmarried code bespeak can be stored in the same number of bytes, but, golly, fifty-fifty the Texans wouldn't be and then bold as to waste that much memory.

And in fact now that yous're thinking of things in terms of platonic ideal letters which are represented by Unicode code points, those unicode code points can exist encoded in whatever onetime-school encoding scheme, likewise! For example, you could encode the Unicode string for Hello (U+0048 U+0065 U+006C U+006C U+006F) in ASCII, or the old OEM Greek Encoding, or the Hebrew ANSI Encoding, or any of several hundred encodings that have been invented so far, with 1 grab: some of the letters might non evidence upward! If there's no equivalent for the Unicode code point you're trying to represent in the encoding you're trying to stand for it in, you usually become a little question mark: ? or, if you're actually good, a box. Which did y'all go? -> �

There are hundreds of traditional encodings which can only shop some code points correctly and alter all the other lawmaking points into question marks. Some popular encodings of English text are Windows-1252 (the Windows 9x standard for Western European languages) and ISO-8859-i, aka Latin-1 (also useful for any Western European language). Merely endeavour to shop Russian or Hebrew letters in these encodings and you lot get a bunch of question marks. UTF 7, 8, 16, and 32 all have the nice holding of being able to shop whatever lawmaking point correctly.

The Single Almost Important Fact About Encodings

If you lot completely forget everything I only explained, please remember one extremely important fact. It does not make sense to accept a string without knowing what encoding it uses. You lot can no longer stick your head in the sand and pretend that "plain" text is ASCII.

At that place Ain't No Such Thing Every bit Manifestly Text.

If y'all have a string, in memory, in a file, or in an email message, you accept to know what encoding information technology is in or you cannot interpret it or display it to users correctly.

Nigh every stupid "my website looks like gibberish" or "she can't read my emails when I use accents" trouble comes downward to one naive developer who didn't empathize the simple fact that if you don't tell me whether a item cord is encoded using UTF-8 or ASCII or ISO 8859-1 (Latin 1) or Windows 1252 (Western European), you just cannot display it correctly or even figure out where information technology ends. At that place are over a hundred encodings and to a higher place code bespeak 127, all bets are off.

How do we preserve this information about what encoding a string uses? Well, there are standard ways to do this. For an email message, you are expected to have a string in the header of the form

Content-Type: text/plain; charset="UTF-8"

For a web page, the original thought was that the web server would return a similar Content-Blazon http header along with the web folio itself — non in the HTML itself, but as 1 of the response headers that are sent before the HTML page.

This causes problems. Suppose y'all accept a big web server with lots of sites and hundreds of pages contributed by lots of people in lots of different languages and all using whatever encoding their re-create of Microsoft FrontPage saw fit to generate. The web server itself wouldn't really know what encoding each file was written in, and then it couldn't send the Content-Type header.

It would be user-friendly if you could put the Content-Type of the HTML file correct in the HTML file itself, using some kind of special tag. Of grade this collection purists crazy… how can you read the HTML file until you lot know what encoding it's in?! Luckily, almost every encoding in common employ does the same affair with characters betwixt 32 and 127, so y'all can always get this far on the HTML folio without starting to utilise funny messages:

<html>
<caput>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">

But that meta tag really has to be the very offset thing in the <caput> department because as soon as the web browser sees this tag it's going to stop parsing the folio and start over after reinterpreting the whole folio using the encoding you specified.

What practise web browsers do if they don't observe any Content-Type, either in the http headers or the meta tag? Internet Explorer actually does something quite interesting: it tries to approximate, based on the frequency in which diverse bytes announced in typical text in typical encodings of diverse languages, what language and encoding was used. Considering the various onetime 8 bit code pages tended to put their national letters in different ranges betwixt 128 and 255, and considering every homo language has a different characteristic histogram of letter of the alphabet usage, this actually has a chance of working. It'due south truly weird, only information technology does seem to piece of work oft plenty that naïve web-page writers who never knew they needed a Content-Type header look at their folio in a web browser and it looks ok, until one day, they write something that doesn't exactly arrange to the letter-frequency-distribution of their native linguistic communication, and Internet Explorer decides information technology's Korean and displays it thusly, proving, I think, the point that Postel's Constabulary near existence "conservative in what you emit and liberal in what you accept" is quite frankly not a skilful engineering principle. Anyhow, what does the poor reader of this website, which was written in Bulgarian but appears to exist Korean (and non even cohesive Korean), practise? He uses the View | Encoding menu and tries a bunch of different encodings (in that location are at least a dozen for Eastern European languages) until the picture comes in clearer. If he knew to do that, which most people don't.

For the latest version of CityDesk, the spider web site management software published past my company, we decided to exercise everything internally in UCS-2 (ii byte) Unicode, which is what Visual Bones, COM, and Windows NT/2000/XP apply as their native string type. In C++ code nosotros just declare strings every bit wchar_t ("broad char") instead of char and use the wcs functions instead of the str functions (for instance wcscat and wcslen instead of strcat and strlen). To create a literal UCS-two cord in C code you merely put an L before it as and then: L"Hello".

When CityDesk publishes the web page, it converts it to UTF-eight encoding, which has been well supported by spider web browsers for many years. That'southward the fashion all 29 language versions of Joel on Software are encoded and I have not yet heard a single person who has had whatever problem viewing them.

This article is getting rather long, and I can't possibly cover everything there is to know about grapheme encodings and Unicode, but I hope that if yous've read this far, you know enough to go back to programming, using antibiotics instead of leeches and spells, a job to which I will leave you at present.

parkconce1978.blogspot.com

Source: https://www.joelonsoftware.com/2003/10/08/the-absolute-minimum-every-software-developer-absolutely-positively-must-know-about-unicode-and-character-sets-no-excuses/

0 Response to "If I Bought Minecraft Before on Pc Can I Get It Again>=?"

Postar um comentário

Iklan Atas Artikel

Iklan Tengah Artikel 1

Iklan Tengah Artikel 2

Iklan Bawah Artikel