Blog Series Tags

Unicode

For many years programmers and strings existed in a state of balance, as much balance as could have been possible under the constant threat of undefined behaviour that sociopathic string functions threatened in any case. It was not to be however for their lives of relative stability was forever changed with the advent of Other Languages into string buffers. In this part of our continual study of strings we delve into intriguing world of internationalisation, multi byte character sets and Unicode.

To understand the accent of Unicode we must go back to its distant ancestor ASCII. ASCII defined a 7 bit code for encoding characters where each character in ASCII had a number from 0 to 127 and a corresponding glyph. The letter capital letter “A” had an ASCII number 65 (or x41 if you’re a l33t hex haxor), “B” was 66 and so on. Characters in the lower range of the ASCII table 0 to 31 were reserved for special control characters like 0 our friendly Null Character. Ranges 32 through 64 were reserved for digits and punctuation.

It was a simple system that worked excellently for English and other languages that are simple derivatives of the Latin alphabet but if you wanted to write in a language that had special accents or characters that didn’t exist in ASCII you were out of luck. This outraged Frenchmen whose Parisian Cafés were just not the same without that acute e – (é). Fortunately another bloody revolution was averted just in time when people realised they could pack in a few more characters into ASCII.

Most computers of the time used 8 bit’s as the length of their byte – a byte doesn’t really have to be 8 bit’s it’s an esoteric notion of how many bits are needed to encode a single character on a given target architecture. This meant that when a single character of ASCII text was encoded there was 1 bit, the most significant bit remaining empty. The boffins at IBM realised this when they had to release the IBM PC into different markets so they extended the ASCII character set with another 128 special characters by using that single bit to store 1 more bit of information thereby doubling the possible characters on their systems (2^7 = 128, 2^8 =256). So everything below 127 was considered ASCII and everything above 128 was considered extended ASCII and was used to store special characters like that acute e for café. But wait you say, what if IBM sold computers to the Greeks as well (for cash they borrowed of course)? What would they be doing with café’s? They would want α and Ω and everything in between instead. So IBM devised a cunning plan. They sold each computer with a set of possible character sets. Everything below 127 on those computers contained standard ASCII while everything above contained characters specific to that region, so the French had their café’s and the Greeks had Ω’s and the world at large returned to peace.

There was a problem though; you could not use two regional languages at the same time on the same machine. In the early days you had to actually replace the ROM that contained the character map to change the text encoding from one region to the other. This was fine as long as there was no real interchange of documents between regions but the Internet changed all that. So a German describing his bank balance as “Über” in a text file sent to a Greek over the Internet would cause the same text to look like “³ber” to the Greek, because Ü on the Western European code page (Code Page 850) is character number 154, which on the Greek code page (Code Page 869) is actually “³”.

Code Page 850   Ü b e r
Code Page 869   ³ b e r
ASCII Code   154 98 101 114


Yet these languages did not pose too much of a challenge to the programmer because for all of them the basic ASCII set of characters were present (0 to 127) and the code page switching was left to the operating system in most cases. If you used just the lower 127 English characters then you didn’t really have to think about the upper set and if you used the upper set of characters then you left switching the code pages to the OS. Another boon was the fact that the string functions that were used with lower ASCII strings worked just as well with the higher ones because each character was still basically one byte long. This meant that in practise you did not have to adopt a new set of functions to do basic string operations like measuring the length of a string, strlen could still do that for you – things hadn’t got too far out of hand.

Then along came the Asian languages which had a larger number of characters than could be held inside an 8 bit byte and there was much weeping and gnashing of teeth. The solution was to encode each character with one or two bytes depending on the character in question – a scheme called multi byte character encoding. This solution led to more weeping and more gnashing of teeth – for now the basic assumption that had worked for string till this point had been lost. A string was not a list of discrete characters anymore, it was a stream of data where each byte may or may not represent a single character. You couldn’t just do a “someString[2]” on a multibyte character string to get the 3rd character in it’s buffer – you had to call special functions. You could not just iterate over a string’s characters with i++, you could no longer use the old string functions like strlen to get the length of a string, you had to use new ones like _mbstrlen on Windows, it was as they say a Royal Mess.

Perhaps I exaggerate… it was still possible to write code that was multilingual but you had to take care to ensure that you never used functions that would work only in a single byte character setting. Fortunately a Solution was on the Horizon.


The One Ring Code

Uni Code to define them all, Many formats to write them
Uni Code to bring them all and in the darkness bind them

Unicode defines a unique number for all characters that can be written. It doesn’t define how that number will be encoded, just that a given character has a number. This number is called the Code Point of that character. The code point for the letter A (capital A) is unique to it, it’s not equal to simple “a” and not equal to any other character and it has the value U+0041 (which is hex for the decimal number 65). This value is independent of the font that A is displayed in and also independent of the encoding that will be used to encode the number in. The latter point is important in any discussion of Unicode, the character set is independent of any implementation of the encoding scheme. This means that Unicode could have a virtually infinite number of characters. You could just add another character to the Unicode standard and not really worry about what the encoding will look like.

What then will the encoding look like? Well every time you hear about something being in UTF-8 or UTF-16 that’s the kind of encoding that a given string of Unicode text is in. UTF stands for Unicode Transformation Format which basically means that it’s what the stream of Unicode characters will look like once it’s in a byte array. UTF8 encodes each code point in a variable number of bytes depending on how large the Unicode code point number is. The 8 at the end means that a single sub unit of that format is 8 bits wide (1 byte). A single Unicode character might be encoded in up to 6 bytes (6 UTF8 sub units). UTF16 is also a variable length encoding scheme but it uses 16 bit numbers as sub units instead of 8 bit numbers. This means that larger Unicode characters might be easier to encode in UTF16 but smaller Unicode characters might just waste space because a lot of the bits contain no information. For example take the text “Über αΩ” partly German and partly Greek – it’s something that code page based methods mentioned could not encode together.

Text Ü b e r [space] α Ω
Code Point U+00DC U+0062 U+0065 U+0072 U+0020 U+03B1 U+03A9
UTF8 C3 9C 62 65 72 20 CE B1 CE A9
UTF16 00DC 0062 0065 0072 0020 03B1 03A9


There’s a couple of points of interest here. First notice that the UTF8 and UTF16 encoding scheme for the same set of Unicode characters is somewhat different. This means that when writing code it’s not possible to do a comparison of 2 buffers of bytes unless they are in the same Unicode encoding. Fortunately most of the time inside a single program you can stick to one encoding but it’s important to ask yourself what the characters are encoded in if you get text from the outside. Next notice that the UTF16 encoding wastes a bit of space because of the high order byte (more about that later) always being set to 00 for most of the characters. This is the downside of using a larger width Unicode format for encoding mostly Latin text. Sic Vita Est. Finaly notice that the characters “b”, “e” and “r” have the same decimal numbers in UTF8 as their ASCII counterparts (62, 65 and 72). This is due to an advantageous property of UTF8, which encodes the first 128 ASCII characters in the same way that ASCII does. This means that text in UTF8 can be fed into a program that believes it’s ASCII as long as only the lower 128 characters are used. Oh Happy Day! Now let us return to the high order byte mentioned before, but first we need to talk about Endians.

Those Endians

There are 2 types of Endians. Red and Brown. I jest I jest, or maybe not.

The Endianess of a CPU architecture for all intents and purposes related to our topic is the ordering of the bytes for a single number. A number smaller than or equal to 255 (but greater than or equal to 0) can be put into single 8 bit byte so the endianness of the of the architecture doesn’t matter. For larger numbers we may need more bytes of storage. For 2 bytes you can squeeze in all numbers smaller than or equal to 65535 and greater than or equal to 0. The problem however is how you order those 2 bytes. The way we humans (some humans at least) do it is to put the largest (most significant) number leftmost and the smallest (least significant number) rightmost. So the number 1000 is written with the highest power of ten to the left most. The hexadecimal representation of 1000, 3E8 requires 2 bytes of storage because each hexadecimal digit requires 4 bits. There are two ways to order those 2 bytes and the order defines the endianess of the architecture.

O3 E8 – big endian
E8 03 – little endian

Big endian architecture numbers are just like human representation in that the lowest address (or leftmost address as we like to think of it) contains the highest order part of the number while the later addresses (the further you go to the right) contains successively lower order parts. Little endian reverses this scheme, so that the lowest address (or the left more number as we like to think of it) contains the least significant byte while the highest address (the right most number) contains the most significant byte. The x86 architecture which includes most flavours of processors from Intel and AMD are little endian. But let’s get back to Unicode.

Multibyte Unicode Encodings and Endianess

The same UTF 16 string can be represented in two distinct formats depending on the byte order.

Text Ü b e r [space] α Ω
Big Endian 00DC 0062 0065 0072 0020 03B1 03A9
Little Endian DC00 6200 6500 7200 2000 B103 A903


Hence it becomes important to distinguish the two when moving text between different computer systems. This is done by using something called a Byte Order Mark – 2 special bytes at the beginning of the string (or text file if your looking at the contents of a file instead of the contents of a buffer) to state it’s endianness. If the text is big endian the following byte order mark will be present: FE FF, while if the text is little endian it’s reverse: FF FE is present.

The program reading the stream has to take into account the endianness of the byte order mark in order to correctly interpret the string. There is a hitch though, the byte order mark is a convention and not a rule. This means that in some text it make not be present but we could still infer the endianess of the character buffer by using a few tricks – tricks that are out of the scope of this discussion in any case. The byte order mark has no real significance in a UTF 8 stream except to state that it’s a UTF8 stream and not an ASCII stream. In UTF8 text the byte order mark takes the form “EF BB BF” and may or may not be present in the stream. Some text editors allow you to choose if you want to add the BOM or not but that’s up to specific editors. In general though UTF8 text can be inferred just like it’s UTF16 counterpart and it’s presence or absence isn’t really so much of a problem for the enlightened programmer.

On the other hand many higher level protocols like HTML define the character encoding so programs that read documents written in such protocols can simply use the one defined therein. HTML does it like so:

<meta http-equiv="Content-Type" 
      content="text/html; 
      charset=utf-8">

There you have it. Unicode in a nutshell more or less, perhaps less more than more but enough to know that when you see a bit of text the first thing that you have to ask yourself is “What encoding is it in?” Because without the encoding all text is just a stream of bytes, it has as much sense as the random bytes generated by a typing monkey. I go have banana now.

This is a post in the C++ Strings series.