Blog Series Tags

Unicode on Linux

At first the place of Unicode in Linux looks simple. It is deceptively so. Linux C/C++ has support for both char and wchar_t. Let’s take a look at the char version:

#include <cstring>
#include <iostream>
 
int main(int argc, char* argv[])
{
    const char text[] = "Άλφα";
    std::cout << text << std::endl;
    std::cout << "sizeof(char): " << sizeof(char) << std::endl;
    std::cout << "sizeof(text): " << sizeof(text) << std::endl;
}

This results in the output:

Άλφα
sizeof(char): 1
sizeof(text): 9

Notice that the console prints out the characters correctly and that the size of a char is 1 byte. On the other hand the size of the text buffer is 9 bytes, though manifestly it only contains 4 (+1 for the null terminator). The reason for this is that each character is encoded as a couple of UTF-8 characters. This is important. All char buffers in Linux supports Unicode by default because GCC by default encodes literal text as UTF-8 streams. On the down side now some of the old string handling functions – for example strlen – which are not UTF-8 aware start returning bovine excrement as results and require replacement but we will return to that.

For now here’s the same code with wchar_t used instead:

#include <cstring>
#include <iostream>
 
int main(int argc, char* argv[])
{
    setlocale (LC_ALL,"");
    const wchar_t text[] = L"Άλφα";
    std::wcout << text << std::endl;
    std::wcout << L"sizeof(wchar_t): " << sizeof(wchar_t) << std::endl;
    std::wcout << L"sizeof(text): " << sizeof(text) << std::endl;
}

With the corresponding output:

Άλφα
sizeof(wchar_t): 4
sizeof(text): 20

While a char takes up a single byte, a wchar_t takes up 4 bytes on Linux in contrast to the 2 bytes that wchar_t takes up in Windows NT. This goes to show that the implementation of wchar_t has nothing to do with the encoding. This distinction is important for the C/C++ standards don’t describe wchar_t beyond it being a type for “wide” characters and each implementation of the C/C++ compiler can choose how wide “wide” is.

Notice that all the text literals have been prepended with the “L” symbol which tells the compiler to treat them as wide characters (wchar_t’s). Also notice the call to setlocale which as the name suggests sets the locale of the output stream, which is required for the wchar_t output stream wcout. It should be noted that you can’t use both std::cout and std::wcout in the same program for reasons best left to cloistered C++ monks with enough time to meditate on such things.

And that’s it! Simple yes?

Well… no not really. Things get a little furry when you bring in the plethora of windowing toolkit’s available. GTK+ for example using UTF-8 encoding internally while QT prefers UTF-16 (because it works easily with Windows and Mac OS X which require UTF-16 for their native API’s). On top of that if you want to write a cross platform application that uses Unicode to show multiple languages at once then you either have to stick with a cross platform toolkit and hope it handles the pain or do some clever trickery of your own.

This is a post in the C++ Strings series.