Blog Series Tags

The Private Lives of Strings

Ah, Strings. Surprisingly simple for most C# and Java programmers yet so complex for those of us left behind in the world of C and C++, Strings are perhaps one of the strangest beasts in the programming landscape. Yet they began so simple, perhaps deceptively so like a chameleon on a plain coloured leaf waiting to deceive the observer with its next background. Yet I have tracked this beast down through the ages, yea, verily I have identified its habits and its lair and here I present to you in true Attenboroughish fashion the Private Life of Strings. To understand the habitat of strings one must go back to their distant ancestors. The typewriters.

These mechanical beasts ruled the days before their electronic kin were brought into being. They were at the beginning purely mechanical in their form. Employing the energies of the user to move the type bars that impress the types upon paper. Sometime afterwards Electric Typewriters and Teletypewriters (which write out messages for transmission to distant terminals) were made. The key difference of this subspecies was that the bars of the type writer were no longer driven by the fingers of the users.

Nay, these were driven by an arcane power called Electricity, wherein wires took the input from the user (duly converted into “electrical signals”) and moved these impulses into the arms that impress the type. This was a boon to those dainty fingered typists and caused Mr. Sherlock Holmes much confusion, for their velvety fingers were no longer gnarled by the use of those terrible Victorian machines, [1] but I digress.

The thing that happens whenever you decouple one part with another is that an intermediate form becomes a necessity. Sets of character encodings were devised for the mechanical parts of the typewriter to interact with the electrical parts and these encodings eventually begat the unified ASCII standard of character encoding. ASCII defined a 7 bit code for denoting all characters that can be typed. This meant that it was possible to address 2^7 characters or a 128 of them. In the early days of electrical type writers this was sufficient, after all English can be written with 26 upper and 26 lower case letters and some punctuation.

When computers with video displays became popular they adopted ASCII as both their internal storage format and as the character set that was displayed on screen, but it didn’t’ stop there. The emptiness of video displays meant that sometimes it was useful to have lines, borders and other characters drawn on screen to make the information look nicer. This caused the ASCII character set to grow organically, adding 1 more bit to itself and 128 more characters (all “special” characters for drawing stuff with). Extended ASCII now needed 8 bits to stow each character, which is 1 byte on most architectures. Now if each character is 8 bits, or 1 byte long then for the computer programmer a “string” of such characters is n bytes long, with each byte in the string being a character encoded in ASCII.

Enter The C String

The C string is possibly the grandfather of all string implementations. A C string is an array of characters encoded in ASCII and it’s terminated by a Null character. Here’s a definition: [cpp]char discworld [16] = “Discworld”;[/cpp] The line above defines an array of type char and length 16, however it can hold 15 characters at most as the last character is (meant to be) reserved from the null terminator (‘\0’). In a C string the length of the char array that holds the characters must be at least one character larger than the string contained therein. It could be much larger if required. Let’s take two examples:

 
char discworld [16] = "Discworld";
char vetinari [] = "Vetinari"; 

cout << "[" << discworld << "], " << "String Length: " 
     << strlen(discworld) << ", Array Size:" 
     << sizeof(discworld) << endl; 
     
cout << "[" << vetinari << "], " << "String Length: " 
     << strlen(vetinari) << ", Array Size:" 
     << sizeof(vetinari) << endl; 
 

Here’s the result:

Notice the difference in the length of the string returned by the strlen and the size of the array. The string is 9 characters long but in fact takes 10 characters of storage with the null terminator. In fact we can take it for granted that a C string ends at the first null terminator character, for example if we do the following:

 // Edit strings 4th character
 vetinari[3] = NULL;

We get the following output:

Note that C string indexes start at 0, so vetinari[3] is the 4th character in the string. The string has been cut off after its 3rd character. Here’s what the memory looks like:

Notice the underlined hex character and the corresponding space in the text of what was until now the name of the Tyrant of Ank Morpork? That’s the helpful null terminator and string now looks like a string of length 3 to strlen but the array is still 9 char’s long.

Detour – Null through the Ages

Dionysius Exiguus or Dennis the Short was a 5th century monk who devised a method of calculating Easter more accurately than what was known to the Catholic Church at that time. He is perhaps best known as the inventor of the AD (Anno Domini) method of dating, now called the Common Era. The calculation of Easter requires the calculation of the difference between days as denoted by the lunar cycle and days as denoted by the solar one. This difference is known as the “age of the moon”. The age of the moon resets every 19 years so that on a given day each 19 years the age of the moon is zero, or nothing – which in Latin is “Nulla”.

Dionysius was perhaps the first western writer to coin the term for the idea of zero, which is what null for the computer programmer really is – a mystical name for the numerical value zero. Dionysius calculated the dates for Easter manually, a processes known as Computus to this day, based on the same Latin stem from which Computer is derived. We stand on the shoulders of giants do we not? Well even short ones in Dennis’s case.

The Null Terminator

The null terminator might well have been one of the Great Mistakes of computing. In the early days of computing where memory was tight it was preferable to the other method of defining the length of a string, which was to prepend the length to it – the so called “Pascal string”, but this method had the additional complexity of need defining just how many bytes were needed for denoting the length and ultimately the C string way of doing things piggybacked the popularity of C as a language into the mainstream. Let’s look at an example:

 
 // Assign a new string 
 // vetinari = "Patrician"; // No can do... 
 
 // Copy the string
 strcpy(vetinari, 
        "Taxation, gentlemen, is very much like dairy 
        farming. The task is to extract the maximum 
        amount of milk with the minimum of moo."); 
        
 cout << "[" << discworld << "], "
         << "String Length: " << strlen(discworld) 
         << ", Array Size:" << sizeof(discworld)
         << endl;
         
 cout << "[" << vetinari << "], " 
      << "String Length: " << strlen(vetinari) 
      << ", Array Size:" << sizeof(vetinari) << endl;
         

To start with you can’t assign into a C string outside its definition, so

 
 vetinari = "Patrician"; // No can do...
  

will generate a compile error. The way to do this is to use the strcpy function. Here’s the output:

Ok, all is not well here. The variable discworld has some of the contents of the variable vetinari, yet we only changed the variable vetinari. Your probably starting to guess the problem. Let’s take a look at the memory just before the call to strcpy.

Everything looks in order. Notice that the stack grows from a higher memory address to a lower one. discworld is defined first but it’s at a lower address than vetinari which is defined second. And then just after the call to strcpy we have:

We see the problem more clearly now. The call to strcpy

 strcpy(vetinari, "'Taxation, gentlemen...'");

has written the entire string to the variable vetinari, dutifully including a null terminator without checking the bounds of the array which can only hold 9 characters (8 without the null terminator). This means that the variable vetinari is now corrupted and one or more variables that have the misfortune of being above that variable in the memory space may have had their contents changed as it is in the case of the variable discworld.

Visual Studio is kind enough to tell us so:

Worse yet, we now have the dreaded Vampire of Buffer Overflows raising its head.

Buffer overflows are A Bad Thing because they can be exploited by hackers to do all sorts of naughty things. Not to worry though, there’s a secure version of strcpy you can use to avoid the above scenario.

The Secure Function

The function strncpy takes 3 arguments instead of strcpy’s 2 and is commonly used as a “secure” alternative to strcpy. The third argument is the key difference here, and it defines the maximum number of characters to copy from the source into the destination. If the end of the source string occurs before the number of characters defined in the third parameter, the destination is padded with zero’s until the total number of characters defined is copied.

Now methinks if we send in the size of the destination array in the 3’rd parameter to strncpy it should ensure that the Overflow Vampire is forever put to rest, right? Let’s try.

 strncpy(vetinari, 
         "Taxation, gentlemen, is very 
          much like dairy farming. The 
          task is to extract the maximum 
          amount of milk with the minimum
          of moo.", 
         sizeof(vetinari));

Gives us the output:

What! What kind of fiendish scheme is this? Clearly something is wrong; the string should be at most length 8 because the array is defined implicitly for us with a size of 9 elements. Let’s take a look at the memory again.

There should have been a null terminator after the letter ‘o’ in Taxation, but it’s taken up by the letter n. Remember this string is defined as having a total size of 9 elements, 8 of which can be text ({‘, T, a, x, a, t, i, o} and the last of which must be a null terminator character. So the secure strncpy function has insecurities of its own and this one in particular, that if the source string is longer than the destination string – even if you put the correct length for the destination string, it does not add a null terminator at the end of the copy but copies a length size set of characters from the source to the destination. It only stops to think about null terminators if at all it thinks about it – after it has finished copying the length size of characters. This is also A Bad Thing as other insecure string functions that rely on the null terminator would trip over such a string causing the Buffer Overflow Vampire to burst out of its coffin and grin at us once again. There is a simple stake through the heart way of dealing with this but it’s rather tedious for you to have to put the same fiend down twice.

vetinari[sizeof(vetinari) - 1] = '\0'; // And stay dead!

Here we explicitly double check the string bounds by manually placing a null character at the length – 1th element (the last element because remember, we start counting elements from 0). This gives us the expected outcome.

The fact of the matter however is that strncpy was never meant to be a secure replacement for strcpy. It was a case of historical bad naming. It was meant to copy c style strings into fixed length records in the Old Days. The naming got it involved with strcpy as a more secure version of the function – which it can be if it is used with external checks but its internal behavior is inconsistent with notion of it being more secure. If at all it will plug one hole and expose another (if used incorrectly).

Will the real secure function please stand up?

Now things begin to move in divergent paths. The secure functions that were by their design intended as replacements for strcpy tend to change from platform to platform. Windows defines the strcpy_s function as a safe replacement for strcpy whereas some flavours of Unix defines the strlcpy function a replacement. Both functions have essentially the same purpose and nearly identical arguments but do not have cross platform support – strcpy_s is not available on Unix and strlcpy is not available on Windows and some flavours of Unix. The result is that if you want to write cross platform portable code you are either stuck with strncpy and its internal behavior or with rolling out your own string copy function. The second option might not be too bad considering that you could add the strlcpy function to your own code so that it can be used across platforms. The first option should be cross platform as long as you make sure to write the code properly, and doing that every time is difficult, people make mistakes.

The point here of course is that writing secure cross platform code in C/C++ that makes use of C Strings is a difficult thing to do – and we haven’t even come upon multilingual support. Even with the secure functions mentioned above an incorrect parameter such as the length of characters to copy can still result in insecure code, so it’s very important to understand what the code is doing.

Well that wraps up our first foray into the private lives of strings and what a life fraught with danger it is.

[1]

My friend took the lady’s ungloved hand, and examined it with as close an attention and as little sentiment as a scientist would show to a specimen.

“You will excuse me, I am sure. It is my business,” said he, as he dropped it. “I nearly fell into the error of supposing that you were typewriting. Of course, it is obvious that it is music. You observe the spatulate finger-ends, Watson, which is common to both professions? There is a spirituality about the face, however”–she gently turned it towards the light–“which the typewriter does not generate. This lady is a musician.”

Sherlock Holmes - The Adventure of the Solitary Cyclist - A.C Doyle

This is a post in the C++ Strings series.