2.2. String abstractions in C

In the C language, the abstract idea of a string is implemented with an array of characters.

// the char array must be null terminated
char a[] = {'h', 'e', 'l', 'l', 'o', '\0'};  // null == '\0'

char b[] = {'h', 'e', 'l', 'l', 'o', 0};     // null == 0 also

// a quoted literal is just a special case of a char array
char* c = "hello";

Arrays of char that are null terminated are commonly called byte strings or C strings. Given the byte string:

const char* howdy = "hi there!";

In memory, howdy is automatically transformed into:

digraph char_array {
  fontname = "Bitstream Vera Sans"
  label="Character array in memory"
  node [
     fontname = "Courier"
     fontsize = 14
     shape = "record"
     style=filled
     fillcolor=lightblue
  ]
  arr [
     label = "{'h'|'i'|' '|'t'|'h'|'e'|'r'|'e'|'!'|'\\0'}";
  ]
  idx [
     color = white;
     fillcolor=white;
     label = "{howdy[0]|howdy[1]|howdy[2]|howdy[3]|howdy[4]|howdy[5]|howdy[6]|howdy[7]|howdy[8]|howdy[9]}";
  ]


}

The last character in the array, '\0' is the null character, and is used to indicate the end of the string. The null character is a char equal to 0.

Note

Care must be taken to ensure that the array is large enough to hold all of the characters AND the null terminator. Forgetting to account for null, or having a ‘off by one error’ is one of the most common mistakes when working with C strings.

A character array may allocate more memory that the characters currently stored in it. An array declaration like this:

char hi[10] = "Hello";

results in an in-memory representation like this:

digraph c {
  rankdir=LR
  fontname = "Bitstream Vera Sans"
  label="Character array with reserve memory"
  node [
     fontname = "Courier"
     fontsize = 14
     shape = "record"
     style=filled
     fillcolor=lightblue
  ]
  arr [
     label = "{H|e|l|l|o|\\0| | | | }"
  ]

}

The array elements after the null are unused, but could be. So, an array of size 10 has space for 4 more characters, 9 total.

C strings have an advantage of being extremely lightweight and simple. Their main disadvantage is that they are too simple for many applications. Their simplicity makes them a pain to work with, which is why the Standard Template Library (STL) contains the string class.

2.3. Working with C strings

The C programming language has a set of functions implementing operations on strings (character strings and byte strings) in the standard library. Various operations, such as copying, concatenation, tokenization, and searching are supported.

The complete list of byte string functions is available from cppreference.com.

It’s important to know how to work with byte strings in C++ because the C string functions that C++ inherits from C continue to provide a few capabilities not implemented elsewhere in the STL.

In addition, when the type you have is a byte string, it’s just easier and more efficient to manipulate the byte string directly, rather than create a temporary std::string merely to perform an operation and then convert back. In general, you want to try to avoid these kinds of unnecessary type conversions.

2.3.1. Change character case

Many languages provide utilities to change character case as part of the string class.

Not C++.

C++ uses the legacy null-terminated byte strings library to provide these features.

Changing character case is a common task and unless you choose to write your own version of these functions, these functions from the STL are the ones you should use.

Many string conversion functions are defined in the cctype header. These functions which C++ inherited from C are often perfectly acceptable, however, there are some notable exceptions, such as the toupper function.

This version of std::toupper function takes a single char, which can be any character type, is not modified, and returns a character of the same type as the character type provided.

Because of this, the C++ version that uses std::locale is preferred.

The std::toupper function takes a single char, which is not modified, and returns an int. The return value can be used as the upper case version of the input character.

Note

Use the right toupper!

The C version of toupper returns int values, not character values. This can cause unexpected behavior or conversions.

For these resaons, the std::locale() version of toupper is preferred.

See the previous tab for details.

toupper is defined in header cctypes.

This function uses the default C locale to replace the lowercase letters abcdefghijklmnopqrstuvwxyz with respective uppercase letters ABCDEFGHIJKLMNOPQRSTUVWXYZ. Non-ASCII characters are not handled.

Recall that char implicitly convert to int.

2.3.2. Copying and comparing C strings

Unlike most of the types we work with in C++, byte strings are simple arrays. Arrays cannot be assigned to each other using operator=. Values in array must be copied one elements at a time, for example using a loop.

Similarly, if you compare two arrays for equivalence, operator== will only return true if both arrays share the same memory address.

This is not what we usually want.

Like copying, in order to check a pair of arrays for equivalence a loop is used to compare each element one at a time until a difference is found.

The copy and compare functions are defined in the cstring header.

The strcpy function takes two byte strings as parameters and copies the source character array including the null terminator, to the destination character array.

Note the order of the arguments.

A common source of error is to swap the order of the arguments.

The strncpy function copies byte strings, but will copy at most a provided count number of characters.

The strcmp function takes two byte strings as parameters and returns a 0 if every element in both arrays is equal.

If the first operand is greater than the second, then a positive value is returned. If the first operand is less than the second, then a negative value is returned.

Note

These functions are not locale-aware.

If you need to make locale aware comparisons, then use strcoll.

The strncmp function takes two byte strings as parameters and returns a 0 if every element in both arrays is equal.

However, this function only compares the at most a specified number of characters.

Note

These functions are not locale-aware.

If you need to make locale aware comparisons, then use strcoll.

Self Check

Fix the errors in the printf line below:


More to Explore

You have attempted of activities on this page