2.2. String abstractions in C¶
In the C language, the abstract idea of a string is implemented with an array of characters.
// the char array must be null terminated
char a[] = {'h', 'e', 'l', 'l', 'o', '\0'}; // null == '\0'
char b[] = {'h', 'e', 'l', 'l', 'o', 0}; // null == 0 also
// a quoted literal is just a special case of a char array
char* c = "hello";
Arrays of char
that are null terminated are commonly called
byte strings or
C strings.
Given the byte string:
const char* howdy = "hi there!";
In memory, howdy
is automatically transformed into:
The last character in the array, '\0'
is the null character,
and is used to indicate the end of the string.
The null character is a char
equal to 0.
Note
Care must be taken to ensure that the array is large enough to hold all of the characters AND the null terminator. Forgetting to account for null, or having a ‘off by one error’ is one of the most common mistakes when working with C strings.
A character array may allocate more memory that the characters currently stored in it. An array declaration like this:
char hi[10] = "Hello";
results in an in-memory representation like this:
The array elements after the null are unused, but could be. So, an array of size 10 has space for 4 more characters, 9 total.
C strings have an advantage of being extremely lightweight and simple. Their main disadvantage is that they are too simple for many applications. Their simplicity makes them a pain to work with, which is why the Standard Template Library (STL) contains the string class.
2.3. Working with C strings¶
The C programming language has a set of functions implementing operations on strings (character strings and byte strings) in the standard library. Various operations, such as copying, concatenation, tokenization, and searching are supported.
The complete list of byte string functions is available from cppreference.com.
It’s important to know how to work with byte strings in C++ because the C string functions that C++ inherits from C continue to provide a few capabilities not implemented elsewhere in the STL.
In addition, when the type you have is a byte string,
it’s just easier and more efficient to manipulate the byte string
directly, rather than create a temporary std::string
merely to
perform an operation and then convert back.
In general, you want to try to avoid these kinds of unnecessary type conversions.
2.3.1. Change character case¶
Many languages provide utilities to change character case as part of the string class.
Not C++.
C++ uses the legacy null-terminated byte strings library to provide these features.
Changing character case is a common task and unless you choose to write your own version of these functions, these functions from the STL are the ones you should use.
Many string conversion functions are defined in the cctype
header.
These functions which C++ inherited from C are often perfectly acceptable,
however, there are some notable exceptions,
such as the toupper function.
This version of std::toupper
function takes a single char
,
which can be any character type,
is not modified, and returns a character of the same type
as the character type provided.
Because of this, the C++ version that uses std::locale
is preferred.
The std::toupper
function takes a single char
,
which is not modified, and returns an int
.
The return value can be used as the upper case
version of the input character.
Note
Use the right toupper!
The C version of toupper returns int values, not character values. This can cause unexpected behavior or conversions.
For these resaons, the std::locale()
version of
toupper is preferred.
See the previous tab for details.
toupper
is defined in header cctypes
.
This function uses the default C locale to replace the
lowercase letters abcdefghijklmnopqrstuvwxyz
with respective uppercase letters
ABCDEFGHIJKLMNOPQRSTUVWXYZ
.
Non-ASCII characters are not handled.
Recall that char
implicitly convert to int
.
2.3.2. Copying and comparing C strings¶
Unlike most of the types we work with in C++,
byte strings are simple arrays.
Arrays cannot be assigned to each other using operator=
.
Values in array must be copied one elements at a time,
for example using a loop.
Similarly, if you compare two arrays for equivalence,
operator==
will only return true
if both arrays share the same memory address.
This is not what we usually want.
Like copying, in order to check a pair of arrays for equivalence a loop is used to compare each element one at a time until a difference is found.
The copy and compare functions are defined in the cstring
header.
The strcpy function takes two byte strings as parameters and copies the source character array including the null terminator, to the destination character array.
Note the order of the arguments.
A common source of error is to swap the order of the arguments.
The strncpy function
copies byte strings, but will copy at most a provided count
number of characters.
The strcmp function
takes two byte strings as parameters and
returns a 0
if every element in both arrays is equal.
If the first operand is greater than the second, then a positive value is returned. If the first operand is less than the second, then a negative value is returned.
Note
These functions are not locale-aware.
If you need to make locale aware comparisons, then use strcoll.
The strncmp function
takes two byte strings as parameters and
returns a 0
if every element in both arrays is equal.
However, this function only compares the at most a specified number of characters.
Note
These functions are not locale-aware.
If you need to make locale aware comparisons, then use strcoll.
Self Check
sc-1-9: Given the following:
char text[32];
strcpy(text, "hello");
int len = strlen(text);
What is the value of len
?
Fix the errors in the printf
line below:
sc-1-11: Which #include
is required to use functions such as
std::atoi
and std::atof
?
sc-1-12: Which #include
is required to use functions such as
std::stoi
and std::stol
?
More to Explore
cppreference.com byte strings
Bjarne Stroustrup’s C++11 FAQ: Raw String literals
Locales: