C++ string handling

{{Short description|Program handling of character strings}}

{{Use dmy dates|date=January 2022}}

{{C++ Standard Library}}

The C++ programming language has support for string handling, mostly implemented in its standard library. The language standard specifies several string types, some inherited from C, some designed to make use of the language's features, such as classes and RAII. The most-used of these is {{mono|std::string}}.

Since the initial versions of C++ had only the "low-level" C string handling functionality and conventions, multiple incompatible designs for string handling classes have been designed over the years and are still used instead of std::string, and C++ programmers may need to handle multiple conventions in a single application.

History

The {{mono|std::string}} type is the main string datatype in standard C++ since 1998, but it was not always part of C++. From C, C++ inherited the convention of using null-terminated strings that are handled by a pointer to their first element, and a library of functions that manipulate such strings. In modern standard C++, a string literal such as {{mono|"hello"}} still denotes a NUL-terminated array of characters.{{cite book |title=Secure Coding in C and C++ |first=Robert C. |last=Seacord |publisher=Addison-Wesley |year=2013 |isbn=9780132981972 |url=https://books.google.com/books?id=Z9aNTafcb3IC&pg=PT82}}

Using C++ classes to implement a string type offers several benefits of automated memory management and a reduced risk of out-of-bounds accesses,{{cite book |title=Practical C++ Programming |first=Steve |last=Oualline |publisher=O'Reilly |year=2003}} and more intuitive syntax for string comparison and concatenation. Therefore, it was strongly tempting to create such a class. Over the years, C++ application, library and framework developers produced their own, incompatible string representations, such as the one in AT&T's Standard Components library (the first such implementation, 1983){{r|history}} or the {{mono|CString}} type in Microsoft's MFC.{{r|pro}} While {{mono|std::string}} standardized strings, legacy applications still commonly contain such custom string types and libraries may expect C-style strings, making it "virtually impossible" to avoid using multiple string types in C++ programs{{r|secure}} and requiring programmers to decide on the desired string representation ahead of starting a project.{{cite book |title=Professional C++ |first1=Nicholas A. |last1=Solter |first2=Scott J. |last2=Kleper |publisher=John Wiley & Sons |year=2005 |page=23 |isbn=9780764589492 |url=https://books.google.com/books?id=YkwA3DeET8UC&pg=PA23}}

In a 1991 retrospective on the history of C++, its inventor Bjarne Stroustrup called the lack of a standard string type (and some other standard types) in C++ 1.0 the worst mistake he made in its development; "the absence of those led to everybody re-inventing the wheel and to an unnecessary diversity in the most fundamental classes".{{cite conference |first=Bjarne |last=Stroustrup |authorlink=Bjarne Stroustrup |title=A History of C++: 1979–1991 |conference=Proc. ACM History of Programming Languages Conf. |year=1993 |url=http://www.stroustrup.com/hopl2.pdf}}

=Implementation issues=

The various vendors' string types have different implementation strategies and performance characteristics. In particular, some string types use a copy-on-write strategy, where an operation such as

string a = "hello!";

string b = a; // Copy constructor

does not actually copy the content of {{mono|a}} to {{mono|b}}; instead, both strings share their contents and a reference count on the content is incremented. The actual copying is postponed until a mutating operation, such as appending a character to either string, makes the strings' contents differ. Copy-on-write can make major performance changes to code using strings (making some operations much faster and some much slower). Though {{mono|std::string}} no longer uses it, many (perhaps most) alternative string libraries still implement copy-on-write strings.

Some string implementations store 16-bit or 32-bit code points instead of bytes, this was intended to facilitate processing of Unicode text.{{r|qt}} However, it means that conversion to these types from {{mono|std::string}} or from arrays of bytes is dependent on the "locale" and can throw exceptions.{{Cite web|title=wstring_convert Class|url=https://docs.microsoft.com/en-us/cpp/standard-library/wstring-convert-class|access-date=2021-12-26|website=docs.microsoft.com|date=2021-08-03}} Any processing advantages of 16-bit code units vanished when the variable-width UTF-16 encoding was introduced (though there are still advantages if you must communicate with a 16-bit API such as Windows). Qt's {{mono|QString}} is an example.{{cite book |title=C++ GUI Programming with Qt4 |first1=Jasmin |last1=Blanchette |first2=Mark |last2=Summerfield |publisher=Pearson Education |year=2008 |isbn=9780132703000 |url=https://books.google.com/books?id=ia-smJ2_ClsC&pg=PT377}}

Third-party string implementations also differed considerably in the syntax to extract or compare substrings, or to perform searches in the text.

Standard string types

The {{mono|std::string}} class is the standard representation for a text string since C++98. The class provides some typical string operations like comparison, concatenation, find and replace, and a function for obtaining substrings. An {{mono|std::string}} can be constructed from a C-style string, and a C-style string can also be obtained from one.{{citation |first=Scott |last=Meyers |authorlink=Scott Meyers |year=2012 |title=Effective STL |publisher=Addison-Wesley |pages=64–65 |isbn=9780132979184 |url=https://books.google.com/books?id=U7lTySXdFk0C&pg=PT734}}

The individual units making up the string are of type {{mono|char}}, at least (and almost always) 8 bits each. In modern usage these are often not "characters", but parts of a multibyte character encoding such as UTF-8.

The copy-on-write strategy was deliberately allowed by the initial C++ Standard for {{mono|std::string}} because it was deemed a useful optimization, and used by nearly all implementations.{{r|meyers}} However, there were mistakes, in particular the {{mono|operator[]}} returned a non-const reference in order to make it easy to port C in-place string manipulations (such code often assumed one byte per character and thus this may not have been a good idea!) This allowed the following code that shows that it must make a copy even though it is almost always used only to examine the string and not modify it:{{cite web |first1=Alisdair |last1=Meredith |first2=Hans |last2=Boehm |first3=Lawrence |last3=Crowl |first4=Peter |last4=Dimov |year=2008 |title=Concurrency Modifications to Basic String |url=http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2008/n2534.html |publisher=ISO/IEC JTC 1/SC 22/WG 21 |access-date=19 November 2015}}{{cite web | url=https://gcc.gnu.org/bugzilla/show_bug.cgi?id=21334 | title=21334 – Lack of Posix compliant thread safety in STD::basic_string}}

std::string original("aaaaaaa");

std::string string_copy = original; // make a copy

char* pointer = &string_copy[3]; // some tried to make operator[] return a "trick" class but this makes it complex

arbitrary_code_here(); // no optimizations can fix this

*pointer = 'b'; // if operator[] did not copy, this would change original unexpectedly

This caused implementations, first MSVC and later GCC, to move away from copy-on-write.{{cite web |url=https://selfboot.cn/en/2024/01/17/c++_string_cow/ |title=Unexpected C++ String Modification Caused by COW (Copy-On-Write) |newspaper=selfboot.cn |date=2024-01-17 |author=selfboot |access-date=May 13, 2025}} It was also discovered that the overhead in multi-threaded applications due to the locking needed to examine or change the reference count was greater than the overhead of copying small strings on modern processors{{cite journal |first=Herb |last=Sutter |authorlink=Herb Sutter |title=Optimizations That Aren't (In a Multithreaded World) |journal=C/C++ Users Journal |volume=17 |issue=6 |year=1999 |url=http://www.gotw.ca/publications/optimizations.htm}} (especially for strings smaller than the size of a pointer). The optimization was finally disallowed in C++11,{{r|boehm}} with the result that even passing a {{mono|std::string}} as an argument to a function, for example void function_name(std::string s); must be expected to perform a full copy of the string into newly allocated memory. The common idiom to avoid such copying is to pass as a const reference.{{cite web |url=https://isocpp.github.io/CppCoreGuidelines/CppCoreGuidelines#Rf-in |title=C++ Core Guidelines |work=Cpp Core Guidelines |date=May 8, 2025 |editor1=Stroustrup, Bjarne |editor2=Sutter, Herb |access-date=May 13, 2025}}

void function_name(const std::string& str_) {

std::cout << str_;

}

The C++17 standard added a new {{mono|string_view}} class{{Cite web|url=http://en.cppreference.com/w/cpp/string/basic_string_view|title=std::basic_string_view – cppreference.com|website=en.cppreference.com|access-date=2016-06-23}} that is only a pointer and length to read-only data, makes passing arguments far faster than either of the above examples:

void print(std::string_view s) { std::cout << s; }

...

std::string x = ...;

print(x); // does not copy x.data()

print("this is a literal string"); // also does not copy the characters!

...

=Example usage=

  1. include
  2. include
  3. include

int main() {

std::string foo = "fighters";

std::string bar = "stool";

if (foo != bar) std::cout << "The strings are different!\n";

std::cout << "foo = " << std::quoted(foo)

<< " while bar = " << std::quoted(bar);

}

= Related classes =

{{mono|std::string}} is a typedef for a particular instantiation of the {{mono|std::basic_string}} template class.{{cite web

| title=C++ reference for basic_string

| url=http://cppreference.com/wiki/string/basic_string/

| publisher=Cppreference.com

| access-date=11 January 2011

}} Its definition is found in the {{mono|}} header:

using string = std::basic_string;

Thus {{mono|string}} provides {{mono|basic_string}} functionality for strings having elements of type {{mono|char}}. There is a similar class {{mono|std::wstring}}, which consists of {{mono|wchar t}}, and is most often used to store UTF-16 text on Windows and UTF-32 on most Unix-like platforms. The C++ standard, however, does not impose any interpretation as Unicode code points or code units on these types and does not even guarantee that a {{mono|wchar_t}} holds more bits than a {{mono|char}}.{{cite book |title=Unicode Demystified: A Practical Programmer's Guide to the Encoding Standard |first=Richard |last=Gillam |url=https://books.google.com/books?id=wn5sXG8bEAcC&pg=PA714 |page=714 |publisher=Addison-Wesley Professional |year=2003|isbn=9780201700527 }} To resolve some of the incompatibilities resulting from {{mono|wchar_t}}'s properties, C++11 added two new classes: {{mono|std::u16string}} and {{mono|std::u32string}} (made up of the new types {{mono|char16_t}} and {{mono|char32_t}}), which are the given number of bits per code unit on all platforms.{{cite web

| title=C++11 Paper N3336

| url=http://www.open-std.org/jtc1/sc22/wg21/docs/papers/2012/n3336.html

| date=13 January 2012

| website=Open Standards

| publisher=Programming Language C++, Library Working Group

| access-date=2 November 2013

}}

C++11 also added new string literals of 16-bit and 32-bit "characters" and syntax for putting Unicode code points into null-terminated (C-style) strings.{{cite book |title=The C++ Programming Language |first=Bjarne |last=Stroustrup |publisher=Addison Wesley |url=https://books.google.com/books?id=kCF4BgAAQBAJ&pg=PA179 |year=2013 |page=179 |access-date=24 November 2015 |archive-url=https://web.archive.org/web/20151125003045/https://books.google.no/books?id=kCF4BgAAQBAJ&pg=PA179 |archive-date=25 November 2015 |url-status=dead}}

A {{mono|basic_string}} is guaranteed to be specializable for any type with a {{mono|char_traits}} struct to accompany it. As of C++11, only {{mono|char}}, {{mono|wchar_t}}, {{mono|char16_t}} and {{mono|char32_t}} specializations are required to be implemented.{{cite web |url=http://www.cplusplus.com/reference/string/char_traits/ |title=char_traits – C++ Reference |access-date=2015-08-01}}

A {{mono|basic_string}} is also a Standard Library container, and thus the Standard Library algorithms can be applied to the code units in strings.

=Critiques=

The design of {{mono|std::string}} has been held up as an example of monolithic design by Herb Sutter, who reckons that of the 103 member functions on the class in C++98, 71 could have been decoupled without loss of implementation efficiency.{{cite web |first=Herb |last=Sutter |url=http://www.gotw.ca/gotw/084.htm |title=Monoliths "Unstrung" |website=gotw.ca |access-date=23 November 2015}}

References

{{reflist}}

{{C++ programming language}}

{{Strings}}

Category:C++

Category:C++ Standard Library

C++

Category:Articles with example C++ code