# String literals [Pull request](https://github.com/carbon-language/carbon-lang/pull/199) ## Table of contents - [Problem](#problem) - [Background](#background) - [Existing practice](#existing-practice) - [Proposal](#proposal) - [Details](#details) - [Non-raw string literals](#non-raw-string-literals) - [Escape sequences](#escape-sequences) - [Raw string literals](#raw-string-literals) - [Encoding](#encoding) - [Alternatives considered](#alternatives-considered) - [Block string literals](#block-string-literals) - [Leading whitespace removal](#leading-whitespace-removal) - [Terminating newline](#terminating-newline) - [Escape sequences](#escape-sequences-1) - [Raw string literals](#raw-string-literals-1) - [Trailing whitespace](#trailing-whitespace) - [Line separators](#line-separators) - [Internal whitespace](#internal-whitespace) - [Rationale](#rationale) ## Problem This proposal specifies lexical rules for constant strings in Carbon. ## Background We wish to provide a syntax for writing literals containing human-readable text. Note that "human-readable text" here should be understood broadly: such text may be subject to further processing, and may in some cases be intended to be interpreted by a computer rather than by a human (such as a regular expression, program source code, or a C++ mangled name), but broadly represents a sequence of characters rather than arbitrary binary data. Such text is typically represented in an _encoding_, which is a bidirectional mapping between a sequence of characters in text and a sequence of bounded integer values known as _code units_, suitable for storage, transmission, and processing. For example, the Russian word углерод (carbon) is encoded in the UTF-8 encoding as D1₁₆83₁₆D0₁₆B3₁₆D0₁₆BB₁₆ D0₁₆B5₁₆ D1₁₆80₁₆ D0₁₆BE₁₆ D0₁₆B4₁₆. ### Existing practice See [Comparison of programming languages (strings) on Wikipedia](https://en.wikipedia.org/wiki/Comparison_of_programming_languages_%28strings%29). Simple string literals are specified in most programming languages as text delimited by double-quote characters, `"like this"`. Such string literals usually are restricted to begin and end on the same source line. Three additional features are commonly seen: - Escape sequences, which permit string literals to include characters that are difficult to type, are ambiguous for the reader, or that would be problematic in some way (such as whitespace characters, characters that are invisible, and characters that change how other characters are rendered), and also to include arbitrary code units. One common convention is to use `\` to introduce an escape sequence, where, for example: - `\n` represents a newline character, - `\u1234` represents the Unicode character U+1234, - `\xAB` represents the code unit AB₁₆, - `\"` represents a single `"` character and does not terminate the string literal, - `\\` represents a single `\` character, and so on. The set of single-letter escape sequences has a lot of commonality between languages, with some variation between older and newer languages: - C++ and Python allow `\a`, `\b`, `\f`, `\n`, `\r`, `\t`, `\v` for bell, backspace, form feed, new line, carriage return, tab, and vertical tab, respectively. - JavaScript drops support for `\a`. - Java additionally drops support for `\v`. - Rust and Swift additionally drop support for `\b` and `\f`, leaving only `\n`, `\r`, and `\t`. The rules for numeric escape sequences differ between kinds of escape sequence and between languages. The rules in C++, JavaScript, Python, Rust, and Swift are as follows: - `\123` is interpreted as a octal code unit value, and up to three octal digits are consumed. In JavaScript and C++, values greater than 377₈ (255₁₀) are invalid (assuming an 8-bit character type). In Python, values greater than 377₈ are interpreted modulo 256. Rust and Swift do not allow octal escapes in general, but do allow `\0` as a special case. - `\xAB` is interpreted as a hexadecimal code unit value. In C++, any nonzero number of hexadecimal digits can follow as part of the escape sequence. In Python, JavaScript, and Rust, exactly two digits are required. In Rust, the value must be less than or equal to 7F₁₆ except for `b`-prefixed strings (byte array literals). Swift does not support this form of escape sequence. - `\uABCD` is interpreted as a hexadecimal code point value. In C++, Python, and JavaScript, exactly four hexadecimal digits can follow, but JavaScript allows any nonzero number of digits to be specified using `\u{ABCDE}` notation. Rust and Swift support only the `\u{ABCDE}` notation. - `\U0010FFFD` is interpreted as a hexadecimal code point value in C++ and Python, but not in JavaScript, Rust, or Swift, and permits exactly eight hexadecimal digits. - Raw string literals, in which escape sequences are not recognized. These are often used in situations where escape sequences are undesirable, but in which the escape character or regular string terminator is used frequently. Such literals are useful when embedding one machine-readable language in another, when those languages share some escaping conventions. Such functionality may also provide a way to customize the string delimiters. - In Python, raw string literals have an `r` prefix: `r"li\ngo"` is a six character string whose third character is `\`. - In C++, raw string literals have an `r` prefix, along with a custom delimiter (which may be empty): `r"DELIM(li\ngo)DELIM"` is a six character string (plus a nul terminator). - In Rust, raw string literals have an `r` prefix and any matching number of `#` characters enclose the string contents: the third character of `r"lo\ng"` is `\`, and the second character of `r#" " "#` is `"`. - In Swift, raw string literals are a generalization of regular string literals: literals are introduced by any number, _N_, of `#` characters followed by a `"`, terminated by `"` followed by _N_ `#` characters, and escape sequences are introduced by a `\` followed by _N_ `#` characters: `#" " # \n \#n \#\# "#` has the same contents as the C++ string literal `" \" # \\n \n \\# "`. - In Java, raw string literals are delimited by a sequence of one or more backticks instead of double quotes: the fourth character of \`\`foo\`\bar\`\` is a backtick and the fifth is a backslash. - Multiline string literals provide a mechanism for a string to easily span more than one line of source text. - In C++, Rust, and Java, raw string literals are used to represent multiline string literals. - In Python, different delimiters (`"""` or, in Python, `'''` instead of `"` or `'`) are used to represent multiline string literals, and plain `"` and `""` can thereby appear in the string contents, but these literals otherwise behave the same as regular string literals. - In Swift, different delimiters are used (`"""` instead of `"`), but unlike in Python, the string content cannot be on the same line as the delimiters, and the resulting mandatory leading and trailing newlines are not included in the string. Internal newlines can be removed by preceding them with backslashes. - In JavaScript, backtick-delimited strings can contain newlines; this syntax also allows string interpolation as described below. In addition, some languages, primarily scripting languages, also provide a mechanism for string interpolation, wherein a string value is formed by including the formatted values of some variables in a given format string. For example, `"Hello, $planet."` might produce a string value including the formatted value of the variable named `planet`. Such interpolation facilities are outside the scope of this proposal. ## Proposal - Single-line string literals are delimited by `"`s: `"hello"` - Multi-line string literals are introduced by a `"""` followed by a newline and terminated by a line beginning with a `"""`. The indentation of the terminating line is removed from all preceding lines: ```carbon var String: henry_vi = """ The winds grow high; so do your stomachs, lords. How irksome is this music to my heart! When such strings jar, what hope of harmony? I pray, my lords, let me compound this strife. -- History of Henry VI, Part II, Act II, Scene 1, W. Shakespeare """; ``` Only the final line of this string literal begins with whitespace. The opening newline is not part of the string's contents, but the trailing newline is; the first character of this example string is `T` and the last character is a newline. - The opening `"""` of a multi-line string literal can be followed by a _file type indicator_, to assist tooling in understanding the intent of the string. This indicator has no effect on the meaning of the program. ```carbon var String: cpp_snippet = """c++ #include int main() { std::cout << "hello world" << std::endl; } """; ``` - Escape sequences are introduced with a `\` character; the most common C and C++ escape sequences are supported: `"hello\nworld"`. Octal escapes (`\177`), `\a`, `\b`, `\f` and `\v` are removed. `\uNNNN` and `\U00NNNNNN` are replaced by `\u{NNNNNN}`. An escape sequence `\` is permitted in multi-line string literals, and results in no string contents. - Raw string literals are supported, for both the single and multi-line case, following the Swift convention: they are introduced by prefixing the opening delimiter with one or more `#`s, and suffixing the closing delimiter with a matching number of `#`s: `#"foo\s*bar"#` or `#"foo"bar"#`. Escape sequences can be introduced in a raw string literal by inserting a matching number of `#`s after the `\` character: `#"foo\#nbar"#` contains a newline character. - Unlike in C and C++, adjacent string literals are not implicitly concatenated. ## Details ### Non-raw string literals A _simple string literal_ is formed of a sequence of - characters other than backslashes, double quotation marks, and vertical whitespace - [escape sequences](#escape-sequences) enclosed in double quotation marks (`"`). Each escape sequence is replaced with the corresponding character sequence or code unit sequence. ```carbon var String: lucius = "The strings, my lord, are false."; ``` A _block string literal_ starts with three double quotation marks, followed by an optional file type indicator, followed by a newline, and ends at the next instance of three double quotation marks whose first `"` is not part of a `\"` escape sequence. The closing `"""` shall be the first non-whitespace characters on that line. The lines between the opening line and the closing line (exclusive) are _content lines_. The content lines shall not contain `\` characters that do not form part of an escape sequence. The _indentation_ of a block string literal is the sequence of horizontal whitespace preceding the closing `"""`. Each non-empty content line shall begin with the indentation of the string literal. The content of the literal is formed as follows: - The indentation of the closing line is removed from each non-empty content line. - All trailing whitespace on each line, including the line terminator, is replaced with a single line feed (U+000A) character. - The resulting lines are concatenated. - Each [escape sequence](#escape-sequences) is replaced with the corresponding character sequence or code unit sequence. A content line is considered empty if it contains only whitespace characters. ```carbon var String: w = """ This is a string literal. Its first character is 'T' and its last character is a newline character. It contains another newline between 'is' and 'a'. """; // This string literal is invalid because the """ after 'closing' terminates // the literal, but is not at the start of the line. var String: invalid = """ error: closing """ is not on its own line. """; ``` A _file type indicator_ is any sequence of non-whitespace characters other than `"` or `#`. The file type indicator has no semantic meaning to the Carbon compiler, but some file type indicators are understood by the language tooling (for example, syntax highlighter, code formatter) as indicating the structure of the string literal's content. ```carbon // This is a block string literal. Its first two characters are spaces, and its // last character is a line feed. It has a file type of 'c++'. var String: starts_with_whitespace = """c++ int x = 1; // This line starts with two spaces. int y = 2; // This line starts with two spaces. """; ``` The file type indicator might contain semantic information beyond the file type itself, such as instructions to the code formatter to disable formatting for the code block. **Open question:** This proposal does not suggest any concrete set of recognized file type indicators. It would be useful to informally specify a set of well-known indicators, so that tools have a common understanding of what those indicators mean, perhaps in a best practices guide. #### Escape sequences Within a string literal, the following escape sequences are recognized: | Escape | Meaning | | ------------- | -------------------------------------------------------- | | `\t` | U+0009 CHARACTER TABULATION | | `\n` | U+000A LINE FEED | | `\r` | U+000D CARRIAGE RETURN | | `\"` | U+0022 QUOTATION MARK (`"`) | | `\'` | U+0027 APOSTROPHE (`'`) | | `\\` | U+005C REVERSE SOLIDUS (`\`) | | `\0` | Code unit with value 0 | | `\xHH` | Code unit with value HH₁₆ | | `\u{HHHH...}` | Unicode code point U+HHHH... | | `\` | No string literal content produced (block literals only) | This includes all C++ escape sequences except: - `\?`, which was historically used to escape trigraphs in string literals, and no longer serves any purpose. - `\ooo` octal escapes, which are removed because Carbon does not support octal literals; `\0` is retained as a special case, which is expected to be important for C interoperability. - `\uABCD`, which is replaced by `\u{ABCD}`. - `\U0010FFFF`, which is replaced by `\u{10FFFF}`. - `\a` (bell), `\b` (backspace), `\v` (vertical tab), and `\f` (form feed). `\a` and `\b` are obsolescent, and `\f` and `\v` are largely obsolete. These characters can be expressed with `\x07`, `\x08`, `\x0B`, and `\x0C` respectively if needed. Note that this is the same set of escape sequences supported by [Swift](https://docs.swift.org/swift-book/LanguageGuide/StringsAndCharacters.html#ID295) and [Rust](https://doc.rust-lang.org/reference/tokens.html), except that, unlike in Swift, support for `\xHH` is provided. While this proposal takes a firm stance on not permitting octal escape sequences, the decision to not allow `\1`..`\7`, and more generally to not treat `\DDDD` as a decimal escape sequence, is _experimental_. In the above table, `H` represents an arbitrary hexadecimal character, `0`-`9` or `A`-`F` (case-sensitive). Unlike in C++, but like in Python, `\x` expects exactly two hexadecimal digits. As in JavaScript, Rust, and Swift, Unicode code points can be expressed by number using `\u{10FFFF}` notation, which accepts any number of hexadecimal characters. Any numeric code point in the ranges 0₁₆-D7FF₁₆ or E000₁₆-10FFFF₁₆ can be expressed this way. _Open question:_ Some programming languages (notably Python) support a `\N{unicode character name}` syntax. We could add such an escape sequence, but this proposal does not include one. Future proposals considering adding such support should pay attention to work done by C++'s Unicode study group in this area. The escape sequence `\0` shall not be followed by a decimal digit. In cases where a null byte should be followed by a decimal digit, `\x00` can be used instead: `"foo\x00123"`. The intent is to preserve the possibility of permitting decimal escape sequences in the future. A backslash followed by a line feed character is an escape sequence that produces no string contents. This escape sequence is _experimental_, and can only appear in multi-line string literals. This escape sequence is processed after trailing whitespace is replaced by a line feed character, so a `\` followed by horizontal whitespace followed by a line terminator removes the whitespace up to and including the line terminator. Unlike in Rust, but like in Swift, leading whitespace on the line after an escaped newline is not removed, other than whitespace that matches the indentation of the terminating `"""`. A character sequence starting with a backslash that doesn't match any known escape sequence is invalid. Whitespace characters other than space and, for block string literals, new line optionally preceded by carriage return are disallowed. All other characters (including non-printable characters) are preserved verbatim. Because all Carbon source files are required to be valid sequences of Unicode characters, code unit sequences that are not valid UTF-8 can only be produced by `\x` escape sequences. The choice to disallow raw tab characters in string literals is _experimental_. ```carbon var String: fret = "I would 'twere something that would fret the string,\n" + "The master-cord on's \u{2764}\u{FE0F}!"; // This string contains two characters (prior to encoding in UTF-8): // U+1F3F9 (BOW AND ARROW) followed by U+0032 (DIGIT TWO) var String: password = "\u{1F3F9}2"; // This string contains no newline characters. var String: type_mismatch = """ Shall I compare thee to a summer's day? Thou art \ more lovely and more temperate.\ """; var String: trailing_whitespace = """ This line ends in a space followed by a newline. \n\ This line starts with four spaces. """; ``` ### Raw string literals In order to allow strings whose contents include backslashes and double quotes, the delimiters of string literals can be customized by prefixing the opening delimiter with _N_ `#` characters. A closing delimiter for such a string is only recognized if it is followed by _N_ `#` characters, and similarly, escape sequences in such string literals are recognized only if the `\` is also followed by _N_ `#` characters. A `\`, `"`, or `"""` not followed by _N_ `#` characters has no special meaning. | Opening delimiter | Escape sequence introducer | Closing delimiter | | ----------------- | ----------------------------- | ----------------- | | `"` / `"""` | `\` (for example, `\n`) | `"` / `"""` | | `#"` / `#"""` | `\#` (for example, `\#n`) | `"#` / `"""#` | | `##"` / `##"""` | `\##` (for example, `\##n`) | `"##` / `"""##` | | `###"` / `###"""` | `\###` (for example, `\###n`) | `"###` / `"""###` | | ... | ... | ... | For example: ```carbon var String: x = #""" This is the content of the string. The 'T' is the first character of the string. """ <-- This is not the end of the string. """#; // But the preceding line does end the string. // OK, final character is \ var String: y = #"Hello\"#; var String: z = ##"Raw strings #"nesting"#"##; var String: w = #"Tab is expressed as \t. Example: '\#t'"#; ``` Note that both a single-line raw string literal and a multi-line raw string literal can begin with `#"""`. These cases can be distinguished by the presence or absence of additional `"`s later in the same line: - In a single-line raw string literal, there must be a `"` and one or more `#`s later in the same line terminating the string. - In a multi-line raw string literal, the rest of the line is a file type indicator, which can contain neither `"` nor `#`. ```carbon // This string is a single-line raw string literal. // The contents of this string start and end with exactly two "s. var String: ambig1 = #"""This is a raw string literal starting with """#; // This string is a block raw string literal with file-type 'This', // whose contents start with "is a ". var String: ambig2 = #"""This is a block string literal with file type 'This', first character 'i', and last character 'X': X\# """#; // This is a single-line raw string literal, equivalent to "\"". var String: ambig3 = #"""#; ``` ### Encoding A string literal results in a sequence of 8-bit bytes. Like Carbon source files, string literals are encoded in UTF-8. This proposal includes no mechanism to request that any other encoding is used. The expectation is that if another encoding is needed, the string literal can be transcoded from UTF-8 during compilation. There is no guarantee that the string is valid UTF-8, however, because arbitrary byte sequences can be inserted by way of `\xHH` escape sequences. This decision is _experimental_, and should be revisited if we find sufficient motivation for directly expressing string literals in other encodings. Similarly, as library support for a string type evolves, we should consider including string literal syntax (perhaps as the default) that guarantees the string content is a valid UTF-8 encoding, so that valid UTF-8 can be distinguished from an arbitrary string in the type system. In such string literals, we should consider rejecting `\xHH` escapes in which HH is greater than 7F₁₆, as in Rust. ## Alternatives considered ### Block string literals We could avoid including a block string literal in general, and instead construct multi-line strings by string concatenation, with either C-style juxtaposition or with an explicit concatenation operator. But doing so would be more verbose and would make the expression of the source code be further from the programmer's intent. We could use raw string literals to provide block string literal syntax, as C++ does. However, this couples two orthogonal choices: whether escape sequences should be recognized and whether the string is intended to span multiple lines. In C++ code, the inability to use escape sequences in multi-line string literals sometimes awkward. For example: ```c++ std::string make_rule = "%s: %s\n\t$(CC) -c -o $@ $< $(CFLAGS)\n\n" "main:\n\t$(CC) %s -o %s\n"; ``` can be written under this proposal as ```carbon var String: make_rule = """make %s: %s \t$(CC) -c -o $@ $< $(CFLAGS) main: \t$(CC) %s -o %s """; ``` improving readability while still making the semantically-meaningful presence of tabs visible even in editors / code browsers that do not distinguish tabs from spaces. #### Leading whitespace removal Block string literals could use explicit characters in the body to indicate the amount of leading whitespace to be removed: ```carbon var String: x = """ | starts with two spaces. """; ``` This would allow the correct indentation to be determined as soon as the first line after the opening `"""` is seen. However, this adds lexical complexity, and harms the ability to copy-paste string contents into other contexts. #### Terminating newline We could choose to exclude the trailing newline (like in Swift). Informal surveys suggest that expectations for whether to include or exclude the trailing newline vary. The intended use case for block string literals is to represent multi-line strings. When forming a single larger string from concatenation of multi-line string literals, including the trailing newline but not the leading newline -- or, more generally, including a newline at the end of each line in the string -- provides the best alignment between the source-level syntax and the result. For example: ```carbon fn Run() { print("""c++ class X { public: X() {} """); for (var String: decl in GetMemberDecls()) { print decl; } print("""c++ private: """); print(GetFields()); print("""c++ }; """); } ``` As is, the output printed by this example can be visualized by ignoring all lines other than those between the `"""`s. If we excluded the trailing newline, additional blank lines would be required at the end of each string literal, harming the readability of the example. ### Escape sequences We could support octal escape sequences, as many C family languages do. However, they are considered antiquated in C++ code, and supporting them would be inconsistent with our decision to not support octal numeric literals. A quick informal poll suggests that many C++ programmers do not realize that `\123` is an octal escape sequence, not a decimal one. We could support `\123` as a decimal escape sequence. However, doing so may lead to surprise when migrating C++ code to Carbon. This possibility should be revisited once Carbon matures and we have a better idea of how the migration process is expected to proceed. We could allow arbitrary-length `\x` escape sequences, as C++ does, and include some explicit mechanism to terminate such a sequence. For example, we could treat `\` for an arbitrary whitespace character in the same way we treat `\`, and use `"\xAB\ C"` to terminate an escape sequence prematurely. However, this is unnecessarily inventive, and the Python approach of requiring exactly two hexadecimal characters after `\x` is adequate, assuming we do not intend to support string literal element types other than 8-bit bytes. If we do find we want to support wider element types in future (for example, if we want to add a UTF-16 or UTF-32 string literal), `\x{ABCD}` can be used. We could permit `\` even in non-block string literals to allow them to be line-wrapped, but there seems to be little benefit to doing so, as a block string literal can always be used instead. We could adopt Python's `\N{unicode character name}` syntax. But there is no pressing need to add such a syntax imminently, and concerns have been raised both over the exact ways in which characters are named and over compatibility with upcoming C++ language extensions in this area, so this syntax is not being proposed at this time. We could allow an `\e` escape sequence for the U+001C ESCAPE character. This character is a common extension in C and C++ compilers, and appears to primarily be used to hardcode ANSI terminal escape sequences, such as `"\e[32mgreen text\e[0m"`. This proposal doesn't explicitly reject this idea. However, if we consider adopting such an extension in the future, we should consider whether a library facility for simple terminal operations would be a more valuable addition than this escape sequence. We could retain the `\uNNNN` escape sequence to give a terser notation for the common case of a Unicode code point that is most naturally written as four hexadecimal digits -- that is, all code points in the Basic Multilingual Plane. This is the approach taken by JavaScript. However, following Swift and Rust in permitting only `\u{NNNN}` is simpler and avoids redundancy. We expect explicit `\u` escapes to be rare: we expect regular, printable Unicode characters that are normalized in NFC to be written directly in the source file, rather than spelled with `\u` escapes. Explicit `\u` escapes would be useful where the code point value is important -- for example, in test data -- or where special characters such as directionality markers or non-normalized characters are desired, but for such uses, the longer `\u{NNNN}` form seems adequate. ### Raw string literals The approach to raw string literals in this proposal is based on Swift's raw strings facility. We could use a different mechanism other than a sequence of `#`s to support nesting raw string literals. For example, we could adopt something like C++'s semi-arbitrary delimiters `R"foo(string contents)foo"`. However, this level of customizability seems unwarranted: raw string literals are unlikely to nest more than one or two levels deep, so using `#"..."#`, `##"..."##`, `###"..."###` for successive nesting levels seems unproblematic, and removes the need for the programmer to make an arbitrary choice. We could use a delimiter other than `#` to demarcate raw string literals. At this stage in Carbon's development, we don't know exactly which characters will be useful in operators, but it seems reasonable to assume a mostly C++-like operator set, which gives us a variety of characters that cannot appear immediately before a string literal: at least `@` `#` `$` `)` `]` `}` `\` `.` all appear likely to be available. Of these, `\` is likely to be problematic due to its use in escape sequences, closing brackets followed by string literals might one day be useful in some grammar constructs, and `.` seems a little too close to resembling designator. That leaves `@`, `#`, and `$`, and multiple existing languages have used `#` for this purpose. We could disallow use of _N_ `#`s as a delimiter if a lower value of _N_ would work. However, this would make the language brittle under maintenance: removing the last nested string literal from a quoted block of code would require changing the delimiters. We could use Rust-style raw strings, which add a leading `r`, permit zero `#`s to be used in raw strings, and do not provide a facility for escape sequences in raw strings. There are several reasons to prefer Swift-style raw strings: - Swift raw strings are not a distinct language feature; rather, they are a generalization of non-raw strings -- non-raw strings are simply the case where the number of `#` characters in the delimiters and escape sequence intorducer is zero. - Permitting escape sequences even in raw strings means that there is no loss of functionality when using a raw string, and code changes under maintenance that would require use of a facility only available by way of an escape sequence (such as inclusion of trailing whitespace in a block string literal) do not force a reversion to a non-raw string literal. There are also several reasons to prefer the Rust-style approach: - In the most common case, a Rust raw string will be one character shorter. - Less lexical space is used: Swift-style raw strings remove the possibility of using `#` as a prefix operator, whereas the leading `r` in Rust-style raw strings does not. - Certain character sequences are hard to express in Swift-style raw strings. Specifically, string literals such as `"\\################"` cannot readily be expressed as a raw string. Such string literals are extremely rare, but not nonexistent, in one large sample C++ corpus. If the final issue is concerning, we have a path to address it, by specifying that `\` followed by N+1 or more `#`s is left alone, just like `\` followed by N-1 or fewer `#`s is left alone. Formally, this can be accomplished by defining `\#` as an escape sequence that expands to itself, that is, to a backslash followed by N+1 `#` characters. #### Trailing whitespace We could preserve trailing whitespace in at least raw block string literals, and perhaps in all block string literals. However, this would mean that visually identical programs could have different meanings, and even that transformations performed automatically on save by some editors (removing trailing whitespace) could change the meaning of a program. It might also mean that raw string literals are no longer a generalization of non-raw string literals. Under this proposal, trailing whitespace can be included in a block string literal by following it with `\n\`: ``` var String: authors = """markdown *Authors*: \n\ Me \n\ Someone Else """; ``` In a single-`#` raw string literal, the same can be accomplished with the more-verbose terminator `\#n\#`, and so on. #### Line separators Raw block string literals could preserve the form of vertical whitespace used to terminate each line. This would allow uncommon forms of vertical whitespace (for example, vertical tab and form feed) to be included in raw string literals, but would create a risk that the meaning of a program would be different when the source code is checked out on an operating system that uses line feed as a line terminator versus when the source code is checked out on an operating system that uses a different line terminator (such as carriage return followed by line feed). This would also mean that raw string literals are no longer a generalization of non-raw string literals. ### Internal whitespace We could allow raw tab characters in string literals. However, raw tab characters harm the readability of the program, and we would like to encourage the use of `\t` escapes instead in situations where they are available, even if this means that the more verbose form `\#t` needs to be used in raw string literals. ## Rationale This proposal supports the goal of making Carbon code [easy to read, understand, and write](/docs/project/goals.md#code-that-is-easy-to-read-understand-and-write), by ensuring that essentially every kind of string content can be represented in a Carbon string literal, in a way that is natural, toolable, and easy to read: - Multi-line strings are supported by multi-line string literals, and the rules for stripping leading indentation enhance readability by allowing those literals to avoid visually disrupting the indentation structure of the code. - Strings that make extensive use of `\` and `"` are supported by raw string literals. - Treating raw versus ordinary and single-line versus multi-line as orthogonal allows Carbon to support all 4 combinations while keeping the language simple. - The handling of `\#` within raw string literals makes it possible to use escape sequences within raw string literals when necessary, for example to embed arbitrary byte values or Unicode data. This ensures that the programmer is never prevented from using a raw string literal, or forced to assemble a single logical string by concatenating ordinary and raw literals (with the negligible and fixable exception of strings like `"\\################"`, as noted in the proposal). - "File type indicators" make it easier for tooling to understand the contents of literals, in order to provide features such as syntax highlighting, automated formatting, and potentially even certain kinds of static analysis, for code that's embedded in string literals. - Support for non-Unicode strings by way of `\x` ensures "support for software outside the primary use case". - Avoids unnecessary invention, following Rust and particularly Swift.