# String literals
[Pull request](https://github.com/carbon-language/carbon-lang/pull/199)
## Table of contents
- [Problem](#problem)
- [Background](#background)
- [Existing practice](#existing-practice)
- [Proposal](#proposal)
- [Details](#details)
- [Non-raw string literals](#non-raw-string-literals)
- [Escape sequences](#escape-sequences)
- [Raw string literals](#raw-string-literals)
- [Encoding](#encoding)
- [Alternatives considered](#alternatives-considered)
- [Block string literals](#block-string-literals)
- [Leading whitespace removal](#leading-whitespace-removal)
- [Terminating newline](#terminating-newline)
- [Escape sequences](#escape-sequences-1)
- [Raw string literals](#raw-string-literals-1)
- [Trailing whitespace](#trailing-whitespace)
- [Line separators](#line-separators)
- [Internal whitespace](#internal-whitespace)
- [Rationale](#rationale)
## Problem
This proposal specifies lexical rules for constant strings in Carbon.
## Background
We wish to provide a syntax for writing literals containing human-readable text.
Note that "human-readable text" here should be understood broadly: such text may
be subject to further processing, and may in some cases be intended to be
interpreted by a computer rather than by a human (such as a regular expression,
program source code, or a C++ mangled name), but broadly represents a sequence
of characters rather than arbitrary binary data.
Such text is typically represented in an _encoding_, which is a bidirectional
mapping between a sequence of characters in text and a sequence of bounded
integer values known as _code units_, suitable for storage, transmission, and
processing. For example, the Russian word углерод (carbon) is encoded in the
UTF-8 encoding as D1168316 D016B316
D016BB16 D016B516
D1168016 D016BE16
D016B416.
### Existing practice
See
[Comparison of programming languages (strings) on Wikipedia](https://en.wikipedia.org/wiki/Comparison_of_programming_languages_%28strings%29).
Simple string literals are specified in most programming languages as text
delimited by double-quote characters, `"like this"`. Such string literals
usually are restricted to begin and end on the same source line. Three
additional features are commonly seen:
- Escape sequences, which permit string literals to include characters that
are difficult to type, are ambiguous for the reader, or that would be
problematic in some way (such as whitespace characters, characters that are
invisible, and characters that change how other characters are rendered),
and also to include arbitrary code units. One common convention is to use
`\` to introduce an escape sequence, where, for example:
- `\n` represents a newline character,
- `\u1234` represents the Unicode character U+1234,
- `\xAB` represents the code unit AB16,
- `\"` represents a single `"` character and does not terminate the string
literal,
- `\\` represents a single `\` character,
and so on.
The set of single-letter escape sequences has a lot of commonality between
languages, with some variation between older and newer languages:
- C++ and Python allow `\a`, `\b`, `\f`, `\n`, `\r`, `\t`, `\v` for bell,
backspace, form feed, new line, carriage return, tab, and vertical tab,
respectively.
- JavaScript drops support for `\a`.
- Java additionally drops support for `\v`.
- Rust and Swift additionally drop support for `\b` and `\f`, leaving only
`\n`, `\r`, and `\t`.
The rules for numeric escape sequences differ between kinds of escape
sequence and between languages. The rules in C++, JavaScript, Python, Rust,
and Swift are as follows:
- `\123` is interpreted as a octal code unit value, and up to three octal
digits are consumed. In JavaScript and C++, values greater than
3778 (25510) are invalid (assuming an 8-bit
character type). In Python, values greater than 3778 are
interpreted modulo 256. Rust and Swift do not allow octal escapes in
general, but do allow `\0` as a special case.
- `\xAB` is interpreted as a hexadecimal code unit value. In C++, any
nonzero number of hexadecimal digits can follow as part of the escape
sequence. In Python, JavaScript, and Rust, exactly two digits are
required. In Rust, the value must be less than or equal to
7F16 except for `b`-prefixed strings (byte array literals).
Swift does not support this form of escape sequence.
- `\uABCD` is interpreted as a hexadecimal code point value. In C++,
Python, and JavaScript, exactly four hexadecimal digits can follow, but
JavaScript allows any nonzero number of digits to be specified using
`\u{ABCDE}` notation. Rust and Swift support only the `\u{ABCDE}`
notation.
- `\U0010FFFD` is interpreted as a hexadecimal code point value in C++ and
Python, but not in JavaScript, Rust, or Swift, and permits exactly eight
hexadecimal digits.
- Raw string literals, in which escape sequences are not recognized. These are
often used in situations where escape sequences are undesirable, but in
which the escape character or regular string terminator is used frequently.
Such literals are useful when embedding one machine-readable language in
another, when those languages share some escaping conventions. Such
functionality may also provide a way to customize the string delimiters.
- In Python, raw string literals have an `r` prefix: `r"li\ngo"` is a six
character string whose third character is `\`.
- In C++, raw string literals have an `r` prefix, along with a custom
delimiter (which may be empty): `r"DELIM(li\ngo)DELIM"` is a six
character string (plus a nul terminator).
- In Rust, raw string literals have an `r` prefix and any matching number
of `#` characters enclose the string contents: the third character of
`r"lo\ng"` is `\`, and the second character of `r#" " "#` is `"`.
- In Swift, raw string literals are a generalization of regular string
literals: literals are introduced by any number, _N_, of `#` characters
followed by a `"`, terminated by `"` followed by _N_ `#` characters, and
escape sequences are introduced by a `\` followed by _N_ `#` characters:
`#" " # \n \#n \#\# "#` has the same contents as the C++ string literal
`" \" # \\n \n \\# "`.
- In Java, raw string literals are delimited by a sequence of one or more
backticks instead of double quotes: the fourth character of
\`\`foo\`\bar\`\` is a backtick and the fifth is a
backslash.
- Multiline string literals provide a mechanism for a string to easily span
more than one line of source text.
- In C++, Rust, and Java, raw string literals are used to represent
multiline string literals.
- In Python, different delimiters (`"""` or, in Python, `'''` instead of
`"` or `'`) are used to represent multiline string literals, and plain
`"` and `""` can thereby appear in the string contents, but these
literals otherwise behave the same as regular string literals.
- In Swift, different delimiters are used (`"""` instead of `"`), but
unlike in Python, the string content cannot be on the same line as the
delimiters, and the resulting mandatory leading and trailing newlines
are not included in the string. Internal newlines can be removed by
preceding them with backslashes.
- In JavaScript, backtick-delimited strings can contain newlines; this
syntax also allows string interpolation as described below.
In addition, some languages, primarily scripting languages, also provide a
mechanism for string interpolation, wherein a string value is formed by
including the formatted values of some variables in a given format string. For
example, `"Hello, $planet."` might produce a string value including the
formatted value of the variable named `planet`. Such interpolation facilities
are outside the scope of this proposal.
## Proposal
- Single-line string literals are delimited by `"`s: `"hello"`
- Multi-line string literals are introduced by a `"""` followed by a newline
and terminated by a line beginning with a `"""`. The indentation of the
terminating line is removed from all preceding lines:
```carbon
var String: henry_vi = """
The winds grow high; so do your stomachs, lords.
How irksome is this music to my heart!
When such strings jar, what hope of harmony?
I pray, my lords, let me compound this strife.
-- History of Henry VI, Part II, Act II, Scene 1, W. Shakespeare
""";
```
Only the final line of this string literal begins with whitespace. The
opening newline is not part of the string's contents, but the trailing
newline is; the first character of this example string is `T` and the last
character is a newline.
- The opening `"""` of a multi-line string literal can be followed by a _file
type indicator_, to assist tooling in understanding the intent of the
string. This indicator has no effect on the meaning of the program.
```carbon
var String: cpp_snippet = """c++
#include
int main() {
std::cout << "hello world" << std::endl;
}
""";
```
- Escape sequences are introduced with a `\` character; the most common C and
C++ escape sequences are supported: `"hello\nworld"`. Octal escapes
(`\177`), `\a`, `\b`, `\f` and `\v` are removed. `\uNNNN` and `\U00NNNNNN`
are replaced by `\u{NNNNNN}`. An escape sequence `\` is permitted
in multi-line string literals, and results in no string contents.
- Raw string literals are supported, for both the single and multi-line case,
following the Swift convention: they are introduced by prefixing the opening
delimiter with one or more `#`s, and suffixing the closing delimiter with a
matching number of `#`s: `#"foo\s*bar"#` or `#"foo"bar"#`. Escape sequences
can be introduced in a raw string literal by inserting a matching number of
`#`s after the `\` character: `#"foo\#nbar"#` contains a newline character.
- Unlike in C and C++, adjacent string literals are not implicitly
concatenated.
## Details
### Non-raw string literals
A _simple string literal_ is formed of a sequence of
- characters other than backslashes, double quotation marks, and vertical
whitespace
- [escape sequences](#escape-sequences)
enclosed in double quotation marks (`"`). Each escape sequence is replaced with
the corresponding character sequence or code unit sequence.
```carbon
var String: lucius = "The strings, my lord, are false.";
```
A _block string literal_ starts with three double quotation marks, followed by
an optional file type indicator, followed by a newline, and ends at the next
instance of three double quotation marks whose first `"` is not part of a `\"`
escape sequence. The closing `"""` shall be the first non-whitespace characters
on that line. The lines between the opening line and the closing line
(exclusive) are _content lines_. The content lines shall not contain `\`
characters that do not form part of an escape sequence.
The _indentation_ of a block string literal is the sequence of horizontal
whitespace preceding the closing `"""`. Each non-empty content line shall begin
with the indentation of the string literal. The content of the literal is formed
as follows:
- The indentation of the closing line is removed from each non-empty content
line.
- All trailing whitespace on each line, including the line terminator, is
replaced with a single line feed (U+000A) character.
- The resulting lines are concatenated.
- Each [escape sequence](#escape-sequences) is replaced with the corresponding
character sequence or code unit sequence.
A content line is considered empty if it contains only whitespace characters.
```carbon
var String: w = """
This is a string literal. Its first character is 'T' and its last character is
a newline character. It contains another newline between 'is' and 'a'.
""";
// This string literal is invalid because the """ after 'closing' terminates
// the literal, but is not at the start of the line.
var String: invalid = """
error: closing """ is not on its own line.
""";
```
A _file type indicator_ is any sequence of non-whitespace characters other than
`"` or `#`. The file type indicator has no semantic meaning to the Carbon
compiler, but some file type indicators are understood by the language tooling
(for example, syntax highlighter, code formatter) as indicating the structure of
the string literal's content.
```carbon
// This is a block string literal. Its first two characters are spaces, and its
// last character is a line feed. It has a file type of 'c++'.
var String: starts_with_whitespace = """c++
int x = 1; // This line starts with two spaces.
int y = 2; // This line starts with two spaces.
""";
```
The file type indicator might contain semantic information beyond the file type
itself, such as instructions to the code formatter to disable formatting for the
code block.
**Open question:** This proposal does not suggest any concrete set of recognized
file type indicators. It would be useful to informally specify a set of
well-known indicators, so that tools have a common understanding of what those
indicators mean, perhaps in a best practices guide.
#### Escape sequences
Within a string literal, the following escape sequences are recognized:
| Escape | Meaning |
| ------------- | -------------------------------------------------------- |
| `\t` | U+0009 CHARACTER TABULATION |
| `\n` | U+000A LINE FEED |
| `\r` | U+000D CARRIAGE RETURN |
| `\"` | U+0022 QUOTATION MARK (`"`) |
| `\'` | U+0027 APOSTROPHE (`'`) |
| `\\` | U+005C REVERSE SOLIDUS (`\`) |
| `\0` | Code unit with value 0 |
| `\xHH` | Code unit with value HH16 |
| `\u{HHHH...}` | Unicode code point U+HHHH... |
| `\` | No string literal content produced (block literals only) |
This includes all C++ escape sequences except:
- `\?`, which was historically used to escape trigraphs in string literals,
and no longer serves any purpose.
- `\ooo` octal escapes, which are removed because Carbon does not support
octal literals; `\0` is retained as a special case, which is expected to be
important for C interoperability.
- `\uABCD`, which is replaced by `\u{ABCD}`.
- `\U0010FFFF`, which is replaced by `\u{10FFFF}`.
- `\a` (bell), `\b` (backspace), `\v` (vertical tab), and `\f` (form feed).
`\a` and `\b` are obsolescent, and `\f` and `\v` are largely obsolete. These
characters can be expressed with `\x07`, `\x08`, `\x0B`, and `\x0C`
respectively if needed.
Note that this is the same set of escape sequences supported by
[Swift](https://docs.swift.org/swift-book/LanguageGuide/StringsAndCharacters.html#ID295)
and [Rust](https://doc.rust-lang.org/reference/tokens.html), except that, unlike
in Swift, support for `\xHH` is provided.
While this proposal takes a firm stance on not permitting octal escape
sequences, the decision to not allow `\1`..`\7`, and more generally to not treat
`\DDDD` as a decimal escape sequence, is _experimental_.
In the above table, `H` represents an arbitrary hexadecimal character, `0`-`9`
or `A`-`F` (case-sensitive). Unlike in C++, but like in Python, `\x` expects
exactly two hexadecimal digits. As in JavaScript, Rust, and Swift, Unicode code
points can be expressed by number using `\u{10FFFF}` notation, which accepts any
number of hexadecimal characters. Any numeric code point in the ranges
016-D7FF16 or E00016-10FFFF16 can be
expressed this way.
_Open question:_ Some programming languages (notably Python) support a
`\N{unicode character name}` syntax. We could add such an escape sequence, but
this proposal does not include one. Future proposals considering adding such
support should pay attention to work done by C++'s Unicode study group in this
area.
The escape sequence `\0` shall not be followed by a decimal digit. In cases
where a null byte should be followed by a decimal digit, `\x00` can be used
instead: `"foo\x00123"`. The intent is to preserve the possibility of permitting
decimal escape sequences in the future.
A backslash followed by a line feed character is an escape sequence that
produces no string contents. This escape sequence is _experimental_, and can
only appear in multi-line string literals. This escape sequence is processed
after trailing whitespace is replaced by a line feed character, so a `\`
followed by horizontal whitespace followed by a line terminator removes the
whitespace up to and including the line terminator. Unlike in Rust, but like in
Swift, leading whitespace on the line after an escaped newline is not removed,
other than whitespace that matches the indentation of the terminating `"""`.
A character sequence starting with a backslash that doesn't match any known
escape sequence is invalid. Whitespace characters other than space and, for
block string literals, new line optionally preceded by carriage return are
disallowed. All other characters (including non-printable characters) are
preserved verbatim. Because all Carbon source files are required to be valid
sequences of Unicode characters, code unit sequences that are not valid UTF-8
can only be produced by `\x` escape sequences.
The choice to disallow raw tab characters in string literals is _experimental_.
```carbon
var String: fret = "I would 'twere something that would fret the string,\n" +
"The master-cord on's \u{2764}\u{FE0F}!";
// This string contains two characters (prior to encoding in UTF-8):
// U+1F3F9 (BOW AND ARROW) followed by U+0032 (DIGIT TWO)
var String: password = "\u{1F3F9}2";
// This string contains no newline characters.
var String: type_mismatch = """
Shall I compare thee to a summer's day? Thou art \
more lovely and more temperate.\
""";
var String: trailing_whitespace = """
This line ends in a space followed by a newline. \n\
This line starts with four spaces.
""";
```
### Raw string literals
In order to allow strings whose contents include backslashes and double quotes,
the delimiters of string literals can be customized by prefixing the opening
delimiter with _N_ `#` characters. A closing delimiter for such a string is only
recognized if it is followed by _N_ `#` characters, and similarly, escape
sequences in such string literals are recognized only if the `\` is also
followed by _N_ `#` characters. A `\`, `"`, or `"""` not followed by _N_ `#`
characters has no special meaning.
| Opening delimiter | Escape sequence introducer | Closing delimiter |
| ----------------- | ----------------------------- | ----------------- |
| `"` / `"""` | `\` (for example, `\n`) | `"` / `"""` |
| `#"` / `#"""` | `\#` (for example, `\#n`) | `"#` / `"""#` |
| `##"` / `##"""` | `\##` (for example, `\##n`) | `"##` / `"""##` |
| `###"` / `###"""` | `\###` (for example, `\###n`) | `"###` / `"""###` |
| ... | ... | ... |
For example:
```carbon
var String: x = #"""
This is the content of the string. The 'T' is the first character
of the string.
""" <-- This is not the end of the string.
"""#;
// But the preceding line does end the string.
// OK, final character is \
var String: y = #"Hello\"#;
var String: z = ##"Raw strings #"nesting"#"##;
var String: w = #"Tab is expressed as \t. Example: '\#t'"#;
```
Note that both a single-line raw string literal and a multi-line raw string
literal can begin with `#"""`. These cases can be distinguished by the presence
or absence of additional `"`s later in the same line:
- In a single-line raw string literal, there must be a `"` and one or more
`#`s later in the same line terminating the string.
- In a multi-line raw string literal, the rest of the line is a file type
indicator, which can contain neither `"` nor `#`.
```carbon
// This string is a single-line raw string literal.
// The contents of this string start and end with exactly two "s.
var String: ambig1 = #"""This is a raw string literal starting with """#;
// This string is a block raw string literal with file-type 'This',
// whose contents start with "is a ".
var String: ambig2 = #"""This
is a block string literal with file type 'This', first character 'i',
and last character 'X': X\#
"""#;
// This is a single-line raw string literal, equivalent to "\"".
var String: ambig3 = #"""#;
```
### Encoding
A string literal results in a sequence of 8-bit bytes. Like Carbon source files,
string literals are encoded in UTF-8. This proposal includes no mechanism to
request that any other encoding is used. The expectation is that if another
encoding is needed, the string literal can be transcoded from UTF-8 during
compilation. There is no guarantee that the string is valid UTF-8, however,
because arbitrary byte sequences can be inserted by way of `\xHH` escape
sequences.
This decision is _experimental_, and should be revisited if we find sufficient
motivation for directly expressing string literals in other encodings.
Similarly, as library support for a string type evolves, we should consider
including string literal syntax (perhaps as the default) that guarantees the
string content is a valid UTF-8 encoding, so that valid UTF-8 can be
distinguished from an arbitrary string in the type system. In such string
literals, we should consider rejecting `\xHH` escapes in which HH is greater
than 7F16, as in Rust.
## Alternatives considered
### Block string literals
We could avoid including a block string literal in general, and instead
construct multi-line strings by string concatenation, with either C-style
juxtaposition or with an explicit concatenation operator. But doing so would be
more verbose and would make the expression of the source code be further from
the programmer's intent.
We could use raw string literals to provide block string literal syntax, as C++
does. However, this couples two orthogonal choices: whether escape sequences
should be recognized and whether the string is intended to span multiple lines.
In C++ code, the inability to use escape sequences in multi-line string literals
sometimes awkward. For example:
```c++
std::string make_rule = "%s: %s\n\t$(CC) -c -o $@ $< $(CFLAGS)\n\n"
"main:\n\t$(CC) %s -o %s\n";
```
can be written under this proposal as
```carbon
var String: make_rule = """make
%s: %s
\t$(CC) -c -o $@ $< $(CFLAGS)
main:
\t$(CC) %s -o %s
""";
```
improving readability while still making the semantically-meaningful presence of
tabs visible even in editors / code browsers that do not distinguish tabs from
spaces.
#### Leading whitespace removal
Block string literals could use explicit characters in the body to indicate the
amount of leading whitespace to be removed:
```carbon
var String: x = """
| starts with two spaces.
""";
```
This would allow the correct indentation to be determined as soon as the first
line after the opening `"""` is seen. However, this adds lexical complexity, and
harms the ability to copy-paste string contents into other contexts.
#### Terminating newline
We could choose to exclude the trailing newline (like in Swift). Informal
surveys suggest that expectations for whether to include or exclude the trailing
newline vary.
The intended use case for block string literals is to represent multi-line
strings. When forming a single larger string from concatenation of multi-line
string literals, including the trailing newline but not the leading newline --
or, more generally, including a newline at the end of each line in the string --
provides the best alignment between the source-level syntax and the result. For
example:
```carbon
fn Run() {
print("""c++
class X {
public:
X() {}
""");
for (var String: decl in GetMemberDecls()) {
print decl;
}
print("""c++
private:
""");
print(GetFields());
print("""c++
};
""");
}
```
As is, the output printed by this example can be visualized by ignoring all
lines other than those between the `"""`s. If we excluded the trailing newline,
additional blank lines would be required at the end of each string literal,
harming the readability of the example.
### Escape sequences
We could support octal escape sequences, as many C family languages do. However,
they are considered antiquated in C++ code, and supporting them would be
inconsistent with our decision to not support octal numeric literals. A quick
informal poll suggests that many C++ programmers do not realize that `\123` is
an octal escape sequence, not a decimal one.
We could support `\123` as a decimal escape sequence. However, doing so may lead
to surprise when migrating C++ code to Carbon. This possibility should be
revisited once Carbon matures and we have a better idea of how the migration
process is expected to proceed.
We could allow arbitrary-length `\x` escape sequences, as C++ does, and include
some explicit mechanism to terminate such a sequence. For example, we could
treat `\` for an arbitrary whitespace character in the same way we
treat `\`, and use `"\xAB\ C"` to terminate an escape sequence
prematurely. However, this is unnecessarily inventive, and the Python approach
of requiring exactly two hexadecimal characters after `\x` is adequate, assuming
we do not intend to support string literal element types other than 8-bit bytes.
If we do find we want to support wider element types in future (for example, if
we want to add a UTF-16 or UTF-32 string literal), `\x{ABCD}` can be used.
We could permit `\` even in non-block string literals to allow them to
be line-wrapped, but there seems to be little benefit to doing so, as a block
string literal can always be used instead.
We could adopt Python's `\N{unicode character name}` syntax. But there is no
pressing need to add such a syntax imminently, and concerns have been raised
both over the exact ways in which characters are named and over compatibility
with upcoming C++ language extensions in this area, so this syntax is not being
proposed at this time.
We could allow an `\e` escape sequence for the U+001C ESCAPE character. This
character is a common extension in C and C++ compilers, and appears to primarily
be used to hardcode ANSI terminal escape sequences, such as
`"\e[32mgreen text\e[0m"`. This proposal doesn't explicitly reject this idea.
However, if we consider adopting such an extension in the future, we should
consider whether a library facility for simple terminal operations would be a
more valuable addition than this escape sequence.
We could retain the `\uNNNN` escape sequence to give a terser notation for the
common case of a Unicode code point that is most naturally written as four
hexadecimal digits -- that is, all code points in the Basic Multilingual Plane.
This is the approach taken by JavaScript. However, following Swift and Rust in
permitting only `\u{NNNN}` is simpler and avoids redundancy. We expect explicit
`\u` escapes to be rare: we expect regular, printable Unicode characters that
are normalized in NFC to be written directly in the source file, rather than
spelled with `\u` escapes. Explicit `\u` escapes would be useful where the code
point value is important -- for example, in test data -- or where special
characters such as directionality markers or non-normalized characters are
desired, but for such uses, the longer `\u{NNNN}` form seems adequate.
### Raw string literals
The approach to raw string literals in this proposal is based on Swift's raw
strings facility.
We could use a different mechanism other than a sequence of `#`s to support
nesting raw string literals. For example, we could adopt something like C++'s
semi-arbitrary delimiters `R"foo(string contents)foo"`. However, this level of
customizability seems unwarranted: raw string literals are unlikely to nest more
than one or two levels deep, so using `#"..."#`, `##"..."##`, `###"..."###` for
successive nesting levels seems unproblematic, and removes the need for the
programmer to make an arbitrary choice.
We could use a delimiter other than `#` to demarcate raw string literals. At
this stage in Carbon's development, we don't know exactly which characters will
be useful in operators, but it seems reasonable to assume a mostly C++-like
operator set, which gives us a variety of characters that cannot appear
immediately before a string literal: at least `@` `#` `$` `)` `]` `}` `\` `.`
all appear likely to be available. Of these, `\` is likely to be problematic due
to its use in escape sequences, closing brackets followed by string literals
might one day be useful in some grammar constructs, and `.` seems a little too
close to resembling designator. That leaves `@`, `#`, and `$`, and multiple
existing languages have used `#` for this purpose.
We could disallow use of _N_ `#`s as a delimiter if a lower value of _N_ would
work. However, this would make the language brittle under maintenance: removing
the last nested string literal from a quoted block of code would require
changing the delimiters.
We could use Rust-style raw strings, which add a leading `r`, permit zero `#`s
to be used in raw strings, and do not provide a facility for escape sequences in
raw strings. There are several reasons to prefer Swift-style raw strings:
- Swift raw strings are not a distinct language feature; rather, they are a
generalization of non-raw strings -- non-raw strings are simply the case
where the number of `#` characters in the delimiters and escape sequence
intorducer is zero.
- Permitting escape sequences even in raw strings means that there is no loss
of functionality when using a raw string, and code changes under maintenance
that would require use of a facility only available by way of an escape
sequence (such as inclusion of trailing whitespace in a block string
literal) do not force a reversion to a non-raw string literal.
There are also several reasons to prefer the Rust-style approach:
- In the most common case, a Rust raw string will be one character shorter.
- Less lexical space is used: Swift-style raw strings remove the possibility
of using `#` as a prefix operator, whereas the leading `r` in Rust-style raw
strings does not.
- Certain character sequences are hard to express in Swift-style raw strings.
Specifically, string literals such as `"\\################"` cannot readily
be expressed as a raw string. Such string literals are extremely rare, but
not nonexistent, in one large sample C++ corpus.
If the final issue is concerning, we have a path to address it, by specifying
that `\` followed by N+1 or more `#`s is left alone, just like `\`
followed by N-1 or fewer `#`s is left alone. Formally, this can be
accomplished by defining `\#` as an escape sequence that expands to itself, that
is, to a backslash followed by N+1 `#` characters.
#### Trailing whitespace
We could preserve trailing whitespace in at least raw block string literals, and
perhaps in all block string literals. However, this would mean that visually
identical programs could have different meanings, and even that transformations
performed automatically on save by some editors (removing trailing whitespace)
could change the meaning of a program. It might also mean that raw string
literals are no longer a generalization of non-raw string literals.
Under this proposal, trailing whitespace can be included in a block string
literal by following it with `\n\`:
```
var String: authors = """markdown
*Authors*: \n\
Me \n\
Someone Else
""";
```
In a single-`#` raw string literal, the same can be accomplished with the
more-verbose terminator `\#n\#`, and so on.
#### Line separators
Raw block string literals could preserve the form of vertical whitespace used to
terminate each line. This would allow uncommon forms of vertical whitespace (for
example, vertical tab and form feed) to be included in raw string literals, but
would create a risk that the meaning of a program would be different when the
source code is checked out on an operating system that uses line feed as a line
terminator versus when the source code is checked out on an operating system
that uses a different line terminator (such as carriage return followed by line
feed). This would also mean that raw string literals are no longer a
generalization of non-raw string literals.
### Internal whitespace
We could allow raw tab characters in string literals. However, raw tab
characters harm the readability of the program, and we would like to encourage
the use of `\t` escapes instead in situations where they are available, even if
this means that the more verbose form `\#t` needs to be used in raw string
literals.
## Rationale
This proposal supports the goal of making Carbon code
[easy to read, understand, and write](/docs/project/goals.md#code-that-is-easy-to-read-understand-and-write),
by ensuring that essentially every kind of string content can be represented in
a Carbon string literal, in a way that is natural, toolable, and easy to read:
- Multi-line strings are supported by multi-line string literals, and the
rules for stripping leading indentation enhance readability by allowing
those literals to avoid visually disrupting the indentation structure of the
code.
- Strings that make extensive use of `\` and `"` are supported by raw string
literals.
- Treating raw versus ordinary and single-line versus multi-line as orthogonal
allows Carbon to support all 4 combinations while keeping the language
simple.
- The handling of `\#` within raw string literals makes it possible to use
escape sequences within raw string literals when necessary, for example to
embed arbitrary byte values or Unicode data. This ensures that the
programmer is never prevented from using a raw string literal, or forced to
assemble a single logical string by concatenating ordinary and raw literals
(with the negligible and fixable exception of strings like
`"\\################"`, as noted in the proposal).
- "File type indicators" make it easier for tooling to understand the contents
of literals, in order to provide features such as syntax highlighting,
automated formatting, and potentially even certain kinds of static analysis,
for code that's embedded in string literals.
- Support for non-Unicode strings by way of `\x` ensures "support for software
outside the primary use case".
- Avoids unnecessary invention, following Rust and particularly Swift.