瀏覽代碼

String literals proposal (#199)

Proposal as accepted by core team on 2021-02-23.
Richard Smith 5 年之前
父節點
當前提交
ec6fde4d61
共有 2 個文件被更改,包括 724 次插入0 次删除
  1. 1 0
      proposals/README.md
  2. 723 0
      proposals/p0199.md

+ 1 - 0
proposals/README.md

@@ -57,5 +57,6 @@ request:
     -   [0175 - Decision](p0175_decision.md)
 -   [0179 - Create a toolchain team.](p0179.md)
     -   [0179 - Decision](p0179_decision.md)
+-   [0199 - String literals](p0199.md)
 
 <!-- endproposals -->

+ 723 - 0
proposals/p0199.md

@@ -0,0 +1,723 @@
+# String literals
+
+<!--
+Part of the Carbon Language project, under the Apache License v2.0 with LLVM
+Exceptions. See /LICENSE for license information.
+SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+-->
+
+[Pull request](https://github.com/carbon-language/carbon-lang/pull/199)
+
+<!-- toc -->
+
+## Table of contents
+
+-   [Problem](#problem)
+-   [Background](#background)
+    -   [Existing practice](#existing-practice)
+-   [Proposal](#proposal)
+-   [Details](#details)
+    -   [Non-raw string literals](#non-raw-string-literals)
+        -   [Escape sequences](#escape-sequences)
+    -   [Raw string literals](#raw-string-literals)
+    -   [Encoding](#encoding)
+-   [Alternatives considered](#alternatives-considered)
+    -   [Block string literals](#block-string-literals)
+        -   [Leading whitespace removal](#leading-whitespace-removal)
+        -   [Terminating newline](#terminating-newline)
+    -   [Escape sequences](#escape-sequences-1)
+    -   [Raw string literals](#raw-string-literals-1)
+        -   [Trailing whitespace](#trailing-whitespace)
+        -   [Line separators](#line-separators)
+    -   [Internal whitespace](#internal-whitespace)
+
+<!-- tocstop -->
+
+## Problem
+
+This proposal specifies lexical rules for constant strings in Carbon.
+
+## Background
+
+We wish to provide a syntax for writing literals containing human-readable text.
+
+Note that "human-readable text" here should be understood broadly: such text may
+be subject to further processing, and may in some cases be intended to be
+interpreted by a computer rather than by a human (such as a regular expression,
+program source code, or a C++ mangled name), but broadly represents a sequence
+of characters rather than arbitrary binary data.
+
+Such text is typically represented in an _encoding_, which is a bidirectional
+mapping between a sequence of characters in text and a sequence of bounded
+integer values known as _code units_, suitable for storage, transmission, and
+processing. For example, the Russian word углерод (carbon) is encoded in the
+UTF-8 encoding as D1<sub>16</sub>83<sub>16 </sub>D0<sub>16</sub>B3<sub>16
+</sub>D0<sub>16</sub>BB<sub>16</sub> D0<sub>16</sub>B5<sub>16</sub>
+D1<sub>16</sub>80<sub>16</sub> D0<sub>16</sub>BE<sub>16</sub>
+D0<sub>16</sub>B4<sub>16</sub>.
+
+### Existing practice
+
+See
+[Comparison of programming languages (strings) on Wikipedia](https://en.wikipedia.org/wiki/Comparison_of_programming_languages_%28strings%29).
+
+Simple string literals are specified in most programming languages as text
+delimited by double-quote characters, `"like this"`. Such string literals
+usually are restricted to begin and end on the same source line. Three
+additional features are commonly seen:
+
+-   Escape sequences, which permit string literals to include characters that
+    are difficult to type, are ambiguous for the reader, or that would be
+    problematic in some way (such as whitespace characters, characters that are
+    invisible, and characters that change how other characters are rendered),
+    and also to include arbitrary code units. One common convention is to use
+    `\` to introduce an escape sequence, where, for example:
+
+    -   `\n` represents a newline character,
+    -   `\u1234` represents the Unicode character U+1234,
+    -   `\xAB` represents the code unit AB<sub>16</sub>,
+    -   `\"` represents a single `"` character and does not terminate the string
+        literal,
+    -   `\\` represents a single `\` character,
+
+    and so on.
+
+    The set of single-letter escape sequences has a lot of commonality between
+    languages, with some variation between older and newer languages:
+
+    -   C++ and Python allow `\a`, `\b`, `\f`, `\n`, `\r`, `\t`, `\v` for bell,
+        backspace, form feed, new line, carriage return, tab, and vertical tab,
+        respectively.
+    -   JavaScript drops support for `\a`.
+    -   Java additionally drops support for `\v`.
+    -   Rust and Swift additionally drop support for `\b` and `\f`, leaving only
+        `\n`, `\r`, and `\t`.
+
+    The rules for numeric escape sequences differ between kinds of escape
+    sequence and between languages. The rules in C++, JavaScript, Python, Rust,
+    and Swift are as follows:
+
+    -   `\123` is interpreted as a octal code unit value, and up to three octal
+        digits are consumed. In JavaScript and C++, values greater than
+        377<sub>8</sub> (255<sub>10</sub>) are invalid (assuming an 8-bit
+        character type). In Python, values greater than 377<sub>8</sub> are
+        interpreted modulo 256. Rust and Swift do not allow octal escapes in
+        general, but do allow `\0` as a special case.
+    -   `\xAB` is interpreted as a hexadecimal code unit value. In C++, any
+        nonzero number of hexadecimal digits can follow as part of the escape
+        sequence. In Python, JavaScript, and Rust, exactly two digits are
+        required. In Rust, the value must be less than or equal to
+        7F<sub>16</sub> except for `b`-prefixed strings (byte array literals).
+        Swift does not support this form of escape sequence.
+    -   `\uABCD` is interpreted as a hexadecimal code point value. In C++,
+        Python, and JavaScript, exactly four hexadecimal digits can follow, but
+        JavaScript allows any nonzero number of digits to be specified using
+        `\u{ABCDE}` notation. Rust and Swift support only the `\u{ABCDE}`
+        notation.
+    -   `\U0010FFFD` is interpreted as a hexadecimal code point value in C++ and
+        Python, but not in JavaScript, Rust, or Swift, and permits exactly eight
+        hexadecimal digits.
+
+-   Raw string literals, in which escape sequences are not recognized. These are
+    often used in situations where escape sequences are undesirable, but in
+    which the escape character or regular string terminator is used frequently.
+    Such literals are useful when embedding one machine-readable language in
+    another, when those languages share some escaping conventions. Such
+    functionality may also provide a way to customize the string delimiters.
+
+    -   In Python, raw string literals have an `r` prefix: `r"li\ngo"` is a six
+        character string whose third character is `\`.
+    -   In C++, raw string literals have an `r` prefix, along with a custom
+        delimiter (which may be empty): `r"DELIM(li\ngo)DELIM"` is a six
+        character string (plus a nul terminator).
+    -   In Rust, raw string literals have an `r` prefix and any matching number
+        of `#` characters enclose the string contents: the third character of
+        `r"lo\ng"` is `\`, and the second character of `r#" " "#` is `"`.
+    -   In Swift, raw string literals are a generalization of regular string
+        literals: literals are introduced by any number, _N_, of `#` characters
+        followed by a `"`, terminated by `"` followed by _N_ `#` characters, and
+        escape sequences are introduced by a `\` followed by _N_ `#` characters:
+        `#" " # \n \#n \#\# "#` has the same contents as the C++ string literal
+        `" \" # \\n \n \\# "`.
+    -   In Java, raw string literals are delimited by a sequence of one or more
+        backticks instead of double quotes: the fourth character of
+        <code>\`\`foo\`\bar\`\`</code> is a backtick and the fifth is a
+        backslash.
+
+-   Multiline string literals provide a mechanism for a string to easily span
+    more than one line of source text.
+
+    -   In C++, Rust, and Java, raw string literals are used to represent
+        multiline string literals.
+    -   In Python, different delimiters (`"""` or, in Python, `'''` instead of
+        `"` or `'`) are used to represent multiline string literals, and plain
+        `"` and `""` can thereby appear in the string contents, but these
+        literals otherwise behave the same as regular string literals.
+    -   In Swift, different delimiters are used (`"""` instead of `"`), but
+        unlike in Python, the string content cannot be on the same line as the
+        delimiters, and the resulting mandatory leading and trailing newlines
+        are not included in the string. Internal newlines can be removed by
+        preceding them with backslashes.
+    -   In JavaScript, backtick-delimited strings can contain newlines; this
+        syntax also allows string interpolation as described below.
+
+In addition, some languages, primarily scripting languages, also provide a
+mechanism for string interpolation, wherein a string value is formed by
+including the formatted values of some variables in a given format string. For
+example, `"Hello, $planet."` might produce a string value including the
+formatted value of the variable named `planet`. Such interpolation facilities
+are outside the scope of this proposal.
+
+## Proposal
+
+-   Single-line string literals are delimited by `"`s: `"hello"`
+
+-   Multi-line string literals are introduced by a `"""` followed by a newline
+    and terminated by a line beginning with a `"""`. The indentation of the
+    terminating line is removed from all preceding lines:
+
+    ```carbon
+    var String: henry_vi = """
+      The winds grow high; so do your stomachs, lords.
+      How irksome is this music to my heart!
+      When such strings jar, what hope of harmony?
+      I pray, my lords, let me compound this strife.
+          -- History of Henry VI, Part II, Act II, Scene 1, W. Shakespeare
+      """;
+    ```
+
+    Only the final line of this string literal begins with whitespace. The
+    opening newline is not part of the string's contents, but the trailing
+    newline is; the first character of this example string is `T` and the last
+    character is a newline.
+
+-   The opening `"""` of a multi-line string literal can be followed by a _file
+    type indicator_, to assist tooling in understanding the intent of the
+    string. This indicator has no effect on the meaning of the program.
+
+    ```carbon
+    var String: cpp_snippet = """c++
+      #include <iostream>
+
+      int main() {
+        std::cout << "hello world" << std::endl;
+      }
+      """;
+    ```
+
+-   Escape sequences are introduced with a `\` character; the most common C and
+    C++ escape sequences are supported: `"hello\nworld"`. Octal escapes
+    (`\177`), `\a`, `\b`, `\f` and `\v` are removed. `\uNNNN` and `\U00NNNNNN`
+    are replaced by `\u{NNNNNN}`. An escape sequence `\<newline>` is permitted
+    in multi-line string literals, and results in no string contents.
+
+-   Raw string literals are supported, for both the single and multi-line case,
+    following the Swift convention: they are introduced by prefixing the opening
+    delimiter with one or more `#`s, and suffixing the closing delimiter with a
+    matching number of `#`s: `#"foo\s*bar"#` or `#"foo"bar"#`. Escape sequences
+    can be introduced in a raw string literal by inserting a matching number of
+    `#`s after the `\` character: `#"foo\#nbar"#` contains a newline character.
+
+-   Unlike in C and C++, adjacent string literals are not implicitly
+    concatenated.
+
+## Details
+
+### Non-raw string literals
+
+A _simple string literal_ is formed of a sequence of
+
+-   characters other than backslashes, double quotation marks, and vertical
+    whitespace
+-   [escape sequences](#escape-sequences)
+
+enclosed in double quotation marks (`"`). Each escape sequence is replaced with
+the corresponding character sequence or code unit sequence.
+
+```carbon
+var String: lucius = "The strings, my lord, are false.";
+```
+
+A _block string literal_ starts with three double quotation marks, followed by
+an optional file type indicator, followed by a newline, and ends at the next
+instance of three double quotation marks whose first `"` is not part of a `\"`
+escape sequence. The closing `"""` shall be the first non-whitespace characters
+on that line. The lines between the opening line and the closing line
+(exclusive) are _content lines_. The content lines shall not contain `\`
+characters that do not form part of an escape sequence.
+
+The _indentation_ of a block string literal is the sequence of horizontal
+whitespace preceding the closing `"""`. Each non-empty content line shall begin
+with the indentation of the string literal. The content of the literal is formed
+as follows:
+
+-   The indentation of the closing line is removed from each non-empty content
+    line.
+-   All trailing whitespace on each line, including the line terminator, is
+    replaced with a single line feed (U+000A) character.
+-   The resulting lines are concatenated.
+-   Each [escape sequence](#escape-sequences) is replaced with the corresponding
+    character sequence or code unit sequence.
+
+A content line is considered empty if it contains only whitespace characters.
+
+```carbon
+var String: w = """
+  This is a string literal. Its first character is 'T' and its last character is
+  a newline character. It contains another newline between 'is' and 'a'.
+  """;
+
+// This string literal is invalid because the """ after 'closing' terminates
+// the literal, but is not at the start of the line.
+var String: invalid = """
+  error: closing """ is not on its own line.
+  """;
+```
+
+A _file type indicator_ is any sequence of non-whitespace characters other than
+`"` or `#`. The file type indicator has no semantic meaning to the Carbon
+compiler, but some file type indicators are understood by the language tooling
+(for example, syntax highlighter, code formatter) as indicating the structure of
+the string literal's content.
+
+```carbon
+// This is a block string literal. Its first two characters are spaces, and its
+// last character is a line feed. It has a file type of 'c++'.
+var String: starts_with_whitespace = """c++
+    int x = 1; // This line starts with two spaces.
+    int y = 2; // This line starts with two spaces.
+  """;
+```
+
+The file type indicator might contain semantic information beyond the file type
+itself, such as instructions to the code formatter to disable formatting for the
+code block.
+
+**Open question:** This proposal does not suggest any concrete set of recognized
+file type indicators. It would be useful to informally specify a set of
+well-known indicators, so that tools have a common understanding of what those
+indicators mean, perhaps in a best practices guide.
+
+#### Escape sequences
+
+Within a string literal, the following escape sequences are recognized:
+
+| Escape        | Meaning                                                  |
+| ------------- | -------------------------------------------------------- |
+| `\t`          | U+0009 CHARACTER TABULATION                              |
+| `\n`          | U+000A LINE FEED                                         |
+| `\r`          | U+000D CARRIAGE RETURN                                   |
+| `\"`          | U+0022 QUOTATION MARK (`"`)                              |
+| `\'`          | U+0027 APOSTROPHE (`'`)                                  |
+| `\\`          | U+005C REVERSE SOLIDUS (`\`)                             |
+| `\0`          | Code unit with value 0                                   |
+| `\xHH`        | Code unit with value HH<sub>16</sub>                     |
+| `\u{HHHH...}` | Unicode code point U+HHHH...                             |
+| `\<newline>`  | No string literal content produced (block literals only) |
+
+This includes all C++ escape sequences except:
+
+-   `\?`, which was historically used to escape trigraphs in string literals,
+    and no longer serves any purpose.
+-   `\ooo` octal escapes, which are removed because Carbon does not support
+    octal literals; `\0` is retained as a special case, which is expected to be
+    important for C interoperability.
+-   `\uABCD`, which is replaced by `\u{ABCD}`.
+-   `\U0010FFFF`, which is replaced by `\u{10FFFF}`.
+-   `\a` (bell), `\b` (backspace), `\v` (vertical tab), and `\f` (form feed).
+    `\a` and `\b` are obsolescent, and `\f` and `\v` are largely obsolete. These
+    characters can be expressed with `\x07`, `\x08`, `\x0B`, and `\x0C`
+    respectively if needed.
+
+Note that this is the same set of escape sequences supported by
+[Swift](https://docs.swift.org/swift-book/LanguageGuide/StringsAndCharacters.html#ID295)
+and [Rust](https://doc.rust-lang.org/reference/tokens.html), except that, unlike
+in Swift, support for `\xHH` is provided.
+
+While this proposal takes a firm stance on not permitting octal escape
+sequences, the decision to not allow `\1`..`\7`, and more generally to not treat
+`\DDDD` as a decimal escape sequence, is _experimental_.
+
+In the above table, `H` represents an arbitrary hexadecimal character, `0`-`9`
+or `A`-`F` (case-sensitive). Unlike in C++, but like in Python, `\x` expects
+exactly two hexadecimal digits. As in JavaScript, Rust, and Swift, Unicode code
+points can be expressed by number using `\u{10FFFF}` notation, which accepts any
+number of hexadecimal characters. Any numeric code point in the ranges
+0<sub>16</sub>-D7FF<sub>16</sub> or E000<sub>16</sub>-10FFFF<sub>16</sub> can be
+expressed this way.
+
+_Open question:_ Some programming languages (notably Python) support a
+`\N{unicode character name}` syntax. We could add such an escape sequence, but
+this proposal does not include one. Future proposals considering adding such
+support should pay attention to work done by C++'s Unicode study group in this
+area.
+
+The escape sequence `\0` shall not be followed by a decimal digit. In cases
+where a null byte should be followed by a decimal digit, `\x00` can be used
+instead: `"foo\x00123"`. The intent is to preserve the possibility of permitting
+decimal escape sequences in the future.
+
+A backslash followed by a line feed character is an escape sequence that
+produces no string contents. This escape sequence is _experimental_, and can
+only appear in multi-line string literals. This escape sequence is processed
+after trailing whitespace is replaced by a line feed character, so a `\`
+followed by horizontal whitespace followed by a line terminator removes the
+whitespace up to and including the line terminator. Unlike in Rust, but like in
+Swift, leading whitespace on the line after an escaped newline is not removed,
+other than whitespace that matches the indentation of the terminating `"""`.
+
+A character sequence starting with a backslash that doesn't match any known
+escape sequence is invalid. Whitespace characters other than space and, for
+block string literals, new line optionally preceded by carriage return are
+disallowed. All other characters (including non-printable characters) are
+preserved verbatim. Because all Carbon source files are required to be valid
+sequences of Unicode characters, code unit sequences that are not valid UTF-8
+can only be produced by `\x` escape sequences.
+
+The choice to disallow raw tab characters in string literals is _experimental_.
+
+```carbon
+var String: fret = "I would 'twere something that would fret the string,\n" +
+                   "The master-cord on's \u{2764}\u{FE0F}!";
+
+// This string contains two characters (prior to encoding in UTF-8):
+// U+1F3F9 (BOW AND ARROW) followed by U+0032 (DIGIT TWO)
+var String: password = "\u{1F3F9}2";
+
+// This string contains no newline characters.
+var String: type_mismatch = """
+  Shall I compare thee to a summer's day? Thou art \
+  more lovely and more temperate.\
+  """;
+
+var String: trailing_whitespace = """
+  This line ends in a space followed by a newline. \n\
+      This line starts with four spaces.
+  """;
+```
+
+### Raw string literals
+
+In order to allow strings whose contents include backslashes and double quotes,
+the delimiters of string literals can be customized by prefixing the opening
+delimiter with _N_ `#` characters. A closing delimiter for such a string is only
+recognized if it is followed by _N_ `#` characters, and similarly, escape
+sequences in such string literals are recognized only if the `\` is also
+followed by _N_ `#` characters. A `\`, `"`, or `"""` not followed by _N_ `#`
+characters has no special meaning.
+
+| Opening delimiter | Escape sequence introducer    | Closing delimiter |
+| ----------------- | ----------------------------- | ----------------- |
+| `"` / `"""`       | `\` (for example, `\n`)       | `"` / `"""`       |
+| `#"` / `#"""`     | `\#` (for example, `\#n`)     | `"#` / `"""#`     |
+| `##"` / `##"""`   | `\##` (for example, `\##n`)   | `"##` / `"""##`   |
+| `###"` / `###"""` | `\###` (for example, `\###n`) | `"###` / `"""###` |
+| ...               | ...                           | ...               |
+
+For example:
+
+```carbon
+var String: x = #"""
+  This is the content of the string. The 'T' is the first character
+  of the string.
+  """ <-- This is not the end of the string.
+  """#;
+  // But the preceding line does end the string.
+// OK, final character is \
+var String: y = #"Hello\"#;
+var String: z = ##"Raw strings #"nesting"#"##;
+var String: w = #"Tab is expressed as \t. Example: '\#t'"#;
+```
+
+Note that both a single-line raw string literal and a multi-line raw string
+literal can begin with `#"""`. These cases can be distinguished by the presence
+or absence of additional `"`s later in the same line:
+
+-   In a single-line raw string literal, there must be a `"` and one or more
+    `#`s later in the same line terminating the string.
+-   In a multi-line raw string literal, the rest of the line is a file type
+    indicator, which can contain neither `"` nor `#`.
+
+```carbon
+// This string is a single-line raw string literal.
+// The contents of this string start and end with exactly two "s.
+var String: ambig1 = #"""This is a raw string literal starting with """#;
+
+// This string is a block raw string literal with file-type 'This',
+// whose contents start with "is a ".
+var String: ambig2 = #"""This
+  is a block string literal with file type 'This', first character 'i',
+  and last character 'X': X\#
+  """#;
+
+// This is a single-line raw string literal, equivalent to "\"".
+var String: ambig3 = #"""#;
+```
+
+### Encoding
+
+A string literal results in a sequence of 8-bit bytes. Like Carbon source files,
+string literals are encoded in UTF-8. This proposal includes no mechanism to
+request that any other encoding is used. The expectation is that if another
+encoding is needed, the string literal can be transcoded from UTF-8 during
+compilation. There is no guarantee that the string is valid UTF-8, however,
+because arbitrary byte sequences can be inserted by way of `\xHH` escape
+sequences.
+
+This decision is _experimental_, and should be revisited if we find sufficient
+motivation for directly expressing string literals in other encodings.
+Similarly, as library support for a string type evolves, we should consider
+including string literal syntax (perhaps as the default) that guarantees the
+string content is a valid UTF-8 encoding, so that valid UTF-8 can be
+distinguished from an arbitrary string in the type system. In such string
+literals, we should consider rejecting `\xHH` escapes in which HH is greater
+than 7F<sub>16</sub>, as in Rust.
+
+## Alternatives considered
+
+### Block string literals
+
+We could avoid including a block string literal in general, and instead
+construct multi-line strings by string concatenation, with either C-style
+juxtaposition or with an explicit concatenation operator. But doing so would be
+more verbose and would make the expression of the source code be further from
+the programmer's intent.
+
+We could use raw string literals to provide block string literal syntax, as C++
+does. However, this couples two orthogonal choices: whether escape sequences
+should be recognized and whether the string is intended to span multiple lines.
+In C++ code, the inability to use escape sequences in multi-line string literals
+sometimes awkward. For example:
+
+```c++
+std::string make_rule = "%s: %s\n\t$(CC) -c -o $@ $< $(CFLAGS)\n\n"
+                        "main:\n\t$(CC) %s -o %s\n";
+```
+
+can be written under this proposal as
+
+```carbon
+var String: make_rule = """make
+  %s: %s
+  \t$(CC) -c -o $@ $< $(CFLAGS)
+
+  main:
+  \t$(CC) %s -o %s
+  """;
+```
+
+improving readability while still making the semantically-meaningful presence of
+tabs visible even in editors / code browsers that do not distinguish tabs from
+spaces.
+
+#### Leading whitespace removal
+
+Block string literals could use explicit characters in the body to indicate the
+amount of leading whitespace to be removed:
+
+```carbon
+var String: x = """
+  |  starts with two spaces.
+  """;
+```
+
+This would allow the correct indentation to be determined as soon as the first
+line after the opening `"""` is seen. However, this adds lexical complexity, and
+harms the ability to copy-paste string contents into other contexts.
+
+#### Terminating newline
+
+We could choose to exclude the trailing newline (like in Swift). Informal
+surveys suggest that expectations for whether to include or exclude the trailing
+newline vary.
+
+The intended use case for block string literals is to represent multi-line
+strings. When forming a single larger string from concatenation of multi-line
+string literals, including the trailing newline but not the leading newline --
+or, more generally, including a newline at the end of each line in the string --
+provides the best alignment between the source-level syntax and the result. For
+example:
+
+```carbon
+fn Run() {
+  print("""c++
+    class X {
+    public:
+      X() {}
+
+    """);
+  for (var String: decl in GetMemberDecls()) {
+    print decl;
+  }
+  print("""c++
+
+    private:
+    """);
+  print(GetFields());
+  print("""c++
+    };
+    """);
+}
+```
+
+As is, the output printed by this example can be visualized by ignoring all
+lines other than those between the `"""`s. If we excluded the trailing newline,
+additional blank lines would be required at the end of each string literal,
+harming the readability of the example.
+
+### Escape sequences
+
+We could support octal escape sequences, as many C family languages do. However,
+they are considered antiquated in C++ code, and supporting them would be
+inconsistent with our decision to not support octal numeric literals. A quick
+informal poll suggests that many C++ programmers do not realize that `\123` is
+an octal escape sequence, not a decimal one.
+
+We could support `\123` as a decimal escape sequence. However, doing so may lead
+to surprise when migrating C++ code to Carbon. This possibility should be
+revisited once Carbon matures and we have a better idea of how the migration
+process is expected to proceed.
+
+We could allow arbitrary-length `\x` escape sequences, as C++ does, and include
+some explicit mechanism to terminate such a sequence. For example, we could
+treat `\<whitespace>` for an arbitrary whitespace character in the same way we
+treat `\<newline>`, and use `"\xAB\ C"` to terminate an escape sequence
+prematurely. However, this is unnecessarily inventive, and the Python approach
+of requiring exactly two hexadecimal characters after `\x` is adequate, assuming
+we do not intend to support string literal element types other than 8-bit bytes.
+If we do find we want to support wider element types in future (for example, if
+we want to add a UTF-16 or UTF-32 string literal), `\x{ABCD}` can be used.
+
+We could permit `\<newline>` even in non-block string literals to allow them to
+be line-wrapped, but there seems to be little benefit to doing so, as a block
+string literal can always be used instead.
+
+We could adopt Python's `\N{unicode character name}` syntax. But there is no
+pressing need to add such a syntax imminently, and concerns have been raised
+both over the exact ways in which characters are named and over compatibility
+with upcoming C++ language extensions in this area, so this syntax is not being
+proposed at this time.
+
+We could allow an `\e` escape sequence for the U+001C ESCAPE character. This
+character is a common extension in C and C++ compilers, and appears to primarily
+be used to hardcode ANSI terminal escape sequences, such as
+`"\e[32mgreen text\e[0m"`. This proposal doesn't explicitly reject this idea.
+However, if we consider adopting such an extension in the future, we should
+consider whether a library facility for simple terminal operations would be a
+more valuable addition than this escape sequence.
+
+We could retain the `\uNNNN` escape sequence to give a terser notation for the
+common case of a Unicode code point that is most naturally written as four
+hexadecimal digits -- that is, all code points in the Basic Multilingual Plane.
+This is the approach taken by JavaScript. However, following Swift and Rust in
+permitting only `\u{NNNN}` is simpler and avoids redundancy. We expect explicit
+`\u` escapes to be rare: we expect regular, printable Unicode characters that
+are normalized in NFC to be written directly in the source file, rather than
+spelled with `\u` escapes. Explicit `\u` escapes would be useful where the code
+point value is important -- for example, in test data -- or where special
+characters such as directionality markers or non-normalized characters are
+desired, but for such uses, the longer `\u{NNNN}` form seems adequate.
+
+### Raw string literals
+
+The approach to raw string literals in this proposal is based on Swift's raw
+strings facility.
+
+We could use a different mechanism other than a sequence of `#`s to support
+nesting raw string literals. For example, we could adopt something like C++'s
+semi-arbitrary delimiters `R"foo(string contents)foo"`. However, this level of
+customizability seems unwarranted: raw string literals are unlikely to nest more
+than one or two levels deep, so using `#"..."#`, `##"..."##`, `###"..."###` for
+successive nesting levels seems unproblematic, and removes the need for the
+programmer to make an arbitrary choice.
+
+We could use a delimiter other than `#` to demarcate raw string literals. At
+this stage in Carbon's development, we don't know exactly which characters will
+be useful in operators, but it seems reasonable to assume a mostly C++-like
+operator set, which gives us a variety of characters that cannot appear
+immediately before a string literal: at least `@` `#` `$` `)` `]` `}` `\` `.`
+all appear likely to be available. Of these, `\` is likely to be problematic due
+to its use in escape sequences, closing brackets followed by string literals
+might one day be useful in some grammar constructs, and `.` seems a little too
+close to resembling designator. That leaves `@`, `#`, and `$`, and multiple
+existing languages have used `#` for this purpose.
+
+We could disallow use of _N_ `#`s as a delimiter if a lower value of _N_ would
+work. However, this would make the language brittle under maintenance: removing
+the last nested string literal from a quoted block of code would require
+changing the delimiters.
+
+We could use Rust-style raw strings, which add a leading `r`, permit zero `#`s
+to be used in raw strings, and do not provide a facility for escape sequences in
+raw strings. There are several reasons to prefer Swift-style raw strings:
+
+-   Swift raw strings are not a distinct language feature; rather, they are a
+    generalization of non-raw strings -- non-raw strings are simply the case
+    where the number of `#` characters in the delimiters and escape sequence
+    intorducer is zero.
+-   Permitting escape sequences even in raw strings means that there is no loss
+    of functionality when using a raw string, and code changes under maintenance
+    that would require use of a facility only available by way of an escape
+    sequence (such as inclusion of trailing whitespace in a block string
+    literal) do not force a reversion to a non-raw string literal.
+
+There are also several reasons to prefer the Rust-style approach:
+
+-   In the most common case, a Rust raw string will be one character shorter.
+-   Less lexical space is used: Swift-style raw strings remove the possibility
+    of using `#` as a prefix operator, whereas the leading `r` in Rust-style raw
+    strings does not.
+-   Certain character sequences are hard to express in Swift-style raw strings.
+    Specifically, string literals such as `"\\################"` cannot readily
+    be expressed as a raw string. Such string literals are extremely rare, but
+    not nonexistent, in one large sample C++ corpus.
+
+If the final issue is concerning, we have a path to address it, by specifying
+that `\` followed by <i>N</i>+1 or more `#`s is left alone, just like `\`
+followed by <i>N</i>-1 or fewer `#`s is left alone. Formally, this can be
+accomplished by defining `\#` as an escape sequence that expands to itself, that
+is, to a backslash followed by <i>N</i>+1 `#` characters.
+
+#### Trailing whitespace
+
+We could preserve trailing whitespace in at least raw block string literals, and
+perhaps in all block string literals. However, this would mean that visually
+identical programs could have different meanings, and even that transformations
+performed automatically on save by some editors (removing trailing whitespace)
+could change the meaning of a program. It might also mean that raw string
+literals are no longer a generalization of non-raw string literals.
+
+Under this proposal, trailing whitespace can be included in a block string
+literal by following it with `\n\`:
+
+```
+var String: authors = """markdown
+  *Authors*:  \n\
+  Me <me@example.com>  \n\
+  Someone Else <them@example.com>
+
+""";
+```
+
+In a single-`#` raw string literal, the same can be accomplished with the
+more-verbose terminator `\#n\#`, and so on.
+
+#### Line separators
+
+Raw block string literals could preserve the form of vertical whitespace used to
+terminate each line. This would allow uncommon forms of vertical whitespace (for
+example, vertical tab and form feed) to be included in raw string literals, but
+would create a risk that the meaning of a program would be different when the
+source code is checked out on an operating system that uses line feed as a line
+terminator versus when the source code is checked out on an operating system
+that uses a different line terminator (such as carriage return followed by line
+feed). This would also mean that raw string literals are no longer a
+generalization of non-raw string literals.
+
+### Internal whitespace
+
+We could allow raw tab characters in string literals. However, raw tab
+characters harm the readability of the program, and we would like to encourage
+the use of `\t` escapes instead in situations where they are available, even if
+this means that the more verbose form `\#t` needs to be used in raw string
+literals.