5 年之前 · ec6fde4d61
--- a/proposals/README.md
+++ b/proposals/README.md
@@ -57,5 +57,6 @@ request:
 
				     -   [0175 - Decision](p0175_decision.md)
			
 
				 -   [0179 - Create a toolchain team.](p0179.md)
			
 
				     -   [0179 - Decision](p0179_decision.md)
			
 
				+-   [0199 - String literals](p0199.md)
			
 
				 
			
 
				 <!-- endproposals -->
			
--- a/proposals/p0199.md
+++ b/proposals/p0199.md
@@ -0,0 +1,723 @@
 
				+# String literals
			
 
				+
			
 
				+<!--
			
 
				+Part of the Carbon Language project, under the Apache License v2.0 with LLVM
			
 
				+Exceptions. See /LICENSE for license information.
			
 
				+SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
			
 
				+-->
			
 
				+
			
 
				+[Pull request](https://github.com/carbon-language/carbon-lang/pull/199)
			
 
				+
			
 
				+<!-- toc -->
			
 
				+
			
 
				+## Table of contents
			
 
				+
			
 
				+-   [Problem](#problem)
			
 
				+-   [Background](#background)
			
 
				+    -   [Existing practice](#existing-practice)
			
 
				+-   [Proposal](#proposal)
			
 
				+-   [Details](#details)
			
 
				+    -   [Non-raw string literals](#non-raw-string-literals)
			
 
				+        -   [Escape sequences](#escape-sequences)
			
 
				+    -   [Raw string literals](#raw-string-literals)
			
 
				+    -   [Encoding](#encoding)
			
 
				+-   [Alternatives considered](#alternatives-considered)
			
 
				+    -   [Block string literals](#block-string-literals)
			
 
				+        -   [Leading whitespace removal](#leading-whitespace-removal)
			
 
				+        -   [Terminating newline](#terminating-newline)
			
 
				+    -   [Escape sequences](#escape-sequences-1)
			
 
				+    -   [Raw string literals](#raw-string-literals-1)
			
 
				+        -   [Trailing whitespace](#trailing-whitespace)
			
 
				+        -   [Line separators](#line-separators)
			
 
				+    -   [Internal whitespace](#internal-whitespace)
			
 
				+
			
 
				+<!-- tocstop -->
			
 
				+
			
 
				+## Problem
			
 
				+
			
 
				+This proposal specifies lexical rules for constant strings in Carbon.
			
 
				+
			
 
				+## Background
			
 
				+
			
 
				+We wish to provide a syntax for writing literals containing human-readable text.
			
 
				+
			
 
				+Note that "human-readable text" here should be understood broadly: such text may
			
 
				+be subject to further processing, and may in some cases be intended to be
			
 
				+interpreted by a computer rather than by a human (such as a regular expression,
			
 
				+program source code, or a C++ mangled name), but broadly represents a sequence
			
 
				+of characters rather than arbitrary binary data.
			
 
				+
			
 
				+Such text is typically represented in an _encoding_, which is a bidirectional
			
 
				+mapping between a sequence of characters in text and a sequence of bounded
			
 
				+integer values known as _code units_, suitable for storage, transmission, and
			
 
				+processing. For example, the Russian word углерод (carbon) is encoded in the
			
 
				+UTF-8 encoding as D1<sub>16</sub>83<sub>16 </sub>D0<sub>16</sub>B3<sub>16
			
 
				+</sub>D0<sub>16</sub>BB<sub>16</sub> D0<sub>16</sub>B5<sub>16</sub>
			
 
				+D1<sub>16</sub>80<sub>16</sub> D0<sub>16</sub>BE<sub>16</sub>
			
 
				+D0<sub>16</sub>B4<sub>16</sub>.
			
 
				+
			
 
				+### Existing practice
			
 
				+
			
 
				+See
			
 
				+[Comparison of programming languages (strings) on Wikipedia](https://en.wikipedia.org/wiki/Comparison_of_programming_languages_%28strings%29).
			
 
				+
			
 
				+Simple string literals are specified in most programming languages as text
			
 
				+delimited by double-quote characters, `"like this"`. Such string literals
			
 
				+usually are restricted to begin and end on the same source line. Three
			
 
				+additional features are commonly seen:
			
 
				+
			
 
				+-   Escape sequences, which permit string literals to include characters that
			
 
				+    are difficult to type, are ambiguous for the reader, or that would be
			
 
				+    problematic in some way (such as whitespace characters, characters that are
			
 
				+    invisible, and characters that change how other characters are rendered),
			
 
				+    and also to include arbitrary code units. One common convention is to use
			
 
				+    `\` to introduce an escape sequence, where, for example:
			
 
				+
			
 
				+    -   `\n` represents a newline character,
			
 
				+    -   `\u1234` represents the Unicode character U+1234,
			
 
				+    -   `\xAB` represents the code unit AB<sub>16</sub>,
			
 
				+    -   `\"` represents a single `"` character and does not terminate the string
			
 
				+        literal,
			
 
				+    -   `\\` represents a single `\` character,
			
 
				+
			
 
				+    and so on.
			
 
				+
			
 
				+    The set of single-letter escape sequences has a lot of commonality between
			
 
				+    languages, with some variation between older and newer languages:
			
 
				+
			
 
				+    -   C++ and Python allow `\a`, `\b`, `\f`, `\n`, `\r`, `\t`, `\v` for bell,
			
 
				+        backspace, form feed, new line, carriage return, tab, and vertical tab,
			
 
				+        respectively.
			
 
				+    -   JavaScript drops support for `\a`.
			
 
				+    -   Java additionally drops support for `\v`.
			
 
				+    -   Rust and Swift additionally drop support for `\b` and `\f`, leaving only
			
 
				+        `\n`, `\r`, and `\t`.
			
 
				+
			
 
				+    The rules for numeric escape sequences differ between kinds of escape
			
 
				+    sequence and between languages. The rules in C++, JavaScript, Python, Rust,
			
 
				+    and Swift are as follows:
			
 
				+
			
 
				+    -   `\123` is interpreted as a octal code unit value, and up to three octal
			
 
				+        digits are consumed. In JavaScript and C++, values greater than
			
 
				+        377<sub>8</sub> (255<sub>10</sub>) are invalid (assuming an 8-bit
			
 
				+        character type). In Python, values greater than 377<sub>8</sub> are
			
 
				+        interpreted modulo 256. Rust and Swift do not allow octal escapes in
			
 
				+        general, but do allow `\0` as a special case.
			
 
				+    -   `\xAB` is interpreted as a hexadecimal code unit value. In C++, any
			
 
				+        nonzero number of hexadecimal digits can follow as part of the escape
			
 
				+        sequence. In Python, JavaScript, and Rust, exactly two digits are
			
 
				+        required. In Rust, the value must be less than or equal to
			
 
				+        7F<sub>16</sub> except for `b`-prefixed strings (byte array literals).
			
 
				+        Swift does not support this form of escape sequence.
			
 
				+    -   `\uABCD` is interpreted as a hexadecimal code point value. In C++,
			
 
				+        Python, and JavaScript, exactly four hexadecimal digits can follow, but
			
 
				+        JavaScript allows any nonzero number of digits to be specified using
			
 
				+        `\u{ABCDE}` notation. Rust and Swift support only the `\u{ABCDE}`
			
 
				+        notation.
			
 
				+    -   `\U0010FFFD` is interpreted as a hexadecimal code point value in C++ and
			
 
				+        Python, but not in JavaScript, Rust, or Swift, and permits exactly eight
			
 
				+        hexadecimal digits.
			
 
				+
			
 
				+-   Raw string literals, in which escape sequences are not recognized. These are
			
 
				+    often used in situations where escape sequences are undesirable, but in
			
 
				+    which the escape character or regular string terminator is used frequently.
			
 
				+    Such literals are useful when embedding one machine-readable language in
			
 
				+    another, when those languages share some escaping conventions. Such
			
 
				+    functionality may also provide a way to customize the string delimiters.
			
 
				+
			
 
				+    -   In Python, raw string literals have an `r` prefix: `r"li\ngo"` is a six
			
 
				+        character string whose third character is `\`.
			
 
				+    -   In C++, raw string literals have an `r` prefix, along with a custom
			
 
				+        delimiter (which may be empty): `r"DELIM(li\ngo)DELIM"` is a six
			
 
				+        character string (plus a nul terminator).
			
 
				+    -   In Rust, raw string literals have an `r` prefix and any matching number
			
 
				+        of `#` characters enclose the string contents: the third character of
			
 
				+        `r"lo\ng"` is `\`, and the second character of `r#" " "#` is `"`.
			
 
				+    -   In Swift, raw string literals are a generalization of regular string
			
 
				+        literals: literals are introduced by any number, _N_, of `#` characters
			
 
				+        followed by a `"`, terminated by `"` followed by _N_ `#` characters, and
			
 
				+        escape sequences are introduced by a `\` followed by _N_ `#` characters:
			
 
				+        `#" " # \n \#n \#\# "#` has the same contents as the C++ string literal
			
 
				+        `" \" # \\n \n \\# "`.
			
 
				+    -   In Java, raw string literals are delimited by a sequence of one or more
			
 
				+        backticks instead of double quotes: the fourth character of
			
 
				+        <code>\`\`foo\`\bar\`\`</code> is a backtick and the fifth is a
			
 
				+        backslash.
			
 
				+
			
 
				+-   Multiline string literals provide a mechanism for a string to easily span
			
 
				+    more than one line of source text.
			
 
				+
			
 
				+    -   In C++, Rust, and Java, raw string literals are used to represent
			
 
				+        multiline string literals.
			
 
				+    -   In Python, different delimiters (`"""` or, in Python, `'''` instead of
			
 
				+        `"` or `'`) are used to represent multiline string literals, and plain
			
 
				+        `"` and `""` can thereby appear in the string contents, but these
			
 
				+        literals otherwise behave the same as regular string literals.
			
 
				+    -   In Swift, different delimiters are used (`"""` instead of `"`), but
			
 
				+        unlike in Python, the string content cannot be on the same line as the
			
 
				+        delimiters, and the resulting mandatory leading and trailing newlines
			
 
				+        are not included in the string. Internal newlines can be removed by
			
 
				+        preceding them with backslashes.
			
 
				+    -   In JavaScript, backtick-delimited strings can contain newlines; this
			
 
				+        syntax also allows string interpolation as described below.
			
 
				+
			
 
				+In addition, some languages, primarily scripting languages, also provide a
			
 
				+mechanism for string interpolation, wherein a string value is formed by
			
 
				+including the formatted values of some variables in a given format string. For
			
 
				+example, `"Hello, $planet."` might produce a string value including the
			
 
				+formatted value of the variable named `planet`. Such interpolation facilities
			
 
				+are outside the scope of this proposal.
			
 
				+
			
 
				+## Proposal
			
 
				+
			
 
				+-   Single-line string literals are delimited by `"`s: `"hello"`
			
 
				+
			
 
				+-   Multi-line string literals are introduced by a `"""` followed by a newline
			
 
				+    and terminated by a line beginning with a `"""`. The indentation of the
			
 
				+    terminating line is removed from all preceding lines:
			
 
				+
			
 
				+    ```carbon
			
 
				+    var String: henry_vi = """
			
 
				+      The winds grow high; so do your stomachs, lords.
			
 
				+      How irksome is this music to my heart!
			
 
				+      When such strings jar, what hope of harmony?
			
 
				+      I pray, my lords, let me compound this strife.
			
 
				+          -- History of Henry VI, Part II, Act II, Scene 1, W. Shakespeare
			
 
				+      """;
			
 
				+    ```
			
 
				+
			
 
				+    Only the final line of this string literal begins with whitespace. The
			
 
				+    opening newline is not part of the string's contents, but the trailing
			
 
				+    newline is; the first character of this example string is `T` and the last
			
 
				+    character is a newline.
			
 
				+
			
 
				+-   The opening `"""` of a multi-line string literal can be followed by a _file
			
 
				+    type indicator_, to assist tooling in understanding the intent of the
			
 
				+    string. This indicator has no effect on the meaning of the program.
			
 
				+
			
 
				+    ```carbon
			
 
				+    var String: cpp_snippet = """c++
			
 
				+      #include <iostream>
			
 
				+
			
 
				+      int main() {
			
 
				+        std::cout << "hello world" << std::endl;
			
 
				+      }
			
 
				+      """;
			
 
				+    ```
			
 
				+
			
 
				+-   Escape sequences are introduced with a `\` character; the most common C and
			
 
				+    C++ escape sequences are supported: `"hello\nworld"`. Octal escapes
			
 
				+    (`\177`), `\a`, `\b`, `\f` and `\v` are removed. `\uNNNN` and `\U00NNNNNN`
			
 
				+    are replaced by `\u{NNNNNN}`. An escape sequence `\<newline>` is permitted
			
 
				+    in multi-line string literals, and results in no string contents.
			
 
				+
			
 
				+-   Raw string literals are supported, for both the single and multi-line case,
			
 
				+    following the Swift convention: they are introduced by prefixing the opening
			
 
				+    delimiter with one or more `#`s, and suffixing the closing delimiter with a
			
 
				+    matching number of `#`s: `#"foo\s*bar"#` or `#"foo"bar"#`. Escape sequences
			
 
				+    can be introduced in a raw string literal by inserting a matching number of
			
 
				+    `#`s after the `\` character: `#"foo\#nbar"#` contains a newline character.
			
 
				+
			
 
				+-   Unlike in C and C++, adjacent string literals are not implicitly
			
 
				+    concatenated.
			
 
				+
			
 
				+## Details
			
 
				+
			
 
				+### Non-raw string literals
			
 
				+
			
 
				+A _simple string literal_ is formed of a sequence of
			
 
				+
			
 
				+-   characters other than backslashes, double quotation marks, and vertical
			
 
				+    whitespace
			
 
				+-   [escape sequences](#escape-sequences)
			
 
				+
			
 
				+enclosed in double quotation marks (`"`). Each escape sequence is replaced with
			
 
				+the corresponding character sequence or code unit sequence.
			
 
				+
			
 
				+```carbon
			
 
				+var String: lucius = "The strings, my lord, are false.";
			
 
				+```
			
 
				+
			
 
				+A _block string literal_ starts with three double quotation marks, followed by
			
 
				+an optional file type indicator, followed by a newline, and ends at the next
			
 
				+instance of three double quotation marks whose first `"` is not part of a `\"`
			
 
				+escape sequence. The closing `"""` shall be the first non-whitespace characters
			
 
				+on that line. The lines between the opening line and the closing line
			
 
				+(exclusive) are _content lines_. The content lines shall not contain `\`
			
 
				+characters that do not form part of an escape sequence.
			
 
				+
			
 
				+The _indentation_ of a block string literal is the sequence of horizontal
			
 
				+whitespace preceding the closing `"""`. Each non-empty content line shall begin
			
 
				+with the indentation of the string literal. The content of the literal is formed
			
 
				+as follows:
			
 
				+
			
 
				+-   The indentation of the closing line is removed from each non-empty content
			
 
				+    line.
			
 
				+-   All trailing whitespace on each line, including the line terminator, is
			
 
				+    replaced with a single line feed (U+000A) character.
			
 
				+-   The resulting lines are concatenated.
			
 
				+-   Each [escape sequence](#escape-sequences) is replaced with the corresponding
			
 
				+    character sequence or code unit sequence.
			
 
				+
			
 
				+A content line is considered empty if it contains only whitespace characters.
			
 
				+
			
 
				+```carbon
			
 
				+var String: w = """
			
 
				+  This is a string literal. Its first character is 'T' and its last character is
			
 
				+  a newline character. It contains another newline between 'is' and 'a'.
			
 
				+  """;
			
 
				+
			
 
				+// This string literal is invalid because the """ after 'closing' terminates
			
 
				+// the literal, but is not at the start of the line.
			
 
				+var String: invalid = """
			
 
				+  error: closing """ is not on its own line.
			
 
				+  """;
			
 
				+```
			
 
				+
			
 
				+A _file type indicator_ is any sequence of non-whitespace characters other than
			
 
				+`"` or `#`. The file type indicator has no semantic meaning to the Carbon
			
 
				+compiler, but some file type indicators are understood by the language tooling
			
 
				+(for example, syntax highlighter, code formatter) as indicating the structure of
			
 
				+the string literal's content.
			
 
				+
			
 
				+```carbon
			
 
				+// This is a block string literal. Its first two characters are spaces, and its
			
 
				+// last character is a line feed. It has a file type of 'c++'.
			
 
				+var String: starts_with_whitespace = """c++
			
 
				+    int x = 1; // This line starts with two spaces.
			
 
				+    int y = 2; // This line starts with two spaces.
			
 
				+  """;
			
 
				+```
			
 
				+
			
 
				+The file type indicator might contain semantic information beyond the file type
			
 
				+itself, such as instructions to the code formatter to disable formatting for the
			
 
				+code block.
			
 
				+
			
 
				+**Open question:** This proposal does not suggest any concrete set of recognized
			
 
				+file type indicators. It would be useful to informally specify a set of
			
 
				+well-known indicators, so that tools have a common understanding of what those
			
 
				+indicators mean, perhaps in a best practices guide.
			
 
				+
			
 
				+#### Escape sequences
			
 
				+
			
 
				+Within a string literal, the following escape sequences are recognized:
			
 
				+
			
 
				+| Escape        | Meaning                                                  |
			
 
				+| ------------- | -------------------------------------------------------- |
			
 
				+| `\t`          | U+0009 CHARACTER TABULATION                              |
			
 
				+| `\n`          | U+000A LINE FEED                                         |
			
 
				+| `\r`          | U+000D CARRIAGE RETURN                                   |
			
 
				+| `\"`          | U+0022 QUOTATION MARK (`"`)                              |
			
 
				+| `\'`          | U+0027 APOSTROPHE (`'`)                                  |
			
 
				+| `\\`          | U+005C REVERSE SOLIDUS (`\`)                             |
			
 
				+| `\0`          | Code unit with value 0                                   |
			
 
				+| `\xHH`        | Code unit with value HH<sub>16</sub>                     |
			
 
				+| `\u{HHHH...}` | Unicode code point U+HHHH...                             |
			
 
				+| `\<newline>`  | No string literal content produced (block literals only) |
			
 
				+
			
 
				+This includes all C++ escape sequences except:
			
 
				+
			
 
				+-   `\?`, which was historically used to escape trigraphs in string literals,
			
 
				+    and no longer serves any purpose.
			
 
				+-   `\ooo` octal escapes, which are removed because Carbon does not support
			
 
				+    octal literals; `\0` is retained as a special case, which is expected to be
			
 
				+    important for C interoperability.
			
 
				+-   `\uABCD`, which is replaced by `\u{ABCD}`.
			
 
				+-   `\U0010FFFF`, which is replaced by `\u{10FFFF}`.
			
 
				+-   `\a` (bell), `\b` (backspace), `\v` (vertical tab), and `\f` (form feed).
			
 
				+    `\a` and `\b` are obsolescent, and `\f` and `\v` are largely obsolete. These
			
 
				+    characters can be expressed with `\x07`, `\x08`, `\x0B`, and `\x0C`
			
 
				+    respectively if needed.
			
 
				+
			
 
				+Note that this is the same set of escape sequences supported by
			
 
				+[Swift](https://docs.swift.org/swift-book/LanguageGuide/StringsAndCharacters.html#ID295)
			
 
				+and [Rust](https://doc.rust-lang.org/reference/tokens.html), except that, unlike
			
 
				+in Swift, support for `\xHH` is provided.
			
 
				+
			
 
				+While this proposal takes a firm stance on not permitting octal escape
			
 
				+sequences, the decision to not allow `\1`..`\7`, and more generally to not treat
			
 
				+`\DDDD` as a decimal escape sequence, is _experimental_.
			
 
				+
			
 
				+In the above table, `H` represents an arbitrary hexadecimal character, `0`-`9`
			
 
				+or `A`-`F` (case-sensitive). Unlike in C++, but like in Python, `\x` expects
			
 
				+exactly two hexadecimal digits. As in JavaScript, Rust, and Swift, Unicode code
			
 
				+points can be expressed by number using `\u{10FFFF}` notation, which accepts any
			
 
				+number of hexadecimal characters. Any numeric code point in the ranges
			
 
				+0<sub>16</sub>-D7FF<sub>16</sub> or E000<sub>16</sub>-10FFFF<sub>16</sub> can be
			
 
				+expressed this way.
			
 
				+
			
 
				+_Open question:_ Some programming languages (notably Python) support a
			
 
				+`\N{unicode character name}` syntax. We could add such an escape sequence, but
			
 
				+this proposal does not include one. Future proposals considering adding such
			
 
				+support should pay attention to work done by C++'s Unicode study group in this
			
 
				+area.
			
 
				+
			
 
				+The escape sequence `\0` shall not be followed by a decimal digit. In cases
			
 
				+where a null byte should be followed by a decimal digit, `\x00` can be used
			
 
				+instead: `"foo\x00123"`. The intent is to preserve the possibility of permitting
			
 
				+decimal escape sequences in the future.
			
 
				+
			
 
				+A backslash followed by a line feed character is an escape sequence that
			
 
				+produces no string contents. This escape sequence is _experimental_, and can
			
 
				+only appear in multi-line string literals. This escape sequence is processed
			
 
				+after trailing whitespace is replaced by a line feed character, so a `\`
			
 
				+followed by horizontal whitespace followed by a line terminator removes the
			
 
				+whitespace up to and including the line terminator. Unlike in Rust, but like in
			
 
				+Swift, leading whitespace on the line after an escaped newline is not removed,
			
 
				+other than whitespace that matches the indentation of the terminating `"""`.
			
 
				+
			
 
				+A character sequence starting with a backslash that doesn't match any known
			
 
				+escape sequence is invalid. Whitespace characters other than space and, for
			
 
				+block string literals, new line optionally preceded by carriage return are
			
 
				+disallowed. All other characters (including non-printable characters) are
			
 
				+preserved verbatim. Because all Carbon source files are required to be valid
			
 
				+sequences of Unicode characters, code unit sequences that are not valid UTF-8
			
 
				+can only be produced by `\x` escape sequences.
			
 
				+
			
 
				+The choice to disallow raw tab characters in string literals is _experimental_.
			
 
				+
			
 
				+```carbon
			
 
				+var String: fret = "I would 'twere something that would fret the string,\n" +
			
 
				+                   "The master-cord on's \u{2764}\u{FE0F}!";
			
 
				+
			
 
				+// This string contains two characters (prior to encoding in UTF-8):
			
 
				+// U+1F3F9 (BOW AND ARROW) followed by U+0032 (DIGIT TWO)
			
 
				+var String: password = "\u{1F3F9}2";
			
 
				+
			
 
				+// This string contains no newline characters.
			
 
				+var String: type_mismatch = """
			
 
				+  Shall I compare thee to a summer's day? Thou art \
			
 
				+  more lovely and more temperate.\
			
 
				+  """;
			
 
				+
			
 
				+var String: trailing_whitespace = """
			
 
				+  This line ends in a space followed by a newline. \n\
			
 
				+      This line starts with four spaces.
			
 
				+  """;
			
 
				+```
			
 
				+
			
 
				+### Raw string literals
			
 
				+
			
 
				+In order to allow strings whose contents include backslashes and double quotes,
			
 
				+the delimiters of string literals can be customized by prefixing the opening
			
 
				+delimiter with _N_ `#` characters. A closing delimiter for such a string is only
			
 
				+recognized if it is followed by _N_ `#` characters, and similarly, escape
			
 
				+sequences in such string literals are recognized only if the `\` is also
			
 
				+followed by _N_ `#` characters. A `\`, `"`, or `"""` not followed by _N_ `#`
			
 
				+characters has no special meaning.
			
 
				+
			
 
				+| Opening delimiter | Escape sequence introducer    | Closing delimiter |
			
 
				+| ----------------- | ----------------------------- | ----------------- |
			
 
				+| `"` / `"""`       | `\` (for example, `\n`)       | `"` / `"""`       |
			
 
				+| `#"` / `#"""`     | `\#` (for example, `\#n`)     | `"#` / `"""#`     |
			
 
				+| `##"` / `##"""`   | `\##` (for example, `\##n`)   | `"##` / `"""##`   |
			
 
				+| `###"` / `###"""` | `\###` (for example, `\###n`) | `"###` / `"""###` |
			
 
				+| ...               | ...                           | ...               |
			
 
				+
			
 
				+For example:
			
 
				+
			
 
				+```carbon
			
 
				+var String: x = #"""
			
 
				+  This is the content of the string. The 'T' is the first character
			
 
				+  of the string.
			
 
				+  """ <-- This is not the end of the string.
			
 
				+  """#;
			
 
				+  // But the preceding line does end the string.
			
 
				+// OK, final character is \
			
 
				+var String: y = #"Hello\"#;
			
 
				+var String: z = ##"Raw strings #"nesting"#"##;
			
 
				+var String: w = #"Tab is expressed as \t. Example: '\#t'"#;
			
 
				+```
			
 
				+
			
 
				+Note that both a single-line raw string literal and a multi-line raw string
			
 
				+literal can begin with `#"""`. These cases can be distinguished by the presence
			
 
				+or absence of additional `"`s later in the same line:
			
 
				+
			
 
				+-   In a single-line raw string literal, there must be a `"` and one or more
			
 
				+    `#`s later in the same line terminating the string.
			
 
				+-   In a multi-line raw string literal, the rest of the line is a file type
			
 
				+    indicator, which can contain neither `"` nor `#`.
			
 
				+
			
 
				+```carbon
			
 
				+// This string is a single-line raw string literal.
			
 
				+// The contents of this string start and end with exactly two "s.
			
 
				+var String: ambig1 = #"""This is a raw string literal starting with """#;
			
 
				+
			
 
				+// This string is a block raw string literal with file-type 'This',
			
 
				+// whose contents start with "is a ".
			
 
				+var String: ambig2 = #"""This
			
 
				+  is a block string literal with file type 'This', first character 'i',
			
 
				+  and last character 'X': X\#
			
 
				+  """#;
			
 
				+
			
 
				+// This is a single-line raw string literal, equivalent to "\"".
			
 
				+var String: ambig3 = #"""#;
			
 
				+```
			
 
				+
			
 
				+### Encoding
			
 
				+
			
 
				+A string literal results in a sequence of 8-bit bytes. Like Carbon source files,
			
 
				+string literals are encoded in UTF-8. This proposal includes no mechanism to
			
 
				+request that any other encoding is used. The expectation is that if another
			
 
				+encoding is needed, the string literal can be transcoded from UTF-8 during
			
 
				+compilation. There is no guarantee that the string is valid UTF-8, however,
			
 
				+because arbitrary byte sequences can be inserted by way of `\xHH` escape
			
 
				+sequences.
			
 
				+
			
 
				+This decision is _experimental_, and should be revisited if we find sufficient
			
 
				+motivation for directly expressing string literals in other encodings.
			
 
				+Similarly, as library support for a string type evolves, we should consider
			
 
				+including string literal syntax (perhaps as the default) that guarantees the
			
 
				+string content is a valid UTF-8 encoding, so that valid UTF-8 can be
			
 
				+distinguished from an arbitrary string in the type system. In such string
			
 
				+literals, we should consider rejecting `\xHH` escapes in which HH is greater
			
 
				+than 7F<sub>16</sub>, as in Rust.
			
 
				+
			
 
				+## Alternatives considered
			
 
				+
			
 
				+### Block string literals
			
 
				+
			
 
				+We could avoid including a block string literal in general, and instead
			
 
				+construct multi-line strings by string concatenation, with either C-style
			
 
				+juxtaposition or with an explicit concatenation operator. But doing so would be
			
 
				+more verbose and would make the expression of the source code be further from
			
 
				+the programmer's intent.
			
 
				+
			
 
				+We could use raw string literals to provide block string literal syntax, as C++
			
 
				+does. However, this couples two orthogonal choices: whether escape sequences
			
 
				+should be recognized and whether the string is intended to span multiple lines.
			
 
				+In C++ code, the inability to use escape sequences in multi-line string literals
			
 
				+sometimes awkward. For example:
			
 
				+
			
 
				+```c++
			
 
				+std::string make_rule = "%s: %s\n\t$(CC) -c -o $@ $< $(CFLAGS)\n\n"
			
 
				+                        "main:\n\t$(CC) %s -o %s\n";
			
 
				+```
			
 
				+
			
 
				+can be written under this proposal as
			
 
				+
			
 
				+```carbon
			
 
				+var String: make_rule = """make
			
 
				+  %s: %s
			
 
				+  \t$(CC) -c -o $@ $< $(CFLAGS)
			
 
				+
			
 
				+  main:
			
 
				+  \t$(CC) %s -o %s
			
 
				+  """;
			
 
				+```
			
 
				+
			
 
				+improving readability while still making the semantically-meaningful presence of
			
 
				+tabs visible even in editors / code browsers that do not distinguish tabs from
			
 
				+spaces.
			
 
				+
			
 
				+#### Leading whitespace removal
			
 
				+
			
 
				+Block string literals could use explicit characters in the body to indicate the
			
 
				+amount of leading whitespace to be removed:
			
 
				+
			
 
				+```carbon
			
 
				+var String: x = """
			
 
				+  |  starts with two spaces.
			
 
				+  """;
			
 
				+```
			
 
				+
			
 
				+This would allow the correct indentation to be determined as soon as the first
			
 
				+line after the opening `"""` is seen. However, this adds lexical complexity, and
			
 
				+harms the ability to copy-paste string contents into other contexts.
			
 
				+
			
 
				+#### Terminating newline
			
 
				+
			
 
				+We could choose to exclude the trailing newline (like in Swift). Informal
			
 
				+surveys suggest that expectations for whether to include or exclude the trailing
			
 
				+newline vary.
			
 
				+
			
 
				+The intended use case for block string literals is to represent multi-line
			
 
				+strings. When forming a single larger string from concatenation of multi-line
			
 
				+string literals, including the trailing newline but not the leading newline --
			
 
				+or, more generally, including a newline at the end of each line in the string --
			
 
				+provides the best alignment between the source-level syntax and the result. For
			
 
				+example:
			
 
				+
			
 
				+```carbon
			
 
				+fn Run() {
			
 
				+  print("""c++
			
 
				+    class X {
			
 
				+    public:
			
 
				+      X() {}
			
 
				+
			
 
				+    """);
			
 
				+  for (var String: decl in GetMemberDecls()) {
			
 
				+    print decl;
			
 
				+  }
			
 
				+  print("""c++
			
 
				+
			
 
				+    private:
			
 
				+    """);
			
 
				+  print(GetFields());
			
 
				+  print("""c++
			
 
				+    };
			
 
				+    """);
			
 
				+}
			
 
				+```
			
 
				+
			
 
				+As is, the output printed by this example can be visualized by ignoring all
			
 
				+lines other than those between the `"""`s. If we excluded the trailing newline,
			
 
				+additional blank lines would be required at the end of each string literal,
			
 
				+harming the readability of the example.
			
 
				+
			
 
				+### Escape sequences
			
 
				+
			
 
				+We could support octal escape sequences, as many C family languages do. However,
			
 
				+they are considered antiquated in C++ code, and supporting them would be
			
 
				+inconsistent with our decision to not support octal numeric literals. A quick
			
 
				+informal poll suggests that many C++ programmers do not realize that `\123` is
			
 
				+an octal escape sequence, not a decimal one.
			
 
				+
			
 
				+We could support `\123` as a decimal escape sequence. However, doing so may lead
			
 
				+to surprise when migrating C++ code to Carbon. This possibility should be
			
 
				+revisited once Carbon matures and we have a better idea of how the migration
			
 
				+process is expected to proceed.
			
 
				+
			
 
				+We could allow arbitrary-length `\x` escape sequences, as C++ does, and include
			
 
				+some explicit mechanism to terminate such a sequence. For example, we could
			
 
				+treat `\<whitespace>` for an arbitrary whitespace character in the same way we
			
 
				+treat `\<newline>`, and use `"\xAB\ C"` to terminate an escape sequence
			
 
				+prematurely. However, this is unnecessarily inventive, and the Python approach
			
 
				+of requiring exactly two hexadecimal characters after `\x` is adequate, assuming
			
 
				+we do not intend to support string literal element types other than 8-bit bytes.
			
 
				+If we do find we want to support wider element types in future (for example, if
			
 
				+we want to add a UTF-16 or UTF-32 string literal), `\x{ABCD}` can be used.
			
 
				+
			
 
				+We could permit `\<newline>` even in non-block string literals to allow them to
			
 
				+be line-wrapped, but there seems to be little benefit to doing so, as a block
			
 
				+string literal can always be used instead.
			
 
				+
			
 
				+We could adopt Python's `\N{unicode character name}` syntax. But there is no
			
 
				+pressing need to add such a syntax imminently, and concerns have been raised
			
 
				+both over the exact ways in which characters are named and over compatibility
			
 
				+with upcoming C++ language extensions in this area, so this syntax is not being
			
 
				+proposed at this time.
			
 
				+
			
 
				+We could allow an `\e` escape sequence for the U+001C ESCAPE character. This
			
 
				+character is a common extension in C and C++ compilers, and appears to primarily
			
 
				+be used to hardcode ANSI terminal escape sequences, such as
			
 
				+`"\e[32mgreen text\e[0m"`. This proposal doesn't explicitly reject this idea.
			
 
				+However, if we consider adopting such an extension in the future, we should
			
 
				+consider whether a library facility for simple terminal operations would be a
			
 
				+more valuable addition than this escape sequence.
			
 
				+
			
 
				+We could retain the `\uNNNN` escape sequence to give a terser notation for the
			
 
				+common case of a Unicode code point that is most naturally written as four
			
 
				+hexadecimal digits -- that is, all code points in the Basic Multilingual Plane.
			
 
				+This is the approach taken by JavaScript. However, following Swift and Rust in
			
 
				+permitting only `\u{NNNN}` is simpler and avoids redundancy. We expect explicit
			
 
				+`\u` escapes to be rare: we expect regular, printable Unicode characters that
			
 
				+are normalized in NFC to be written directly in the source file, rather than
			
 
				+spelled with `\u` escapes. Explicit `\u` escapes would be useful where the code
			
 
				+point value is important -- for example, in test data -- or where special
			
 
				+characters such as directionality markers or non-normalized characters are
			
 
				+desired, but for such uses, the longer `\u{NNNN}` form seems adequate.
			
 
				+
			
 
				+### Raw string literals
			
 
				+
			
 
				+The approach to raw string literals in this proposal is based on Swift's raw
			
 
				+strings facility.
			
 
				+
			
 
				+We could use a different mechanism other than a sequence of `#`s to support
			
 
				+nesting raw string literals. For example, we could adopt something like C++'s
			
 
				+semi-arbitrary delimiters `R"foo(string contents)foo"`. However, this level of
			
 
				+customizability seems unwarranted: raw string literals are unlikely to nest more
			
 
				+than one or two levels deep, so using `#"..."#`, `##"..."##`, `###"..."###` for
			
 
				+successive nesting levels seems unproblematic, and removes the need for the
			
 
				+programmer to make an arbitrary choice.
			
 
				+
			
 
				+We could use a delimiter other than `#` to demarcate raw string literals. At
			
 
				+this stage in Carbon's development, we don't know exactly which characters will
			
 
				+be useful in operators, but it seems reasonable to assume a mostly C++-like
			
 
				+operator set, which gives us a variety of characters that cannot appear
			
 
				+immediately before a string literal: at least `@` `#` `$` `)` `]` `}` `\` `.`
			
 
				+all appear likely to be available. Of these, `\` is likely to be problematic due
			
 
				+to its use in escape sequences, closing brackets followed by string literals
			
 
				+might one day be useful in some grammar constructs, and `.` seems a little too
			
 
				+close to resembling designator. That leaves `@`, `#`, and `$`, and multiple
			
 
				+existing languages have used `#` for this purpose.
			
 
				+
			
 
				+We could disallow use of _N_ `#`s as a delimiter if a lower value of _N_ would
			
 
				+work. However, this would make the language brittle under maintenance: removing
			
 
				+the last nested string literal from a quoted block of code would require
			
 
				+changing the delimiters.
			
 
				+
			
 
				+We could use Rust-style raw strings, which add a leading `r`, permit zero `#`s
			
 
				+to be used in raw strings, and do not provide a facility for escape sequences in
			
 
				+raw strings. There are several reasons to prefer Swift-style raw strings:
			
 
				+
			
 
				+-   Swift raw strings are not a distinct language feature; rather, they are a
			
 
				+    generalization of non-raw strings -- non-raw strings are simply the case
			
 
				+    where the number of `#` characters in the delimiters and escape sequence
			
 
				+    intorducer is zero.
			
 
				+-   Permitting escape sequences even in raw strings means that there is no loss
			
 
				+    of functionality when using a raw string, and code changes under maintenance
			
 
				+    that would require use of a facility only available by way of an escape
			
 
				+    sequence (such as inclusion of trailing whitespace in a block string
			
 
				+    literal) do not force a reversion to a non-raw string literal.
			
 
				+
			
 
				+There are also several reasons to prefer the Rust-style approach:
			
 
				+
			
 
				+-   In the most common case, a Rust raw string will be one character shorter.
			
 
				+-   Less lexical space is used: Swift-style raw strings remove the possibility
			
 
				+    of using `#` as a prefix operator, whereas the leading `r` in Rust-style raw
			
 
				+    strings does not.
			
 
				+-   Certain character sequences are hard to express in Swift-style raw strings.
			
 
				+    Specifically, string literals such as `"\\################"` cannot readily
			
 
				+    be expressed as a raw string. Such string literals are extremely rare, but
			
 
				+    not nonexistent, in one large sample C++ corpus.
			
 
				+
			
 
				+If the final issue is concerning, we have a path to address it, by specifying
			
 
				+that `\` followed by <i>N</i>+1 or more `#`s is left alone, just like `\`
			
 
				+followed by <i>N</i>-1 or fewer `#`s is left alone. Formally, this can be
			
 
				+accomplished by defining `\#` as an escape sequence that expands to itself, that
			
 
				+is, to a backslash followed by <i>N</i>+1 `#` characters.
			
 
				+
			
 
				+#### Trailing whitespace
			
 
				+
			
 
				+We could preserve trailing whitespace in at least raw block string literals, and
			
 
				+perhaps in all block string literals. However, this would mean that visually
			
 
				+identical programs could have different meanings, and even that transformations
			
 
				+performed automatically on save by some editors (removing trailing whitespace)
			
 
				+could change the meaning of a program. It might also mean that raw string
			
 
				+literals are no longer a generalization of non-raw string literals.
			
 
				+
			
 
				+Under this proposal, trailing whitespace can be included in a block string
			
 
				+literal by following it with `\n\`:
			
 
				+
			
 
				+```
			
 
				+var String: authors = """markdown
			
 
				+  *Authors*:  \n\
			
 
				+  Me <me@example.com>  \n\
			
 
				+  Someone Else <them@example.com>
			
 
				+
			
 
				+""";
			
 
				+```
			
 
				+
			
 
				+In a single-`#` raw string literal, the same can be accomplished with the
			
 
				+more-verbose terminator `\#n\#`, and so on.
			
 
				+
			
 
				+#### Line separators
			
 
				+
			
 
				+Raw block string literals could preserve the form of vertical whitespace used to
			
 
				+terminate each line. This would allow uncommon forms of vertical whitespace (for
			
 
				+example, vertical tab and form feed) to be included in raw string literals, but
			
 
				+would create a risk that the meaning of a program would be different when the
			
 
				+source code is checked out on an operating system that uses line feed as a line
			
 
				+terminator versus when the source code is checked out on an operating system
			
 
				+that uses a different line terminator (such as carriage return followed by line
			
 
				+feed). This would also mean that raw string literals are no longer a
			
 
				+generalization of non-raw string literals.
			
 
				+
			
 
				+### Internal whitespace
			
 
				+
			
 
				+We could allow raw tab characters in string literals. However, raw tab
			
 
				+characters harm the readability of the program, and we would like to encourage
			
 
				+the use of `\t` escapes instead in situations where they are available, even if
			
 
				+this means that the more verbose form `\#t` needs to be used in raw string
			
 
				+literals.