|
|
@@ -0,0 +1,723 @@
|
|
|
+# String literals
|
|
|
+
|
|
|
+<!--
|
|
|
+Part of the Carbon Language project, under the Apache License v2.0 with LLVM
|
|
|
+Exceptions. See /LICENSE for license information.
|
|
|
+SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
|
|
|
+-->
|
|
|
+
|
|
|
+[Pull request](https://github.com/carbon-language/carbon-lang/pull/199)
|
|
|
+
|
|
|
+<!-- toc -->
|
|
|
+
|
|
|
+## Table of contents
|
|
|
+
|
|
|
+- [Problem](#problem)
|
|
|
+- [Background](#background)
|
|
|
+ - [Existing practice](#existing-practice)
|
|
|
+- [Proposal](#proposal)
|
|
|
+- [Details](#details)
|
|
|
+ - [Non-raw string literals](#non-raw-string-literals)
|
|
|
+ - [Escape sequences](#escape-sequences)
|
|
|
+ - [Raw string literals](#raw-string-literals)
|
|
|
+ - [Encoding](#encoding)
|
|
|
+- [Alternatives considered](#alternatives-considered)
|
|
|
+ - [Block string literals](#block-string-literals)
|
|
|
+ - [Leading whitespace removal](#leading-whitespace-removal)
|
|
|
+ - [Terminating newline](#terminating-newline)
|
|
|
+ - [Escape sequences](#escape-sequences-1)
|
|
|
+ - [Raw string literals](#raw-string-literals-1)
|
|
|
+ - [Trailing whitespace](#trailing-whitespace)
|
|
|
+ - [Line separators](#line-separators)
|
|
|
+ - [Internal whitespace](#internal-whitespace)
|
|
|
+
|
|
|
+<!-- tocstop -->
|
|
|
+
|
|
|
+## Problem
|
|
|
+
|
|
|
+This proposal specifies lexical rules for constant strings in Carbon.
|
|
|
+
|
|
|
+## Background
|
|
|
+
|
|
|
+We wish to provide a syntax for writing literals containing human-readable text.
|
|
|
+
|
|
|
+Note that "human-readable text" here should be understood broadly: such text may
|
|
|
+be subject to further processing, and may in some cases be intended to be
|
|
|
+interpreted by a computer rather than by a human (such as a regular expression,
|
|
|
+program source code, or a C++ mangled name), but broadly represents a sequence
|
|
|
+of characters rather than arbitrary binary data.
|
|
|
+
|
|
|
+Such text is typically represented in an _encoding_, which is a bidirectional
|
|
|
+mapping between a sequence of characters in text and a sequence of bounded
|
|
|
+integer values known as _code units_, suitable for storage, transmission, and
|
|
|
+processing. For example, the Russian word углерод (carbon) is encoded in the
|
|
|
+UTF-8 encoding as D1<sub>16</sub>83<sub>16 </sub>D0<sub>16</sub>B3<sub>16
|
|
|
+</sub>D0<sub>16</sub>BB<sub>16</sub> D0<sub>16</sub>B5<sub>16</sub>
|
|
|
+D1<sub>16</sub>80<sub>16</sub> D0<sub>16</sub>BE<sub>16</sub>
|
|
|
+D0<sub>16</sub>B4<sub>16</sub>.
|
|
|
+
|
|
|
+### Existing practice
|
|
|
+
|
|
|
+See
|
|
|
+[Comparison of programming languages (strings) on Wikipedia](https://en.wikipedia.org/wiki/Comparison_of_programming_languages_%28strings%29).
|
|
|
+
|
|
|
+Simple string literals are specified in most programming languages as text
|
|
|
+delimited by double-quote characters, `"like this"`. Such string literals
|
|
|
+usually are restricted to begin and end on the same source line. Three
|
|
|
+additional features are commonly seen:
|
|
|
+
|
|
|
+- Escape sequences, which permit string literals to include characters that
|
|
|
+ are difficult to type, are ambiguous for the reader, or that would be
|
|
|
+ problematic in some way (such as whitespace characters, characters that are
|
|
|
+ invisible, and characters that change how other characters are rendered),
|
|
|
+ and also to include arbitrary code units. One common convention is to use
|
|
|
+ `\` to introduce an escape sequence, where, for example:
|
|
|
+
|
|
|
+ - `\n` represents a newline character,
|
|
|
+ - `\u1234` represents the Unicode character U+1234,
|
|
|
+ - `\xAB` represents the code unit AB<sub>16</sub>,
|
|
|
+ - `\"` represents a single `"` character and does not terminate the string
|
|
|
+ literal,
|
|
|
+ - `\\` represents a single `\` character,
|
|
|
+
|
|
|
+ and so on.
|
|
|
+
|
|
|
+ The set of single-letter escape sequences has a lot of commonality between
|
|
|
+ languages, with some variation between older and newer languages:
|
|
|
+
|
|
|
+ - C++ and Python allow `\a`, `\b`, `\f`, `\n`, `\r`, `\t`, `\v` for bell,
|
|
|
+ backspace, form feed, new line, carriage return, tab, and vertical tab,
|
|
|
+ respectively.
|
|
|
+ - JavaScript drops support for `\a`.
|
|
|
+ - Java additionally drops support for `\v`.
|
|
|
+ - Rust and Swift additionally drop support for `\b` and `\f`, leaving only
|
|
|
+ `\n`, `\r`, and `\t`.
|
|
|
+
|
|
|
+ The rules for numeric escape sequences differ between kinds of escape
|
|
|
+ sequence and between languages. The rules in C++, JavaScript, Python, Rust,
|
|
|
+ and Swift are as follows:
|
|
|
+
|
|
|
+ - `\123` is interpreted as a octal code unit value, and up to three octal
|
|
|
+ digits are consumed. In JavaScript and C++, values greater than
|
|
|
+ 377<sub>8</sub> (255<sub>10</sub>) are invalid (assuming an 8-bit
|
|
|
+ character type). In Python, values greater than 377<sub>8</sub> are
|
|
|
+ interpreted modulo 256. Rust and Swift do not allow octal escapes in
|
|
|
+ general, but do allow `\0` as a special case.
|
|
|
+ - `\xAB` is interpreted as a hexadecimal code unit value. In C++, any
|
|
|
+ nonzero number of hexadecimal digits can follow as part of the escape
|
|
|
+ sequence. In Python, JavaScript, and Rust, exactly two digits are
|
|
|
+ required. In Rust, the value must be less than or equal to
|
|
|
+ 7F<sub>16</sub> except for `b`-prefixed strings (byte array literals).
|
|
|
+ Swift does not support this form of escape sequence.
|
|
|
+ - `\uABCD` is interpreted as a hexadecimal code point value. In C++,
|
|
|
+ Python, and JavaScript, exactly four hexadecimal digits can follow, but
|
|
|
+ JavaScript allows any nonzero number of digits to be specified using
|
|
|
+ `\u{ABCDE}` notation. Rust and Swift support only the `\u{ABCDE}`
|
|
|
+ notation.
|
|
|
+ - `\U0010FFFD` is interpreted as a hexadecimal code point value in C++ and
|
|
|
+ Python, but not in JavaScript, Rust, or Swift, and permits exactly eight
|
|
|
+ hexadecimal digits.
|
|
|
+
|
|
|
+- Raw string literals, in which escape sequences are not recognized. These are
|
|
|
+ often used in situations where escape sequences are undesirable, but in
|
|
|
+ which the escape character or regular string terminator is used frequently.
|
|
|
+ Such literals are useful when embedding one machine-readable language in
|
|
|
+ another, when those languages share some escaping conventions. Such
|
|
|
+ functionality may also provide a way to customize the string delimiters.
|
|
|
+
|
|
|
+ - In Python, raw string literals have an `r` prefix: `r"li\ngo"` is a six
|
|
|
+ character string whose third character is `\`.
|
|
|
+ - In C++, raw string literals have an `r` prefix, along with a custom
|
|
|
+ delimiter (which may be empty): `r"DELIM(li\ngo)DELIM"` is a six
|
|
|
+ character string (plus a nul terminator).
|
|
|
+ - In Rust, raw string literals have an `r` prefix and any matching number
|
|
|
+ of `#` characters enclose the string contents: the third character of
|
|
|
+ `r"lo\ng"` is `\`, and the second character of `r#" " "#` is `"`.
|
|
|
+ - In Swift, raw string literals are a generalization of regular string
|
|
|
+ literals: literals are introduced by any number, _N_, of `#` characters
|
|
|
+ followed by a `"`, terminated by `"` followed by _N_ `#` characters, and
|
|
|
+ escape sequences are introduced by a `\` followed by _N_ `#` characters:
|
|
|
+ `#" " # \n \#n \#\# "#` has the same contents as the C++ string literal
|
|
|
+ `" \" # \\n \n \\# "`.
|
|
|
+ - In Java, raw string literals are delimited by a sequence of one or more
|
|
|
+ backticks instead of double quotes: the fourth character of
|
|
|
+ <code>\`\`foo\`\bar\`\`</code> is a backtick and the fifth is a
|
|
|
+ backslash.
|
|
|
+
|
|
|
+- Multiline string literals provide a mechanism for a string to easily span
|
|
|
+ more than one line of source text.
|
|
|
+
|
|
|
+ - In C++, Rust, and Java, raw string literals are used to represent
|
|
|
+ multiline string literals.
|
|
|
+ - In Python, different delimiters (`"""` or, in Python, `'''` instead of
|
|
|
+ `"` or `'`) are used to represent multiline string literals, and plain
|
|
|
+ `"` and `""` can thereby appear in the string contents, but these
|
|
|
+ literals otherwise behave the same as regular string literals.
|
|
|
+ - In Swift, different delimiters are used (`"""` instead of `"`), but
|
|
|
+ unlike in Python, the string content cannot be on the same line as the
|
|
|
+ delimiters, and the resulting mandatory leading and trailing newlines
|
|
|
+ are not included in the string. Internal newlines can be removed by
|
|
|
+ preceding them with backslashes.
|
|
|
+ - In JavaScript, backtick-delimited strings can contain newlines; this
|
|
|
+ syntax also allows string interpolation as described below.
|
|
|
+
|
|
|
+In addition, some languages, primarily scripting languages, also provide a
|
|
|
+mechanism for string interpolation, wherein a string value is formed by
|
|
|
+including the formatted values of some variables in a given format string. For
|
|
|
+example, `"Hello, $planet."` might produce a string value including the
|
|
|
+formatted value of the variable named `planet`. Such interpolation facilities
|
|
|
+are outside the scope of this proposal.
|
|
|
+
|
|
|
+## Proposal
|
|
|
+
|
|
|
+- Single-line string literals are delimited by `"`s: `"hello"`
|
|
|
+
|
|
|
+- Multi-line string literals are introduced by a `"""` followed by a newline
|
|
|
+ and terminated by a line beginning with a `"""`. The indentation of the
|
|
|
+ terminating line is removed from all preceding lines:
|
|
|
+
|
|
|
+ ```carbon
|
|
|
+ var String: henry_vi = """
|
|
|
+ The winds grow high; so do your stomachs, lords.
|
|
|
+ How irksome is this music to my heart!
|
|
|
+ When such strings jar, what hope of harmony?
|
|
|
+ I pray, my lords, let me compound this strife.
|
|
|
+ -- History of Henry VI, Part II, Act II, Scene 1, W. Shakespeare
|
|
|
+ """;
|
|
|
+ ```
|
|
|
+
|
|
|
+ Only the final line of this string literal begins with whitespace. The
|
|
|
+ opening newline is not part of the string's contents, but the trailing
|
|
|
+ newline is; the first character of this example string is `T` and the last
|
|
|
+ character is a newline.
|
|
|
+
|
|
|
+- The opening `"""` of a multi-line string literal can be followed by a _file
|
|
|
+ type indicator_, to assist tooling in understanding the intent of the
|
|
|
+ string. This indicator has no effect on the meaning of the program.
|
|
|
+
|
|
|
+ ```carbon
|
|
|
+ var String: cpp_snippet = """c++
|
|
|
+ #include <iostream>
|
|
|
+
|
|
|
+ int main() {
|
|
|
+ std::cout << "hello world" << std::endl;
|
|
|
+ }
|
|
|
+ """;
|
|
|
+ ```
|
|
|
+
|
|
|
+- Escape sequences are introduced with a `\` character; the most common C and
|
|
|
+ C++ escape sequences are supported: `"hello\nworld"`. Octal escapes
|
|
|
+ (`\177`), `\a`, `\b`, `\f` and `\v` are removed. `\uNNNN` and `\U00NNNNNN`
|
|
|
+ are replaced by `\u{NNNNNN}`. An escape sequence `\<newline>` is permitted
|
|
|
+ in multi-line string literals, and results in no string contents.
|
|
|
+
|
|
|
+- Raw string literals are supported, for both the single and multi-line case,
|
|
|
+ following the Swift convention: they are introduced by prefixing the opening
|
|
|
+ delimiter with one or more `#`s, and suffixing the closing delimiter with a
|
|
|
+ matching number of `#`s: `#"foo\s*bar"#` or `#"foo"bar"#`. Escape sequences
|
|
|
+ can be introduced in a raw string literal by inserting a matching number of
|
|
|
+ `#`s after the `\` character: `#"foo\#nbar"#` contains a newline character.
|
|
|
+
|
|
|
+- Unlike in C and C++, adjacent string literals are not implicitly
|
|
|
+ concatenated.
|
|
|
+
|
|
|
+## Details
|
|
|
+
|
|
|
+### Non-raw string literals
|
|
|
+
|
|
|
+A _simple string literal_ is formed of a sequence of
|
|
|
+
|
|
|
+- characters other than backslashes, double quotation marks, and vertical
|
|
|
+ whitespace
|
|
|
+- [escape sequences](#escape-sequences)
|
|
|
+
|
|
|
+enclosed in double quotation marks (`"`). Each escape sequence is replaced with
|
|
|
+the corresponding character sequence or code unit sequence.
|
|
|
+
|
|
|
+```carbon
|
|
|
+var String: lucius = "The strings, my lord, are false.";
|
|
|
+```
|
|
|
+
|
|
|
+A _block string literal_ starts with three double quotation marks, followed by
|
|
|
+an optional file type indicator, followed by a newline, and ends at the next
|
|
|
+instance of three double quotation marks whose first `"` is not part of a `\"`
|
|
|
+escape sequence. The closing `"""` shall be the first non-whitespace characters
|
|
|
+on that line. The lines between the opening line and the closing line
|
|
|
+(exclusive) are _content lines_. The content lines shall not contain `\`
|
|
|
+characters that do not form part of an escape sequence.
|
|
|
+
|
|
|
+The _indentation_ of a block string literal is the sequence of horizontal
|
|
|
+whitespace preceding the closing `"""`. Each non-empty content line shall begin
|
|
|
+with the indentation of the string literal. The content of the literal is formed
|
|
|
+as follows:
|
|
|
+
|
|
|
+- The indentation of the closing line is removed from each non-empty content
|
|
|
+ line.
|
|
|
+- All trailing whitespace on each line, including the line terminator, is
|
|
|
+ replaced with a single line feed (U+000A) character.
|
|
|
+- The resulting lines are concatenated.
|
|
|
+- Each [escape sequence](#escape-sequences) is replaced with the corresponding
|
|
|
+ character sequence or code unit sequence.
|
|
|
+
|
|
|
+A content line is considered empty if it contains only whitespace characters.
|
|
|
+
|
|
|
+```carbon
|
|
|
+var String: w = """
|
|
|
+ This is a string literal. Its first character is 'T' and its last character is
|
|
|
+ a newline character. It contains another newline between 'is' and 'a'.
|
|
|
+ """;
|
|
|
+
|
|
|
+// This string literal is invalid because the """ after 'closing' terminates
|
|
|
+// the literal, but is not at the start of the line.
|
|
|
+var String: invalid = """
|
|
|
+ error: closing """ is not on its own line.
|
|
|
+ """;
|
|
|
+```
|
|
|
+
|
|
|
+A _file type indicator_ is any sequence of non-whitespace characters other than
|
|
|
+`"` or `#`. The file type indicator has no semantic meaning to the Carbon
|
|
|
+compiler, but some file type indicators are understood by the language tooling
|
|
|
+(for example, syntax highlighter, code formatter) as indicating the structure of
|
|
|
+the string literal's content.
|
|
|
+
|
|
|
+```carbon
|
|
|
+// This is a block string literal. Its first two characters are spaces, and its
|
|
|
+// last character is a line feed. It has a file type of 'c++'.
|
|
|
+var String: starts_with_whitespace = """c++
|
|
|
+ int x = 1; // This line starts with two spaces.
|
|
|
+ int y = 2; // This line starts with two spaces.
|
|
|
+ """;
|
|
|
+```
|
|
|
+
|
|
|
+The file type indicator might contain semantic information beyond the file type
|
|
|
+itself, such as instructions to the code formatter to disable formatting for the
|
|
|
+code block.
|
|
|
+
|
|
|
+**Open question:** This proposal does not suggest any concrete set of recognized
|
|
|
+file type indicators. It would be useful to informally specify a set of
|
|
|
+well-known indicators, so that tools have a common understanding of what those
|
|
|
+indicators mean, perhaps in a best practices guide.
|
|
|
+
|
|
|
+#### Escape sequences
|
|
|
+
|
|
|
+Within a string literal, the following escape sequences are recognized:
|
|
|
+
|
|
|
+| Escape | Meaning |
|
|
|
+| ------------- | -------------------------------------------------------- |
|
|
|
+| `\t` | U+0009 CHARACTER TABULATION |
|
|
|
+| `\n` | U+000A LINE FEED |
|
|
|
+| `\r` | U+000D CARRIAGE RETURN |
|
|
|
+| `\"` | U+0022 QUOTATION MARK (`"`) |
|
|
|
+| `\'` | U+0027 APOSTROPHE (`'`) |
|
|
|
+| `\\` | U+005C REVERSE SOLIDUS (`\`) |
|
|
|
+| `\0` | Code unit with value 0 |
|
|
|
+| `\xHH` | Code unit with value HH<sub>16</sub> |
|
|
|
+| `\u{HHHH...}` | Unicode code point U+HHHH... |
|
|
|
+| `\<newline>` | No string literal content produced (block literals only) |
|
|
|
+
|
|
|
+This includes all C++ escape sequences except:
|
|
|
+
|
|
|
+- `\?`, which was historically used to escape trigraphs in string literals,
|
|
|
+ and no longer serves any purpose.
|
|
|
+- `\ooo` octal escapes, which are removed because Carbon does not support
|
|
|
+ octal literals; `\0` is retained as a special case, which is expected to be
|
|
|
+ important for C interoperability.
|
|
|
+- `\uABCD`, which is replaced by `\u{ABCD}`.
|
|
|
+- `\U0010FFFF`, which is replaced by `\u{10FFFF}`.
|
|
|
+- `\a` (bell), `\b` (backspace), `\v` (vertical tab), and `\f` (form feed).
|
|
|
+ `\a` and `\b` are obsolescent, and `\f` and `\v` are largely obsolete. These
|
|
|
+ characters can be expressed with `\x07`, `\x08`, `\x0B`, and `\x0C`
|
|
|
+ respectively if needed.
|
|
|
+
|
|
|
+Note that this is the same set of escape sequences supported by
|
|
|
+[Swift](https://docs.swift.org/swift-book/LanguageGuide/StringsAndCharacters.html#ID295)
|
|
|
+and [Rust](https://doc.rust-lang.org/reference/tokens.html), except that, unlike
|
|
|
+in Swift, support for `\xHH` is provided.
|
|
|
+
|
|
|
+While this proposal takes a firm stance on not permitting octal escape
|
|
|
+sequences, the decision to not allow `\1`..`\7`, and more generally to not treat
|
|
|
+`\DDDD` as a decimal escape sequence, is _experimental_.
|
|
|
+
|
|
|
+In the above table, `H` represents an arbitrary hexadecimal character, `0`-`9`
|
|
|
+or `A`-`F` (case-sensitive). Unlike in C++, but like in Python, `\x` expects
|
|
|
+exactly two hexadecimal digits. As in JavaScript, Rust, and Swift, Unicode code
|
|
|
+points can be expressed by number using `\u{10FFFF}` notation, which accepts any
|
|
|
+number of hexadecimal characters. Any numeric code point in the ranges
|
|
|
+0<sub>16</sub>-D7FF<sub>16</sub> or E000<sub>16</sub>-10FFFF<sub>16</sub> can be
|
|
|
+expressed this way.
|
|
|
+
|
|
|
+_Open question:_ Some programming languages (notably Python) support a
|
|
|
+`\N{unicode character name}` syntax. We could add such an escape sequence, but
|
|
|
+this proposal does not include one. Future proposals considering adding such
|
|
|
+support should pay attention to work done by C++'s Unicode study group in this
|
|
|
+area.
|
|
|
+
|
|
|
+The escape sequence `\0` shall not be followed by a decimal digit. In cases
|
|
|
+where a null byte should be followed by a decimal digit, `\x00` can be used
|
|
|
+instead: `"foo\x00123"`. The intent is to preserve the possibility of permitting
|
|
|
+decimal escape sequences in the future.
|
|
|
+
|
|
|
+A backslash followed by a line feed character is an escape sequence that
|
|
|
+produces no string contents. This escape sequence is _experimental_, and can
|
|
|
+only appear in multi-line string literals. This escape sequence is processed
|
|
|
+after trailing whitespace is replaced by a line feed character, so a `\`
|
|
|
+followed by horizontal whitespace followed by a line terminator removes the
|
|
|
+whitespace up to and including the line terminator. Unlike in Rust, but like in
|
|
|
+Swift, leading whitespace on the line after an escaped newline is not removed,
|
|
|
+other than whitespace that matches the indentation of the terminating `"""`.
|
|
|
+
|
|
|
+A character sequence starting with a backslash that doesn't match any known
|
|
|
+escape sequence is invalid. Whitespace characters other than space and, for
|
|
|
+block string literals, new line optionally preceded by carriage return are
|
|
|
+disallowed. All other characters (including non-printable characters) are
|
|
|
+preserved verbatim. Because all Carbon source files are required to be valid
|
|
|
+sequences of Unicode characters, code unit sequences that are not valid UTF-8
|
|
|
+can only be produced by `\x` escape sequences.
|
|
|
+
|
|
|
+The choice to disallow raw tab characters in string literals is _experimental_.
|
|
|
+
|
|
|
+```carbon
|
|
|
+var String: fret = "I would 'twere something that would fret the string,\n" +
|
|
|
+ "The master-cord on's \u{2764}\u{FE0F}!";
|
|
|
+
|
|
|
+// This string contains two characters (prior to encoding in UTF-8):
|
|
|
+// U+1F3F9 (BOW AND ARROW) followed by U+0032 (DIGIT TWO)
|
|
|
+var String: password = "\u{1F3F9}2";
|
|
|
+
|
|
|
+// This string contains no newline characters.
|
|
|
+var String: type_mismatch = """
|
|
|
+ Shall I compare thee to a summer's day? Thou art \
|
|
|
+ more lovely and more temperate.\
|
|
|
+ """;
|
|
|
+
|
|
|
+var String: trailing_whitespace = """
|
|
|
+ This line ends in a space followed by a newline. \n\
|
|
|
+ This line starts with four spaces.
|
|
|
+ """;
|
|
|
+```
|
|
|
+
|
|
|
+### Raw string literals
|
|
|
+
|
|
|
+In order to allow strings whose contents include backslashes and double quotes,
|
|
|
+the delimiters of string literals can be customized by prefixing the opening
|
|
|
+delimiter with _N_ `#` characters. A closing delimiter for such a string is only
|
|
|
+recognized if it is followed by _N_ `#` characters, and similarly, escape
|
|
|
+sequences in such string literals are recognized only if the `\` is also
|
|
|
+followed by _N_ `#` characters. A `\`, `"`, or `"""` not followed by _N_ `#`
|
|
|
+characters has no special meaning.
|
|
|
+
|
|
|
+| Opening delimiter | Escape sequence introducer | Closing delimiter |
|
|
|
+| ----------------- | ----------------------------- | ----------------- |
|
|
|
+| `"` / `"""` | `\` (for example, `\n`) | `"` / `"""` |
|
|
|
+| `#"` / `#"""` | `\#` (for example, `\#n`) | `"#` / `"""#` |
|
|
|
+| `##"` / `##"""` | `\##` (for example, `\##n`) | `"##` / `"""##` |
|
|
|
+| `###"` / `###"""` | `\###` (for example, `\###n`) | `"###` / `"""###` |
|
|
|
+| ... | ... | ... |
|
|
|
+
|
|
|
+For example:
|
|
|
+
|
|
|
+```carbon
|
|
|
+var String: x = #"""
|
|
|
+ This is the content of the string. The 'T' is the first character
|
|
|
+ of the string.
|
|
|
+ """ <-- This is not the end of the string.
|
|
|
+ """#;
|
|
|
+ // But the preceding line does end the string.
|
|
|
+// OK, final character is \
|
|
|
+var String: y = #"Hello\"#;
|
|
|
+var String: z = ##"Raw strings #"nesting"#"##;
|
|
|
+var String: w = #"Tab is expressed as \t. Example: '\#t'"#;
|
|
|
+```
|
|
|
+
|
|
|
+Note that both a single-line raw string literal and a multi-line raw string
|
|
|
+literal can begin with `#"""`. These cases can be distinguished by the presence
|
|
|
+or absence of additional `"`s later in the same line:
|
|
|
+
|
|
|
+- In a single-line raw string literal, there must be a `"` and one or more
|
|
|
+ `#`s later in the same line terminating the string.
|
|
|
+- In a multi-line raw string literal, the rest of the line is a file type
|
|
|
+ indicator, which can contain neither `"` nor `#`.
|
|
|
+
|
|
|
+```carbon
|
|
|
+// This string is a single-line raw string literal.
|
|
|
+// The contents of this string start and end with exactly two "s.
|
|
|
+var String: ambig1 = #"""This is a raw string literal starting with """#;
|
|
|
+
|
|
|
+// This string is a block raw string literal with file-type 'This',
|
|
|
+// whose contents start with "is a ".
|
|
|
+var String: ambig2 = #"""This
|
|
|
+ is a block string literal with file type 'This', first character 'i',
|
|
|
+ and last character 'X': X\#
|
|
|
+ """#;
|
|
|
+
|
|
|
+// This is a single-line raw string literal, equivalent to "\"".
|
|
|
+var String: ambig3 = #"""#;
|
|
|
+```
|
|
|
+
|
|
|
+### Encoding
|
|
|
+
|
|
|
+A string literal results in a sequence of 8-bit bytes. Like Carbon source files,
|
|
|
+string literals are encoded in UTF-8. This proposal includes no mechanism to
|
|
|
+request that any other encoding is used. The expectation is that if another
|
|
|
+encoding is needed, the string literal can be transcoded from UTF-8 during
|
|
|
+compilation. There is no guarantee that the string is valid UTF-8, however,
|
|
|
+because arbitrary byte sequences can be inserted by way of `\xHH` escape
|
|
|
+sequences.
|
|
|
+
|
|
|
+This decision is _experimental_, and should be revisited if we find sufficient
|
|
|
+motivation for directly expressing string literals in other encodings.
|
|
|
+Similarly, as library support for a string type evolves, we should consider
|
|
|
+including string literal syntax (perhaps as the default) that guarantees the
|
|
|
+string content is a valid UTF-8 encoding, so that valid UTF-8 can be
|
|
|
+distinguished from an arbitrary string in the type system. In such string
|
|
|
+literals, we should consider rejecting `\xHH` escapes in which HH is greater
|
|
|
+than 7F<sub>16</sub>, as in Rust.
|
|
|
+
|
|
|
+## Alternatives considered
|
|
|
+
|
|
|
+### Block string literals
|
|
|
+
|
|
|
+We could avoid including a block string literal in general, and instead
|
|
|
+construct multi-line strings by string concatenation, with either C-style
|
|
|
+juxtaposition or with an explicit concatenation operator. But doing so would be
|
|
|
+more verbose and would make the expression of the source code be further from
|
|
|
+the programmer's intent.
|
|
|
+
|
|
|
+We could use raw string literals to provide block string literal syntax, as C++
|
|
|
+does. However, this couples two orthogonal choices: whether escape sequences
|
|
|
+should be recognized and whether the string is intended to span multiple lines.
|
|
|
+In C++ code, the inability to use escape sequences in multi-line string literals
|
|
|
+sometimes awkward. For example:
|
|
|
+
|
|
|
+```c++
|
|
|
+std::string make_rule = "%s: %s\n\t$(CC) -c -o $@ $< $(CFLAGS)\n\n"
|
|
|
+ "main:\n\t$(CC) %s -o %s\n";
|
|
|
+```
|
|
|
+
|
|
|
+can be written under this proposal as
|
|
|
+
|
|
|
+```carbon
|
|
|
+var String: make_rule = """make
|
|
|
+ %s: %s
|
|
|
+ \t$(CC) -c -o $@ $< $(CFLAGS)
|
|
|
+
|
|
|
+ main:
|
|
|
+ \t$(CC) %s -o %s
|
|
|
+ """;
|
|
|
+```
|
|
|
+
|
|
|
+improving readability while still making the semantically-meaningful presence of
|
|
|
+tabs visible even in editors / code browsers that do not distinguish tabs from
|
|
|
+spaces.
|
|
|
+
|
|
|
+#### Leading whitespace removal
|
|
|
+
|
|
|
+Block string literals could use explicit characters in the body to indicate the
|
|
|
+amount of leading whitespace to be removed:
|
|
|
+
|
|
|
+```carbon
|
|
|
+var String: x = """
|
|
|
+ | starts with two spaces.
|
|
|
+ """;
|
|
|
+```
|
|
|
+
|
|
|
+This would allow the correct indentation to be determined as soon as the first
|
|
|
+line after the opening `"""` is seen. However, this adds lexical complexity, and
|
|
|
+harms the ability to copy-paste string contents into other contexts.
|
|
|
+
|
|
|
+#### Terminating newline
|
|
|
+
|
|
|
+We could choose to exclude the trailing newline (like in Swift). Informal
|
|
|
+surveys suggest that expectations for whether to include or exclude the trailing
|
|
|
+newline vary.
|
|
|
+
|
|
|
+The intended use case for block string literals is to represent multi-line
|
|
|
+strings. When forming a single larger string from concatenation of multi-line
|
|
|
+string literals, including the trailing newline but not the leading newline --
|
|
|
+or, more generally, including a newline at the end of each line in the string --
|
|
|
+provides the best alignment between the source-level syntax and the result. For
|
|
|
+example:
|
|
|
+
|
|
|
+```carbon
|
|
|
+fn Run() {
|
|
|
+ print("""c++
|
|
|
+ class X {
|
|
|
+ public:
|
|
|
+ X() {}
|
|
|
+
|
|
|
+ """);
|
|
|
+ for (var String: decl in GetMemberDecls()) {
|
|
|
+ print decl;
|
|
|
+ }
|
|
|
+ print("""c++
|
|
|
+
|
|
|
+ private:
|
|
|
+ """);
|
|
|
+ print(GetFields());
|
|
|
+ print("""c++
|
|
|
+ };
|
|
|
+ """);
|
|
|
+}
|
|
|
+```
|
|
|
+
|
|
|
+As is, the output printed by this example can be visualized by ignoring all
|
|
|
+lines other than those between the `"""`s. If we excluded the trailing newline,
|
|
|
+additional blank lines would be required at the end of each string literal,
|
|
|
+harming the readability of the example.
|
|
|
+
|
|
|
+### Escape sequences
|
|
|
+
|
|
|
+We could support octal escape sequences, as many C family languages do. However,
|
|
|
+they are considered antiquated in C++ code, and supporting them would be
|
|
|
+inconsistent with our decision to not support octal numeric literals. A quick
|
|
|
+informal poll suggests that many C++ programmers do not realize that `\123` is
|
|
|
+an octal escape sequence, not a decimal one.
|
|
|
+
|
|
|
+We could support `\123` as a decimal escape sequence. However, doing so may lead
|
|
|
+to surprise when migrating C++ code to Carbon. This possibility should be
|
|
|
+revisited once Carbon matures and we have a better idea of how the migration
|
|
|
+process is expected to proceed.
|
|
|
+
|
|
|
+We could allow arbitrary-length `\x` escape sequences, as C++ does, and include
|
|
|
+some explicit mechanism to terminate such a sequence. For example, we could
|
|
|
+treat `\<whitespace>` for an arbitrary whitespace character in the same way we
|
|
|
+treat `\<newline>`, and use `"\xAB\ C"` to terminate an escape sequence
|
|
|
+prematurely. However, this is unnecessarily inventive, and the Python approach
|
|
|
+of requiring exactly two hexadecimal characters after `\x` is adequate, assuming
|
|
|
+we do not intend to support string literal element types other than 8-bit bytes.
|
|
|
+If we do find we want to support wider element types in future (for example, if
|
|
|
+we want to add a UTF-16 or UTF-32 string literal), `\x{ABCD}` can be used.
|
|
|
+
|
|
|
+We could permit `\<newline>` even in non-block string literals to allow them to
|
|
|
+be line-wrapped, but there seems to be little benefit to doing so, as a block
|
|
|
+string literal can always be used instead.
|
|
|
+
|
|
|
+We could adopt Python's `\N{unicode character name}` syntax. But there is no
|
|
|
+pressing need to add such a syntax imminently, and concerns have been raised
|
|
|
+both over the exact ways in which characters are named and over compatibility
|
|
|
+with upcoming C++ language extensions in this area, so this syntax is not being
|
|
|
+proposed at this time.
|
|
|
+
|
|
|
+We could allow an `\e` escape sequence for the U+001C ESCAPE character. This
|
|
|
+character is a common extension in C and C++ compilers, and appears to primarily
|
|
|
+be used to hardcode ANSI terminal escape sequences, such as
|
|
|
+`"\e[32mgreen text\e[0m"`. This proposal doesn't explicitly reject this idea.
|
|
|
+However, if we consider adopting such an extension in the future, we should
|
|
|
+consider whether a library facility for simple terminal operations would be a
|
|
|
+more valuable addition than this escape sequence.
|
|
|
+
|
|
|
+We could retain the `\uNNNN` escape sequence to give a terser notation for the
|
|
|
+common case of a Unicode code point that is most naturally written as four
|
|
|
+hexadecimal digits -- that is, all code points in the Basic Multilingual Plane.
|
|
|
+This is the approach taken by JavaScript. However, following Swift and Rust in
|
|
|
+permitting only `\u{NNNN}` is simpler and avoids redundancy. We expect explicit
|
|
|
+`\u` escapes to be rare: we expect regular, printable Unicode characters that
|
|
|
+are normalized in NFC to be written directly in the source file, rather than
|
|
|
+spelled with `\u` escapes. Explicit `\u` escapes would be useful where the code
|
|
|
+point value is important -- for example, in test data -- or where special
|
|
|
+characters such as directionality markers or non-normalized characters are
|
|
|
+desired, but for such uses, the longer `\u{NNNN}` form seems adequate.
|
|
|
+
|
|
|
+### Raw string literals
|
|
|
+
|
|
|
+The approach to raw string literals in this proposal is based on Swift's raw
|
|
|
+strings facility.
|
|
|
+
|
|
|
+We could use a different mechanism other than a sequence of `#`s to support
|
|
|
+nesting raw string literals. For example, we could adopt something like C++'s
|
|
|
+semi-arbitrary delimiters `R"foo(string contents)foo"`. However, this level of
|
|
|
+customizability seems unwarranted: raw string literals are unlikely to nest more
|
|
|
+than one or two levels deep, so using `#"..."#`, `##"..."##`, `###"..."###` for
|
|
|
+successive nesting levels seems unproblematic, and removes the need for the
|
|
|
+programmer to make an arbitrary choice.
|
|
|
+
|
|
|
+We could use a delimiter other than `#` to demarcate raw string literals. At
|
|
|
+this stage in Carbon's development, we don't know exactly which characters will
|
|
|
+be useful in operators, but it seems reasonable to assume a mostly C++-like
|
|
|
+operator set, which gives us a variety of characters that cannot appear
|
|
|
+immediately before a string literal: at least `@` `#` `$` `)` `]` `}` `\` `.`
|
|
|
+all appear likely to be available. Of these, `\` is likely to be problematic due
|
|
|
+to its use in escape sequences, closing brackets followed by string literals
|
|
|
+might one day be useful in some grammar constructs, and `.` seems a little too
|
|
|
+close to resembling designator. That leaves `@`, `#`, and `$`, and multiple
|
|
|
+existing languages have used `#` for this purpose.
|
|
|
+
|
|
|
+We could disallow use of _N_ `#`s as a delimiter if a lower value of _N_ would
|
|
|
+work. However, this would make the language brittle under maintenance: removing
|
|
|
+the last nested string literal from a quoted block of code would require
|
|
|
+changing the delimiters.
|
|
|
+
|
|
|
+We could use Rust-style raw strings, which add a leading `r`, permit zero `#`s
|
|
|
+to be used in raw strings, and do not provide a facility for escape sequences in
|
|
|
+raw strings. There are several reasons to prefer Swift-style raw strings:
|
|
|
+
|
|
|
+- Swift raw strings are not a distinct language feature; rather, they are a
|
|
|
+ generalization of non-raw strings -- non-raw strings are simply the case
|
|
|
+ where the number of `#` characters in the delimiters and escape sequence
|
|
|
+ intorducer is zero.
|
|
|
+- Permitting escape sequences even in raw strings means that there is no loss
|
|
|
+ of functionality when using a raw string, and code changes under maintenance
|
|
|
+ that would require use of a facility only available by way of an escape
|
|
|
+ sequence (such as inclusion of trailing whitespace in a block string
|
|
|
+ literal) do not force a reversion to a non-raw string literal.
|
|
|
+
|
|
|
+There are also several reasons to prefer the Rust-style approach:
|
|
|
+
|
|
|
+- In the most common case, a Rust raw string will be one character shorter.
|
|
|
+- Less lexical space is used: Swift-style raw strings remove the possibility
|
|
|
+ of using `#` as a prefix operator, whereas the leading `r` in Rust-style raw
|
|
|
+ strings does not.
|
|
|
+- Certain character sequences are hard to express in Swift-style raw strings.
|
|
|
+ Specifically, string literals such as `"\\################"` cannot readily
|
|
|
+ be expressed as a raw string. Such string literals are extremely rare, but
|
|
|
+ not nonexistent, in one large sample C++ corpus.
|
|
|
+
|
|
|
+If the final issue is concerning, we have a path to address it, by specifying
|
|
|
+that `\` followed by <i>N</i>+1 or more `#`s is left alone, just like `\`
|
|
|
+followed by <i>N</i>-1 or fewer `#`s is left alone. Formally, this can be
|
|
|
+accomplished by defining `\#` as an escape sequence that expands to itself, that
|
|
|
+is, to a backslash followed by <i>N</i>+1 `#` characters.
|
|
|
+
|
|
|
+#### Trailing whitespace
|
|
|
+
|
|
|
+We could preserve trailing whitespace in at least raw block string literals, and
|
|
|
+perhaps in all block string literals. However, this would mean that visually
|
|
|
+identical programs could have different meanings, and even that transformations
|
|
|
+performed automatically on save by some editors (removing trailing whitespace)
|
|
|
+could change the meaning of a program. It might also mean that raw string
|
|
|
+literals are no longer a generalization of non-raw string literals.
|
|
|
+
|
|
|
+Under this proposal, trailing whitespace can be included in a block string
|
|
|
+literal by following it with `\n\`:
|
|
|
+
|
|
|
+```
|
|
|
+var String: authors = """markdown
|
|
|
+ *Authors*: \n\
|
|
|
+ Me <me@example.com> \n\
|
|
|
+ Someone Else <them@example.com>
|
|
|
+
|
|
|
+""";
|
|
|
+```
|
|
|
+
|
|
|
+In a single-`#` raw string literal, the same can be accomplished with the
|
|
|
+more-verbose terminator `\#n\#`, and so on.
|
|
|
+
|
|
|
+#### Line separators
|
|
|
+
|
|
|
+Raw block string literals could preserve the form of vertical whitespace used to
|
|
|
+terminate each line. This would allow uncommon forms of vertical whitespace (for
|
|
|
+example, vertical tab and form feed) to be included in raw string literals, but
|
|
|
+would create a risk that the meaning of a program would be different when the
|
|
|
+source code is checked out on an operating system that uses line feed as a line
|
|
|
+terminator versus when the source code is checked out on an operating system
|
|
|
+that uses a different line terminator (such as carriage return followed by line
|
|
|
+feed). This would also mean that raw string literals are no longer a
|
|
|
+generalization of non-raw string literals.
|
|
|
+
|
|
|
+### Internal whitespace
|
|
|
+
|
|
|
+We could allow raw tab characters in string literals. However, raw tab
|
|
|
+characters harm the readability of the program, and we would like to encourage
|
|
|
+the use of `\t` escapes instead in situations where they are available, even if
|
|
|
+this means that the more verbose form `\#t` needs to be used in raw string
|
|
|
+literals.
|