2 месяцев назад · d39fdfcfad
--- a/proposals/p6710.md
+++ b/proposals/p6710.md
@@ -0,0 +1,567 @@
 
				+# `char` redesign
			
 
				+
			
 
				+<!--
			
 
				+Part of the Carbon Language project, under the Apache License v2.0 with LLVM
			
 
				+Exceptions. See /LICENSE for license information.
			
 
				+SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
			
 
				+-->
			
 
				+
			
 
				+[Pull request](https://github.com/carbon-language/carbon-lang/pull/6710)
			
 
				+
			
 
				+<!-- toc -->
			
 
				+
			
 
				+## Table of contents
			
 
				+
			
 
				+-   [Abstract](#abstract)
			
 
				+-   [Problem](#problem)
			
 
				+-   [Background](#background)
			
 
				+-   [Proposal](#proposal)
			
 
				+-   [Details](#details)
			
 
				+    -   [Add a `char` type literal](#add-a-char-type-literal)
			
 
				+        -   [Escape sequences](#escape-sequences)
			
 
				+    -   [Add a `Core.CharLiteral` type for character literals](#add-a-corecharliteral-type-for-character-literals)
			
 
				+    -   [Operators](#operators)
			
 
				+        -   [Conversion operators](#conversion-operators)
			
 
				+        -   [Comparison operators](#comparison-operators)
			
 
				+        -   [Arithmetic operators](#arithmetic-operators)
			
 
				+            -   [`char` integer parameters](#char-integer-parameters)
			
 
				+            -   [Overflow semantics](#overflow-semantics)
			
 
				+            -   [Preferring i32 returns](#preferring-i32-returns)
			
 
				+    -   [Revoke and replace proposal #1964: Character Literals](#revoke-and-replace-proposal-1964-character-literals)
			
 
				+-   [Rationale](#rationale)
			
 
				+-   [Future work](#future-work)
			
 
				+-   [Alternatives considered](#alternatives-considered)
			
 
				+    -   [Align `char` fully with C++, or make it fully valid](#align-char-fully-with-c-or-make-it-fully-valid)
			
 
				+    -   [Raw character literals](#raw-character-literals)
			
 
				+    -   [Disallow hex escape sequences in character literals](#disallow-hex-escape-sequences-in-character-literals)
			
 
				+    -   [Allow grapheme clusters in character literals](#allow-grapheme-clusters-in-character-literals)
			
 
				+    -   [Reuse string literal syntax for character literals](#reuse-string-literal-syntax-for-character-literals)
			
 
				+        -   [Treat single-character string literals as a third "text literal" type](#treat-single-character-string-literals-as-a-third-text-literal-type)
			
 
				+
			
 
				+<!-- tocstop -->
			
 
				+
			
 
				+## Abstract
			
 
				+
			
 
				+-   Add a `char` type literal mapping to `Core.Char` and equivalent to C++'s
			
 
				+    `char`.
			
 
				+    -   8 bits, unsigned, treated as a single UTF-8
			
 
				+        [code unit](https://en.wikipedia.org/wiki/Character_encoding#Code_unit).
			
 
				+-   Add a `Core.CharLiteral` type for character literals, similar to
			
 
				+    `Core.IntLiteral`.
			
 
				+-   Allow operations for `char` and `Core.CharLiteral` which reinforce the
			
 
				+    "character" concept, versus an integer value.
			
 
				+-   Revokes and replaces
			
 
				+    [#1964: Character Literals](https://github.com/carbon-language/carbon-lang/pull/1964).
			
 
				+
			
 
				+## Problem
			
 
				+
			
 
				+`char` is an important type due to its common use in C++ code. However, the
			
 
				+related proposal
			
 
				+[#1964: Character Literals](https://github.com/carbon-language/carbon-lang/pull/1964)
			
 
				+has several issues, including:
			
 
				+
			
 
				+-   Lacks a decision for `char` handling; it is not mentioned in proposal #1964.
			
 
				+    -   Similarly, decides there are character literals, but more detail is
			
 
				+        needed for implementation.
			
 
				+-   Type literal naming no longer reflects naming consensus.
			
 
				+    -   `Char8` seems potentially more equivalent to `std::char8_t` instead of
			
 
				+        `char`, and for interop purposes these are slightly different types.
			
 
				+        Similar applies to `Char16` and `Char32`.
			
 
				+    -   As a design direction, we have been lowercasing type literals (such as
			
 
				+        `u8`).
			
 
				+-   Conflicting statements about behavior.
			
 
				+    -   For example, "Rationale" states that `var b: u8 = 'a' + 1` would be
			
 
				+        supported, while "Operations" states that `+` is returning a character
			
 
				+        literal (not a `u8`).
			
 
				+    -   For character literals, states "Escape sequences which would result in
			
 
				+        non-UTF-8 encodings or more than one code point are not included."
			
 
				+        However, it goes on to say that `let smiley: Char16 = '\u{1F600}'` is
			
 
				+        valid even though `1F600` would require multiple code units in both
			
 
				+        UTF-8 and UTF-16.
			
 
				+-   Unclear that it gives us a good UTF plan.
			
 
				+    -   Does not decide what a single character in a Carbon string is.
			
 
				+    -   No consideration regarding interop with the `std::char32_t` family of
			
 
				+        types or [ICU](https://github.com/unicode-org/icu) compatibility.
			
 
				+
			
 
				+In other words, it's likely we want something similar to `Char32`, but it may be
			
 
				+named something like `Core.Char32` and have slightly different type behaviors
			
 
				+than decided in #1964. On the other hand, we need something compatible with the
			
 
				+C++ `char` in order to proceed with basic C++ interop, and #1964 doesn't provide
			
 
				+that.
			
 
				+
			
 
				+## Background
			
 
				+
			
 
				+-   [Proposal #1964: Character Literals](https://github.com/carbon-language/carbon-lang/pull/1964)
			
 
				+    is fundamental, and a lot of the underlying thoughts still apply. In
			
 
				+    particular, we still want character types to be distinct from numeric types.
			
 
				+-   [Proposal #199: String literals](https://github.com/carbon-language/carbon-lang/pull/199)
			
 
				+    is important because we want character and string literals to have mirrored
			
 
				+    escaping concepts.
			
 
				+-   [Proposal #5448: Carbon &lt;-> C++ Interop: Primitive Types](https://github.com/carbon-language/carbon-lang/pull/5448)
			
 
				+    left the question of character type mappings open. This proposal aims to
			
 
				+    answer it for `char`.
			
 
				+-   [Issue #5903: Built-in character type questions](https://github.com/carbon-language/carbon-lang/issues/5903)
			
 
				+    addressed type questions.
			
 
				+-   [Issue #5922: Built-in character operators](https://github.com/carbon-language/carbon-lang/issues/5922)
			
 
				+    addressed operators.
			
 
				+
			
 
				+## Proposal
			
 
				+
			
 
				+The way `char` will work is:
			
 
				+
			
 
				+-   Add a `char` type literal.
			
 
				+    -   Carbon's `str` type will use `char` for elements.
			
 
				+    -   For interop, map Carbon's `char` to C++'s `char`.
			
 
				+-   Add a `Core.CharLiteral` type for character literals, similar to
			
 
				+    `Core.IntLiteral`.
			
 
				+-   Provide operators which are consistent with the character concept.
			
 
				+
			
 
				+This proposal additionally revokes and replaces proposal #1964, rather than
			
 
				+trying to define which parts we are keeping and which are changing.
			
 
				+
			
 
				+## Details
			
 
				+
			
 
				+### Add a `char` type literal
			
 
				+
			
 
				+`char` is intended to offer a basic construct for Carbon's strings that is both
			
 
				+compatible with UTF-8, and has high fidelity with C++ strings.
			
 
				+
			
 
				+In support of that, important notes are:
			
 
				+
			
 
				+-   `char` itself will be a
			
 
				+    [type literal](https://github.com/carbon-language/carbon-lang/blob/trunk/docs/design/lexical_conventions/words.md#type-literals).
			
 
				+-   `char` notionally represents a UTF-8 code unit.
			
 
				+    -   It can contain invalid code units, as long as it remains 8 bits. We do
			
 
				+        not assume runtime validation.
			
 
				+-   `char` will be backed by `Core.Char`, in the prelude.
			
 
				+    -   `Core.Char` will adapt `u8`.
			
 
				+-   C++ interoperability will transparently map `char` and `Cpp.char` on API
			
 
				+    boundaries.
			
 
				+    -   When used with Carbon, C++ `char` will be unsigned by default
			
 
				+        (`-funsigned-char`). A program can switch back to signed
			
 
				+        (`-fno-unsigned-char`), and Carbon will maintain interoperability but
			
 
				+        bits will be interpreted differently in each language.
			
 
				+
			
 
				+#### Escape sequences
			
 
				+
			
 
				+Escape sequences are the same as for a string literal. Only one escape sequence
			
 
				+may be provided in a character literal.
			
 
				+
			
 
				+### Add a `Core.CharLiteral` type for character literals
			
 
				+
			
 
				+`Core.CharLiteral` is the type of a character literal, similar to how
			
 
				+`Core.IntLiteral` is the type of integer literals. It abstractly represents a
			
 
				+single Unicode code point. This gives us a compile-time structure for characters
			
 
				+that can be typed and referred to in programs.
			
 
				+
			
 
				+Semantics of a character literal will be equivalent to a
			
 
				+[simple string literal](https://github.com/carbon-language/carbon-lang/blob/trunk/docs/design/lexical_conventions/string_literals.md#simple-and-block-string-literals),
			
 
				+except that:
			
 
				+
			
 
				+-   A character literal has a validated Unicode code point value.
			
 
				+-   The enclosing character is `'`.
			
 
				+-   The contents are precisely one character or escape sequence.
			
 
				+    -   The `\x` escape sequence is limited to values up to `7F`, where the
			
 
				+        UTF-8 code unit and Unicode code point values are identical.
			
 
				+
			
 
				+An important detail of the character literal type is it gives us a way to track
			
 
				+constant values at compile time. For example, `'a' + 1` has a constant value of
			
 
				+`b`. This means we can diagnose uses of character literals that don't represent
			
 
				+a valid Unicode code point, such as `'a' + 0xFFFFFF`.
			
 
				+
			
 
				+### Operators
			
 
				+
			
 
				+The goal of provided operators is to provide a set of operators which map to
			
 
				+common operations a user would want to do. It is a non-goal to support use of
			
 
				+`char` as an arbitrary byte or integer: developers should use `u8` for that.
			
 
				+
			
 
				+In general, `char` and `Core.CharLiteral` operators are intended to be mirrors
			
 
				+of each other.
			
 
				+
			
 
				+#### Conversion operators
			
 
				+
			
 
				+-   `char`
			
 
				+    -   `ImplicitAs`: None
			
 
				+    -   `ExplicitAs`: To/from `u8`, plus the set of `ImplicitAs` for `u8`.
			
 
				+        -   For example, `u8` has `ImplicitAs` to `u16`, so `char` has
			
 
				+            `ExplicitAs` to `u16`.
			
 
				+-   `Core.CharLiteral`
			
 
				+    -   `ImplicitAs`: to `char` only
			
 
				+    -   `ExplicitAs`: To/from the set of `ImplicitAs` for `i32` and `u32`.
			
 
				+        -   For example, `i32` has `ImplicitAs` to `i64`, so `Core.CharLiteral`
			
 
				+            has `ExplicitAs` to `i64`.
			
 
				+        -   For example, `i64` does not have `ImplicitAs` to `i32`; conversion
			
 
				+            requires two casts, `((i64_val as i32) as Core.CharLiteral)`.
			
 
				+
			
 
				+Casting from a `char` to a `Core.CharLiteral` is not supported.
			
 
				+
			
 
				+See also
			
 
				+[implicit numeric conversions](https://github.com/carbon-language/carbon-lang/blob/trunk/docs/design/expressions/implicit_conversions.md#data-types).
			
 
				+
			
 
				+#### Comparison operators
			
 
				+
			
 
				+-   `char`
			
 
				+    -   `EqWith` and `OrderedWith` when both operands are `char`.
			
 
				+    -   `ImplicitAs` should allow substituting one operand with
			
 
				+        `Core.CharLiteral`.
			
 
				+-   `Core.CharLiteral`
			
 
				+    -   `EqWith` and `OrderedWith` when operands are `Core.CharLiteral`.
			
 
				+
			
 
				+#### Arithmetic operators
			
 
				+
			
 
				+-   `char`
			
 
				+    -   `AddWith`: `char + &lt;integer> -> char` (with reversible operands)
			
 
				+        -   Equivalent to `(((char as i16) + &lt;integer>) as u8) as char)`
			
 
				+    -   `SubWith`:
			
 
				+        -   `char - &lt;integer> -> char` (non-reversible operands)
			
 
				+            -   Equivalent to `(((char as i16) - &lt;integer>) as u8) as char)`
			
 
				+        -   `char - char -> i32`
			
 
				+            -   Equivalent to `(lhs as i32) - (rhs as i32)`.
			
 
				+            -   `ImplicitAs` should allow substituting one operand with
			
 
				+                `Core.CharLiteral`.
			
 
				+-   `Core.CharLiteral`
			
 
				+    -   `AddWith`: `Core.CharLiteral + &lt;integer> -> Core.CharLiteral` (with
			
 
				+        reversible operands)
			
 
				+    -   `SubWith`:
			
 
				+        -   `Core.CharLiteral - &lt;integer> -> Core.CharLiteral`
			
 
				+            (non-reversible operands)
			
 
				+        -   `Core.CharLiteral - Core.CharLiteral -> i32`
			
 
				+            -   Provides a unicode code point delta.
			
 
				+
			
 
				+##### `char` integer parameters
			
 
				+
			
 
				+Arbitrary integers are supported for most of these operations. For example, we
			
 
				+want to allow addition of negative numbers, even though the representation of
			
 
				+`char` is unsigned, without requiring additional casts.
			
 
				+
			
 
				+##### Overflow semantics
			
 
				+
			
 
				+Operations will use error overflow semantics,
			
 
				+[similar to signed integers](https://github.com/carbon-language/carbon-lang/blob/trunk/docs/design/expressions/arithmetic.md#overflow-and-other-error-conditions).
			
 
				+For example, `(('a' as char) + 500)` is invalid code because it causes `char`
			
 
				+overflow. That's why conversions are to signed values (for example,
			
 
				+`char as i16`).
			
 
				+
			
 
				+##### Preferring i32 returns
			
 
				+
			
 
				+In arithmetic, `i32` returns are preferred for deltas because they should be
			
 
				+valid for unicode code points. Even though `char` is only 8-bits, using `i32`
			
 
				+for returns there too creates consistency with `Core.CharLiteral`.
			
 
				+
			
 
				+### Revoke and replace proposal #1964: Character Literals
			
 
				+
			
 
				+This revokes proposal #1964 for simplicity. Rather than trying to detail which
			
 
				+decisions still apply and which don't, this proposal is acting from an
			
 
				+assumption that all decisions there no longer apply. We can still benefit by
			
 
				+pointing towards the rationale in explicitly maintaining decisions, but we want
			
 
				+to go through that step.
			
 
				+
			
 
				+## Rationale
			
 
				+
			
 
				+-   [Performance-critical software](https://github.com/carbon-language/carbon-lang/blob/trunk/docs/project/goals.md#performance-critical-software)
			
 
				+    -   The intent is that Carbon's main string type privileges UTF-8 over other
			
 
				+        potential encodings. A `char` represents a single code unit within that,
			
 
				+        and is consequently efficient to access. It can also be invalid, meaning
			
 
				+        we don't guarantee performing runtime validation for users (avoiding
			
 
				+        performance overhead), and that users might be able to use it for other
			
 
				+        encodings.
			
 
				+-   [Software and language evolution](https://github.com/carbon-language/carbon-lang/blob/trunk/docs/project/goals.md#software-and-language-evolution)
			
 
				+    -   `Core.CharLiteral` is designed as a Unicode code point, and even though
			
 
				+        this design doesn't include a way to use values over `7F`, we anticipate
			
 
				+        those will be added in the future. It's being provided as a building
			
 
				+        block for more elaborate Unicode functionality, including both UTF-16
			
 
				+        and UTF-32, even as we prioritize UTF-8.
			
 
				+-   [Code that is easy to read, understand, and write](https://github.com/carbon-language/carbon-lang/blob/trunk/docs/project/goals.md#code-that-is-easy-to-read-understand-and-write)
			
 
				+    -   Character literal syntax mirrors string literal syntax. The main
			
 
				+        divergence is `\x80` and higher similar escapes, which are not supported
			
 
				+        due to potentially ambiguous behavior, still in furtherance of this
			
 
				+        goal.
			
 
				+-   [Practical safety and testing mechanisms](https://github.com/carbon-language/carbon-lang/blob/trunk/docs/project/goals.md#practical-safety-and-testing-mechanisms)
			
 
				+    -   Restricting the set of operators valid for `char` gives us a way to do
			
 
				+        different sorts of validation that can be more character-oriented than
			
 
				+        if we treated it as an arbitrary byte.
			
 
				+    -   Treating `Core.CharLiteral` as a valid Unicode character allows us to
			
 
				+        provide static checking for some operations, such as `'a' + 1` resulting
			
 
				+        in another valid Unicode code point; more is also transitively possible,
			
 
				+        including involving `char`.
			
 
				+-   [Interoperability with and migration from existing C++ code](https://github.com/carbon-language/carbon-lang/blob/trunk/docs/project/goals.md#interoperability-with-and-migration-from-existing-c-code)
			
 
				+    -   Modeling `char` as a UTF-8 code unit creates behavior which is very
			
 
				+        similar to C++, but still shifts towards a more character-oriented
			
 
				+        approach. We do expect some migration friction as a consequence (as
			
 
				+        use-cases might need either more casts, or to switch to a byte type).
			
 
				+
			
 
				+## Future work
			
 
				+
			
 
				+There's still significant future work, including:
			
 
				+
			
 
				+-   `signed char`, `unsigned char`
			
 
				+-   `std::char8_t`, `std::char16_t`, `std::char32_t`
			
 
				+-   UTF-16 and UTF-32 support
			
 
				+
			
 
				+It should not be assumed that there's any restriction on the designs of those
			
 
				+features, particularly no restrictions from #1964.
			
 
				+
			
 
				+## Alternatives considered
			
 
				+
			
 
				+### Align `char` fully with C++, or make it fully valid
			
 
				+
			
 
				+Alternatives were discussed in
			
 
				+[zygoloid's comment on #5903](https://github.com/carbon-language/carbon-lang/issues/5903#issuecomment-3494068591).
			
 
				+
			
 
				+The comment notes that three options were proposed:
			
 
				+
			
 
				+1. `char` is fully aligned with C++.
			
 
				+
			
 
				+    There is no universal convention for what the value in a `char` means, and
			
 
				+    the numerical encoding of Unicode characters into `char` sequences might
			
 
				+    even be platform-dependent. For example, we might use some code page on
			
 
				+    Windows, EBCDIC on some IBM targets, and probably UTF-8 everywhere else.
			
 
				+    Likely the encoding would match what a character literal in C++ code would
			
 
				+    do for that target. Even when the target normally uses UTF-8, it would be
			
 
				+    reasonable to use an array of `char` as the type of the output buffer when
			
 
				+    transcoding from UTF-8 to some other encoding, and generally an encoded text
			
 
				+    buffer (in any encoding) would typically be represented as an array of
			
 
				+    `char`. It might also be reasonable to use an array of `char` for things
			
 
				+    that aren't necessarily text, such as file contents.
			
 
				+
			
 
				+2. `char` models a UTF-8 code unit, although it may not necessarily be valid,
			
 
				+   and may appear in a sequence that is not a valid UTF-8 encoding.
			
 
				+
			
 
				+    As with the first option, `char` can represent an integer in [0, 255], although
			
 
				+    it is not an integer type. Higher-level abstractions would likely (eventually)
			
 
				+    be provided to represent different views of the code unit sequence as (for example)
			
 
				+    a sequence of code points or a sequence of graphemes, but the fundamental model
			
 
				+    exposes the encoding. Functions taking `char` or `char` sequences would assume
			
 
				+    UTF-8 encoding, and would need to consider how to handle invalid `char`s and
			
 
				+    invalid `char` sequences.
			
 
				+
			
 
				+3. Use a foundation that enforces Unicode string validity, for some definition
			
 
				+   of "Unicode string validity".
			
 
				+
			
 
				+    The `char` type is a Unicode character. Strings would notionally be a
			
 
				+    sequence of Unicode characters, possibly also maintaining some higher-level
			
 
				+    string invariants. String indexing, if it exists, would likely treat the
			
 
				+    string as a sequence of Unicode characters. String invariants would be
			
 
				+    enforced by type conversion into the string type rather than within the
			
 
				+    string operations, and certain classes of invalid strings would be
			
 
				+    unrepresentable.
			
 
				+
			
 
				+Rationale as evaluated are:
			
 
				+
			
 
				+-   **Privilege UTF-8 over other encodings:** UTF-8 is
			
 
				+    [typically the best choice](https://utf8everywhere.org/) for representing
			
 
				+    text, even when targeting languages where characters are 3 bytes in UTF-8
			
 
				+    but 2 in UTF-16, and even on Windows where the system APIs typically operate
			
 
				+    primarily in UTF-16 or UCS-2. We should create affordances that encourage
			
 
				+    use of UTF-8 (such as having the `char` type be conventionally UTF-8).
			
 
				+    -   Our overall goal to support (only) modern environments and a general
			
 
				+        desire for consistency and portability argues against supporting
			
 
				+        non-Unicode encodings for character types.
			
 
				+    -   Having _some_ convention for the meaning of the value of a `char` seems
			
 
				+        important, and the lack of one in C++ has been a notable problem over
			
 
				+        time, leading to the addition of `char8_t` et al, which have not been
			
 
				+        entirely satisfactory solutions due to the existing widespread usage of
			
 
				+        plain `char`.
			
 
				+-   **Do not privilege any particular meaning of "validity":** There are many
			
 
				+    different ways in which you can view a sequence of UTF-8 code units as being
			
 
				+    valid or invalid. For example: Can a string start with a combining
			
 
				+    character? Can it have mismatched LRE/RLE/PDF characters in it? Can it be
			
 
				+    unnormalized, or must it be in NFC, or in NFD? Can it contain unassigned
			
 
				+    Unicode characters? Can it contain PUA characters? Can it contain
			
 
				+    non-characters? Picking any set of answers to these questions as being our
			
 
				+    canonical notion of "validity" is somewhat arbitrary.
			
 
				+-   **Do not privilege any particular level for accessing elements of the string
			
 
				+    other than code units:** There are many different layers of abstraction at
			
 
				+    which you can interpret the contents of a string. The atoms that users want
			
 
				+    to interact with, such as glyphs or grapheme clusters in rendering, or
			
 
				+    combining characters when editing or performing substring searches, aren't
			
 
				+    in one-to-one correspondence with Unicode characters any more than they're
			
 
				+    in one-to-one correspondence with UTF-8 code units. So it's not clear that
			
 
				+    privileging Unicode-character-oriented access (or indeed any of the other
			
 
				+    higher-level Unicode views) is appropriate. However, code units are in
			
 
				+    direct correspondence with bytes of memory, which is directly relevant for
			
 
				+    low-level operations, so there is a reason to provide direct access to
			
 
				+    byte-level / code-unit-level operations.
			
 
				+    -   If string indexing operates on Unicode characters, it would either be
			
 
				+        non-constant-time or would require not storing strings as just a
			
 
				+        sequence of UTF-8. Having a constant-time indexing operation on strings
			
 
				+        seems very important (especially for interop and for meeting C++
			
 
				+        developers where they are), even though a lot of the desired
			
 
				+        functionality (perhaps all of it) can be provided with iterator- or
			
 
				+        cursor-like machinery instead.
			
 
				+-   **Enforcing validity is problematic for existing API structures:** Requiring
			
 
				+    strings to be valid UTF-8 presents difficulties when moving text into or out
			
 
				+    of other sources. For example, when reading text from a validly-encoded
			
 
				+    UTF-8 file into a text buffer, one would need to deal with a read that ends
			
 
				+    in the middle of an encoding of a character. I don't know how Rust deals
			
 
				+    with this, but it seems like it would create significant impedance mismatch
			
 
				+    with C-like buffered I/O utilities. Similarly, when interoperating with C++,
			
 
				+    it would create friction if our string representation requires strings to be
			
 
				+    valid UTF-8 encodings.
			
 
				+-   **We can allow additional invariants without requiring them:** For a
			
 
				+    known-to-be-valid UTF-8 sequence, a higher-level abstraction can be built,
			
 
				+    and similarly, yet-higher-level abstractions can be built for whatever other
			
 
				+    invariants we want to enforce. So using option 2 rather than option 3 as our
			
 
				+    foundation doesn't prevent enforcing invariants in the type system (but nor
			
 
				+    does it encourage doing so).
			
 
				+
			
 
				+This proposal is choosing option 2, that `char` models a UTF-8 code unit without
			
 
				+validation. In some sense, option 2 is still "fully aligned with C++", but with
			
 
				+C++'s `char8_t` rather than with C++'s `char`.
			
 
				+
			
 
				+### Raw character literals
			
 
				+
			
 
				+[Raw string literals](https://github.com/carbon-language/carbon-lang/blob/trunk/docs/design/lexical_conventions/string_literals.md#raw-string-literals)
			
 
				+use a `#` prefix. There's limited use for this in character literals;
			
 
				+technically, `'\\'` could instead be `#'\'#`, but that's longer and extra
			
 
				+characters may prove distracting. Raw string literals are more useful when
			
 
				+there's a longer character sequence, whereas character literals have one
			
 
				+character by definition. For simplicity, character literals won't have raw
			
 
				+syntax.
			
 
				+
			
 
				+### Disallow hex escape sequences in character literals
			
 
				+
			
 
				+A `\x##` escape sequence abstractly represents a UTF-8 code unit. Whereas values
			
 
				+over 7F are valid in string literals (allowing arbitrary byte values), these are
			
 
				+disallowed in character literals because we want a more validated Unicode
			
 
				+behavior. Developers could instead rely on `\u` escapes for `\x`.
			
 
				+
			
 
				+It can still be useful to allow `\x` escapes for low-range values because some
			
 
				+developers will still need to specify
			
 
				+[ANSI escapes](https://en.wikipedia.org/wiki/ANSI_escape_code). Carbon
			
 
				+[drops support for some escape sequences](https://github.com/carbon-language/carbon-lang/blob/trunk/docs/design/lexical_conventions/string_literals.md#escape-sequences),
			
 
				+such as `\a`, and specifically advises `\x` as an alternative for developers
			
 
				+that need it. Requiring `\a` -> `\x07` -> `\u{07}` is incrementally more verbose
			
 
				+syntax, and developers may be confused why `"\x1B"` is allowed for strings but
			
 
				+`'\u{1B}'` is required for characters.
			
 
				+
			
 
				+Values over 7F are ambiguous between an arbitrary byte value and a Unicode code
			
 
				+point, and so should be invalid. However, where both interpretations are
			
 
				+identical for UTF-8 (values up to and including 7F), we will allow `\x` escape
			
 
				+sequences.
			
 
				+
			
 
				+### Allow grapheme clusters in character literals
			
 
				+
			
 
				+This proposal carries forward the decision in #1964
			
 
				+[to not support grapheme clusters](https://github.com/carbon-language/carbon-lang/pull/1964/files#diff-192d5568d8c1d15e68abe0c46cc52cc0b375a372d1dad8d2154d09f8b29666c5R340)
			
 
				+in character literals.
			
 
				+
			
 
				+### Reuse string literal syntax for character literals
			
 
				+
			
 
				+Instead of using single quotes (for example, `'a'`), we could use string literal
			
 
				+syntax with a conversion (for example, `"a" as char`) for character literals.
			
 
				+This was proposed because it would free up the single quote for other,
			
 
				+unspecified syntax uses.
			
 
				+
			
 
				+For background, character literals are common in C++. For example, in
			
 
				+SourceGraph search statistics (some of these are in comments -- a search
			
 
				+limitation):
			
 
				+
			
 
				+-   `'(.|\\.)'`:
			
 
				+    [46.2 million](https://sourcegraph.com/search?q=context:global+lang:c%2B%2B+count:50000000+/%27%28.%7C%5C%5C.%29%27/&patternType=keyword&sm=0)
			
 
				+-   `<<`:
			
 
				+    [over 100 million](https://sourcegraph.com/search?q=context:global+lang:c%2B%2B+count:100000000+/+%3C%3C+/&patternType=keyword&sm=0)
			
 
				+-   `>>`:
			
 
				+    [10.4 million](https://sourcegraph.com/search?q=context:global+lang:c%2B%2B+count:50000000+/+%3E%3E+/&patternType=keyword&sm=0)
			
 
				+-   `%`:
			
 
				+    [5.3 million](https://sourcegraph.com/search?q=context:global+lang:c%2B%2B+count:10000000+/+%25+/&patternType=keyword&sm=0)
			
 
				+
			
 
				+This creates several disadvantages for removing character literals in Carbon:
			
 
				+
			
 
				+-   **Migrating C++ developers to Carbon:** The frequency of use can be expected
			
 
				+    to have trained developers to expect single quotes to be used for
			
 
				+    characters, especially the C++ developers that Carbon is targeting.
			
 
				+    Repurposing them would create a friction for C++ developers to need to
			
 
				+    understand the different meanings of the same syntax in each of C++ and
			
 
				+    Carbon, something Carbon prefers to avoid.
			
 
				+
			
 
				+-   **Increased runtime error risks:** Runtime errors could take the form of
			
 
				+    simple increased overhead, such as converting a string literal to a `str`
			
 
				+    then to a `char`. However, they could also be more insidious, such as doing
			
 
				+    `[0]` on a string literal and not validating that the string is exactly one
			
 
				+    character (this would also likely return a null byte for `""[0]`). By having
			
 
				+    a character literal type, Carbon encourages developers to stay within guide
			
 
				+    rails that make it easier to get compile-time behavior and program
			
 
				+    validation.
			
 
				+
			
 
				+-   **Block string literal use:** We already have another use for single quotes
			
 
				+    in Carbon:
			
 
				+    [block string literals](/docs/design/lexical_conventions/string_literals.md).
			
 
				+    The syntax may need to change along with removing character literals, to
			
 
				+    make room for other uses of single quotes.
			
 
				+
			
 
				+    -   If retained, it would constrain uses of single quotes. For example, a
			
 
				+        unary operator syntax has overlap (that is, if `'a` and `''a` are valid,
			
 
				+        then `'''a` is ambiguous).
			
 
				+
			
 
				+    -   The choice of single quotes in proposal
			
 
				+        [#1360: Change raw string literal syntax](https://github.com/carbon-language/carbon-lang/pull/1360)
			
 
				+        was made accounting for single quotes in character literals, and that
			
 
				+        commonality would be lost.
			
 
				+
			
 
				+-   **Tooling:** The prevalence of single quotes being used for either strings
			
 
				+    or characters also affects their treatment in tools not specialized to
			
 
				+    Carbon: they expect them to be used for strings. For example, Rust's use of
			
 
				+    single quotes for lifetime annotations has been observed to break
			
 
				+    language-agnostic syntax highlighting.
			
 
				+
			
 
				+While a compelling proposal for a different use of single quotes may come up in
			
 
				+the future, freeing up the character for other purposes is insufficient to
			
 
				+justify a different syntax for character literals.
			
 
				+
			
 
				+#### Treat single-character string literals as a third "text literal" type
			
 
				+
			
 
				+A related alternative with the same goal of eliminating single quotes for
			
 
				+character literals is that, rather than requiring single-character string
			
 
				+literals be explicitly converted to `char`, they could instead have a third type
			
 
				+of text literal. This would implicitly cast to either `str` or `char`.
			
 
				+
			
 
				+This approach would lead to three literal types: `StrLiteral`, `CharLiteral`,
			
 
				+and `TextLiteral`. The distinction of `CharLiteral` is important because we
			
 
				+still want to support arithmetic on character literals, such as `'a' + 1` (which
			
 
				+we would not want to be allowed for `StrLiteral`).
			
 
				+
			
 
				+The existence of a third type would be important for generic code, even when not
			
 
				+trying to use character literals. For example:
			
 
				+
			
 
				+```carbon
			
 
				+  fn StoreValue[U:! type](ref a: Optional(U), b: U) {
			
 
				+    a = b;
			
 
				+  }
			
 
				+
			
 
				+  fn StrLogic[T:! type](a: T) {
			
 
				+    var x: Optional(T) = a;
			
 
				+    StoreValue(x, "str");
			
 
				+  }
			
 
				+
			
 
				+  fn F() {
			
 
				+    StrLogic("a");
			
 
				+  }
			
 
				+```
			
 
				+
			
 
				+Here, `T` is deduced to be `TextLiteral`. However, `U` has no valid value: it's
			
 
				+passed `Optional(TextLiteral)`, while `"str"` is a `StrLiteral` (which should
			
 
				+not be convertible to `TextLiteral`). As a consequence, this code is invalid,
			
 
				+even though the same code would be valid if there were not `TextLiteral` type.
			
 
				+
			
 
				+Advantages:
			
 
				+
			
 
				+-   Avoids an explicit cast.
			
 
				+
			
 
				+Disadvantages:
			
 
				+
			
 
				+-   Shares most of the disadvantages of the primary explicit conversion
			
 
				+    approach.
			
 
				+    -   This includes the risk that developers will write `"..."[0]` instead of
			
 
				+        `"..." as char` when they need a character, although the frequency may
			
 
				+        be reduced.
			
 
				+-   Having additional types in common literals could lead to programmer errors
			
 
				+    in deducing generic types, as described above.
			
 
				+-   Implicit casts cause more operator ambiguity.
			
 
				+    -   How are operators that have different meanings for string and character
			
 
				+        literals handled, such as `Cpp.std.cout <<` or `<=>`?
			
 
				+    -   In Carbon, we'd probably still want string operators to work; for
			
 
				+        example, `"a" + "b" => "ab"`, and that can be compile-time. Is `"a" + 1`
			
 
				+        a pointer to the null byte as it is in C++ (similar to `&("a"[1])`), a
			
 
				+        character addition (`'a' + 1 => 'b'`), or does it require an explicit
			
 
				+        cast in order to ensure behavior is deliberate?