|
|
@@ -0,0 +1,567 @@
|
|
|
+# `char` redesign
|
|
|
+
|
|
|
+<!--
|
|
|
+Part of the Carbon Language project, under the Apache License v2.0 with LLVM
|
|
|
+Exceptions. See /LICENSE for license information.
|
|
|
+SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
|
|
|
+-->
|
|
|
+
|
|
|
+[Pull request](https://github.com/carbon-language/carbon-lang/pull/6710)
|
|
|
+
|
|
|
+<!-- toc -->
|
|
|
+
|
|
|
+## Table of contents
|
|
|
+
|
|
|
+- [Abstract](#abstract)
|
|
|
+- [Problem](#problem)
|
|
|
+- [Background](#background)
|
|
|
+- [Proposal](#proposal)
|
|
|
+- [Details](#details)
|
|
|
+ - [Add a `char` type literal](#add-a-char-type-literal)
|
|
|
+ - [Escape sequences](#escape-sequences)
|
|
|
+ - [Add a `Core.CharLiteral` type for character literals](#add-a-corecharliteral-type-for-character-literals)
|
|
|
+ - [Operators](#operators)
|
|
|
+ - [Conversion operators](#conversion-operators)
|
|
|
+ - [Comparison operators](#comparison-operators)
|
|
|
+ - [Arithmetic operators](#arithmetic-operators)
|
|
|
+ - [`char` integer parameters](#char-integer-parameters)
|
|
|
+ - [Overflow semantics](#overflow-semantics)
|
|
|
+ - [Preferring i32 returns](#preferring-i32-returns)
|
|
|
+ - [Revoke and replace proposal #1964: Character Literals](#revoke-and-replace-proposal-1964-character-literals)
|
|
|
+- [Rationale](#rationale)
|
|
|
+- [Future work](#future-work)
|
|
|
+- [Alternatives considered](#alternatives-considered)
|
|
|
+ - [Align `char` fully with C++, or make it fully valid](#align-char-fully-with-c-or-make-it-fully-valid)
|
|
|
+ - [Raw character literals](#raw-character-literals)
|
|
|
+ - [Disallow hex escape sequences in character literals](#disallow-hex-escape-sequences-in-character-literals)
|
|
|
+ - [Allow grapheme clusters in character literals](#allow-grapheme-clusters-in-character-literals)
|
|
|
+ - [Reuse string literal syntax for character literals](#reuse-string-literal-syntax-for-character-literals)
|
|
|
+ - [Treat single-character string literals as a third "text literal" type](#treat-single-character-string-literals-as-a-third-text-literal-type)
|
|
|
+
|
|
|
+<!-- tocstop -->
|
|
|
+
|
|
|
+## Abstract
|
|
|
+
|
|
|
+- Add a `char` type literal mapping to `Core.Char` and equivalent to C++'s
|
|
|
+ `char`.
|
|
|
+ - 8 bits, unsigned, treated as a single UTF-8
|
|
|
+ [code unit](https://en.wikipedia.org/wiki/Character_encoding#Code_unit).
|
|
|
+- Add a `Core.CharLiteral` type for character literals, similar to
|
|
|
+ `Core.IntLiteral`.
|
|
|
+- Allow operations for `char` and `Core.CharLiteral` which reinforce the
|
|
|
+ "character" concept, versus an integer value.
|
|
|
+- Revokes and replaces
|
|
|
+ [#1964: Character Literals](https://github.com/carbon-language/carbon-lang/pull/1964).
|
|
|
+
|
|
|
+## Problem
|
|
|
+
|
|
|
+`char` is an important type due to its common use in C++ code. However, the
|
|
|
+related proposal
|
|
|
+[#1964: Character Literals](https://github.com/carbon-language/carbon-lang/pull/1964)
|
|
|
+has several issues, including:
|
|
|
+
|
|
|
+- Lacks a decision for `char` handling; it is not mentioned in proposal #1964.
|
|
|
+ - Similarly, decides there are character literals, but more detail is
|
|
|
+ needed for implementation.
|
|
|
+- Type literal naming no longer reflects naming consensus.
|
|
|
+ - `Char8` seems potentially more equivalent to `std::char8_t` instead of
|
|
|
+ `char`, and for interop purposes these are slightly different types.
|
|
|
+ Similar applies to `Char16` and `Char32`.
|
|
|
+ - As a design direction, we have been lowercasing type literals (such as
|
|
|
+ `u8`).
|
|
|
+- Conflicting statements about behavior.
|
|
|
+ - For example, "Rationale" states that `var b: u8 = 'a' + 1` would be
|
|
|
+ supported, while "Operations" states that `+` is returning a character
|
|
|
+ literal (not a `u8`).
|
|
|
+ - For character literals, states "Escape sequences which would result in
|
|
|
+ non-UTF-8 encodings or more than one code point are not included."
|
|
|
+ However, it goes on to say that `let smiley: Char16 = '\u{1F600}'` is
|
|
|
+ valid even though `1F600` would require multiple code units in both
|
|
|
+ UTF-8 and UTF-16.
|
|
|
+- Unclear that it gives us a good UTF plan.
|
|
|
+ - Does not decide what a single character in a Carbon string is.
|
|
|
+ - No consideration regarding interop with the `std::char32_t` family of
|
|
|
+ types or [ICU](https://github.com/unicode-org/icu) compatibility.
|
|
|
+
|
|
|
+In other words, it's likely we want something similar to `Char32`, but it may be
|
|
|
+named something like `Core.Char32` and have slightly different type behaviors
|
|
|
+than decided in #1964. On the other hand, we need something compatible with the
|
|
|
+C++ `char` in order to proceed with basic C++ interop, and #1964 doesn't provide
|
|
|
+that.
|
|
|
+
|
|
|
+## Background
|
|
|
+
|
|
|
+- [Proposal #1964: Character Literals](https://github.com/carbon-language/carbon-lang/pull/1964)
|
|
|
+ is fundamental, and a lot of the underlying thoughts still apply. In
|
|
|
+ particular, we still want character types to be distinct from numeric types.
|
|
|
+- [Proposal #199: String literals](https://github.com/carbon-language/carbon-lang/pull/199)
|
|
|
+ is important because we want character and string literals to have mirrored
|
|
|
+ escaping concepts.
|
|
|
+- [Proposal #5448: Carbon <-> C++ Interop: Primitive Types](https://github.com/carbon-language/carbon-lang/pull/5448)
|
|
|
+ left the question of character type mappings open. This proposal aims to
|
|
|
+ answer it for `char`.
|
|
|
+- [Issue #5903: Built-in character type questions](https://github.com/carbon-language/carbon-lang/issues/5903)
|
|
|
+ addressed type questions.
|
|
|
+- [Issue #5922: Built-in character operators](https://github.com/carbon-language/carbon-lang/issues/5922)
|
|
|
+ addressed operators.
|
|
|
+
|
|
|
+## Proposal
|
|
|
+
|
|
|
+The way `char` will work is:
|
|
|
+
|
|
|
+- Add a `char` type literal.
|
|
|
+ - Carbon's `str` type will use `char` for elements.
|
|
|
+ - For interop, map Carbon's `char` to C++'s `char`.
|
|
|
+- Add a `Core.CharLiteral` type for character literals, similar to
|
|
|
+ `Core.IntLiteral`.
|
|
|
+- Provide operators which are consistent with the character concept.
|
|
|
+
|
|
|
+This proposal additionally revokes and replaces proposal #1964, rather than
|
|
|
+trying to define which parts we are keeping and which are changing.
|
|
|
+
|
|
|
+## Details
|
|
|
+
|
|
|
+### Add a `char` type literal
|
|
|
+
|
|
|
+`char` is intended to offer a basic construct for Carbon's strings that is both
|
|
|
+compatible with UTF-8, and has high fidelity with C++ strings.
|
|
|
+
|
|
|
+In support of that, important notes are:
|
|
|
+
|
|
|
+- `char` itself will be a
|
|
|
+ [type literal](https://github.com/carbon-language/carbon-lang/blob/trunk/docs/design/lexical_conventions/words.md#type-literals).
|
|
|
+- `char` notionally represents a UTF-8 code unit.
|
|
|
+ - It can contain invalid code units, as long as it remains 8 bits. We do
|
|
|
+ not assume runtime validation.
|
|
|
+- `char` will be backed by `Core.Char`, in the prelude.
|
|
|
+ - `Core.Char` will adapt `u8`.
|
|
|
+- C++ interoperability will transparently map `char` and `Cpp.char` on API
|
|
|
+ boundaries.
|
|
|
+ - When used with Carbon, C++ `char` will be unsigned by default
|
|
|
+ (`-funsigned-char`). A program can switch back to signed
|
|
|
+ (`-fno-unsigned-char`), and Carbon will maintain interoperability but
|
|
|
+ bits will be interpreted differently in each language.
|
|
|
+
|
|
|
+#### Escape sequences
|
|
|
+
|
|
|
+Escape sequences are the same as for a string literal. Only one escape sequence
|
|
|
+may be provided in a character literal.
|
|
|
+
|
|
|
+### Add a `Core.CharLiteral` type for character literals
|
|
|
+
|
|
|
+`Core.CharLiteral` is the type of a character literal, similar to how
|
|
|
+`Core.IntLiteral` is the type of integer literals. It abstractly represents a
|
|
|
+single Unicode code point. This gives us a compile-time structure for characters
|
|
|
+that can be typed and referred to in programs.
|
|
|
+
|
|
|
+Semantics of a character literal will be equivalent to a
|
|
|
+[simple string literal](https://github.com/carbon-language/carbon-lang/blob/trunk/docs/design/lexical_conventions/string_literals.md#simple-and-block-string-literals),
|
|
|
+except that:
|
|
|
+
|
|
|
+- A character literal has a validated Unicode code point value.
|
|
|
+- The enclosing character is `'`.
|
|
|
+- The contents are precisely one character or escape sequence.
|
|
|
+ - The `\x` escape sequence is limited to values up to `7F`, where the
|
|
|
+ UTF-8 code unit and Unicode code point values are identical.
|
|
|
+
|
|
|
+An important detail of the character literal type is it gives us a way to track
|
|
|
+constant values at compile time. For example, `'a' + 1` has a constant value of
|
|
|
+`b`. This means we can diagnose uses of character literals that don't represent
|
|
|
+a valid Unicode code point, such as `'a' + 0xFFFFFF`.
|
|
|
+
|
|
|
+### Operators
|
|
|
+
|
|
|
+The goal of provided operators is to provide a set of operators which map to
|
|
|
+common operations a user would want to do. It is a non-goal to support use of
|
|
|
+`char` as an arbitrary byte or integer: developers should use `u8` for that.
|
|
|
+
|
|
|
+In general, `char` and `Core.CharLiteral` operators are intended to be mirrors
|
|
|
+of each other.
|
|
|
+
|
|
|
+#### Conversion operators
|
|
|
+
|
|
|
+- `char`
|
|
|
+ - `ImplicitAs`: None
|
|
|
+ - `ExplicitAs`: To/from `u8`, plus the set of `ImplicitAs` for `u8`.
|
|
|
+ - For example, `u8` has `ImplicitAs` to `u16`, so `char` has
|
|
|
+ `ExplicitAs` to `u16`.
|
|
|
+- `Core.CharLiteral`
|
|
|
+ - `ImplicitAs`: to `char` only
|
|
|
+ - `ExplicitAs`: To/from the set of `ImplicitAs` for `i32` and `u32`.
|
|
|
+ - For example, `i32` has `ImplicitAs` to `i64`, so `Core.CharLiteral`
|
|
|
+ has `ExplicitAs` to `i64`.
|
|
|
+ - For example, `i64` does not have `ImplicitAs` to `i32`; conversion
|
|
|
+ requires two casts, `((i64_val as i32) as Core.CharLiteral)`.
|
|
|
+
|
|
|
+Casting from a `char` to a `Core.CharLiteral` is not supported.
|
|
|
+
|
|
|
+See also
|
|
|
+[implicit numeric conversions](https://github.com/carbon-language/carbon-lang/blob/trunk/docs/design/expressions/implicit_conversions.md#data-types).
|
|
|
+
|
|
|
+#### Comparison operators
|
|
|
+
|
|
|
+- `char`
|
|
|
+ - `EqWith` and `OrderedWith` when both operands are `char`.
|
|
|
+ - `ImplicitAs` should allow substituting one operand with
|
|
|
+ `Core.CharLiteral`.
|
|
|
+- `Core.CharLiteral`
|
|
|
+ - `EqWith` and `OrderedWith` when operands are `Core.CharLiteral`.
|
|
|
+
|
|
|
+#### Arithmetic operators
|
|
|
+
|
|
|
+- `char`
|
|
|
+ - `AddWith`: `char + <integer> -> char` (with reversible operands)
|
|
|
+ - Equivalent to `(((char as i16) + <integer>) as u8) as char)`
|
|
|
+ - `SubWith`:
|
|
|
+ - `char - <integer> -> char` (non-reversible operands)
|
|
|
+ - Equivalent to `(((char as i16) - <integer>) as u8) as char)`
|
|
|
+ - `char - char -> i32`
|
|
|
+ - Equivalent to `(lhs as i32) - (rhs as i32)`.
|
|
|
+ - `ImplicitAs` should allow substituting one operand with
|
|
|
+ `Core.CharLiteral`.
|
|
|
+- `Core.CharLiteral`
|
|
|
+ - `AddWith`: `Core.CharLiteral + <integer> -> Core.CharLiteral` (with
|
|
|
+ reversible operands)
|
|
|
+ - `SubWith`:
|
|
|
+ - `Core.CharLiteral - <integer> -> Core.CharLiteral`
|
|
|
+ (non-reversible operands)
|
|
|
+ - `Core.CharLiteral - Core.CharLiteral -> i32`
|
|
|
+ - Provides a unicode code point delta.
|
|
|
+
|
|
|
+##### `char` integer parameters
|
|
|
+
|
|
|
+Arbitrary integers are supported for most of these operations. For example, we
|
|
|
+want to allow addition of negative numbers, even though the representation of
|
|
|
+`char` is unsigned, without requiring additional casts.
|
|
|
+
|
|
|
+##### Overflow semantics
|
|
|
+
|
|
|
+Operations will use error overflow semantics,
|
|
|
+[similar to signed integers](https://github.com/carbon-language/carbon-lang/blob/trunk/docs/design/expressions/arithmetic.md#overflow-and-other-error-conditions).
|
|
|
+For example, `(('a' as char) + 500)` is invalid code because it causes `char`
|
|
|
+overflow. That's why conversions are to signed values (for example,
|
|
|
+`char as i16`).
|
|
|
+
|
|
|
+##### Preferring i32 returns
|
|
|
+
|
|
|
+In arithmetic, `i32` returns are preferred for deltas because they should be
|
|
|
+valid for unicode code points. Even though `char` is only 8-bits, using `i32`
|
|
|
+for returns there too creates consistency with `Core.CharLiteral`.
|
|
|
+
|
|
|
+### Revoke and replace proposal #1964: Character Literals
|
|
|
+
|
|
|
+This revokes proposal #1964 for simplicity. Rather than trying to detail which
|
|
|
+decisions still apply and which don't, this proposal is acting from an
|
|
|
+assumption that all decisions there no longer apply. We can still benefit by
|
|
|
+pointing towards the rationale in explicitly maintaining decisions, but we want
|
|
|
+to go through that step.
|
|
|
+
|
|
|
+## Rationale
|
|
|
+
|
|
|
+- [Performance-critical software](https://github.com/carbon-language/carbon-lang/blob/trunk/docs/project/goals.md#performance-critical-software)
|
|
|
+ - The intent is that Carbon's main string type privileges UTF-8 over other
|
|
|
+ potential encodings. A `char` represents a single code unit within that,
|
|
|
+ and is consequently efficient to access. It can also be invalid, meaning
|
|
|
+ we don't guarantee performing runtime validation for users (avoiding
|
|
|
+ performance overhead), and that users might be able to use it for other
|
|
|
+ encodings.
|
|
|
+- [Software and language evolution](https://github.com/carbon-language/carbon-lang/blob/trunk/docs/project/goals.md#software-and-language-evolution)
|
|
|
+ - `Core.CharLiteral` is designed as a Unicode code point, and even though
|
|
|
+ this design doesn't include a way to use values over `7F`, we anticipate
|
|
|
+ those will be added in the future. It's being provided as a building
|
|
|
+ block for more elaborate Unicode functionality, including both UTF-16
|
|
|
+ and UTF-32, even as we prioritize UTF-8.
|
|
|
+- [Code that is easy to read, understand, and write](https://github.com/carbon-language/carbon-lang/blob/trunk/docs/project/goals.md#code-that-is-easy-to-read-understand-and-write)
|
|
|
+ - Character literal syntax mirrors string literal syntax. The main
|
|
|
+ divergence is `\x80` and higher similar escapes, which are not supported
|
|
|
+ due to potentially ambiguous behavior, still in furtherance of this
|
|
|
+ goal.
|
|
|
+- [Practical safety and testing mechanisms](https://github.com/carbon-language/carbon-lang/blob/trunk/docs/project/goals.md#practical-safety-and-testing-mechanisms)
|
|
|
+ - Restricting the set of operators valid for `char` gives us a way to do
|
|
|
+ different sorts of validation that can be more character-oriented than
|
|
|
+ if we treated it as an arbitrary byte.
|
|
|
+ - Treating `Core.CharLiteral` as a valid Unicode character allows us to
|
|
|
+ provide static checking for some operations, such as `'a' + 1` resulting
|
|
|
+ in another valid Unicode code point; more is also transitively possible,
|
|
|
+ including involving `char`.
|
|
|
+- [Interoperability with and migration from existing C++ code](https://github.com/carbon-language/carbon-lang/blob/trunk/docs/project/goals.md#interoperability-with-and-migration-from-existing-c-code)
|
|
|
+ - Modeling `char` as a UTF-8 code unit creates behavior which is very
|
|
|
+ similar to C++, but still shifts towards a more character-oriented
|
|
|
+ approach. We do expect some migration friction as a consequence (as
|
|
|
+ use-cases might need either more casts, or to switch to a byte type).
|
|
|
+
|
|
|
+## Future work
|
|
|
+
|
|
|
+There's still significant future work, including:
|
|
|
+
|
|
|
+- `signed char`, `unsigned char`
|
|
|
+- `std::char8_t`, `std::char16_t`, `std::char32_t`
|
|
|
+- UTF-16 and UTF-32 support
|
|
|
+
|
|
|
+It should not be assumed that there's any restriction on the designs of those
|
|
|
+features, particularly no restrictions from #1964.
|
|
|
+
|
|
|
+## Alternatives considered
|
|
|
+
|
|
|
+### Align `char` fully with C++, or make it fully valid
|
|
|
+
|
|
|
+Alternatives were discussed in
|
|
|
+[zygoloid's comment on #5903](https://github.com/carbon-language/carbon-lang/issues/5903#issuecomment-3494068591).
|
|
|
+
|
|
|
+The comment notes that three options were proposed:
|
|
|
+
|
|
|
+1. `char` is fully aligned with C++.
|
|
|
+
|
|
|
+ There is no universal convention for what the value in a `char` means, and
|
|
|
+ the numerical encoding of Unicode characters into `char` sequences might
|
|
|
+ even be platform-dependent. For example, we might use some code page on
|
|
|
+ Windows, EBCDIC on some IBM targets, and probably UTF-8 everywhere else.
|
|
|
+ Likely the encoding would match what a character literal in C++ code would
|
|
|
+ do for that target. Even when the target normally uses UTF-8, it would be
|
|
|
+ reasonable to use an array of `char` as the type of the output buffer when
|
|
|
+ transcoding from UTF-8 to some other encoding, and generally an encoded text
|
|
|
+ buffer (in any encoding) would typically be represented as an array of
|
|
|
+ `char`. It might also be reasonable to use an array of `char` for things
|
|
|
+ that aren't necessarily text, such as file contents.
|
|
|
+
|
|
|
+2. `char` models a UTF-8 code unit, although it may not necessarily be valid,
|
|
|
+ and may appear in a sequence that is not a valid UTF-8 encoding.
|
|
|
+
|
|
|
+ As with the first option, `char` can represent an integer in [0, 255], although
|
|
|
+ it is not an integer type. Higher-level abstractions would likely (eventually)
|
|
|
+ be provided to represent different views of the code unit sequence as (for example)
|
|
|
+ a sequence of code points or a sequence of graphemes, but the fundamental model
|
|
|
+ exposes the encoding. Functions taking `char` or `char` sequences would assume
|
|
|
+ UTF-8 encoding, and would need to consider how to handle invalid `char`s and
|
|
|
+ invalid `char` sequences.
|
|
|
+
|
|
|
+3. Use a foundation that enforces Unicode string validity, for some definition
|
|
|
+ of "Unicode string validity".
|
|
|
+
|
|
|
+ The `char` type is a Unicode character. Strings would notionally be a
|
|
|
+ sequence of Unicode characters, possibly also maintaining some higher-level
|
|
|
+ string invariants. String indexing, if it exists, would likely treat the
|
|
|
+ string as a sequence of Unicode characters. String invariants would be
|
|
|
+ enforced by type conversion into the string type rather than within the
|
|
|
+ string operations, and certain classes of invalid strings would be
|
|
|
+ unrepresentable.
|
|
|
+
|
|
|
+Rationale as evaluated are:
|
|
|
+
|
|
|
+- **Privilege UTF-8 over other encodings:** UTF-8 is
|
|
|
+ [typically the best choice](https://utf8everywhere.org/) for representing
|
|
|
+ text, even when targeting languages where characters are 3 bytes in UTF-8
|
|
|
+ but 2 in UTF-16, and even on Windows where the system APIs typically operate
|
|
|
+ primarily in UTF-16 or UCS-2. We should create affordances that encourage
|
|
|
+ use of UTF-8 (such as having the `char` type be conventionally UTF-8).
|
|
|
+ - Our overall goal to support (only) modern environments and a general
|
|
|
+ desire for consistency and portability argues against supporting
|
|
|
+ non-Unicode encodings for character types.
|
|
|
+ - Having _some_ convention for the meaning of the value of a `char` seems
|
|
|
+ important, and the lack of one in C++ has been a notable problem over
|
|
|
+ time, leading to the addition of `char8_t` et al, which have not been
|
|
|
+ entirely satisfactory solutions due to the existing widespread usage of
|
|
|
+ plain `char`.
|
|
|
+- **Do not privilege any particular meaning of "validity":** There are many
|
|
|
+ different ways in which you can view a sequence of UTF-8 code units as being
|
|
|
+ valid or invalid. For example: Can a string start with a combining
|
|
|
+ character? Can it have mismatched LRE/RLE/PDF characters in it? Can it be
|
|
|
+ unnormalized, or must it be in NFC, or in NFD? Can it contain unassigned
|
|
|
+ Unicode characters? Can it contain PUA characters? Can it contain
|
|
|
+ non-characters? Picking any set of answers to these questions as being our
|
|
|
+ canonical notion of "validity" is somewhat arbitrary.
|
|
|
+- **Do not privilege any particular level for accessing elements of the string
|
|
|
+ other than code units:** There are many different layers of abstraction at
|
|
|
+ which you can interpret the contents of a string. The atoms that users want
|
|
|
+ to interact with, such as glyphs or grapheme clusters in rendering, or
|
|
|
+ combining characters when editing or performing substring searches, aren't
|
|
|
+ in one-to-one correspondence with Unicode characters any more than they're
|
|
|
+ in one-to-one correspondence with UTF-8 code units. So it's not clear that
|
|
|
+ privileging Unicode-character-oriented access (or indeed any of the other
|
|
|
+ higher-level Unicode views) is appropriate. However, code units are in
|
|
|
+ direct correspondence with bytes of memory, which is directly relevant for
|
|
|
+ low-level operations, so there is a reason to provide direct access to
|
|
|
+ byte-level / code-unit-level operations.
|
|
|
+ - If string indexing operates on Unicode characters, it would either be
|
|
|
+ non-constant-time or would require not storing strings as just a
|
|
|
+ sequence of UTF-8. Having a constant-time indexing operation on strings
|
|
|
+ seems very important (especially for interop and for meeting C++
|
|
|
+ developers where they are), even though a lot of the desired
|
|
|
+ functionality (perhaps all of it) can be provided with iterator- or
|
|
|
+ cursor-like machinery instead.
|
|
|
+- **Enforcing validity is problematic for existing API structures:** Requiring
|
|
|
+ strings to be valid UTF-8 presents difficulties when moving text into or out
|
|
|
+ of other sources. For example, when reading text from a validly-encoded
|
|
|
+ UTF-8 file into a text buffer, one would need to deal with a read that ends
|
|
|
+ in the middle of an encoding of a character. I don't know how Rust deals
|
|
|
+ with this, but it seems like it would create significant impedance mismatch
|
|
|
+ with C-like buffered I/O utilities. Similarly, when interoperating with C++,
|
|
|
+ it would create friction if our string representation requires strings to be
|
|
|
+ valid UTF-8 encodings.
|
|
|
+- **We can allow additional invariants without requiring them:** For a
|
|
|
+ known-to-be-valid UTF-8 sequence, a higher-level abstraction can be built,
|
|
|
+ and similarly, yet-higher-level abstractions can be built for whatever other
|
|
|
+ invariants we want to enforce. So using option 2 rather than option 3 as our
|
|
|
+ foundation doesn't prevent enforcing invariants in the type system (but nor
|
|
|
+ does it encourage doing so).
|
|
|
+
|
|
|
+This proposal is choosing option 2, that `char` models a UTF-8 code unit without
|
|
|
+validation. In some sense, option 2 is still "fully aligned with C++", but with
|
|
|
+C++'s `char8_t` rather than with C++'s `char`.
|
|
|
+
|
|
|
+### Raw character literals
|
|
|
+
|
|
|
+[Raw string literals](https://github.com/carbon-language/carbon-lang/blob/trunk/docs/design/lexical_conventions/string_literals.md#raw-string-literals)
|
|
|
+use a `#` prefix. There's limited use for this in character literals;
|
|
|
+technically, `'\\'` could instead be `#'\'#`, but that's longer and extra
|
|
|
+characters may prove distracting. Raw string literals are more useful when
|
|
|
+there's a longer character sequence, whereas character literals have one
|
|
|
+character by definition. For simplicity, character literals won't have raw
|
|
|
+syntax.
|
|
|
+
|
|
|
+### Disallow hex escape sequences in character literals
|
|
|
+
|
|
|
+A `\x##` escape sequence abstractly represents a UTF-8 code unit. Whereas values
|
|
|
+over 7F are valid in string literals (allowing arbitrary byte values), these are
|
|
|
+disallowed in character literals because we want a more validated Unicode
|
|
|
+behavior. Developers could instead rely on `\u` escapes for `\x`.
|
|
|
+
|
|
|
+It can still be useful to allow `\x` escapes for low-range values because some
|
|
|
+developers will still need to specify
|
|
|
+[ANSI escapes](https://en.wikipedia.org/wiki/ANSI_escape_code). Carbon
|
|
|
+[drops support for some escape sequences](https://github.com/carbon-language/carbon-lang/blob/trunk/docs/design/lexical_conventions/string_literals.md#escape-sequences),
|
|
|
+such as `\a`, and specifically advises `\x` as an alternative for developers
|
|
|
+that need it. Requiring `\a` -> `\x07` -> `\u{07}` is incrementally more verbose
|
|
|
+syntax, and developers may be confused why `"\x1B"` is allowed for strings but
|
|
|
+`'\u{1B}'` is required for characters.
|
|
|
+
|
|
|
+Values over 7F are ambiguous between an arbitrary byte value and a Unicode code
|
|
|
+point, and so should be invalid. However, where both interpretations are
|
|
|
+identical for UTF-8 (values up to and including 7F), we will allow `\x` escape
|
|
|
+sequences.
|
|
|
+
|
|
|
+### Allow grapheme clusters in character literals
|
|
|
+
|
|
|
+This proposal carries forward the decision in #1964
|
|
|
+[to not support grapheme clusters](https://github.com/carbon-language/carbon-lang/pull/1964/files#diff-192d5568d8c1d15e68abe0c46cc52cc0b375a372d1dad8d2154d09f8b29666c5R340)
|
|
|
+in character literals.
|
|
|
+
|
|
|
+### Reuse string literal syntax for character literals
|
|
|
+
|
|
|
+Instead of using single quotes (for example, `'a'`), we could use string literal
|
|
|
+syntax with a conversion (for example, `"a" as char`) for character literals.
|
|
|
+This was proposed because it would free up the single quote for other,
|
|
|
+unspecified syntax uses.
|
|
|
+
|
|
|
+For background, character literals are common in C++. For example, in
|
|
|
+SourceGraph search statistics (some of these are in comments -- a search
|
|
|
+limitation):
|
|
|
+
|
|
|
+- `'(.|\\.)'`:
|
|
|
+ [46.2 million](https://sourcegraph.com/search?q=context:global+lang:c%2B%2B+count:50000000+/%27%28.%7C%5C%5C.%29%27/&patternType=keyword&sm=0)
|
|
|
+- `<<`:
|
|
|
+ [over 100 million](https://sourcegraph.com/search?q=context:global+lang:c%2B%2B+count:100000000+/+%3C%3C+/&patternType=keyword&sm=0)
|
|
|
+- `>>`:
|
|
|
+ [10.4 million](https://sourcegraph.com/search?q=context:global+lang:c%2B%2B+count:50000000+/+%3E%3E+/&patternType=keyword&sm=0)
|
|
|
+- `%`:
|
|
|
+ [5.3 million](https://sourcegraph.com/search?q=context:global+lang:c%2B%2B+count:10000000+/+%25+/&patternType=keyword&sm=0)
|
|
|
+
|
|
|
+This creates several disadvantages for removing character literals in Carbon:
|
|
|
+
|
|
|
+- **Migrating C++ developers to Carbon:** The frequency of use can be expected
|
|
|
+ to have trained developers to expect single quotes to be used for
|
|
|
+ characters, especially the C++ developers that Carbon is targeting.
|
|
|
+ Repurposing them would create a friction for C++ developers to need to
|
|
|
+ understand the different meanings of the same syntax in each of C++ and
|
|
|
+ Carbon, something Carbon prefers to avoid.
|
|
|
+
|
|
|
+- **Increased runtime error risks:** Runtime errors could take the form of
|
|
|
+ simple increased overhead, such as converting a string literal to a `str`
|
|
|
+ then to a `char`. However, they could also be more insidious, such as doing
|
|
|
+ `[0]` on a string literal and not validating that the string is exactly one
|
|
|
+ character (this would also likely return a null byte for `""[0]`). By having
|
|
|
+ a character literal type, Carbon encourages developers to stay within guide
|
|
|
+ rails that make it easier to get compile-time behavior and program
|
|
|
+ validation.
|
|
|
+
|
|
|
+- **Block string literal use:** We already have another use for single quotes
|
|
|
+ in Carbon:
|
|
|
+ [block string literals](/docs/design/lexical_conventions/string_literals.md).
|
|
|
+ The syntax may need to change along with removing character literals, to
|
|
|
+ make room for other uses of single quotes.
|
|
|
+
|
|
|
+ - If retained, it would constrain uses of single quotes. For example, a
|
|
|
+ unary operator syntax has overlap (that is, if `'a` and `''a` are valid,
|
|
|
+ then `'''a` is ambiguous).
|
|
|
+
|
|
|
+ - The choice of single quotes in proposal
|
|
|
+ [#1360: Change raw string literal syntax](https://github.com/carbon-language/carbon-lang/pull/1360)
|
|
|
+ was made accounting for single quotes in character literals, and that
|
|
|
+ commonality would be lost.
|
|
|
+
|
|
|
+- **Tooling:** The prevalence of single quotes being used for either strings
|
|
|
+ or characters also affects their treatment in tools not specialized to
|
|
|
+ Carbon: they expect them to be used for strings. For example, Rust's use of
|
|
|
+ single quotes for lifetime annotations has been observed to break
|
|
|
+ language-agnostic syntax highlighting.
|
|
|
+
|
|
|
+While a compelling proposal for a different use of single quotes may come up in
|
|
|
+the future, freeing up the character for other purposes is insufficient to
|
|
|
+justify a different syntax for character literals.
|
|
|
+
|
|
|
+#### Treat single-character string literals as a third "text literal" type
|
|
|
+
|
|
|
+A related alternative with the same goal of eliminating single quotes for
|
|
|
+character literals is that, rather than requiring single-character string
|
|
|
+literals be explicitly converted to `char`, they could instead have a third type
|
|
|
+of text literal. This would implicitly cast to either `str` or `char`.
|
|
|
+
|
|
|
+This approach would lead to three literal types: `StrLiteral`, `CharLiteral`,
|
|
|
+and `TextLiteral`. The distinction of `CharLiteral` is important because we
|
|
|
+still want to support arithmetic on character literals, such as `'a' + 1` (which
|
|
|
+we would not want to be allowed for `StrLiteral`).
|
|
|
+
|
|
|
+The existence of a third type would be important for generic code, even when not
|
|
|
+trying to use character literals. For example:
|
|
|
+
|
|
|
+```carbon
|
|
|
+ fn StoreValue[U:! type](ref a: Optional(U), b: U) {
|
|
|
+ a = b;
|
|
|
+ }
|
|
|
+
|
|
|
+ fn StrLogic[T:! type](a: T) {
|
|
|
+ var x: Optional(T) = a;
|
|
|
+ StoreValue(x, "str");
|
|
|
+ }
|
|
|
+
|
|
|
+ fn F() {
|
|
|
+ StrLogic("a");
|
|
|
+ }
|
|
|
+```
|
|
|
+
|
|
|
+Here, `T` is deduced to be `TextLiteral`. However, `U` has no valid value: it's
|
|
|
+passed `Optional(TextLiteral)`, while `"str"` is a `StrLiteral` (which should
|
|
|
+not be convertible to `TextLiteral`). As a consequence, this code is invalid,
|
|
|
+even though the same code would be valid if there were not `TextLiteral` type.
|
|
|
+
|
|
|
+Advantages:
|
|
|
+
|
|
|
+- Avoids an explicit cast.
|
|
|
+
|
|
|
+Disadvantages:
|
|
|
+
|
|
|
+- Shares most of the disadvantages of the primary explicit conversion
|
|
|
+ approach.
|
|
|
+ - This includes the risk that developers will write `"..."[0]` instead of
|
|
|
+ `"..." as char` when they need a character, although the frequency may
|
|
|
+ be reduced.
|
|
|
+- Having additional types in common literals could lead to programmer errors
|
|
|
+ in deducing generic types, as described above.
|
|
|
+- Implicit casts cause more operator ambiguity.
|
|
|
+ - How are operators that have different meanings for string and character
|
|
|
+ literals handled, such as `Cpp.std.cout <<` or `<=>`?
|
|
|
+ - In Carbon, we'd probably still want string operators to work; for
|
|
|
+ example, `"a" + "b" => "ab"`, and that can be compile-time. Is `"a" + 1`
|
|
|
+ a pointer to the null byte as it is in C++ (similar to `&("a"[1])`), a
|
|
|
+ character addition (`'a' + 1 => 'b'`), or does it require an explicit
|
|
|
+ cast in order to ensure behavior is deliberate?
|