char redesignchar type literal mapping to Core.Char and equivalent to C++'s
char.
Core.CharLiteral type for character literals, similar to
Core.IntLiteral.char and Core.CharLiteral which reinforce the
"character" concept, versus an integer value.char is an important type due to its common use in C++ code. However, the
related proposal
#1964: Character Literals
has several issues, including:
char handling; it is not mentioned in proposal #1964.
Char8 seems potentially more equivalent to std::char8_t instead of
char, and for interop purposes these are slightly different types.
Similar applies to Char16 and Char32.u8).var b: u8 = 'a' + 1 would be
supported, while "Operations" states that + is returning a character
literal (not a u8).let smiley: Char16 = '\u{1F600}' is
valid even though 1F600 would require multiple code units in both
UTF-8 and UTF-16.std::char32_t family of
types or ICU compatibility.In other words, it's likely we want something similar to Char32, but it may be
named something like Core.Char32 and have slightly different type behaviors
than decided in #1964. On the other hand, we need something compatible with the
C++ char in order to proceed with basic C++ interop, and #1964 doesn't provide
that.
char.The way char will work is:
char type literal.
str type will use char for elements.char to C++'s char.Core.CharLiteral type for character literals, similar to
Core.IntLiteral.This proposal additionally revokes and replaces proposal #1964, rather than trying to define which parts we are keeping and which are changing.
char type literalchar is intended to offer a basic construct for Carbon's strings that is both
compatible with UTF-8, and has high fidelity with C++ strings.
In support of that, important notes are:
char itself will be a
type literal.char notionally represents a UTF-8 code unit.
char will be backed by Core.Char, in the prelude.
Core.Char will adapt u8.char and Cpp.char on API
boundaries.
char will be unsigned by default
(-funsigned-char). A program can switch back to signed
(-fno-unsigned-char), and Carbon will maintain interoperability but
bits will be interpreted differently in each language.Escape sequences are the same as for a string literal. Only one escape sequence may be provided in a character literal.
Core.CharLiteral type for character literalsCore.CharLiteral is the type of a character literal, similar to how
Core.IntLiteral is the type of integer literals. It abstractly represents a
single Unicode code point. This gives us a compile-time structure for characters
that can be typed and referred to in programs.
Semantics of a character literal will be equivalent to a simple string literal, except that:
'.\x escape sequence is limited to values up to 7F, where the
UTF-8 code unit and Unicode code point values are identical.An important detail of the character literal type is it gives us a way to track
constant values at compile time. For example, 'a' + 1 has a constant value of
b. This means we can diagnose uses of character literals that don't represent
a valid Unicode code point, such as 'a' + 0xFFFFFF.
The goal of provided operators is to provide a set of operators which map to
common operations a user would want to do. It is a non-goal to support use of
char as an arbitrary byte or integer: developers should use u8 for that.
In general, char and Core.CharLiteral operators are intended to be mirrors
of each other.
char
ImplicitAs: NoneExplicitAs: To/from u8, plus the set of ImplicitAs for u8.
u8 has ImplicitAs to u16, so char has
ExplicitAs to u16.Core.CharLiteral
ImplicitAs: to char onlyExplicitAs: To/from the set of ImplicitAs for i32 and u32.
i32 has ImplicitAs to i64, so Core.CharLiteral
has ExplicitAs to i64.i64 does not have ImplicitAs to i32; conversion
requires two casts, ((i64_val as i32) as Core.CharLiteral).Casting from a char to a Core.CharLiteral is not supported.
See also implicit numeric conversions.
char
EqWith and OrderedWith when both operands are char.ImplicitAs should allow substituting one operand with
Core.CharLiteral.Core.CharLiteral
EqWith and OrderedWith when operands are Core.CharLiteral.char
AddWith: char + <integer> -> char (with reversible operands)
(((char as i16) + <integer>) as u8) as char)SubWith:
char - <integer> -> char (non-reversible operands)
(((char as i16) - <integer>) as u8) as char)char - char -> i32
(lhs as i32) - (rhs as i32).ImplicitAs should allow substituting one operand with
Core.CharLiteral.Core.CharLiteral
AddWith: Core.CharLiteral + <integer> -> Core.CharLiteral (with
reversible operands)SubWith:
Core.CharLiteral - <integer> -> Core.CharLiteral
(non-reversible operands)Core.CharLiteral - Core.CharLiteral -> i32
char integer parametersArbitrary integers are supported for most of these operations. For example, we
want to allow addition of negative numbers, even though the representation of
char is unsigned, without requiring additional casts.
Operations will use error overflow semantics,
similar to signed integers.
For example, (('a' as char) + 500) is invalid code because it causes char
overflow. That's why conversions are to signed values (for example,
char as i16).
In arithmetic, i32 returns are preferred for deltas because they should be
valid for unicode code points. Even though char is only 8-bits, using i32
for returns there too creates consistency with Core.CharLiteral.
This revokes proposal #1964 for simplicity. Rather than trying to detail which decisions still apply and which don't, this proposal is acting from an assumption that all decisions there no longer apply. We can still benefit by pointing towards the rationale in explicitly maintaining decisions, but we want to go through that step.
char represents a single code unit within that,
and is consequently efficient to access. It can also be invalid, meaning
we don't guarantee performing runtime validation for users (avoiding
performance overhead), and that users might be able to use it for other
encodings.Core.CharLiteral is designed as a Unicode code point, and even though
this design doesn't include a way to use values over 7F, we anticipate
those will be added in the future. It's being provided as a building
block for more elaborate Unicode functionality, including both UTF-16
and UTF-32, even as we prioritize UTF-8.\x80 and higher similar escapes, which are not supported
due to potentially ambiguous behavior, still in furtherance of this
goal.char gives us a way to do
different sorts of validation that can be more character-oriented than
if we treated it as an arbitrary byte.Core.CharLiteral as a valid Unicode character allows us to
provide static checking for some operations, such as 'a' + 1 resulting
in another valid Unicode code point; more is also transitively possible,
including involving char.char as a UTF-8 code unit creates behavior which is very
similar to C++, but still shifts towards a more character-oriented
approach. We do expect some migration friction as a consequence (as
use-cases might need either more casts, or to switch to a byte type).There's still significant future work, including:
signed char, unsigned charstd::char8_t, std::char16_t, std::char32_tIt should not be assumed that there's any restriction on the designs of those features, particularly no restrictions from #1964.
char fully with C++, or make it fully validAlternatives were discussed in zygoloid's comment on #5903.
The comment notes that three options were proposed:
char is fully aligned with C++.
There is no universal convention for what the value in a char means, and
the numerical encoding of Unicode characters into char sequences might
even be platform-dependent. For example, we might use some code page on
Windows, EBCDIC on some IBM targets, and probably UTF-8 everywhere else.
Likely the encoding would match what a character literal in C++ code would
do for that target. Even when the target normally uses UTF-8, it would be
reasonable to use an array of char as the type of the output buffer when
transcoding from UTF-8 to some other encoding, and generally an encoded text
buffer (in any encoding) would typically be represented as an array of
char. It might also be reasonable to use an array of char for things
that aren't necessarily text, such as file contents.
char models a UTF-8 code unit, although it may not necessarily be valid,
and may appear in a sequence that is not a valid UTF-8 encoding.
As with the first option, char can represent an integer in [0, 255], although
it is not an integer type. Higher-level abstractions would likely (eventually)
be provided to represent different views of the code unit sequence as (for example)
a sequence of code points or a sequence of graphemes, but the fundamental model
exposes the encoding. Functions taking char or char sequences would assume
UTF-8 encoding, and would need to consider how to handle invalid chars and
invalid char sequences.
Use a foundation that enforces Unicode string validity, for some definition of "Unicode string validity".
The char type is a Unicode character. Strings would notionally be a
sequence of Unicode characters, possibly also maintaining some higher-level
string invariants. String indexing, if it exists, would likely treat the
string as a sequence of Unicode characters. String invariants would be
enforced by type conversion into the string type rather than within the
string operations, and certain classes of invalid strings would be
unrepresentable.
Rationale as evaluated are:
char type be conventionally UTF-8).
char seems
important, and the lack of one in C++ has been a notable problem over
time, leading to the addition of char8_t et al, which have not been
entirely satisfactory solutions due to the existing widespread usage of
plain char.This proposal is choosing option 2, that char models a UTF-8 code unit without
validation. In some sense, option 2 is still "fully aligned with C++", but with
C++'s char8_t rather than with C++'s char.
Raw string literals
use a # prefix. There's limited use for this in character literals;
technically, '\\' could instead be #'\'#, but that's longer and extra
characters may prove distracting. Raw string literals are more useful when
there's a longer character sequence, whereas character literals have one
character by definition. For simplicity, character literals won't have raw
syntax.
A \x## escape sequence abstractly represents a UTF-8 code unit. Whereas values
over 7F are valid in string literals (allowing arbitrary byte values), these are
disallowed in character literals because we want a more validated Unicode
behavior. Developers could instead rely on \u escapes for \x.
It can still be useful to allow \x escapes for low-range values because some
developers will still need to specify
ANSI escapes. Carbon
drops support for some escape sequences,
such as \a, and specifically advises \x as an alternative for developers
that need it. Requiring \a -> \x07 -> \u{07} is incrementally more verbose
syntax, and developers may be confused why "\x1B" is allowed for strings but
'\u{1B}' is required for characters.
Values over 7F are ambiguous between an arbitrary byte value and a Unicode code
point, and so should be invalid. However, where both interpretations are
identical for UTF-8 (values up to and including 7F), we will allow \x escape
sequences.
This proposal carries forward the decision in #1964 to not support grapheme clusters in character literals.
Instead of using single quotes (for example, 'a'), we could use string literal
syntax with a conversion (for example, "a" as char) for character literals.
This was proposed because it would free up the single quote for other,
unspecified syntax uses.
For background, character literals are common in C++. For example, in SourceGraph search statistics (some of these are in comments -- a search limitation):
'(.|\\.)':
46.2 million<<:
over 100 million>>:
10.4 million%:
5.3 millionThis creates several disadvantages for removing character literals in Carbon:
Migrating C++ developers to Carbon: The frequency of use can be expected to have trained developers to expect single quotes to be used for characters, especially the C++ developers that Carbon is targeting. Repurposing them would create a friction for C++ developers to need to understand the different meanings of the same syntax in each of C++ and Carbon, something Carbon prefers to avoid.
Increased runtime error risks: Runtime errors could take the form of
simple increased overhead, such as converting a string literal to a str
then to a char. However, they could also be more insidious, such as doing
[0] on a string literal and not validating that the string is exactly one
character (this would also likely return a null byte for ""[0]). By having
a character literal type, Carbon encourages developers to stay within guide
rails that make it easier to get compile-time behavior and program
validation.
Block string literal use: We already have another use for single quotes in Carbon: block string literals. The syntax may need to change along with removing character literals, to make room for other uses of single quotes.
If retained, it would constrain uses of single quotes. For example, a
unary operator syntax has overlap (that is, if 'a and ''a are valid,
then '''a is ambiguous).
The choice of single quotes in proposal #1360: Change raw string literal syntax was made accounting for single quotes in character literals, and that commonality would be lost.
Tooling: The prevalence of single quotes being used for either strings or characters also affects their treatment in tools not specialized to Carbon: they expect them to be used for strings. For example, Rust's use of single quotes for lifetime annotations has been observed to break language-agnostic syntax highlighting.
While a compelling proposal for a different use of single quotes may come up in the future, freeing up the character for other purposes is insufficient to justify a different syntax for character literals.
A related alternative with the same goal of eliminating single quotes for
character literals is that, rather than requiring single-character string
literals be explicitly converted to char, they could instead have a third type
of text literal. This would implicitly cast to either str or char.
This approach would lead to three literal types: StrLiteral, CharLiteral,
and TextLiteral. The distinction of CharLiteral is important because we
still want to support arithmetic on character literals, such as 'a' + 1 (which
we would not want to be allowed for StrLiteral).
The existence of a third type would be important for generic code, even when not trying to use character literals. For example:
fn StoreValue[U:! type](ref a: Optional(U), b: U) {
a = b;
}
fn StrLogic[T:! type](a: T) {
var x: Optional(T) = a;
StoreValue(x, "str");
}
fn F() {
StrLogic("a");
}
Here, T is deduced to be TextLiteral. However, U has no valid value: it's
passed Optional(TextLiteral), while "str" is a StrLiteral (which should
not be convertible to TextLiteral). As a consequence, this code is invalid,
even though the same code would be valid if there were not TextLiteral type.
Advantages:
Disadvantages:
"..."[0] instead of
"..." as char when they need a character, although the frequency may
be reduced.Cpp.std.cout << or <=>?"a" + "b" => "ab", and that can be compile-time. Is "a" + 1
a pointer to the null byte as it is in C++ (similar to &("a"[1])), a
character addition ('a' + 1 => 'b'), or does it require an explicit
cast in order to ensure behavior is deliberate?