This proposal specifies lexical rules for numeric constants in Carbon.
We wish to cover literals for two categories of types:
Real number types may include additional values (infinities and NaN values). We do not provide a notation to express such values.
In C++, the following syntaxes are used:
12345 (decimal)0x1FE (hexadecimal)0123 (octal)0b1010 (binary)123..123123.456123.e456 (= 123 * 10456).123e456123.456e789123e456 (no decimal point)+ or - after e.0x123.p456 (= 12316 * 2456)0x.123p4560x123.456p7890x123p456 (no hexadecimal point)+ or - after p.') may appear between any two digitsU (unsigned) and L (long) or LL (long long) for integers
(order-independent, but LUL disallowed)F (float) or L (long double) for real numbers_ for
non-standard-library literals.C++ numeric literals are case-insensitive, except in the suffix of a
user-defined literal. Negative numbers are formed by applying a unary -
operator to a non-negative literal.
The type of a literal in C++ depends primarily on its syntax and its suffix.
However, for integer literals, the type also depends on the value; the language
rules attempt to pick a type large enough to fit the value. An unsigned type
is always used if a U suffix is present, is never used for a decimal literal
without a U suffix, and otherwise may or may not be used depending on whether
the value happens to fit into an unsigned type but not into a signed type of the
same width.
Other languages use somewhat different rules, but the broad lexical structure above -- an optional prefix for the base, a value, an optional exponent, and an optional suffix -- is common across a large number of languages.
We allow these syntaxes:
12345 (decimal)0x1FE (hexadecimal)0b1010 (binary)123.456 (digits on both sides of the .)123.456e789 (optional + or - after the e)0x1.2p123 (optional + or - after the p)_) may be used, but only in conventional locationsNote that real number literals always contain a . with digits on both sides,
and integer literals never contain a ..
Literals are case-sensitive.
No support is proposed for literals with type suffixes, but without prejudice: this proposal proposes neither the inclusion nor the absence of such literals.
Decimal integers are written as a non-zero decimal digit followed by zero or
more additional decimal digits, or as a single 0.
Integers in other bases are written as a 0 followed by a base specifier
character, followed by a sequence of digits in the corresponding base. The
available base specifiers and corresponding bases are:
| Base specifier | Base | Digits |
|---|---|---|
b |
2 | 0 and 1 |
x |
16 | 0 ... 9, A ... F |
The above table is case-sensitive. For example, 0b1 and 0x1A are valid, and
0B1, 0X1A, and 0x1a are invalid.
A zero at the start of a literal can never be followed by another digit: either
the literal is 0, the 0 begins a base specifier, or the next character is a
decimal point (see below).
Real numbers are written as a decimal or hexadecimal integer followed by a
period (.) followed by a sequence of one or more decimal or hexadecimal
digits, respectively. A digit is required on each side of the period. 0. and
.3 are both invalid.
A real number can be followed by an exponent character, an optional + or -
(defaulting to + if absent), and a character sequence matching the grammar of
a decimal integer with some value N. For a decimal real number, the exponent
character is e, and the effect is to multiply the given value by
10±N. For a hexadecimal real number, the exponent character
is p, and the effect is to multiply the given value by
2±N. The exponent suffix is optional for both decimal and
hexadecimal real numbers.
Note that a decimal integer followed by e is not a real number literal. For
example, 3e10 is not a valid literal.
When a real number literal is interpreted as a value of a real number type, its value is the representable real number closest to the value of the literal. In the case of a tie, the conversion to the real number type is invalid.
The decimal real number syntax allows for any decimal fraction to be expressed -- that is, any number of the form a x 10-b, where a is an integer and b is a non-negative integer. Because the decimal fractions are dense in the reals and the set of values of the real number type is assumed to be discrete, every value of the real number type can be expressed as a real number literal. However, for certain applications, directly expressing the intended real number representation may be more convenient than producing a decimal equivalent that is known to convert to the intended value. Hexadecimal real number literals are provided in order to permit values of binary floating or fixed point real number types to be expressed directly.
As described above, a real number literal that lies exactly between two
representable values for its target type is invalid. Such ties are extremely
unlikely to occur by accident: for example, when interpreting a literal as
Float64, 1. would need to be followed by exactly 53 decimal digits (followed
by zero or more 0s) to land exactly half-way between two representable values,
and the probability of 1. followed by a random 53-digit sequence resulting in
such a tie is one in 553, or about
0.000000000000000000000000000000000009%. For Float32, it's about
0.000000000000001%, and even for a typical Float16 implementation with 10
fractional bits, it's around 0.00001%.
Ties are much easier to express as hexadecimal floating-point literals: for
example, 0x1.0000_0000_0000_08p+0 is exactly half way between 1.0 and the
smallest Float64 value greater than 1.0, which is 0x1.0000_0000_0000_1p+0.
Whether written in decimal or hexadecimal, a tie provides very strong evidence that the developer intended to express a precise floating-point value, and provided one bit too much precision (or one bit too little, depending on whether they expected some rounding to occur), so rejecting the literal seems like a better option than accepting it and making an arbitrary choice between the two possible values.
If digit separators (_) are included in literals, they must meet the
respective condition:
2_147_483_648.0x7FFF_FFFF.e
or mandatory p) as described in the previous bullets. For example,
2_147.483648e12_345 or 0x1_00CA.FEF00Dp+240b1_000_101_11.2020-09-15: core team meeting selected Alternative 0
As an alternative to the rule proposed above, we could consider different restrictions on where digit separators can appear:
Alternative 0: as presented above.
Alternative 1: allow any digit groupings (for example, 123_4567_89).
Pro:
var Date: d = 01_12_1983;,
or var Int64: time_in_microseconds = 123456_000000;.1_23_45_678).Con:
Alternative 2: as above, but additionally require binary digits to be grouped in 4s.
Pro:
Con:
When used to express literals involving bit-fields, arbitrary grouping may be desirable. For example:
var Float32: flt_max =
BitCast(Float32, 0b0_11111110_11111111111111111111111);
Alternative 3: allow any regular grouping.
Pro:
Con:
There are a number of different design choices we could make, as divergences from the above proposal. Those choices, along with the arguments that led to choosing the proposed design rather than each alternative, are presented below.
No support is proposed for octal literals. In practice, their appearance in C
and C++ code in a sample corpus consisted of (in decreasing order of commonality
and excluding 0 literals):
CivilDay(2020, 04, 01)), andThe number of intentional uses of octal literals, other than in file permissions, was negligible. We considered the following alternatives:
Baseline: This proposal suggests that we do not support octal literals. Octal literals are rare and mostly obsolescent. File permissions can be supported in some other way.
Alternative 1: Follow C and C++, and use 0 as the base prefix for octal.
Pro:
Con:
Alternative 2: Use 0o as the base prefix for octal.
Pro:
Con:
If we decide we want to introduce octal literals at a later date, use of alternative 2 is suggested.
We could permit leading 0s in decimal integers (and in floating-point
numbers).
Pro:
0s to be used to align columns of numbers.Con:
We could add an (optional) base specifier 0d for decimal integers.
Pro:
0 could be achieved by
using 0d000123.Con:
We could permit an e in decimal literals to express large powers of 10.
Pro:
1e6 in our sample C++ corpus intend to form an integer
literal instead of a floating-point literal.Con:
e
indicating a floating-point constant.We suggest that this syntax is not added at this point. However, it should be reconsidered at a later date, once developers are used the requirement that real literals always contain a period.
We could make base specifiers case-insensitive.
Pro:
Con:
0B1 is easily mistaken for 0810B1 can be confused with 0xB10O17 is easily mistaken for 0017We could make the digit sequence in hexadecimal integers case-insensitive.
Pro:
md5, will print lowercase.Con:
x base
specifier (for example, the digit sequence is more visually distinct in
0xAC than in 0xac).We could require the digit sequence in hexadecimal integers to be written
using lowercase letters a..f.
Pro:
md5, will print lowercase.B and D are more likely to be confused with 8 and 0 than b and d
are.Con:
x base
specifier (for example, the digit sequence is more visually distinct in
0xAC than in 0xac).We could allow real numbers with no digits on one side of the period (3. or
.5).
Pro:
Con:
tup.0 syntax that may be useful for indexing tuples.0.ToString() syntax that may be useful for performing
member access on literals.See also the section on floating-point literals in the Google style guide, which argues for the same rule.
We could allow a real number with no e or p to omit a period (1e100).
Pro:
Con:
We could allow the e or p to be written in uppercase.
Pro:
E, to avoid confusion with the constant e.Con:
E may be confused with a hexadecimal digit.We could require a p in a hexadecimal real number literal.
Pro:
Con:
We could arbitrarily pick one of the two values when a real number is exactly half-way between two representable values.
Pro:
Con:
2020-09-15: core team meeting chose to forward digit separator to painter
2020-10-05: painter selected Alternative 2: _ as digit separator
There are various different characters we could attempt to use as a digit separator. The options we considered are:
Alternative 0: ' as a digit separator.
Pro:
Con:
' is also likely to be used to introduce character literals.Alternative 1: , as a digit separator.
Pro:
Con:
f(1, 234) called f with two arguments
but f(1,234) called f with a single argument.Alternative 2: _ as a digit separator.
Pro:
Con:
Alternative 3: whitespace as a digit separator.
Pro:
Con:
f(1, 23, 4 567) may be interpreted as three
separate numerical arguments instead of four arguments with a missing comma.Alternative 4: . as digit separator, , as decimal point.
Pro:
Con:
, as a digit separator, , as a decimal point is problematic., is the decimal point in regular writing are likely
already accustomed to using . as the decimal point in programming
environments, and the converse is not true.Alternative 5: No digit separator syntax.
Pro:
Con:
The proposal provides a syntax that is sufficiently close to that used both by C++ and many other languages to be very familiar. However, it selects a reasonably minimal subset of the syntaxes. This minimal approach provides benefits directly in line with both the simplicity and readability goals of Carbon:
That said, it still provides sufficient variations to address important use cases for the goal of not leaving room for a lower level language:
The primary aesthetic benefit of ' to the painter is consistency with C++.
However, its rare usage in C++ at this point reduces this advantage to a very
small one, while there is broad convergence amongst other languages around _.
The choice here has no risk of significant meaning or building up patterns of
reading for users that might be disrupted by the change, and so it seems
reasonable to simply converge with other languages to end up in the less
surprising and more conventional syntax space.
Placement restrictions of digit separators:
Use _ or ' as the digit separator character:
_.