Carbon needs a set of tokens to represent operators.
Some languages have a fixed set of operator tokens. For example:
and, or, etc. are lexical synonyms for
corresponding symbolic operators &&, ||, etc.Other languages have extensible rules for defining operators, including the facility for a developer to define operators that aren't part of the base language. For example:
Operators tokens can be formed by various rules, for example:
a += b is 3 tokens and
a =+ b is four tokens, because there are +, =, and += operators, but
there is no =+ operator. This approach is sometimes known as "max munch".a =+ b
would be invalid instead of meaning a = (+b).Carbon has a fixed set of tokens that represent operators, defined by the language specification. Developers cannot define new tokens to represent new operators; there may be facilities to overload operators, but that is outside the scope of this proposal. There are two kinds of tokens that represent operators:
Symbolic tokens are lexed using a "max munch" rule: at each lexing step, the longest symbolic token defined by the language specification that appears starting at the current input position is lexed, if any.
Not all uses of symbolic tokens within the Carbon grammar will be as operators.
For example, we will have ( and ) tokens that serve to delimit various
grammar productions, and we may not want to consider . to be an operator,
because its right "operand" is not an expression.
When a symbolic token is used as an operator, we use the presence or absence of
whitespace around the symbolic token to determine its fixity, in the same way we
expect a human reader to recognize them. For example, we want a* - 4 to treat
the * as a unary operator and the - as a binary operator, while a * -4
results in the reverse. This largely requires whitespace on only one side of a
unary operator and on both sides of a binary operator. However, we'd also like
to support binary operators where a lack of whitespace reflects precedence such
as2*x*x + 3*x + 1 where doing so is straightforward. The rules we use to
achieve this are:
), ], or }), and the token after the operator
must be an identifier, a literal, or any kind of opening bracket (for
example, (, [, or {).This proposal includes an initial set of symbolic tokens covering only the grammar productions that have been approved so far. This list should be extended by proposals that use additional symbolic tokens.
Two kinds of operator tokens are proposed. These two kinds are intended for different uses, not as alternate spellings of the same functionality:
+, *, <, and so on.
and, or, throw,
yield, and operators closely connected to these, such as not. It is
important that these stand out from other operators as they have action
that goes beyond evaluating their operands and computing a value.as.The example operators in this section are included only to motivate the two kinds of operator token; those specific operators are not proposed as part of this proposal.
The following is the initial list of symbolic tokens recognized in a Carbon source file:
( |
) |
{ |
} |
[ |
] |
, |
. |
; |
: |
* |
& |
= |
-> |
=> |
This list is expected to grow over time as more symbolic tokens are required by language proposals.
We wish to support the use of the same symbolic token as a prefix operator, an
infix operator, and a postfix operator, in some cases. In particular, we have
decided in #523
that the * operator should support all three uses; this operator will be
introduced in a future proposal. In order to support such usage, we want a rule
that allows us to simply and unambiguously parse operators that might have all
three fixities.
For example, given the expression a * - b, there are two possible parses:
a * (- b), multiplying a by the negation of b.(a *) - b, subtracting b from the pointer type a *.Our chosen rule to distinguish such cases is to consider the presence or absence
of whitespace, as we think this strikes a good balance between simplicity and
expressiveness for the programmer and simplicity and good support for error
recovery in the implementation. a * -b uses the first interpretation, a* - b
uses the second interpretation, and other combinations (a*-b, a *- b,
a* -b, a * - b, a*- b, a *-b) are rejected as errors.
In general, we require whitespace to be present or absent around the operator to
indicate its fixity, as this is a cue that a human reader would use to
understand the code: binary operators have whitespace on both sides, and unary
operators lack whitespace between the operator and its operand. We also make
allowance for omitting the whitespace around a binary operator in cases where it
aids readability to do so, such as in expressions like 2*x*x + 3*x + 1: for an
operator with whitespace on neither side, if the token immediately before the
operator indicates it is the end of an operand, and the token immediately after
the operator indicates it is the beginning of an operand, the operator is
treated as binary.
We define the set of tokens that constitutes the beginning or end of an operand as:
x*x + y*y.3*x + 4*y or "foo"+s.f()*(n + 3) or
args[3]*{.real=4, .imag=1}.For error recovery purposes, this rule functions best if no expression context can be preceded by a token that looks like the end of an operand and no expression context can be followed by a token that looks like the start of an operand. One known exception to this is in function definitions:
fn F(p: Int *) -> Int * { return p; }
Both occurrences of Int * here are erroneous. The first is easy to detect and
diagnose, but the second is more challenging, if {...} is a valid expression
form. We expect to be able to easily distinguish between code blocks starting
with { and expressions starting with { for all cases other than {}.
However, the code block {} is not a reasonable body for a function with a
return type, so we expect errors involving a combination of misplaced whitespace
and {} to be rare, and we should be able to recover well from the remaining
cases.
From the perspective of token formation, the whitespace rule means that there are four variants of each symbolic token:
(, is also a binary variant of the token.When used in non-operator contexts, any variant of a symbolic token is acceptable. When used in operator contexts, only a binary variant of a token can be used as a binary operator, only a prefix or unary variant of a token can be used as a prefix operator, and only a postfix or unary variant of a token can be used as a postfix operator.
This whitespace rule has been
implemented in the Carbon toolchain
for all operators by tracking the presence or absence of trailing whitespace as
part of a token, and
in executable semantics
for the * operator by forming four different token variants as described
above.
The choice to disallow whitespace between a unary operator and its operand is experimental.
Software and language evolution
Code that is easy to read, understand, and write
2*x*x + 3*x + 1 to use the
absence of whitespace to improve readability. Because the language
officially sanctions both choices, the formatting tool can be expected
to preserve the user's choice.x = -*p;.Interoperability with and migration from existing C++ code
* operator to be used for all of
multiplication, dereference, and pointer type formation, as in C++,
while still permitting Carbon to treat type expressions as expressions.We could lex the longest sequence of symbolic characters rather than lexing only the longest known operator.
Advantages:
Disadvantages:
Int** would lex as Int followed by a single
** token, and **p would lex as a single ** token followed by p, if
there is no ** operator. While we could define **, ***, and so on as
operators, doing so would add complexity and inconsistency to the language
rules.We could support an extensible operator set, giving the developer the option to add new operators.
Advantages:
Disadvantages:
We could apply different whitespace restrictions or no whitespace restrictions. See #520 for discussion of the alternatives and the leads decision.
We could require whitespace around a binary operator followed by [ or {. In
particular, for examples such as:
fn F() -> Int*{ return Null; }
var n: Int = pointer_to_array^[i];
... this would allow us to form a unary operator instead of a binary operator, which is likely to be more in line with the developer's expectations.
Advantages:
^ dereference operator, or similarly any other
postfix operator producing an array, without creating surprises for pointers
to arrays.{ of a function body to be consistently
omitted if desired.Disadvantages:
arr[i]*3.[ or {, for
example Time.Now()+{.seconds = 3} or names+["Lrrr"].