4 年之前 · db2efeafeb
--- a/proposals/README.md
+++ b/proposals/README.md
@@ -59,5 +59,6 @@ request:
 
				 -   [0538 - `return` with no argument](p0538.md)
			
 
				 -   [0540 - Remove `Void`](p0540.md)
			
 
				 -   [0555 - Operator precedence](p0555.md)
			
 
				+-   [0601 - Operator tokens](p0601.md)
			
 
				 
			
 
				 <!-- endproposals -->
			
--- a/proposals/p0601.md
+++ b/proposals/p0601.md
@@ -0,0 +1,351 @@
 
				+# Operator tokens
			
 
				+
			
 
				+<!--
			
 
				+Part of the Carbon Language project, under the Apache License v2.0 with LLVM
			
 
				+Exceptions. See /LICENSE for license information.
			
 
				+SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
			
 
				+-->
			
 
				+
			
 
				+[Pull request](https://github.com/carbon-language/carbon-lang/pull/601)
			
 
				+
			
 
				+<!-- toc -->
			
 
				+
			
 
				+## Table of contents
			
 
				+
			
 
				+-   [Problem](#problem)
			
 
				+-   [Background](#background)
			
 
				+-   [Proposal](#proposal)
			
 
				+-   [Details](#details)
			
 
				+    -   [Two kinds of operator tokens](#two-kinds-of-operator-tokens)
			
 
				+    -   [Symbolic token list](#symbolic-token-list)
			
 
				+    -   [Whitespace](#whitespace)
			
 
				+-   [Rationale based on Carbon's goals](#rationale-based-on-carbons-goals)
			
 
				+-   [Alternatives considered](#alternatives-considered)
			
 
				+
			
 
				+<!-- tocstop -->
			
 
				+
			
 
				+## Problem
			
 
				+
			
 
				+Carbon needs a set of tokens to represent operators.
			
 
				+
			
 
				+## Background
			
 
				+
			
 
				+Some languages have a fixed set of operator tokens. For example:
			
 
				+
			
 
				+-   [C++ operators](https://eel.is/c++draft/lex.operators)
			
 
				+    -   The keyword operators `and`, `or`, etc. are lexical synonyms for
			
 
				+        corresponding symbolic operators `&&`, `||`, etc.
			
 
				+-   [Rust operators](https://doc.rust-lang.org/book/appendix-02-operators.html)
			
 
				+
			
 
				+Other languages have extensible rules for defining operators, including the
			
 
				+facility for a developer to define operators that aren't part of the base
			
 
				+language. For example:
			
 
				+
			
 
				+-   [Swift operator rules](https://docs.swift.org/swift-book/ReferenceManual/LexicalStructure.html#ID418)
			
 
				+-   [Haskell operator rules](https://www.haskell.org/onlinereport/haskell2010/haskellch2.html#dx7-18008)
			
 
				+
			
 
				+Operators tokens can be formed by various rules, for example:
			
 
				+
			
 
				+-   At each lexing step, form the longest known operator token possible from the
			
 
				+    remaining character sequence. For example, in C++, `a += b` is 3 tokens and
			
 
				+    `a =+ b` is four tokens, because there are `+`, `=`, and `+=` operators, but
			
 
				+    there is no `=+` operator. This approach is sometimes known as "max munch".
			
 
				+-   At each lexing step, treat the longest sequence of operator-like characters
			
 
				+    possible as an operator. The program is invalid if there is no such
			
 
				+    operator. For example, in a C++-like language using this approach, `a =+ b`
			
 
				+    would be invalid instead of meaning `a = (+b)`.
			
 
				+-   Use semantic information to determine how to split a sequence of operator
			
 
				+    characters into one or more operators, for example based on the types of the
			
 
				+    operands.
			
 
				+
			
 
				+## Proposal
			
 
				+
			
 
				+Carbon has a fixed set of tokens that represent operators, defined by the
			
 
				+language specification. Developers cannot define new tokens to represent new
			
 
				+operators; there may be facilities to overload operators, but that is outside
			
 
				+the scope of this proposal. There are two kinds of tokens that represent
			
 
				+operators:
			
 
				+
			
 
				+-   _Symbolic tokens_ consist of one or more symbol characters. In particular,
			
 
				+    such a token contains no characters that are valid in identifiers, no quote
			
 
				+    characters, and no whitespace.
			
 
				+-   _Keywords_ follow the lexical rules for words.
			
 
				+
			
 
				+Symbolic tokens are lexed using a "max munch" rule: at each lexing step, the
			
 
				+longest symbolic token defined by the language specification that appears
			
 
				+starting at the current input position is lexed, if any.
			
 
				+
			
 
				+Not all uses of symbolic tokens within the Carbon grammar will be as operators.
			
 
				+For example, we will have `(` and `)` tokens that serve to delimit various
			
 
				+grammar productions, and we may not want to consider `.` to be an operator,
			
 
				+because its right "operand" is not an expression.
			
 
				+
			
 
				+When a symbolic token is used as an operator, we use the presence or absence of
			
 
				+whitespace around the symbolic token to determine its fixity, in the same way we
			
 
				+expect a human reader to recognize them. For example, we want `a* - 4` to treat
			
 
				+the `*` as a unary operator and the `-` as a binary operator, while `a * -4`
			
 
				+results in the reverse. This largely requires whitespace on only one side of a
			
 
				+unary operator and on both sides of a binary operator. However, we'd also like
			
 
				+to support binary operators where a lack of whitespace reflects precedence such
			
 
				+as`2*x*x + 3*x + 1` where doing so is straightforward. The rules we use to
			
 
				+achieve this are:
			
 
				+
			
 
				+-   There can be no whitespace between a unary operator and its operand.
			
 
				+-   The whitespace around a binary operator must be consistent: either there is
			
 
				+    whitespace on both sides or on neither side.
			
 
				+-   If there is whitespace on neither side of a binary operator, the token
			
 
				+    before the operator must be an identifier, a literal, or any kind of closing
			
 
				+    bracket (for example, `)`, `]`, or `}`), and the token after the operator
			
 
				+    must be an identifier, a literal, or any kind of opening bracket (for
			
 
				+    example, `(`, `[`, or `{`).
			
 
				+
			
 
				+This proposal includes an initial set of symbolic tokens covering only the
			
 
				+grammar productions that have been approved so far. This list should be extended
			
 
				+by proposals that use additional symbolic tokens.
			
 
				+
			
 
				+## Details
			
 
				+
			
 
				+### Two kinds of operator tokens
			
 
				+
			
 
				+Two kinds of operator tokens are proposed. These two kinds are intended for
			
 
				+different uses, not as alternate spellings of the same functionality:
			
 
				+
			
 
				+-   Symbolic tokens are intended to be used for widely-recognized operators,
			
 
				+    such as the mathematical operators `+`, `*`, `<`, and so on.
			
 
				+    -   Symbolic tokens used as operators would generally be expected to also be
			
 
				+        meaningful for some user-defined types, and should be candidates for
			
 
				+        being made overloadable once we support operator overloading.
			
 
				+-   Keywords are intended to be used for cases such as the following:
			
 
				+    -   Operators that perform flow control, such as `and`, `or`, `throw`,
			
 
				+        `yield`, and operators closely connected to these, such as `not`. It is
			
 
				+        important that these stand out from other operators as they have action
			
 
				+        that goes beyond evaluating their operands and computing a value.
			
 
				+    -   Operators that are rare and that we do not want to spend our finite
			
 
				+        symbolic token budget on, such as perhaps xor or bit rotate.
			
 
				+    -   Operators with very low precedence, and perhaps certain operators with
			
 
				+        very high precedence.
			
 
				+    -   Special-purpose operators for which there is no conventional established
			
 
				+        symbol and for which we do not want to invent one, such as `as`.
			
 
				+
			
 
				+The example operators in this section are included only to motivate the two
			
 
				+kinds of operator token; those specific operators are not proposed as part of
			
 
				+this proposal.
			
 
				+
			
 
				+### Symbolic token list
			
 
				+
			
 
				+The following is the initial list of symbolic tokens recognized in a Carbon
			
 
				+source file:
			
 
				+
			
 
				+|     |      |      |     |     |     |
			
 
				+| --- | ---- | ---- | --- | --- | --- |
			
 
				+| `(` | `)`  | `{`  | `}` | `[` | `]` |
			
 
				+| `,` | `.`  | `;`  | `:` | `*` | `&` |
			
 
				+| `=` | `->` | `=>` |     |     |     |
			
 
				+
			
 
				+This list is expected to grow over time as more symbolic tokens are required by
			
 
				+language proposals.
			
 
				+
			
 
				+### Whitespace
			
 
				+
			
 
				+We wish to support the use of the same symbolic token as a prefix operator, an
			
 
				+infix operator, and a postfix operator, in some cases. In particular, we have
			
 
				+[decided in #523](https://github.com/carbon-language/carbon-lang/issues/523)
			
 
				+that the `*` operator should support all three uses; this operator will be
			
 
				+introduced in a future proposal. In order to support such usage, we want a rule
			
 
				+that allows us to simply and unambiguously parse operators that might have all
			
 
				+three fixities.
			
 
				+
			
 
				+For example, given the expression `a * - b`, there are two possible parses:
			
 
				+
			
 
				+-   As `a * (- b)`, multiplying `a` by the negation of `b`.
			
 
				+-   As `(a *) - b`, subtracting `b` from the pointer type `a *`.
			
 
				+
			
 
				+Our chosen rule to distinguish such cases is to consider the presence or absence
			
 
				+of whitespace, as we think this strikes a good balance between simplicity and
			
 
				+expressiveness for the programmer and simplicity and good support for error
			
 
				+recovery in the implementation. `a * -b` uses the first interpretation, `a* - b`
			
 
				+uses the second interpretation, and other combinations (`a*-b`, `a *- b`,
			
 
				+`a* -b`, `a * - b`, `a*- b`, `a *-b`) are rejected as errors.
			
 
				+
			
 
				+In general, we require whitespace to be present or absent around the operator to
			
 
				+indicate its fixity, as this is a cue that a human reader would use to
			
 
				+understand the code: binary operators have whitespace on both sides, and unary
			
 
				+operators lack whitespace between the operator and its operand. We also make
			
 
				+allowance for omitting the whitespace around a binary operator in cases where it
			
 
				+aids readability to do so, such as in expressions like `2*x*x + 3*x + 1`: for an
			
 
				+operator with whitespace on neither side, if the token immediately before the
			
 
				+operator indicates it is the end of an operand, and the token immediately after
			
 
				+the operator indicates it is the beginning of an operand, the operator is
			
 
				+treated as binary.
			
 
				+
			
 
				+We define the set of tokens that constitutes the beginning or end of an operand
			
 
				+as:
			
 
				+
			
 
				+-   Identifiers, as in `x*x + y*y`.
			
 
				+-   Literals, as in `3*x + 4*y` or `"foo"+s`.
			
 
				+-   Brackets of any kind, facing away from the operator, as in `f()*(n + 3)` or
			
 
				+    `args[3]*{.real=4, .imag=1}`.
			
 
				+
			
 
				+For error recovery purposes, this rule functions best if no expression context
			
 
				+can be preceded by a token that looks like the end of an operand and no
			
 
				+expression context can be followed by a token that looks like the start of an
			
 
				+operand. One known exception to this is in function definitions:
			
 
				+
			
 
				+```
			
 
				+fn F(p: Int *) -> Int * { return p; }
			
 
				+```
			
 
				+
			
 
				+Both occurrences of `Int *` here are erroneous. The first is easy to detect and
			
 
				+diagnose, but the second is more challenging, if `{...}` is a valid expression
			
 
				+form. We expect to be able to easily distinguish between code blocks starting
			
 
				+with `{` and expressions starting with `{` for all cases other than `{}`.
			
 
				+However, the code block `{}` is not a reasonable body for a function with a
			
 
				+return type, so we expect errors involving a combination of misplaced whitespace
			
 
				+and `{}` to be rare, and we should be able to recover well from the remaining
			
 
				+cases.
			
 
				+
			
 
				+From the perspective of token formation, the whitespace rule means that there
			
 
				+are four _variants_ of each symbolic token:
			
 
				+
			
 
				+-   A symbolic token with whitespace on both sides is a _binary_ variant of the
			
 
				+    token.
			
 
				+-   A symbolic token with whitespace on neither side, where the preceding token
			
 
				+    is an identifier, literal, or closing bracket, and the following token is an
			
 
				+    identifier, literal, or `(`, is also a _binary_ variant of the token.
			
 
				+-   A symbolic token with whitespace on neither side that does not satisfy the
			
 
				+    preceding rule is a _unary_ variant of the token.
			
 
				+-   A symbolic token with whitespace on the left side only is a _prefix_ variant
			
 
				+    of the token.
			
 
				+-   A symbolic token with whitespace on the right side only is a _postfix_
			
 
				+    variant of the token.
			
 
				+
			
 
				+When used in non-operator contexts, any variant of a symbolic token is
			
 
				+acceptable. When used in operator contexts, only a binary variant of a token can
			
 
				+be used as a binary operator, only a prefix or unary variant of a token can be
			
 
				+used as a prefix operator, and only a postfix or unary variant of a token can be
			
 
				+used as a postfix operator.
			
 
				+
			
 
				+This whitespace rule has been
			
 
				+[implemented in the Carbon toolchain](https://github.com/carbon-language/carbon-lang/pull/576)
			
 
				+for all operators by tracking the presence or absence of trailing whitespace as
			
 
				+part of a token, and
			
 
				+[in executable semantics](https://github.com/carbon-language/carbon-lang/commit/04d3a885ae01a779aadb19f51ec7a5a12ffe295c)
			
 
				+for the `*` operator by forming four different token variants as described
			
 
				+above.
			
 
				+
			
 
				+The choice to disallow whitespace between a unary operator and its operand is
			
 
				+_experimental_.
			
 
				+
			
 
				+## Rationale based on Carbon's goals
			
 
				+
			
 
				+-   Software and language evolution
			
 
				+
			
 
				+    -   By not allowing user-defined operators, we reduce the possibility that
			
 
				+        operators added to the language later will conflict with existing uses
			
 
				+        in programs. Due to the use of a max munch rule, we might add an
			
 
				+        operator that causes existing code to be interpreted differently, but
			
 
				+        such problems will be easy to detect and resolve, because we know the
			
 
				+        operator set in advance.
			
 
				+
			
 
				+-   Code that is easy to read, understand, and write
			
 
				+
			
 
				+    -   The fixed operator set means that developers don't need to understand an
			
 
				+        unbounded and extensible number of operators and precedence rules. The
			
 
				+        fixed operator set encourages functionality that does not correspond to
			
 
				+        a well-known operator symbol to be exposed by way of a named operation
			
 
				+        instead of a symbol, improving readability among developers not familiar
			
 
				+        with a codebase.
			
 
				+    -   Requiring whitespace to be used consistently around operators reduces
			
 
				+        the possibility for confusing formatting.
			
 
				+    -   Permitting whitespace on either both sides of a binary operator or on
			
 
				+        neither side allows expressions such as `2*x*x + 3*x + 1` to use the
			
 
				+        absence of whitespace to improve readability. Because the language
			
 
				+        officially sanctions both choices, the formatting tool can be expected
			
 
				+        to preserve the user's choice.
			
 
				+    -   The choice to lex the longest known symbolic token rather than the
			
 
				+        longest sequence of symbolic characters makes it easier to write
			
 
				+        expressions involving a series of prefix or postfix operators, such as
			
 
				+        `x = -*p;`.
			
 
				+
			
 
				+-   Interoperability with and migration from existing C++ code
			
 
				+
			
 
				+    -   The fixed operator set makes a mapping between Carbon operators and C++
			
 
				+        operators easier, by avoiding any desire to map arbitrary user-defined
			
 
				+        Carbon operators into a C++ form.
			
 
				+    -   The choice of a fixed operator set and a "max munch" rule will be
			
 
				+        familiar to C++ developers, as it is the same approach taken by C++.
			
 
				+    -   The whitespace rule permits the `*` operator to be used for all of
			
 
				+        multiplication, dereference, and pointer type formation, as in C++,
			
 
				+        while still permitting Carbon to treat type expressions as expressions.
			
 
				+
			
 
				+## Alternatives considered
			
 
				+
			
 
				+We could lex the longest sequence of symbolic characters rather than lexing only
			
 
				+the longest known operator.
			
 
				+
			
 
				+Advantages:
			
 
				+
			
 
				+-   Adding new operators could be done without any change to the lexing rules.
			
 
				+-   If unknown operators are rejected, adding new operators would carry no risk
			
 
				+    of changing the meaning of existing valid code.
			
 
				+
			
 
				+Disadvantages:
			
 
				+
			
 
				+-   Sequences of prefix or postfix operators would require parentheses or
			
 
				+    whitespace. For example, `Int**` would lex as `Int` followed by a single
			
 
				+    `**` token, and `**p` would lex as a single `**` token followed by `p`, if
			
 
				+    there is no `**` operator. While we could define `**`, `***`, and so on as
			
 
				+    operators, doing so would add complexity and inconsistency to the language
			
 
				+    rules.
			
 
				+
			
 
				+We could support an extensible operator set, giving the developer the option to
			
 
				+add new operators.
			
 
				+
			
 
				+Advantages:
			
 
				+
			
 
				+-   This would increase expressivity, especially for embedded domain-specific
			
 
				+    languages.
			
 
				+
			
 
				+Disadvantages:
			
 
				+
			
 
				+-   This would harm readability, at least for those unfamiliar with the code
			
 
				+    using the operators.
			
 
				+-   This could harm our ability to evolve the language, by admitting the
			
 
				+    possibility of a custom operator colliding with a newly-introduced standard
			
 
				+    operator, although this risk could be reduced by providing a separate
			
 
				+    lexical syntax for custom operators.
			
 
				+-   We would need to either lex the longest sequence of symbolic characters we
			
 
				+    can, which has the same disadvantage discussed for that approach above, or
			
 
				+    use a more sophisticated rule to determine how to split operators -- perhaps
			
 
				+    based on what operator overloads are in scope -- increasing complexity.
			
 
				+
			
 
				+We could apply different whitespace restrictions or no whitespace restrictions.
			
 
				+See [#520](https://github.com/carbon-language/carbon-lang/issues/520) for
			
 
				+discussion of the alternatives and the leads decision.
			
 
				+
			
 
				+We could require whitespace around a binary operator followed by `[` or `{`. In
			
 
				+particular, for examples such as:
			
 
				+
			
 
				+```
			
 
				+fn F() -> Int*{ return Null; }
			
 
				+var n: Int = pointer_to_array^[i];
			
 
				+```
			
 
				+
			
 
				+... this would allow us to form a unary operator instead of a binary operator,
			
 
				+which is likely to be more in line with the developer's expectations.
			
 
				+
			
 
				+Advantages:
			
 
				+
			
 
				+-   Room to add a postfix `^` dereference operator, or similarly any other
			
 
				+    postfix operator producing an array, without creating surprises for pointers
			
 
				+    to arrays.
			
 
				+-   Allows the whitespace before the `{` of a function body to be consistently
			
 
				+    omitted if desired.
			
 
				+
			
 
				+Disadvantages:
			
 
				+
			
 
				+-   The rule would be more complex, and would be asymmetric: we must allow
			
 
				+    closing square brackets before unspaced binary operators to permit things
			
 
				+    like `arr[i]*3`.
			
 
				+-   Would interact badly with expression forms that begin with a `[` or `{`, for
			
 
				+    example `Time.Now()+{.seconds = 3}` or `names+["Lrrr"]`.