# Parse ## Table of contents - [Overview](#overview) - [Parse stack](#parse-stack) - [Postorder tree](#postorder-tree) - [Bracketing inside the tree](#bracketing-inside-the-tree) - [Visual example](#visual-example) - [Handling invalid parses](#handling-invalid-parses) - [How is this accomplished?](#how-is-this-accomplished) - [Introducer](#introducer) - [Optional modifiers before an introducer](#optional-modifiers-before-an-introducer) - [Something required in context](#something-required-in-context) - [Optional clauses](#optional-clauses) - [Case 1: introducer to optional clause is used as parent node](#case-1-introducer-to-optional-clause-is-used-as-parent-node) - [Case 2: parent node is required token after optional clause, with different parent node kinds for different options](#case-2-parent-node-is-required-token-after-optional-clause-with-different-parent-node-kinds-for-different-options) - [Case 3: optional sibling](#case-3-optional-sibling) - [Operators](#operators) - [Alternatives considered](#alternatives-considered) - [Restrictive parsing](#restrictive-parsing) ## Overview Parsing uses tokens to produce a parse tree that faithfully represents the tree structure of the source program, interpreted according to the Carbon grammar. No semantics are associated with the tree structure at this level, and no name lookup is performed. The parse tree's structure corresponds to the grammar of the Carbon language. On valid input, there will be a 1:1 correspondence between parse tree nodes and tokens. A parse tree is considered _structurally valid_ if all nodes have the number of children that their node kind requires. On invalid input, nodes may be added that don't correspond to a token to maintain a structurally valid parse tree. When a parse tree node is marked as having an error, it will still be structurally valid, but its children may not match a valid grammar. Code trying to handle children of erroneous nodes must be prepared to handle atypical structures, but it may still be helpful for tools such as syntax highlighters or refactoring tools. In general, we favor doing the checking for whether something is allowed _in a particular context_ in [the check stage](check) instead of the parse stage, unless the context is very local. This is for a few reasons: - We anticipate that the parse stage will be used to operate on invalid code while still preserving as much of the intent of the author as possible, for example in an IDE or a code formatter. - To keep as much code out of the parse stage as possible, so it is simple and fast. - We are building all the infrastructure to keep track of context in the check stage. These reasons explain what local context is okay: where we already have the contextual information at hand so there is no performance cost, and we can output a parse tree that still captures faithfully what the user wrote. Examples: - All declaration modifiers are allowed in any order on any declaration in the parse stage. Diagnosing duplicated modifiers, modifiers that conflict with other modifiers, or modifiers that can't be used on a particular declaration is postponed until the check stage. - Rejecting a keyword after `fn` where a name is expected is done at the parse stage. ## Parse stack The core parser loop is `Parse::Tree::Parse`. In the loop, it pops the next state off the stack, and dispatches to the appropriate `Handle` function. A typical handler function pops the state first, leaving the stack ready for the next state. It may add nodes to the parse tree, based on the current code. If it needs to trigger other states, it will push them onto the stack; because it's a stack, the _next_ state is always pushed _last_. Operator expressions store information about current operator precedence in the stack as well. While this isn't necessary for most parser states, and could be stored separately, it's currently together because it has no impact on the size of a stack entry and is thus more efficient to store in one place. ## Postorder tree The parse tree's storage layout is in postorder. For example, given the code: ```carbon fn foo() -> f64 { return 42; } ``` The node order is (with indentation to indicate nesting): ```yaml [ {kind: 'FileStart', text: ''}, {kind: 'FunctionIntroducer', text: 'fn'}, {kind: 'Name', text: 'foo'}, {kind: 'ParamListStart', text: '('}, {kind: 'ParamList', text: ')', subtree_size: 2}, {kind: 'Literal', text: 'f64'}, {kind: 'ReturnType', text: '->', subtree_size: 2}, {kind: 'FunctionDefinitionStart', text: '{', subtree_size: 7}, {kind: 'ReturnStatementStart', text: 'return'}, {kind: 'Literal', text: '42'}, {kind: 'ReturnStatement', text: ';', subtree_size: 3}, {kind: 'FunctionDefinition', text: '}', subtree_size: 11}, {kind: 'FileEnd', text: ''}, ] ``` In this example, `FileStart`, `FunctionDefinition`, and `FileEnd` are "root" nodes for the tree. Function components are children of `FunctionDefinition`. It's produced in this way because it's an efficient layout to produce with vectorized storage, requiring little context to be maintained during parsing. Because it's stored in postorder, it's also most efficient to process the parsed output in postorder; this affects checking. The parse tree is printed in postorder by default because it matches how the parse tree is expected to be processed within the toolchain , and so can make it easier to reason about. However, the `--preorder` flag may be used in contexts where a preorder representation would be easier to handle. ## Bracketing inside the tree The parse tree is designed to be walked in postorder by checking, allowing checking to be more efficient. To support this, checking sometimes requires context on the meaning of a node when it is encountered. Each `ParseNodeKind` has either a bracketing node, or a specific child count. This helps document and enforce the expected tree structure. When a bracketing node is indicated, it is the opening bracket: it will always be the first child of the parent, and that will be the only time it occurs in the parent's children (it may still occur in children of children). When checking encounters the opening bracket, this means it can make contextual decisions for the later children of the node. Nodes can also have a specific child count, for example, infix operators always have two children: the lhs and rhs expressions. Many nodes have a child count of 0; this just means they're leaf nodes, and will never have children. Because the tree structure is always valid, these are treated as contracts. Some nodes exist only to be used to construct valid tree structures for invalid input, such as `InvalidParse`. Although each subtree's size is also tracked as part of the node, we're currently trying to avoid relying on it and may eliminate it if it turns out to be unnecessary and a meaningful cost for the compiler. ## Visual example To try to explain the transition from code to Parse Tree, consider the statement: ```carbon var x: i32 = y + 1; ``` Lexing creates distinct tokens for each syntactic element, which will form the basis of the parse tree:
Tokens: +-----+ +-----+ +-----+ +-----+ +-----+ +-----+ +-----+ +-----+ +-----+ | var | | x | | : | | i32 | | = | | y | | + | | 1 | | ; | +-----+ +-----+ +-----+ +-----+ +-----+ +-----+ +-----+ +-----+ +-----+First the `var` keyword is used as a "bracketing" node (VariableIntroducer). When this is seen in a postorder traversal, it tells us to expect the basics of a variable declaration structure.
Tokens:
+-----+ +-----+ +-----+ +-----+ +-----+ +-----+ +-----+ +-----+
| x | | : | | i32 | | = | | y | | + | | 1 | | ; |
+-----+ +-----+ +-----+ +-----+ +-----+ +-----+ +-----+ +-----+
Parse tree:
+-----+
| var |
+-----+
Next, we can consider the pattern binding. Here, `x` is the identifier and `i32`
is the type expression. The `:` provides a parent node that must always contain
two children, the name and type expression. Because it always has two direct
children, it doesn't need to be bracketed.
Tokens:
+-----+ +-----+ +-----+ +-----+ +-----+
| = | | y | | + | | 1 | | ; |
+-----+ +-----+ +-----+ +-----+ +-----+
Parse tree:
+-----+ +-----+
| x | | i32 |
+-----+ +-----+
| |
+-------+-------+
|
+-----+ +-----+
| var | | : |
+-----+ +-----+
We use the `=` as a separator (instead of a node with children like `:`) to help
indicate the transition from binding to assignment expression, which is
important for expression parsing during checking.
Tokens:
+-----+ +-----+ +-----+ +-----+
| y | | + | | 1 | | ; |
+-----+ +-----+ +-----+ +-----+
Parse tree:
+-----+ +-----+
| x | | i32 |
+-----+ +-----+
| |
+-------+-------+
|
+-----+ +-----+ +-----+
| var | | : | | = |
+-----+ +-----+ +-----+
The expression is a subtree with `+` as the parent, and the two operands as
child nodes.
Tokens:
+-----+
| ; |
+-----+
Parse tree:
+-----+ +-----+ +-----+ +-----+
| x | | i32 | | y | | 1 |
+-----+ +-----+ +-----+ +-----+
| | | |
+-------+-------+ +-------+-------+
| |
+-----+ +-----+ +-----+ +-----+
| var | | : | | = | | + |
+-----+ +-----+ +-----+ +-----+
Finally, the `;` is used as the "root" of the variable declaration. It's
explicitly tracked as the `;` for a variable declaration so that it's
unambiguously bracketed by `var`.
Tokens:
Parse tree:
+-----+ +-----+ +-----+ +-----+
| x | | i32 | | y | | 1 |
+-----+ +-----+ +-----+ +-----+
| | | |
+-------+-------+ +-------+-------+
| |
+-----+ +-----+ +-----+ +-----+
| var | | : | | = | | + |
+-----+ +-----+ +-----+ +-----+
| | | |
+-----------------------+-------+-----------------------+-------+
|
+-----+
| ; |
+-----+
This is the completed parse tree.
In storage, this tree will be flat and in postorder. Because the order hasn't
changed much from the original code, we can do the reordering for postorder with
a minimal number of nodes being delayed for later output: it will be linear with
respect to the depth of the parse tree.
Tokens:
+-----+ +-----+ +-----+ +-----+ +-----+ +-----+ +-----+ +-----+ +-----+
| var | | x | | : | | i32 | | = | | y | | + | | 1 | | ; |
+-----+ +-----+ +-----+ +-----+ +-----+ +-----+ +-----+ +-----+ +-----+
Parse tree:
+-----+ +-----+ +-----+ +-----+
| x | | i32 | | y | | 1 |
+-----+ +-----+ +-----+ +-----+
| | | |
+-------+-------+ +-------+-------+
| |
+-----+ +-----+ +-----+ +-----+
| var | | : | | = | | + |
+-----+ +-----+ +-----+ +-----+
| | | |
+-----------------------+-------+-----------------------+-------+
|
+-----+
| ; |
+-----+
Flattened for storage:
+-----+ +-----+ +-----+ +-----+ +-----+ +-----+ +-----+ +-----+ +-----+
| var | | x | | i32 | | : | | = | | y | | 1 | | + | | ; |
+-----+ +-----+ +-----+ +-----+ +-----+ +-----+ +-----+ +-----+ +-----+
The structural concepts of bracketing nodes (`var` and `;`) and parent nodes
with a known child count (`:` and `+` with 2 children, but also `=` with 0
children) will allow checking to reconstruct the tree as it encounters nodes
during the postorder.
There are other structures that could have been used here, such as `=` being
parent of the `var` and pattern nodes, and `;` being the parent of the `=` and
assignment expression nodes. In that example alternative, the storage order
would be the same; it would only change the tree representation. The current
structure is influenced by choices in checking.
## Handling invalid parses
On an invalid parse, the output tree should still try to mirror the intended
tree structure when possible. There's a balance here, and it's not expected to
try too hard to make things correct, but outputting nodes is preferred. There
are `InvalidParse` nodes which may be used to provide a node when the planned
node kind is too difficult to get correct child counts (bracketed subtrees may
not need an `InvalidParse` node).
When marking a child node with `has_error=true`, parent nodes may also be marked
with `has_error=true`, but try to be conservative about this. As a rule of
thumb, if checking could continue on a parent node without needing the child
node to be fully checked (possibly with incomplete information), then the parent
node should not be marked as `has_error=true`. The goal remains providing
something similar to a well-formed parse tree.
In general, a parent node must have the immediate children described in
[parse/typed_nodes.h](/toolchain/parse/typed_nodes.h), unless it is marked
`has_error=true`. If this is violated for a particular parse tree, an error will
be raised in `Tree::Verify`. Note that an `InvalidParse` node is allowed as a
declaration or expression, and an `InvalidParseSubtree` is allowed as a
declaration. These invalid nodes can be added to more node categories as needed.
Child states may indicate an error to their parent using `ReturnErrorOnState`.
This is particularly intended for when a child state emits a diagnostic, to
prevent the parent state from emitting redundant diagnostics; for example, an
invalid expression might have more invalid tokens following it, and the parent
might skip those without emitting diagnostics.
## How is this accomplished?
The specific approach to producing the desired tree depends on the kind of
grammar rule being implemented, as well as the desired output tree structure.
### Introducer
**Example:** `if (c) { ... }`
Here `if` is the introducer. Many other possible introducers could occur in that
position, such as `while` or `var`, and we want to dispatch based on which token
is present. See
[parse/handle_statement.cpp](/toolchain/parse/handle_statement.cpp).
The first step is to identify the introducer token, typically using a `switch`
or `if` on the `Lex::TokenKind` at the current position:
```cpp
switch (context.PositionKind()) {
case Lex::TokenKind::___: {
...
break;
}
...
}
```
There should be a `default:` (or `else`) case so every kind of token is handled.
This may be an error, in which case:
- A [diagnostic](diagnostics.md) should be emitted.
- An invalid parse node should be added, using something like:
```cpp
context.AddLeafNode(NodeKind::InvalidParse, context.Consume(),
/*has_error=*/true);
```
- At least one node should be consumed, particularly if it will continue with
this state at this position, to avoid an infinite loop.
The default case may also be delegated to another state. For example, in the
state where a statement is expected, if no keyword introducer is recognized, it
switches to the expression-statement state.
Depending on the introducer, different actions can be taken. The most common
case is to:
- Call `context.PushState(StateKind::___);` to mark the beginning of the
statement or declaration and indicate the state that will handle the tokens
after the introducer.
- Call `context.AddLeafNode(NodeKind::___, context.Consume());` to output a
bracketing node for this introducer.
The next state can then add sibling nodes until it gets to the end of the
declaration or statement. The last token, often a semicolon `;`, is used as a
parent node to match the bracketing node of the introducer.
If the introducer token won't be used as a bracketing node, it can be
temporarily skipped after `context.PushState` by calling
`context.ConsumeAndDiscard()` instead of `context.AddLeafNode`. It must be added
to the output tree as a node by some later state, unless an error occurs. For
example, a `for` statement uses the `for` token as the root of the tree -- it
doesn't need a bracketing node since it has a fixed child count. Note that the
token was saved when the state was pushed, and can be retrieved when adding a
node as in this example:
```cpp
auto state = context.PopState();
context.AddNode(NodeKind::ForStatement, state.token, state.subtree_start,
state.has_error);
```
If this state is for an element of a scope like the statements in a code block,
most introducer tokens indicate that the current state should be repeated, to
handle the next statement, but some other token, like a close curly brace (`}`)
means that the state should be exited.
### Optional modifiers before an introducer
**Example:** `virtual fn Foo();`
Here `fn` is the introducer, and `virtual` is an optional modifier that appears
before. See
[parse/handle_decl_scope_loop.cpp](/toolchain/parse/handle_decl_scope_loop.cpp).
Use this pattern when the goal is to produce a subtree that starts with the
introducer as a bracketing node, as in the previous case, followed by nodes for
any modifiers. Note that bracketing is needed here, since the optional modifier
nodes mean that there is not a fixed child count for the parent node. This means
shuffling the introducer node before an unknown number of modifier nodes. This
is accomplished by emitting a placeholder node for the introducer, processing
all the modifiers until reaching the introducer, filling in the placeholder with
the information about the introducer, and then finishing the rest of the
declaration or statement.
- **Step 1**: Save the current value of `context.tree().size()`. This could be
accomplished by calling `context.PushState()`, which saves that value in the
`subtree_start` field of `Context::State`; or by constructing a
`Context::State` value directly, as is done in
[parse/handle_decl_scope_loop.cpp](/toolchain/parse/handle_decl_scope_loop.cpp).
This marks the position of the placeholder node we are going to replace, as
well as the beginning of the subtree we are eventually going to emit for
this declaration or statement.
- **Step 2**: Emit the placeholder node using
`context.AddLeafNode(NodeKind::Placeholder, *context.position());`. The
`NodeKind` and `Lex::TokenIndex` values will be overwritten later.
- **Step 3**: Process tokens until we hit the introducer. All of the nodes we
emit at this point will appear as siblings after the introducer token in the
output tree.
- **Step 4 - success**: If an introducer token is found, replace the
placeholder node using something like:
```cpp
context.ReplacePlaceholderNode(state.subtree_start, introducer_kind,
context.Consume());
```
- `state.subtree_start` is the value of `context.tree().size()` saved in
step 1, which marks the position of the placeholder node in the output
parse tree.
- `introducer_kind` is the `NodeKind` for the introducer of this
declaration or statement, a leaf node that will act as a bracketing node
at the beginning of the subtree for this declaration or statement
- **Step 4 - error**: If we run into something other than a modifier or
introducer before finding an introducer, we need to do error handling:
```cpp
context.ReplacePlaceholderNode(subtree_start, NodeKind::InvalidParseStart,
*context.position(), /*has_error=*/true);
```
- Emit a [diagnostic](diagnostics.md).
- Replace the placeholder node (similar to step 4) with an
`InvalidParseStart` node. It will be associated with the unexpected
token that triggered this error.
- Consume input token up to the likely end of the end of the current
statement or declaration. For example, we might consume up to a `;` or a
token at a lesser indent level using `context.SkipPastLikelyEnd(...)`.
It is important that we consume at least one token in the error case,
otherwise we could have an infinite loop of generating the same error on
the same token.
- Emit a `InvalidParseSubtree` node. This will be the parent of any
emitted modifier nodes, and will be bracketed by the `InvalidParseStart`
node emitted above. It should be associated with the last token
consumed.
```cpp
// Set `iter` to the last token consumed, one before the current position.
auto iter = context.position();
--iter;
context.AddNode(NodeKind::InvalidParseSubtree, *iter, subtree_start,
/*has_error=*/true);
```
- **Step 5**: (If success at step 4) Push whatever states are to be used to
parse the rest of the declaration. The first state pushed (the last state to
be processed) will handle the end of this declaration. That pushed state
should have a `subtree_start` field set to the value of
`context.tree().size()` saved in step 1.
- **Step 6**: When handling the state for the end of the declaration, emit the
root node of subtree:
```cpp
state = context.PopState();
context.AddNode(NodeKind::___, context.Consume(),
state.subtree_start, state.has_error);
```
- This `state.subtree_start` will mark everything since the bracketing
introducer node as the children of this node.
### Something required in context
TODO
Example: name after introducer
[parse/handle_decl_name_and_params.cpp](/toolchain/parse/handle_decl_name_and_params.cpp)
Example: "`[` _implicit parameter list_ `]`" after `impl forall`
[parse/handle_impl.cpp](/toolchain/parse/handle_impl.cpp)
### Optional clauses
#### Case 1: introducer to optional clause is used as parent node
**Example:** The optional `->