This document focuses on the implementation of pattern matching. See here for more on the design and fundamental concepts.
The SemIR for a pattern-matching operation is emitted in three steps:
The SemIR emitted in the pattern step primarily consists of pattern
instructions, which are instructions that describe the pattern itself. For
example, given the pattern (x: i32, y:i32), the pattern step might emit the
following SemIR:
%x.patt: %pattern_type.7ce = binding_pattern x [concrete]
%y.patt: %pattern_type.7ce = binding_pattern y [concrete]
%.loc4_21: %pattern_type.511 = tuple_pattern (%x.patt, %y.patt) [concrete]
Pattern instructions do not represent executable code, and are generally ignored
during lowering. Instead, they descriptively represent the pattern itself as a
kind of constant value, and their primary consumer is the match step. The type
of a pattern instruction is a pattern type, which is represented by a
PatternType instruction. For example, the constants block might define the
types in the above SemIR like so:
%i32: type = class_type @Int, @Int(%int_32) [concrete]
%pattern_type.7ce: type = pattern_type %i32 [concrete]
%tuple.type: type = tuple_type (%i32, %i32) [concrete]
%pattern_type.511: type = pattern_type %tuple.type [concrete]
We can read this as saying that the type of %x.patt and %y.patt is "pattern
that matches an i32 scrutinee", and the type of %.loc4_21 is "pattern that
matches a (i32, i32) scrutinee".
Pattern instructions are only emitted during the pattern step, but that step can
emit non-pattern instructions as well. For example, in a pattern like
(x: i32, a + b), i32 and a + b are ordinary expressions, and so their
SemIR must be emitted during the initial traversal of the parse tree, as with
any other expression.
All the pattern instructions for a given full-pattern are grouped together in a
distinct block that contains only pattern instructions. Consequently,
Check::Context maintains pattern_block_stack as a separate InstBlockStack
for pattern blocks, and operations like AddInst automatically put
newly-created pattern insts on that stack.
The SemIR produced in the first two steps is (like most SemIR) generally in post-order, reflecting the order of the parse tree. However, the match step traversal is performed pre-order, starting with the root instruction of the pattern and traversing into its dependencies.
In some cases it is necessary for the pattern step to allocate instructions that
won't actually be emitted until the match step, because they are responsible for
performing pattern matching. When that happens, they are allocated but not added
to a block, and their IDs are stored in the Check::Context so that they can be
spliced into the current block at the appropriate point in the match step.
Currently this happens in two cases, which are handled using two maps in
Check::Context from pattern instruction IDs to the corresponding match
instruction IDs:
A name binding can be used within the same pattern that declares it:
match (x) {
case (n: i32, n) => ...
For this to work, the name n needs to be added to the scope as soon as we
handle its declaration, and it needs to resolve to the ValueBinding
instruction that binds a value to that name. This means that the
ValueBinding instruction needs to be allocated during the pattern step,
even though it is part of matching, not part of the pattern.
Context::bind_name_map stores these ValueBindings, keyed by the
corresponding ValueBindingPattern instruction.
A var pattern allocates storage during matching, which is represented by a
VarStorage instruction. This instruction must be allocated during the
pattern step, so that it can be used as the output parameter of scrutinee
expression evaluation during the scrutinee step. Context::var_storage_map
stores these VarStorage instructions, keyed by the corresponding
VarPattern instruction.
As noted earlier, the pattern step can also emit non-pattern instructions to
evaluate expressions that are embedded in the pattern, such as the type
expressions of binding patterns, and expressions that are used as patterns
themselves (although those have not been implemented yet). The parse tree
doesn't mark these situations in advance: any given subpattern might turn out to
be one that emits non-pattern instructions. To handle these situations, we
speculatively push an instruction block onto the (non-pattern) stack whenever we
are about to begin handling a subpattern, and then pop it at the end of the
subpattern, with different treatment depending on whether the subpattern turned
out to involve a subexpression. This is handled by BeginSubpattern,
ConsumeSubpatternExpr, EndSubpattern, and EndEmptySubpattern.
One further complication here is that the type expression can contain control
flow (such as an if expression). Consequently, we can't represent the type
expression SemIR as a single block; instead, we represent the SemIR for a given
type expression as a
single-entry, single-exit (SE/SE) region,
potentially consisting of multiple blocks.
Note: The original motivation for rigorously excluding non-pattern instructions from the pattern block may no longer apply. In particular, it may make sense to put non-pattern instructions in the pattern block when they represent an expression that is part of the pattern. If so, substantial parts of this design might change. See issue #5351.
At the same time as all of that, we have to manage the pattern block stack as well. We attempt to do this precisely rather than speculatively, by leveraging the parser to precisely mark the nodes immediately before full-patterns, and pushing the pattern block stack when we handle those nodes. We then rely on signals from both the parser and the node stack to determine when to pop from the pattern block stack.
In the case of let and var decls, this is fairly straightforward: the
beginning is marked by the LetIntroducer or VarIntroducer node, and the end
is marked by the LetInitializer or VarInitializer, or by the VarDecl in
the case of a var decl with no initializer. Similarly, the beginning of an
impl forall parameter list is marked by the Forall node, and the end is
marked by the ImplDecl or ImplDefinitionStart.
The case of a parameterized name (such as Bar(y: i32)) is more challenging.
The node immediately before the start of the full-pattern is an identifier, but
an identifier doesn't necessarily mark the start of a full-pattern. We've solved
that by having the parser mark identifier nodes that are followed by
full-patterns (using lookahead). Rather than use additional storage for what is
logically a single bit of data, we effectively smuggle that bit into the kind
enum by having separate node kinds IdentifierNameMaybeBeforeSignature and
IdentifierNameNotBeforeSignature.
If the parameterized name is a name qualifier (such as the first part of
Foo(X:! i32).Bar(y: i32)), the node immediately after it will be the qualifier
node. As of this writing, we bifurcate qualifier nodes into
NameQualifierWithParams and NameQualifierWithoutParams, much like we do with
identifier names, but we don't actually use that information, and instead use
the presence of parameters on the node stack to determine whether to pop the
pattern block stack.
Open question: should we re-combine the two qualifier node kinds?
If the parameterized name is not part of a name qualifier, the node immediately
after it will be a *Decl or *DefinitionStart node of the appropriate kind
(for example FunctionDecl or FunctionDefinitionStart if the introducer was
fn). Note that this means the pattern block is still on the stack while
handling the return type of a function. This is intentional, because we model
the return type as declaring an output parameter (see below), which makes it
functionally part of the parameter pattern.
Call parameters and argumentsSemIR models a function call as a Call instruction, which has an instruction
block consisting of one instruction per argument. Correspondingly, the SemIR
representation of a function has a block consisting of one instruction per
parameter. We refer to these as Call arguments and Call parameters,
because they don't necessarily correspond to the colloquial meaning of
"arguments" and "parameters" (which are sometimes referred to as syntactic
arguments and parameters).
For example, consider this function:
fn F(T:! type, U:! type) -> Core.String;
The Call instruction is a runtime-phase operation, so it notionally runs after
compile-time parameters have already been bound to values. As a result, a Call
instruction calling F does not pass values for either T or U. On the other
hand, it does pass a reference to the storage that F should construct the
return value in. So although we would colloquially say that F takes two
parameters of type type, it has a single Call parameter of type
Core.String.
If Carbon supports general patterns in function parameter lists, that introduces
additional ways that Call parameters can diverge from the colloquial meaning.
For example:
fn G(x: i32, var (y: i32, z: i32));
fn H(x: i32, (y: i32, var z: i32));
A var pattern converts the scrutinee to a durable reference expression, and
then performs further pattern matching on the object it refers to. As a result,
G has two Call parameters: a value corresponding to x, and a reference to
an object of type (i32, i32), corresponding to both y and z. On the other
hand, H has 3 Call parameters: values corresponding to x and y, and a
reference corresponding to z.
The Call parameters define the API boundary between the caller and callee at
the SemIR level. As a result, responsibility for matching the arguments against
the parameter list is split between the caller and the callee. Continuing the
example from above, given the call G(0, (x, y)), the caller is responsible for
converting 0 to i32, and for initializing a new (i32, i32) object from
(x, y), but the callee is responsible for binding the name x to its first
Call parameter, and for destructuring its second Call parameter and binding
the names y and z to its elements.
In SemIR we represent this situation with special ParamPattern instructions,
which mark the boundary: there is exactly one ParamPattern instruction for
each Call parameter, which matches the entire corresponding Call argument.
If a ParamPattern has a subpattern, it is matched on the callee side, and
everything above it is matched primarily on the caller side. There are multiple
kinds of ParamPattern instruction, which correspond to different ways of
passing a parameter (such as by reference or by value).
When performing callee-side pattern matching, we do not have an actual scrutinee
expression. Instead, for each ParamPattern instruction we generate a
corresponding Param instruction, which reads from the corresponding entry in
the Call argument list, and we use that as the scrutinee of the
ParamPattern. Every ParamPattern kind has a corresponding Param kind.
If a function has a declared return type, the function takes an additional
Call parameter, which points to the storage that should be initialized with
the return value. This Call parameter is represented as ReturnSlotPattern
instruction with an OutParamPattern instruction as a subpattern. The
ReturnSlotPattern also represents the return type declaration itself, such as
in FunctionFields. The SemIR that matches these patterns consists of a
ReturnSlot instruction, which binds the special name NameId::ReturnSlot to
the OutParam instruction representing the storage passed by the caller.
This structure is analogous to the handling of an ordinary by-value parameter,
which is represented in the Call parameters as an WrapperBindingPattern
instruction with a ValueParamPattern subpattern, and in the pattern-matching
SemIR as a ValueBinding instruction that binds the parameter name to the
ValueParam instruction representing the argument passed by the caller.
Note that if the return type does not have an in-place value representation
(meaning that the return value should not be passed in memory), these
instructions will all still be generated, but the SemIR for return statements
will not access the ReturnSlot, and the Call argument list will not contain
an argument corresponding to the OutParamPattern (and so it will be one
element shorter than the Call parameter list). However, the
ReturnSlotPattern is still used, in its other role as a representation of the
return type declaration. This leads to a potentially confusing situation, where
the term "return slot" sometimes refers to the ReturnSlotPattern (for example
in FunctionFields::return_slot_pattern), which is present for any function
with a declared return type, and sometimes refers to the actual storage provided
by the caller (for example in ReturnTypeInfo::has_return_slot), which is
present only if the return type has an in-place value representation.
TODO: When the return type isn't in-place, the
OutParamPatternshould probably not be in theCallparameter list (for consistency with theCallargument list), and possibly theOutParamPattern,OutParam, andReturnSlotinstructions should not be emitted in the first place. Furthermore, we should find a way to resolve the inconsistent "return slot" terminology.