# Pattern matching ## Table of contents - [Overview](#overview) - [Pattern instructions](#pattern-instructions) - [Instruction ordering](#instruction-ordering) - [Parser-driven pattern block pushing](#parser-driven-pattern-block-pushing) - [Function parameters](#function-parameters) - [`Call` parameters and arguments](#call-parameters-and-arguments) - [Caller and callee matching](#caller-and-callee-matching) - [The return slot](#the-return-slot) ## Overview This document focuses on the implementation of pattern matching. See [here](/docs/design/pattern_matching.md) for more on the design and fundamental concepts. The SemIR for a pattern-matching operation is emitted in three steps: 1. **Pattern:** Traverse the parse tree of the pattern to emit SemIR that abstractly describes the pattern. 2. **Scrutinee:** Traverse the parse tree of the scrutinee expression to emit SemIR that evaluates it. 3. **Match:** Traverse the pattern SemIR from step 1 (sometimes in conjunction with the scrutinee SemIR) to emit SemIR that actually performs pattern matching. ## Pattern instructions The SemIR emitted in the pattern step primarily consists of _pattern instructions_, which are instructions that describe the pattern itself. For example, given the pattern `(x: i32, y:i32)`, the pattern step might emit the following SemIR: ``` %x.patt: %pattern_type.7ce = binding_pattern x [concrete] %y.patt: %pattern_type.7ce = binding_pattern y [concrete] %.loc4_21: %pattern_type.511 = tuple_pattern (%x.patt, %y.patt) [concrete] ``` Pattern instructions do not represent executable code, and are generally ignored during lowering. Instead, they descriptively represent the pattern itself as a kind of constant value, and their primary consumer is the match step. The type of a pattern instruction is a _pattern type_, which is represented by a `PatternType` instruction. For example, the `constants` block might define the types in the above SemIR like so: ``` %i32: type = class_type @Int, @Int(%int_32) [concrete] %pattern_type.7ce: type = pattern_type %i32 [concrete] %tuple.type: type = tuple_type (%i32, %i32) [concrete] %pattern_type.511: type = pattern_type %tuple.type [concrete] ``` We can read this as saying that the type of `%x.patt` and `%y.patt` is "pattern that matches an `i32` scrutinee", and the type of `%.loc4_21` is "pattern that matches a `(i32, i32)` scrutinee". Pattern instructions are only emitted during the pattern step, but that step can emit non-pattern instructions as well. For example, in a pattern like `(x: i32, a + b)`, `i32` and `a + b` are ordinary expressions, and so their SemIR must be emitted during the initial traversal of the parse tree, as with any other expression. All the pattern instructions for a given full-pattern are grouped together in a distinct block that contains only pattern instructions. Consequently, `Check::Context` maintains `pattern_block_stack` as a separate `InstBlockStack` for pattern blocks, and provides separate methods like `AddPatternInst` for adding instructions to it. ## Instruction ordering The SemIR produced in the first two steps is (like most SemIR) generally in post-order, reflecting the order of the parse tree. However, the match step traversal is performed pre-order, starting with the root instruction of the pattern and traversing into its dependencies. In some cases it is necessary for the pattern step to allocate instructions that won't actually be emitted until the match step, because they are responsible for performing pattern matching. When that happens, they are allocated but not added to a block, and their IDs are stored in the `Check::Context` so that they can be spliced into the current block at the appropriate point in the match step. Currently this happens in two cases, which are handled using two maps in `Check::Context` from pattern instruction IDs to the corresponding match instruction IDs: - A name binding can be used within the same pattern that declares it: ```carbon match (x) { case (n: i32, n) => ... ``` For this to work, the name `n` needs to be added to the scope as soon as we handle its declaration, and it needs to resolve to the `ValueBinding` instruction that binds a value to that name. This means that the `ValueBinding` instruction needs to be allocated during the pattern step, even though it is part of matching, not part of the pattern. `Context::bind_name_map` stores these `ValueBinding`s, keyed by the corresponding `ValueBindingPattern` instruction. - A `var` pattern allocates storage during matching, which is represented by a `VarStorage` instruction. This instruction must be allocated during the pattern step, so that it can be used as the output parameter of scrutinee expression evaluation during the scrutinee step. `Context::var_storage_map` stores these `VarStorage` instructions, keyed by the corresponding `VarPattern` instruction. As noted earlier, the pattern step can also emit non-pattern instructions to evaluate expressions that are embedded in the pattern, such as the type expressions of binding patterns, and expressions that are used as patterns themselves (although those have not been implemented yet). The parse tree doesn't mark these situations in advance: any given subpattern might turn out to be one that emits non-pattern instructions. To handle these situations, we speculatively push an instruction block onto the (non-pattern) stack whenever we are about to begin handling a subpattern, and then pop it at the end of the subpattern, with different treatment depending on whether the subpattern turned out to be a subexpression. This is handled by `BeginSubpattern`, `EndSubpatternAsExpr`, and `EndSubpatternAsNonExpr`. One further complication here is that the type expression can contain control flow (such as an `if` expression). Consequently, we can't represent the type expression SemIR as a single block; instead, we represent the SemIR for a given type expression as a [single-entry, single-exit (SE/SE) region](https://en.wikipedia.org/wiki/Single-entry_single-exit), potentially consisting of multiple blocks. > **Note:** The original motivation for rigorously excluding non-pattern > instructions from the pattern block may no longer apply. In particular, it may > make sense to put non-pattern instructions in the pattern block when they > represent an expression that is part of the pattern. If so, substantial parts > of this design might change. See > [issue #5351](https://github.com/carbon-language/carbon-lang/issues/5351). ## Parser-driven pattern block pushing At the same time as all of that, we have to manage the _pattern_ block stack as well. We attempt to do this precisely rather than speculatively, by leveraging the parser to precisely mark the nodes immediately before full-patterns, and pushing the pattern block stack when we handle those nodes. We then rely on signals from both the parser and the node stack to determine when to pop from the pattern block stack. In the case of `let` and `var` decls, this is fairly straightforward: the beginning is marked by the `LetIntroducer` or `VarIntroducer` node, and the end is marked by the `LetInitializer` or `VarInitializer`, or by the `VarDecl` in the case of a `var` decl with no initializer. Similarly, the beginning of an `impl forall` parameter list is marked by the `Forall` node, and the end is marked by the `ImplDecl` or `ImplDefinitionStart`. The case of a parameterized name (such as `Bar(y: i32)`) is more challenging. The node immediately before the start of the full-pattern is an identifier, but an identifier doesn't necessarily mark the start of a full-pattern. We've solved that by having the parser mark identifier nodes that are followed by full-patterns (using lookahead). Rather than use additional storage for what is logically a single bit of data, we effectively smuggle that bit into the kind enum by having separate node kinds `IdentifierNameBeforeParams` and `IdentifierNameNotBeforeParams`. If the parameterized name is a name qualifier (such as the first part of `Foo(X:! i32).Bar(y: i32)`), the node immediately after it will be the qualifier node. As of this writing, we bifurcate qualifier nodes into `NameQualifierWithParams` and `NameQualifierWithoutParams`, much like we do with identifier names, but we don't actually use that information, and instead use the presence of parameters on the node stack to determine whether to pop the pattern block stack. > **Open question:** should we re-combine the two qualifier node kinds? If the parameterized name is not part of a name qualifier, the node immediately after it will be a `*Decl` or `*DefinitionStart` node of the appropriate kind (for example `FunctionDecl` or `FunctionDefinitionStart` if the introducer was `fn`). Note that this means the pattern block is still on the stack while handling the return type of a function. This is intentional, because we model the return type as declaring an output parameter (see below), which makes it functionally part of the parameter pattern. ## Function parameters ### `Call` parameters and arguments SemIR models a function call as a `Call` instruction, which has an instruction block consisting of one instruction per argument. Correspondingly, the SemIR representation of a function has a block consisting of one instruction per parameter. We refer to these as _`Call` arguments_ and _`Call` parameters_, because they don't necessarily correspond to the colloquial meaning of "arguments" and "parameters" (which are sometimes referred to as _syntactic_ arguments and parameters). For example, consider this function: ```carbon fn F(T:! type, U:! type) -> Core.String; ``` The `Call` instruction is a runtime-phase operation, so it notionally runs after compile-time parameters have already been bound to values. As a result, a `Call` instruction calling `F` does not pass values for either `T` or `U`. On the other hand, it does pass a reference to the storage that `F` should construct the return value in. So although we would colloquially say that `F` takes two parameters of type `type`, it has a single `Call` parameter of type `Core.String`. If Carbon supports general patterns in function parameter lists, that introduces additional ways that `Call` parameters can diverge from the colloquial meaning. For example: ```carbon fn G(x: i32, var (y: i32, z: i32)); fn H(x: i32, (y: i32, var z: i32)); ``` A `var` pattern converts the scrutinee to a durable reference expression, and then performs further pattern matching on the object it refers to. As a result, `G` has two `Call` parameters: a value corresponding to `x`, and a reference to an object of type `(i32, i32)`, corresponding to both `y` and `z`. On the other hand, `H` has 3 `Call` parameters: values corresponding to `x` and `y`, and a reference corresponding to `z`. ### Caller and callee matching The `Call` parameters define the API boundary between the caller and callee at the SemIR level. As a result, responsibility for matching the arguments against the parameter list is split between the caller and the callee. Continuing the example from above, given the call `G(0, (x, y))`, the caller is responsible for converting `0` to `i32`, and for initializing a new `(i32, i32)` object from `(x, y)`, but the callee is responsible for binding the name `x` to its first `Call` parameter, and for destructuring its second `Call` parameter and binding the names `y` and `z` to its elements. In SemIR we represent this situation with special `ParamPattern` instructions, which mark the boundary: there is exactly one `ParamPattern` instruction for each `Call` parameter, which matches the entire corresponding `Call` argument. The subpatterns of the `ParamPattern`s are matched on the callee side, and everything above them is matched on the caller side. There are multiple kinds of `ParamPattern` instruction, which correspond to different ways of passing a parameter (such as by reference or by value). When performing callee-side pattern matching, we do not have an actual scrutinee expression. Instead, for each `ParamPattern` instruction we generate a corresponding `Param` instruction, which reads from the corresponding entry in the `Call` argument list, and we use that as the scrutinee of the `ParamPattern`. Every `ParamPattern` kind has a corresponding `Param` kind. ### The return slot If a function has a declared return type, the function takes an additional `Call` parameter, which points to the storage that should be initialized with the return value. This `Call` parameter is represented as an `OutParamPattern` instruction with a `ReturnSlotPattern` instruction as a subpattern. The `ReturnSlotPattern` also represents the return type declaration itself, such as in `FunctionFields`. The SemIR that matches these patterns consists of a `ReturnSlot` instruction, which binds the special name `NameId::ReturnSlot` to the `OutParam` instruction representing the storage passed by the caller. This structure is analogous to the handling of an ordinary by-value parameter, which is represented in the `Call` parameters as a `ValueParamPattern` instruction with a `ValueBindingPattern`, and in the pattern-matching SemIR as a `ValueBinding` instruction that binds the parameter name to the `ValueParam` instruction representing the argument passed by the caller. Note that if the return type does not have an in-place value representation (meaning that the return value should not be passed in memory), these instructions will all still be generated, but the SemIR for `return` statements will not access the `ReturnSlot`, and the `Call` argument list will not contain an argument corresponding to the `OutParamPattern` (and so it will be one element shorter than the `Call` parameter list). However, the `ReturnSlotPattern` is still used, in its other role as a representation of the return type declaration. This leads to a potentially confusing situation, where the term "return slot" sometimes refers to the `ReturnSlotPattern` (for example in `FunctionFields::return_slot_pattern`), which is present for any function with a declared return type, and sometimes refers to the actual storage provided by the caller (for example in `ReturnTypeInfo::has_return_slot`), which is present only if the return type has an in-place value representation. > **TODO:** When the return type isn't in-place, the `OutParamPattern` should > probably not be in the `Call` parameter list (for consistency with the `Call` > argument list), and possibly the `OutParamPattern`, `OutParam`, and > `ReturnSlot` instructions should not be emitted in the first place. > Furthermore, we should find a way to resolve the inconsistent "return slot" > terminology.