Browse Source

Revise README and add subdir READMEs (#1023)

Geoff Romer 4 years ago
parent
commit
6c3e1f6d7c

+ 105 - 176
executable_semantics/README.md

@@ -6,134 +6,118 @@ Exceptions. See /LICENSE for license information.
 SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
 -->
 
-This directory contains a work-in-progress executable semantics. It started as
-an executable semantics for Featherweight C and it is migrating into an
-executable semantics for the Carbon language. It includes a parser, type
-checker, and abstract machine.
-
-This language currently includes several kinds of values: integer, booleans,
-functions, and structs. A kind of safe union, called a `choice`, is in progress.
-Regarding control-flow, it includes if statements, while loops, break, continue,
-function calls, and a variant of `switch` called `match` is in progress.
-
-The grammar of the language matches the one in Proposal
-[#162](https://github.com/carbon-language/carbon-lang/pull/162). The type
-checker and abstract machine do not yet have a corresponding proposal.
-Nevertheless they are present here to help test the parser but should not be
-considered definitive.
-
-The parser is implemented using the flex and bison parser generator tools.
-
--   [`lexer.lpp`](syntax/lexer.lpp) the lexer specification
--   [`parser.ypp`](syntax/parser.ypp) the grammar
-
-The parser translates program text into an abstract syntax tree (AST), defined
-in the [ast](ast/) subdirectory. The `UnimplementedExpression` node type can be
-used to define new expression syntaxes without defining their semantics, and the
-same techniques can be applied to other kinds of AST nodes as needed. See the
-handling of the `UNIMPL_EXAMPLE` token for an example of how this is done, and
-see [`unimplemented_example_test.cpp`](syntax/unimplemented_example_test.cpp)
-for an example of how to test it.
-
-The [type checker](interpreter/type_checker.h) defines what it means for an AST
-to be a valid program. The type checker prints an error and exits if the AST is
-invalid.
-
-The parser and type checker together specify the static (compile-time)
-semantics.
-
-The dynamic (run-time) semantics is specified by an abstract machine. Abstract
-machines have several positive characteristics that make them good for
-specification:
-
--   abstract machines operate on the AST of the program (and not some
-    lower-level representation such as bytecode) so they directly connect the
-    program to its behavior
-
--   abstract machines can easily handle language features with complex
-    control-flow, such as goto, exceptions, coroutines, and even first-class
-    continuations.
-
-The one down-side of abstract machines is that they are not as simple as a
-definitional interpreter (a recursive function that interprets the program), but
-it is more difficult to handle complex control flow in a definitional
-interpreter.
-
-[InterpProgram()](interpreter/interpreter.h) runs an abstract machine using the
-[interpreter](interpreter/), as described below.
-
-## Abstract Machine
-
-The abstract machine implements a state-transition system. The state is defined
-by the `State` structure, which includes three components: the procedure call
-stack, the heap, and the function definitions. The `Step` function updates the
-state by executing a little bit of the program. The `Step` function is called
-repeatedly to execute the entire program.
-
-An implementation of the language (such as a compiler) must be observationally
-equivalent to this abstract machine. The notion of observation is different for
-each language, and can include things like input and output. This language is
-currently so simple that the only thing that is observable is the final result,
-an integer. So an implementation must produce the same final result as the one
-produces by the abstract machine. In particular, an implementation does **not**
-have to mimic each step of the abstract machine and does not have to use the
-same kinds of data structures to store the state of the program.
-
-A procedure call frame, defined by the `Frame` structure, includes a pointer to
-the function being called, the environment that maps variables to their
-addresses, and a to-do list of actions. Each action corresponds to an expression
-or statement in the program. The `Action` structure represents an action. An
-action often spawns other actions that needs to be completed first and
-afterwards uses their results to complete its action. To keep track of this
-process, each action includes a position field `pos` that stores an integer that
-starts at `-1` and increments as the action makes progress. For example, suppose
-the action associated with an addition expression `e1 + e2` is at the top of the
-to-do list:
-
-    (e1 + e2) [-1] :: ...
-
-When this action kicks off (in the `StepExp` function), it increments `pos` to
-`0` and pushes `e1` onto the to-do list, so the top of the todo list now looks
-like:
-
-    e1 [-1] :: (e1 + e2) [0] :: ...
-
-Skipping over the processing of `e1`, it eventually turns into an integer value
-`n1`:
-
-    n1 :: (e1 + e2) [0]
-
-Because there is a value at the top of the to-do list, the `Step` function
-invokes `HandleValue` which then dispatches on the next action on the to-do
-list, in this case the addition. The addition action spawns an action for
-subexpression `e2`, increments `pos` to `1`, and remembers `n1`.
-
-    e2 [-1] :: (e1 + e2) [1](n1) :: ...
+`executable_semantics` is an implementation of Carbon whose primary purpose is
+to act as a clear specification of the language. As an extension of that goal,
+it can also be used as a platform for prototyping and validating changes to the
+language. Consequently, it prioritizes straightforward, readable code over
+performance, diagnostic quality, and other conventional implementation
+priorities. In other words, its intended audience is people working on the
+design of Carbon, and it is not intended for real-world Carbon programming on
+any scale. See the [`toolchain`](/toolchain/) directory for a separate
+implementation that's focused on the needs of Carbon users.
+
+## Overview
+
+`executable_semantics` represents Carbon code using an abstract syntax tree
+(AST), which is defined in the [`ast`](ast/) directory. The [`syntax`](syntax/)
+directory contains lexer and parser, which define how the AST is generated from
+Carbon code. The [`interpreter`](interpreter/) directory contains the remainder
+of the implementation.
+
+`executable_semantics` is an interpreter rather than a compiler, although it
+attempts to separate compile time from run time, since that separation is an
+important constraint on Carbon's design.
+
+## Programming conventions
+
+The class hierarchies in `executable_semantics` are built to support
+[LLVM-style RTTI](https://llvm.org/docs/HowToSetUpLLVMStyleRTTI.html), and
+define a `kind` accessor that returns an enum identifying the concrete type.
+`executable_semantics` typically relies less on virtual dispatch, and more on
+using `kind` as the key of a `switch` and then down-casting in the individual
+cases. As a result, adding a new derived class to a hierarchy requires updating
+existing code to handle it. It is generally better to avoid defining `default`
+cases for RTTI switches, so that the compiler can help ensure the code is
+updated when a new type is added.
+
+`executable_semantics` never uses plain pointer types directly. Instead, we use
+the [`Nonnull<T*>`](common/nonnull.h) alias for pointers that are not nullable,
+or `std::optional<Nonnull<T*>>` for pointers that are nullable.
+
+Many of the most commonly-used objects in `executable_semantics` have lifetimes
+that are tied to the lifespan of the entire Carbon program. We manage the
+lifetimes of those objects by allocating them through an
+[`Arena`](common/arena.h) object, which can allocate objects of arbitrary types,
+and retains ownership of them. As of this writing, all of `executable_semantics`
+uses a single `Arena` object, we may introduce multiple `Arena`s for different
+lifetime groups in the future.
+
+For simplicity, `executable_semantics` generally treats all errors as fatal.
+Errors caused by bugs in the user-provided Carbon code should be reported with
+the macros in [`error.h`](common/error.h). Errors caused by bugs in
+`executable_semantics` itself should be reported with
+[`CHECK` or `FATAL`](../common/check.h).
 
-Skipping over the processing of `e2`, it eventually turns into an integer value
-`n2`:
+## Example Programs (Regression Tests)
+
+The [`testdata/`](testdata/) subdirectory includes some example programs with
+expected output.
+
+These tests make use of LLVM's
+[lit](https://llvm.org/docs/CommandGuide/lit.html) and
+[FileCheck](https://llvm.org/docs/CommandGuide/FileCheck.html). Tests have
+boilerplate at the top:
+
+```carbon
+// Part of the Carbon Language project, under the Apache License v2.0 with LLVM
+// Exceptions. See /LICENSE for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+// RUN: executable_semantics %s 2>&1 | \
+// RUN:   FileCheck --match-full-lines --allow-unused-prefixes=false %s
+// RUN: executable_semantics --trace %s 2>&1 | \
+// RUN:   FileCheck --match-full-lines --allow-unused-prefixes %s
+// AUTOUPDATE: executable_semantics %s
+// CHECK: result: 0
 
-    n2 :: (e1 + e2) [1](n1) :: ...
+package ExecutableSemanticsTest api;
+```
 
-Again the `Step` function invokes `HandleValue` and dispatches to the addition
-action which performs the arithmetic and pushes the result on the to-do list.
-Let `n3` be the sum of `n1` and `n2`.
+To explain this boilerplate:
 
-    n3 :: ...
+-   The standard copyright is expected.
+-   The `RUN` lines indicate two commands for `lit` to execute using the file:
+    one without `--trace` output, one with.
+    -   Output is piped to `FileCheck` for verification.
+    -   Setting `-allow-unused-prefixes` to false when processing the ordinary
+        output, and true when handling the `--trace` output, allows us to omit
+        the tracing output from the `CHECK` lines, while ensuring they cover all
+        non-tracing output.
+    -   Setting `-match-full-lines` in both cases indicates that each `CHECK`
+        line must match a complete output line, with no extra characters before
+        or after the `CHECK` pattern.
+    -   `RUN:` will be followed by the `not` command when failure is expected.
+        In particular, `RUN: not executable_semantics ...`.
+    -   `%s` is a
+        [`lit` substitution](https://llvm.org/docs/CommandGuide/lit.html#substitutions)
+        for the path to the given test file.
+-   The `AUTOUPDATE` line indicates that `CHECK` lines will be automatically
+    inserted immediately below by the `./update_checks.py` script.
+-   The `CHECK` lines indicate expected output, verified by `FileCheck`.
+    -   Where a `CHECK` line contains text like `{{.*}}`, the double curly
+        braces indicate a contained regular expression.
+-   The `package` is required in all test files, per normal Carbon syntax rules.
 
-The heap is an array of values. It is used to store anything that is mutable,
-including function parameters and local variables. An address is simply an index
-into the array. The assignment operation stores the value of the right-hand side
-into the heap at the index specified by the address of the left-hand side
-lvalue.
+### Useful commands
 
-Function calls push a new frame on the stack and the `return` statement pops a
-frame off the stack. The parameter passing semantics is call-by-value, so the
-machine applies `CopyVal` to the incoming arguments and the outgoing return
-value. Also, the machine kills the values stored in the parameters and local
-variables when the function call is complete.
+-   `./update_checks.py` -- Updates expected output.
+-   `bazel test :executable_semantics_lit_test --test_output=errors` -- Runs
+    tests and prints any errors.
+-   `bazel test :executable_semantics_lit_test --test_output=errors --test_arg=--filter=basic_syntax/.*`
+    -- Only runs tests in the `basic_syntax` directory; `--filter` is a regular
+    expression.
 
-## Experimental: Delimited Continuations
+## Experimental feature: Delimited Continuations
 
 Delimited continuations provide a kind of resumable exception with first-class
 continuations. The point of experimenting with this feature is not to say that
@@ -214,58 +198,3 @@ We adapted the feature to operate in a more imperative manner. The
 `shift` to pause and capture the continuation object. The `__run` feature is
 equivalent to calling the continuation. The `__await` feature is equivalent to a
 `shift` except that it updates the continuation in place.
-
-## Example Programs (Regression Tests)
-
-The [`testdata/`](testdata/) subdirectory includes some example programs with
-expected output.
-
-These tests make use of LLVM's
-[lit](https://llvm.org/docs/CommandGuide/lit.html) and
-[FileCheck](https://llvm.org/docs/CommandGuide/FileCheck.html). Tests have
-boilerplate at the top:
-
-```carbon
-// Part of the Carbon Language project, under the Apache License v2.0 with LLVM
-// Exceptions. See /LICENSE for license information.
-// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
-//
-// RUN: executable_semantics %s 2>&1 | \
-// RUN:   FileCheck --match-full-lines --allow-unused-prefixes=false %s
-// RUN: executable_semantics --trace %s 2>&1 | \
-// RUN:   FileCheck --match-full-lines --allow-unused-prefixes %s
-// AUTOUPDATE: executable_semantics %s
-// CHECK: result: 0
-
-package ExecutableSemanticsTest api;
-```
-
-To explain this boilerplate:
-
--   The standard copyright is expected.
--   The `RUN` lines indicate two commands for `lit` to execute using the file:
-    one without `--trace` output, one with.
-    -   Output is piped to `FileCheck` for verification.
-    -   `-allow-unused-prefixes` controls that output of the command without
-        `--trace` should _precisely_ match `CHECK` lines, whereas the command
-        with `--trace` will be a superset.
-    -   `RUN:` will be followed by the `not` command when failure is expected.
-        In particular, `RUN: not executable_semantics ...`.
-    -   `%s` is a
-        [`lit` substitution](https://llvm.org/docs/CommandGuide/lit.html#substitutions)
-        for the path to the given test file.
--   The `AUTOUPDATE` line indicates that `CHECK` lines will be automatically
-    inserted immediately below by the `./update_checks.py` script.
--   The `CHECK` lines indicate expected output, verified by `FileCheck`.
-    -   Where a `CHECK` line contains text like `{{.*}}`, the double curly
-        braces indicate a contained regular expression.
--   The `package` is required in all test files, per normal Carbon syntax rules.
-
-Useful commands are:
-
--   `./update_checks.py` -- Updates expected output.
--   `bazel test :executable_semantics_lit_test --test_output=errors` -- Runs
-    tests and prints any errors.
--   `bazel test :executable_semantics_lit_test --test_output=errors --test_arg=--filter=basic_syntax/.*`
-    -- Only runs tests in the `basic_syntax` directory; `--filter` is a regular
-    expression.

+ 34 - 0
executable_semantics/ast/README.md

@@ -0,0 +1,34 @@
+<!--
+Part of the Carbon Language project, under the Apache License v2.0 with LLVM
+Exceptions. See /LICENSE for license information.
+SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+-->
+
+The code in this directory defines the AST that represents Carbon code in the
+rest of `executable-semantics`.
+
+All node types in the AST are derived from [`AstNode`](ast_node.h), and use
+[LLVM-style RTTI](https://llvm.org/docs/HowToSetUpLLVMStyleRTTI.html) to support
+safe down-casting and similar operations. Each abstract class `Foo` in the
+hierarchy has a `kind` method which returns a enum `FooKind` that identifies the
+concrete type of the object, and a `FooKind` value can be safely `static_cast`ed
+to `BarKind` if that value represents a type that's derived from both `Foo` and
+`Bar`.
+
+We rely on code generation to help enforce those invariants, so every node type
+must be described in [`ast_rtti.txt`](ast_rtti.txt). See the documentation in
+(`gen_rtti.py`)[../gen_rtti.py], the code generation script, for details about
+the file format and generated code.
+
+The AST class hierarchy is structured in a fairly unsurprising way, with
+abstract classes such as `Statement` and `Expression`, and concrete classes
+representing individual syntactic constructs, such as `If` for if-statements.
+
+Sometimes it is useful to work with a subset of node types that "cuts across"
+the primary class hierarchy. Rather than deal with the pitfalls of multiple
+inheritance, we handle these cases using a form of type erasure: we specify a
+notional interface that those types conform to, and then define a "view" class
+that behaves like a pointer to an instance of that interface. Types declare that
+they model an interface `Foo` by defining a public static member named
+`ImplementsCarbonFoo`. See [NamedEntityView](static_scope.h) for an example of
+this pattern.

+ 108 - 0
executable_semantics/interpreter/README.md

@@ -0,0 +1,108 @@
+# `executable_semantics` execution
+
+<!--
+Part of the Carbon Language project, under the Apache License v2.0 with LLVM
+Exceptions. See /LICENSE for license information.
+SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+-->
+
+The code in this directory defines all phases of program execution after
+parsing, including typechecking, name resolution, and execution. The overall
+flow can be found in [`ExecProgram`](exec_program.cpp), which executes each
+phase in sequence.
+
+## The Carbon abstract machine
+
+Execution is specified in terms of an abstract machine, which executes minimal
+program steps in a loop until the program terminates. The state of the Carbon
+program (including the stack as well as the heap) is represented explicitly in
+C++ data structures, rather than implicitly in the C++ call stack. The
+[`Interpreter`](interpreter.cpp) class represents an instance of the abstract
+machine, and is responsible for maintaining those data structures and
+implementing the steps of abstract machine execution.
+
+The control-flow state of the abstract machine is encapsulated in an
+[`ActionStack`](action_stack.h) object, which a represents a stack of
+[`Action`s](action.h). An `Action` represents a self-contained computation (such
+as evaluation of an expression or execution of a statement) as a state machine,
+and the abstract machine proceeds by repeatedly executing the next state
+transition of the `Action` at the top of the stack. Executing a step may modify
+the internal state of the `Action`, and may also modify the `Action` stack, for
+example by pushing a new `Action` onto it. When an `Action` is done executing,
+it can optionally produce a value as its result, which is made available to the
+`Action` below it on the stack.
+
+Carbon values are represented as [`Value`](value.h) objects, both at compile
+time and at run time. Note that in Carbon, a type is a kind of value, so types
+are represented as `Value`s. More subtly, `Value` can also represent information
+that isn't a true Carbon value, but needs to be propagated through channels that
+use `Value`. Most notably, certain kinds of `Value` are used to represent the
+result of "evaluating" a `Pattern`, which evaluates all the subexpressions
+nested within it, while preserving the structure of the non-expression parts for
+use in pattern matching.
+
+`Value`s are always immutable. The abstract machine's mutable memory is
+represented using the [`Heap`](heap.h) class, which is essentially a mapping of
+[`Address`es](address.h) to `Value`s.
+
+### Example
+
+To evaluate the expression `((1 + 2) + 3)`, the interpreter starts by pushing an
+`Action` onto the stack that corresponds to the whole expression:
+
+`((1 + 2) + 3)`[0]{} :: ...
+
+In this notation, we're expressing the stack as a sequence of `Action`s
+separated by `::`, with the top at the left, and representing each `Action` as
+the expression it evaluates, followed by its state. The state of an `Action`
+currently consists of:
+
+-   An integer `pos`, which is initially 0 and usually counts the number of
+    steps executed. Here that's denoted with a number in square brackets.
+-   A vector `results`, which collects the results of any sub-`Action`s spawned
+    by the `Action`. Here that's denoted by a list in curly braces.
+
+Then the interpreter proceeds by repeatedly taking the next step of the `Action`
+at the top of the stack. For expression `Action`s, `pos` typically identifies
+the operand that the next step should begin evaluation of. In this case, that
+operand is the expression `(1 + 2)`, so we push a new `Action` onto the stack,
+and increment `pos` on the old one:
+
+`(1 + 2)`[0]{} :: `((1 + 2) + 3)`[1]{} :: ...
+
+The next step spawns an action to evaluate `1`:
+
+`1`[0]{} :: `(1 + 2)`[1]{} :: `((1 + 2) + 3)`[1]{} :: ...
+
+That expression can be fully evaluated in a single step, so the next step
+evaluates it, appends the result to the next `Action` down the stack, and pops
+the now-completed `Action` off the stack:
+
+`(1 + 2)`[1]{1} :: `((1 + 2) + 3)`[1]{} :: ...
+
+The top `Action`'s `pos` is 1, so the next step begins evaluation of the second
+operand:
+
+`2`[0]{} :: `(1 + 2)`[2]{1} :: `((1 + 2) + 3)`[1]{} :: ...
+
+Which again can be evaluated immediately:
+
+`(1 + 2)`[2]{1, 2} :: `((1 + 2) + 3)`[1]{} :: ...
+
+This expression has two operands, so now that `pos` is 2, all operands have been
+evaluated, and their results are in the corresponding entries of `results`.
+Thus, the next step can compute the expression value, passing it down to the
+parent `Action` and popping the completed action as before:
+
+`((1 + 2) + 3)`[1]{3} :: ...
+
+Evaluation now proceeds to the second operand:
+
+`3`[0]{} :: `((1 + 2) + 3)`[2]{3} :: ...
+
+Which, again, can be evaluated immediately:
+
+`((1 + 2) + 3)`[2]{3, 3} :: ...
+
+`pos` now indicates that all subexpressions have been evaluated, so the next
+step computes the final result of 6, and passes it down the stack.

+ 17 - 0
executable_semantics/syntax/README.md

@@ -0,0 +1,17 @@
+<!--
+Part of the Carbon Language project, under the Apache License v2.0 with LLVM
+Exceptions. See /LICENSE for license information.
+SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+-->
+
+The code in this directory is responsible for translating Carbon source code to
+the AST defined in [`ast`](../ast/). It consists primarily of a Flex lexer
+defined in [`lexer.lpp`](lexer.lpp) and a Bison grammar defined in
+[`parser.ypp`](parser.ypp).
+
+It is possible to define and test a new expression syntax without defining its
+semantics by using the `UnimplementedExpression` AST node type and the same
+techniques can be applied to other kinds of AST nodes as needed. See the
+handling of the `UNIMPL_EXAMPLE` token for an example of how this is done, and
+see [`unimplemented_example_test.cpp`](unimplemented_example_test.cpp) for an
+example of how to test it.