浏览代码

Merge parser library from the toolchain repository. (#214)

Only change is to update the path to the fuzzer build extension.

Original main commit message:

> Add an initial parser library. (#30)
>
> This library builds a parse tree, very similar to a concrete syntax
> tree. There are no semantics here, simply introducing the basic
> syntactic structure.
>
> The current focus has been on the APIs and the data structures used to
> represent the parse tree, and not on the actual code doing the
> parsing. The code doing the parsing tries to be reasonably efficient
> and reasonably easy to understand recursive descent parser. But there
> is likely much that can be done to improve this code path. A notable
> area where very little thought has been given yet are emitting good
> diagnostics and doing good recovery in the event of parse errors.
>
> Also, this code does not try to match the current under-discussion
> grammar closely. It is only partial and reflects discussions from some
> time ago. It should be updated incrementally to reflect the current
> expected grammar.
>
> The data structure used for the parse tree is unusual. The first
> constraint is that there is a precise one-to-one correspondence
> between the tokens produced by the lexer and the nodes in the parse
> tree. Every token results in exactly one node. In that way, the parse
> tree can be thought of as merely shaping the token stream into a tree.
>
> Each node is also represented with a fixed set of data that is densely
> packed. Combined with the exact relationship to tokens, this allows us
> to fully allocate the parse tree's storage, and to use a dense array
> rather than a pointer-based tree structure.
>
> The tree structure itself is implicitly defined by tracking the size
> of each subtree rooted at a particular node. See the code comments for
> more details (and I'm happy to add more comments where necessary). The
> goal is to minimize both the allocations (one), the working set size
> of the tree as a whole, and optimize common iteration patterns. The
> tree is stored in postorder. This allows depth-first postorder
> iteration as well as topological iteration by walking in reverse.
>
> Building the parse tree in postorder is a natural consequence of the
> grammar being LR rather than LL, which is a consequence of supporting
> infix operators.
>
> As with the Lexer, the parser supports an API for operating on the
> parse tree, as well as the ability to print the tree in both
> a human-readable and machine-readable format (YAML-based). It includes
> significant unit tests and a fuzz tester. The fuzzer's corpus will be
> in a follow-up commit.
>
> This is the largest chunk of code already written by several of us
> prior to open sourcing. (There are a few more pieces, but they are
> significantly smaller and less interesting.) If there are major things
> that folks would like to see happen here, it may make sense to move
> them into issues for tracking. I have tried to update the code to
> follow the style guidelines, but apologies if I missed anything, just
> let me know. We also have issues #19 and #29 to track things that
> already came up with the lexer.

Co-authored-by: Jon Meow <46229924+jonmeow@users.noreply.github.com>
Chandler Carruth 5 年之前
父节点
当前提交
3512c2218f

+ 86 - 0
parser/BUILD

@@ -0,0 +1,86 @@
+# Part of the Carbon Language project, under the Apache License v2.0 with LLVM
+# Exceptions. See /LICENSE for license information.
+# SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+
+load("@rules_cc//cc:defs.bzl", "cc_library", "cc_test")
+load("//bazel/fuzzing:rules.bzl", "cc_fuzz_test")
+
+package(default_visibility = ["//visibility:public"])
+
+cc_library(
+    name = "parse_node_kind",
+    srcs = ["parse_node_kind.cpp"],
+    hdrs = ["parse_node_kind.h"],
+    textual_hdrs = ["parse_node_kind.def"],
+    deps = ["@llvm-project//llvm:Support"],
+)
+
+cc_test(
+    name = "parse_node_kind_test",
+    srcs = ["parse_node_kind_test.cpp"],
+    deps = [
+        ":parse_node_kind",
+        "@llvm-project//llvm:Support",
+        "@llvm-project//llvm:gtest",
+        "@llvm-project//llvm:gtest_main",
+    ],
+)
+
+cc_library(
+    name = "parse_tree",
+    srcs = [
+        "parse_tree.cpp",
+        "parser_impl.cpp",
+        "parser_impl.h",
+    ],
+    hdrs = ["parse_tree.h"],
+    deps = [
+        ":parse_node_kind",
+        "@llvm-project//llvm:Support",
+        "//diagnostics:diagnostic_emitter",
+        "//lexer:token_kind",
+        "//lexer:tokenized_buffer",
+    ],
+)
+
+cc_library(
+    name = "parse_test_helpers",
+    testonly = 1,
+    hdrs = ["parse_test_helpers.h"],
+    deps = [
+        ":parse_node_kind",
+        ":parse_tree",
+        "@llvm-project//llvm:Support",
+        "@llvm-project//llvm:gmock",
+        "//lexer:tokenized_buffer",
+    ],
+)
+
+cc_test(
+    name = "parse_tree_test",
+    srcs = ["parse_tree_test.cpp"],
+    deps = [
+        ":parse_node_kind",
+        ":parse_test_helpers",
+        ":parse_tree",
+        "@llvm-project//llvm:Support",
+        "@llvm-project//llvm:gmock",
+        "@llvm-project//llvm:gtest",
+        "@llvm-project//llvm:gtest_main",
+        "//diagnostics:diagnostic_emitter",
+        "//lexer:tokenized_buffer",
+        "//lexer:tokenized_buffer_test_helpers",
+    ],
+)
+
+cc_fuzz_test(
+    name = "parse_tree_fuzzer",
+    srcs = ["parse_tree_fuzzer.cpp"],
+    corpus = glob(["fuzzer_corpus/*"]),
+    deps = [
+        ":parse_tree",
+        "@llvm-project//llvm:Support",
+        "//diagnostics:diagnostic_emitter",
+        "//lexer:tokenized_buffer",
+    ],
+)

+ 0 - 0
parser/fuzzer_corpus/empty


+ 19 - 0
parser/parse_node_kind.cpp

@@ -0,0 +1,19 @@
+// Part of the Carbon Language project, under the Apache License v2.0 with LLVM
+// Exceptions. See /LICENSE for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+
+#include "parser/parse_node_kind.h"
+
+#include "llvm/ADT/StringRef.h"
+
+namespace Carbon {
+
+auto ParseNodeKind::GetName() const -> llvm::StringRef {
+  static constexpr llvm::StringLiteral Names[] = {
+#define CARBON_PARSE_NODE_KIND(Name) #Name,
+#include "parser/parse_node_kind.def"
+  };
+  return Names[static_cast<int>(kind)];
+}
+
+}  // namespace Carbon

+ 25 - 0
parser/parse_node_kind.def

@@ -0,0 +1,25 @@
+// Part of the Carbon Language project, under the Apache License v2.0 with LLVM
+// Exceptions. See /LICENSE for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+//
+// Note that this is an X-macro header.
+//
+// It does not use `#include` guards, and instead is designed to be `#include`ed
+// after the x-macro is defined in order for its inclusion to expand to the
+// desired output. The x-macro for this header is `CARBON_PARSE_NODE_KIND`. The
+// definition provided will be removed at the end of this file to clean up.
+
+#ifndef CARBON_PARSE_NODE_KIND
+#error "Must define the x-macro to use this file."
+#endif
+
+CARBON_PARSE_NODE_KIND(CodeBlockEnd)
+CARBON_PARSE_NODE_KIND(CodeBlock)
+CARBON_PARSE_NODE_KIND(DeclarationEnd)
+CARBON_PARSE_NODE_KIND(EmptyDeclaration)
+CARBON_PARSE_NODE_KIND(FunctionDeclaration)
+CARBON_PARSE_NODE_KIND(Identifier)
+CARBON_PARSE_NODE_KIND(ParameterListEnd)
+CARBON_PARSE_NODE_KIND(ParameterList)
+
+#undef CARBON_PARSE_NODE_KIND

+ 69 - 0
parser/parse_node_kind.h

@@ -0,0 +1,69 @@
+// Part of the Carbon Language project, under the Apache License v2.0 with LLVM
+// Exceptions. See /LICENSE for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+
+#ifndef PARSER_PARSE_NODE_KIND_H_
+#define PARSER_PARSE_NODE_KIND_H_
+
+#include <cstdint>
+#include <iterator>
+
+#include "llvm/ADT/StringRef.h"
+
+namespace Carbon {
+
+// A class wrapping an enumeration of the different kinds of nodes in the parse
+// tree.
+//
+// Rather than using a raw enumerator for each distinct kind of node produced by
+// the parser, we wrap the enumerator in a class to expose a more rich API
+// including bidirectional mappings to string spellings of the different kinds
+// and any relevant classification.
+//
+// Instances of this type should always be created using the `constexpr` static
+// member functions. These instances are designed specifically to be usable in
+// `case` labels of `switch` statements just like an enumerator would.
+class ParseNodeKind {
+ public:
+  // The formatting for this macro is weird due to a `clang-format` bug. See
+  // https://bugs.llvm.org/show_bug.cgi?id=48320 for details.
+#define CARBON_PARSE_NODE_KIND(Name) \
+  static constexpr auto Name()->ParseNodeKind { return KindEnum::Name; }
+#include "parser/parse_node_kind.def"
+
+  // The default constructor is deleted as objects of this type should always be
+  // constructed using the above factory functions for each unique kind.
+  ParseNodeKind() = delete;
+
+  auto operator==(const ParseNodeKind& rhs) const -> bool {
+    return kind == rhs.kind;
+  }
+  auto operator!=(const ParseNodeKind& rhs) const -> bool {
+    return kind != rhs.kind;
+  }
+
+  // Gets a friendly name for the token for logging or debugging.
+  [[nodiscard]] auto GetName() const -> llvm::StringRef;
+
+ private:
+  enum class KindEnum : uint8_t {
+#define CARBON_PARSE_NODE_KIND(Name) Name,
+#include "parser/parse_node_kind.def"
+  };
+
+  constexpr ParseNodeKind(KindEnum k) : kind(k) {}
+
+  // Enable conversion to our private enum, including in a `constexpr` context,
+  // to enable usage in `switch` and `case`. The enum remains private and
+  // nothing else should be using this.
+  explicit constexpr operator KindEnum() const { return kind; }
+
+  KindEnum kind;
+};
+
+// We expect the parse node kind to fit compactly into 8 bits.
+static_assert(sizeof(ParseNodeKind) == 1, "Kind objects include padding!");
+
+}  // namespace Carbon
+
+#endif  // PARSER_PARSE_NODE_KIND_H_

+ 25 - 0
parser/parse_node_kind_test.cpp

@@ -0,0 +1,25 @@
+// Part of the Carbon Language project, under the Apache License v2.0 with LLVM
+// Exceptions. See /LICENSE for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+
+#include "parser/parse_node_kind.h"
+
+#include <cstring>
+
+#include "gtest/gtest.h"
+#include "llvm/ADT/StringRef.h"
+
+namespace Carbon {
+
+namespace {
+
+// Not much to test here, so just verify that the API compiles and returns the
+// data in the `.def` file.
+#define CARBON_PARSE_NODE_KIND(Name)                   \
+  TEST(ParseNodeKindTest, Name) {                      \
+    EXPECT_EQ(#Name, ParseNodeKind::Name().GetName()); \
+  }
+#include "parser/parse_node_kind.def"
+
+}  // namespace
+}  // namespace Carbon

+ 262 - 0
parser/parse_test_helpers.h

@@ -0,0 +1,262 @@
+// Part of the Carbon Language project, under the Apache License v2.0 with LLVM
+// Exceptions. See /LICENSE for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+
+#ifndef PARSER_PARSE_TEST_HELPERS_H_
+#define PARSER_PARSE_TEST_HELPERS_H_
+
+#include <ostream>
+#include <string>
+#include <vector>
+
+#include "gmock/gmock.h"
+#include "lexer/tokenized_buffer.h"
+#include "llvm/ADT/STLExtras.h"
+#include "llvm/ADT/SmallVector.h"
+#include "llvm/ADT/StringRef.h"
+#include "parser/parse_node_kind.h"
+#include "parser/parse_tree.h"
+
+namespace Carbon {
+
+// Enable printing a parse tree from Google Mock.
+inline void PrintTo(const ParseTree& tree, std::ostream* output) {
+  std::string text;
+  llvm::raw_string_ostream text_stream(text);
+  tree.Print(text_stream);
+  *output << "\n" << text_stream.str() << "\n";
+}
+
+namespace Testing {
+
+// An aggregate used to describe an expected parse tree.
+//
+// This type is designed to be used via aggregate initialization with designated
+// initializers. The latter make it easy to default everything and then override
+// the desired aspects when writing an expectation in a test.
+struct ExpectedNode {
+  ParseNodeKind kind = ParseNodeKind::EmptyDeclaration();
+  std::string text;
+  bool has_error = false;
+  bool skip_subtree = false;
+  std::vector<ExpectedNode> children;
+};
+
+// Implementation of a matcher for a parse tree based on a tree of expected
+// nodes.
+//
+// Don't create this directly, instead use `MatchParseTreeNodes` to construct a
+// matcher based on this.
+class ExpectedNodesMatcher
+    : public ::testing::MatcherInterface<const ParseTree&> {
+ public:
+  ExpectedNodesMatcher(llvm::SmallVector<ExpectedNode, 0> expected_nodess)
+      : expected_nodes(std::move(expected_nodess)) {}
+
+  auto MatchAndExplain(const ParseTree& tree,
+                       ::testing::MatchResultListener* output_ptr) const
+      -> bool override;
+  auto DescribeTo(std::ostream* output_ptr) const -> void override;
+
+ private:
+  auto MatchExpectedNode(const ParseTree& tree, ParseTree::Node n,
+                         int postorder_index, ExpectedNode expected_node,
+                         ::testing::MatchResultListener& output) const -> bool;
+
+  llvm::SmallVector<ExpectedNode, 0> expected_nodes;
+};
+
+// Implementation of the Google Mock interface for matching (and explaining any
+// failure).
+inline auto ExpectedNodesMatcher::MatchAndExplain(
+    const ParseTree& tree, ::testing::MatchResultListener* output_ptr) const
+    -> bool {
+  auto& output = *output_ptr;
+  bool matches = true;
+  const auto rpo = llvm::reverse(tree.Postorder());
+  const auto nodes_begin = rpo.begin();
+  const auto nodes_end = rpo.end();
+  auto nodes_it = nodes_begin;
+  llvm::SmallVector<const ExpectedNode*, 16> expected_node_stack;
+  for (const ExpectedNode& en : expected_nodes)
+    expected_node_stack.push_back(&en);
+  while (!expected_node_stack.empty()) {
+    if (nodes_it == nodes_end)
+      // We'll check the size outside the loop.
+      break;
+
+    ParseTree::Node n = *nodes_it++;
+    int postorder_index = n.GetIndex();
+
+    const ExpectedNode& expected_node = *expected_node_stack.pop_back_val();
+
+    if (!MatchExpectedNode(tree, n, postorder_index, expected_node, output))
+      matches = false;
+
+    if (expected_node.skip_subtree) {
+      assert(expected_node.children.empty() &&
+             "Must not skip an expected subtree while specifying expected "
+             "children!");
+      nodes_it = llvm::reverse(tree.Postorder(n)).end();
+      continue;
+    }
+
+    // We want to make sure we don't end up with unsynchronized walks, so skip
+    // ahead in the tree to ensure that the number of children of this node and
+    // the expected number of children match.
+    int num_children =
+        std::distance(tree.Children(n).begin(), tree.Children(n).end());
+    if (num_children != static_cast<int>(expected_node.children.size())) {
+      output
+          << "\nParse node (postorder index #" << postorder_index << ") has "
+          << num_children << " children, expected "
+          << expected_node.children.size()
+          << ". Skipping this subtree to avoid any unsynchronized tree walk.";
+      matches = false;
+      nodes_it = llvm::reverse(tree.Postorder(n)).end();
+      continue;
+    }
+
+    // Push the children onto the stack to continue matching. The expectation
+    // is in preorder, but we visit the parse tree in reverse postorder. This
+    // causes the siblings to be visited in reverse order from the expected
+    // list. However, we use a stack which inherently does this reverse for us
+    // so we simply append to the stack here.
+    for (const ExpectedNode& child_expected_node : expected_node.children)
+      expected_node_stack.push_back(&child_expected_node);
+  }
+
+  // We don't directly check the size because we allow expectations to skip
+  // subtrees. Instead, we need to check that we successfully processed all of
+  // the actual tree and consumed all of the expected tree.
+  if (nodes_it != nodes_end) {
+    assert(expected_node_stack.empty() &&
+           "If we have unmatched nodes in the input tree, should only finish "
+           "having fully processed expected tree.");
+    output << "\nFinished processing expected nodes and there are still "
+           << (nodes_end - nodes_it) << " unexpected nodes.";
+    matches = false;
+  } else if (!expected_node_stack.empty()) {
+    output << "\nProcessed all " << (nodes_end - nodes_begin)
+           << " nodes and still have " << expected_node_stack.size()
+           << " expected nodes that were unmatched.";
+    matches = false;
+  }
+
+  return matches;
+}
+
+// Implementation of the Google Mock interface for describing the expected node
+// tree.
+//
+// This is designed to describe the expected tree node structure in as similar
+// of a format to the parse tree's print format as is reasonable. There is both
+// more and less information, so it won't be exact, but should be close enough
+// to make it easy to visually compare the two.
+inline auto ExpectedNodesMatcher::DescribeTo(std::ostream* output_ptr) const
+    -> void {
+  auto& output = *output_ptr;
+  output << "Matches expected node pattern:\n[\n";
+
+  // We want to walk these in RPO instead of in preorder to match the printing
+  // of the actual parse tree.
+  llvm::SmallVector<std::pair<const ExpectedNode*, int>, 16>
+      expected_node_stack;
+  for (const ExpectedNode& expected_node : llvm::reverse(expected_nodes))
+    expected_node_stack.push_back({&expected_node, 0});
+
+  while (!expected_node_stack.empty()) {
+    const ExpectedNode& expected_node = *expected_node_stack.back().first;
+    int depth = expected_node_stack.back().second;
+    expected_node_stack.pop_back();
+    for (int indent_count = 0; indent_count < depth; ++indent_count)
+      output << "  ";
+    output << "{kind: '" << expected_node.kind.GetName().str() << "'";
+    if (!expected_node.text.empty())
+      output << ", text: '" << expected_node.text << "'";
+    if (expected_node.has_error)
+      output << ", has_error: yes";
+    if (expected_node.skip_subtree)
+      output << ", skip_subtree: yes";
+
+    if (!expected_node.children.empty()) {
+      assert(!expected_node.skip_subtree &&
+             "Must not have children and skip a subtree!");
+      output << ", children: [\n";
+      for (const ExpectedNode& child_expected_node :
+           llvm::reverse(expected_node.children))
+        expected_node_stack.push_back({&child_expected_node, depth + 1});
+      // If we have children, we know we're not popping off.
+      continue;
+    }
+
+    // If this is some form of leaf we'll at least need to close it. It may also
+    // be the last sibling of its parent, and we'll need to close any parents as
+    // we pop up.
+    output << "}";
+    if (!expected_node_stack.empty()) {
+      assert(depth >= expected_node_stack.back().second &&
+             "Cannot have an increase in depth on a leaf node!");
+      // The distance we need to pop is the difference in depth.
+      int pop_depth = depth - expected_node_stack.back().second;
+      for (int pop_count = 0; pop_count < pop_depth; ++pop_count)
+        // Close both the children array and the node mapping.
+        output << "]}";
+    }
+    output << "\n";
+  }
+  output << "]\n";
+}
+
+inline auto ExpectedNodesMatcher::MatchExpectedNode(
+    const ParseTree& tree, ParseTree::Node n, int postorder_index,
+    ExpectedNode expected_node, ::testing::MatchResultListener& output) const
+    -> bool {
+  bool matches = true;
+
+  ParseNodeKind kind = tree.GetNodeKind(n);
+  if (kind != expected_node.kind) {
+    output << "\nParse node (postorder index #" << postorder_index << ") is a "
+           << kind.GetName().str() << ", expected a "
+           << expected_node.kind.GetName().str() << ".";
+    matches = false;
+  }
+
+  if (tree.HasErrorInNode(n) != expected_node.has_error) {
+    output << "\nParse node (postorder index #" << postorder_index << ") "
+           << (tree.HasErrorInNode(n) ? "has an error"
+                                      : "does not have an error")
+           << ", expected that it "
+           << (expected_node.has_error ? "has an error"
+                                       : "does not have an error")
+           << ".";
+    matches = false;
+  }
+
+  llvm::StringRef node_text = tree.GetNodeText(n);
+  if (!expected_node.text.empty() && node_text != expected_node.text) {
+    output << "\nParse node (postorder index #" << postorder_index
+           << ") is spelled '" << node_text.str() << "', expected '"
+           << expected_node.text << "'.";
+    matches = false;
+  }
+
+  return matches;
+}
+
+// Creates a matcher for a parse tree using a tree of expected nodes.
+//
+// This is intended to be used with an braced initializer list style aggregate
+// initializer for an argument, allowing it to describe a tree structure via
+// nested `ExpectedNode` objects.
+inline auto MatchParseTreeNodes(
+    llvm::SmallVector<ExpectedNode, 0> expected_nodes)
+    -> ::testing::Matcher<const ParseTree&> {
+  return ::testing::MakeMatcher(
+      new ExpectedNodesMatcher(std::move(expected_nodes)));
+}
+
+}  // namespace Testing
+}  // namespace Carbon
+
+#endif  // PARSER_PARSE_TEST_HELPERS_H_

+ 194 - 0
parser/parse_tree.cpp

@@ -0,0 +1,194 @@
+// Part of the Carbon Language project, under the Apache License v2.0 with LLVM
+// Exceptions. See /LICENSE for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+
+#include "parser/parse_tree.h"
+
+#include <cstdlib>
+
+#include "lexer/token_kind.h"
+#include "llvm/ADT/ArrayRef.h"
+#include "llvm/ADT/Optional.h"
+#include "llvm/ADT/Sequence.h"
+#include "llvm/ADT/SmallSet.h"
+#include "llvm/ADT/SmallVector.h"
+#include "llvm/ADT/iterator.h"
+#include "llvm/Support/raw_ostream.h"
+#include "parser/parse_node_kind.h"
+#include "parser/parser_impl.h"
+
+namespace Carbon {
+
+auto ParseTree::Parse(TokenizedBuffer& tokens, DiagnosticEmitter& emitter)
+    -> ParseTree {
+  // Delegate to the parser.
+  return Parser::Parse(tokens, emitter);
+}
+
+auto ParseTree::Postorder() const -> llvm::iterator_range<PostorderIterator> {
+  return {PostorderIterator(Node(0)),
+          PostorderIterator(Node(node_impls.size()))};
+}
+
+auto ParseTree::Postorder(Node n) const
+    -> llvm::iterator_range<PostorderIterator> {
+  // The postorder ends after this node, the root, and begins at the start of
+  // its subtree.
+  int end_index = n.index + 1;
+  int start_index = end_index - node_impls[n.index].subtree_size;
+  return {PostorderIterator(Node(start_index)),
+          PostorderIterator(Node(end_index))};
+}
+
+auto ParseTree::Children(Node n) const
+    -> llvm::iterator_range<SiblingIterator> {
+  int end_index = n.index - node_impls[n.index].subtree_size;
+  return {SiblingIterator(*this, Node(n.index - 1)),
+          SiblingIterator(*this, Node(end_index))};
+}
+
+auto ParseTree::Roots() const -> llvm::iterator_range<SiblingIterator> {
+  return {SiblingIterator(*this, Node(static_cast<int>(node_impls.size()) - 1)),
+          SiblingIterator(*this, Node(-1))};
+}
+
+auto ParseTree::HasErrorInNode(Node n) const -> bool {
+  return node_impls[n.index].has_error;
+}
+
+auto ParseTree::GetNodeKind(Node n) const -> ParseNodeKind {
+  return node_impls[n.index].kind;
+}
+
+auto ParseTree::GetNodeToken(Node n) const -> TokenizedBuffer::Token {
+  return node_impls[n.index].token;
+}
+
+auto ParseTree::GetNodeText(Node n) const -> llvm::StringRef {
+  return tokens->GetTokenText(node_impls[n.index].token);
+}
+
+auto ParseTree::Print(llvm::raw_ostream& output) const -> void {
+  output << "[\n";
+  // The parse tree is stored in postorder, but the most natural order to
+  // visualize is preorder. This is a tree, so the preorder can be constructed
+  // by reversing the order of each level of siblings within an RPO. The sibling
+  // iterators are directly built around RPO and so can be used with a stack to
+  // produce preorder.
+
+  // The roots, like siblings, are in RPO (so reversed), but we add them in
+  // order here because we'll pop off the stack effectively reversing then.
+  llvm::SmallVector<std::pair<Node, int>, 16> node_stack;
+  for (Node n : Roots())
+    node_stack.push_back({n, 0});
+
+  while (!node_stack.empty()) {
+    Node n;
+    int depth;
+    std::tie(n, depth) = node_stack.pop_back_val();
+    auto& n_impl = node_impls[n.GetIndex()];
+
+    for (int unused_indent : llvm::seq(0, depth)) {
+      (void)unused_indent;
+      output << "  ";
+    }
+
+    output << "{node_index: " << n.index << ", kind: '" << n_impl.kind.GetName()
+           << "', text: '" << tokens->GetTokenText(n_impl.token) << "'";
+
+    if (n_impl.has_error)
+      output << ", has_error: yes";
+
+    if (n_impl.subtree_size > 1) {
+      output << ", subtree_size: " << n_impl.subtree_size;
+      // Has children, so we descend.
+      output << ", children: [\n";
+      // We append the children in order here as well because they will get
+      // reversed when popped off the stack.
+      for (Node sibling_n : Children(n))
+        node_stack.push_back({sibling_n, depth + 1});
+      continue;
+    }
+
+    // This node is finished, so close it up.
+    assert(n_impl.subtree_size == 1 &&
+           "Subtree size must always be a positive integer!");
+    output << "}";
+
+    int next_depth = node_stack.empty() ? 0 : node_stack.back().second;
+    assert(next_depth <= depth && "Cannot have the next depth increase!");
+    for (int close_children_count : llvm::seq(0, depth - next_depth)) {
+      (void)close_children_count;
+      output << "]}";
+    }
+
+    // We always end with a comma and a new line as we'll move to the next node
+    // at whatever the current level ends up being.
+    output << ",\n";
+  }
+  output << "]\n";
+}
+
+auto ParseTree::Verify() const -> bool {
+  // Verify basic tree structure invariants.
+  llvm::SmallVector<ParseTree::Node, 16> ancestors;
+  for (Node n : llvm::reverse(Postorder())) {
+    auto& n_impl = node_impls[n.GetIndex()];
+
+    if (n_impl.has_error && !has_errors) {
+      llvm::errs()
+          << "Node #" << n.GetIndex()
+          << " has errors, but the tree is not marked as having any.\n";
+      return false;
+    }
+
+    if (n_impl.subtree_size > 1) {
+      if (!ancestors.empty()) {
+        auto parent_n = ancestors.back();
+        auto& parent_n_impl = node_impls[parent_n.GetIndex()];
+        int end_index = n.GetIndex() - n_impl.subtree_size;
+        int parent_end_index = parent_n.GetIndex() - parent_n_impl.subtree_size;
+        if (parent_end_index > end_index) {
+          llvm::errs() << "Node #" << n.GetIndex() << " has a subtree size of "
+                       << n_impl.subtree_size
+                       << " which extends beyond its parent's (node #"
+                       << parent_n.GetIndex() << ") subtree (size "
+                       << parent_n_impl.subtree_size << ")\n";
+          return false;
+        }
+      }
+      // Has children, so we descend.
+      ancestors.push_back(n);
+      continue;
+    }
+
+    if (n_impl.subtree_size < 1) {
+      llvm::errs() << "Node #" << n.GetIndex()
+                   << " has an invalid subtree size of " << n_impl.subtree_size
+                   << "!\n";
+      return false;
+    }
+
+    // We're going to pop off some levels of the tree. Check each ancestor to
+    // make sure the offsets are correct.
+    int next_index = n.GetIndex() - 1;
+    while (!ancestors.empty()) {
+      ParseTree::Node parent_n = ancestors.back();
+      if ((parent_n.GetIndex() -
+           node_impls[parent_n.GetIndex()].subtree_size) != next_index)
+        break;
+      ancestors.pop_back();
+    }
+  }
+  if (!ancestors.empty()) {
+    llvm::errs()
+        << "Finished walking the parse tree and there are still ancestors:\n";
+    for (Node ancestor_n : ancestors)
+      llvm::errs() << "  Node #" << ancestor_n.GetIndex() << "\n";
+    return false;
+  }
+
+  return true;
+}
+
+}  // namespace Carbon

+ 356 - 0
parser/parse_tree.h

@@ -0,0 +1,356 @@
+// Part of the Carbon Language project, under the Apache License v2.0 with LLVM
+// Exceptions. See /LICENSE for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+
+#ifndef PARSER_PARSE_TREE_H_
+#define PARSER_PARSE_TREE_H_
+
+#include <iterator>
+
+#include "diagnostics/diagnostic_emitter.h"
+#include "lexer/tokenized_buffer.h"
+#include "llvm/ADT/SmallVector.h"
+#include "llvm/ADT/StringRef.h"
+#include "llvm/ADT/iterator.h"
+#include "llvm/ADT/iterator_range.h"
+#include "parser/parse_node_kind.h"
+
+namespace Carbon {
+
+// A tree of parsed tokens based on the language grammar.
+//
+// This is a purely syntactic parse tree without any semantics yet attached. It
+// is based on the token stream and the grammar of the language without even
+// name lookup.
+//
+// The tree is designed to make depth-first traversal especially efficient, with
+// postorder and reverse postorder (RPO, a topological order) not even requiring
+// extra state.
+//
+// The nodes of the tree follow a flyweight pattern and are handles into the
+// tree. The tree itself must be available to query for information about those
+// nodes.
+//
+// Nodes also have a precise one-to-one correspondence to tokens from the parsed
+// token stream. Each node can be thought of as the tree-position of a
+// particular token from the stream.
+//
+// The tree is immutable once built, but is designed to support reasonably
+// efficient patterns that build a new tree with a specific transformation
+// applied.
+class ParseTree {
+ public:
+  class Node;
+  class PostorderIterator;
+  class SiblingIterator;
+
+  // Parses the token buffer into a `ParseTree`.
+  //
+  // This is the factory function which is used to build parse trees.
+  static auto Parse(TokenizedBuffer& tokens, DiagnosticEmitter& emitter)
+      -> ParseTree;
+
+  // Tests whether there are any errors in the parse tree.
+  [[nodiscard]] auto HasErrors() const -> bool { return has_errors; }
+
+  // Returns the number of nodes in this parse tree.
+  [[nodiscard]] auto Size() const -> int { return node_impls.size(); }
+
+  // Returns an iterable range over the parse tree nodes in depth-first
+  // postorder.
+  [[nodiscard]] auto Postorder() const
+      -> llvm::iterator_range<PostorderIterator>;
+
+  // Returns an iterable range over the parse tree node and all of its
+  // descendants in depth-first postorder.
+  [[nodiscard]] auto Postorder(Node n) const
+      -> llvm::iterator_range<PostorderIterator>;
+
+  // Returns an iterable range over the direct children of a node in the parse
+  // tree. This is a forward range, but is constant time to increment. The order
+  // of children is the same as would be found in a reverse postorder traversal.
+  [[nodiscard]] auto Children(Node n) const
+      -> llvm::iterator_range<SiblingIterator>;
+
+  // Returns an iterable range over the roots of the parse tree. This is a
+  // forward range, but is constant time to increment. The order of roots is the
+  // same as would be found in a reverse postorder traversal.
+  [[nodiscard]] auto Roots() const -> llvm::iterator_range<SiblingIterator>;
+
+  // Tests whether a particular node contains an error and may not match the
+  // full expected structure of the grammar.
+  [[nodiscard]] auto HasErrorInNode(Node n) const -> bool;
+
+  // Returns the kind of the given parse tree node.
+  [[nodiscard]] auto GetNodeKind(Node n) const -> ParseNodeKind;
+
+  // Returns the token the given parse tree node models.
+  [[nodiscard]] auto GetNodeToken(Node n) const -> TokenizedBuffer::Token;
+
+  // Returns the text backing the token for the given node.
+  //
+  // This is a convenience method for chaining from a node through its token to
+  // the underlying source text.
+  [[nodiscard]] auto GetNodeText(Node n) const -> llvm::StringRef;
+
+  // Prints a description of the parse tree to the provided `raw_ostream`.
+  //
+  // While the parse tree is represented as a postorder sequence, we print it in
+  // preorder to make it easier to visualize and read. The node indices are the
+  // postorder indices. The print out represents each node as a YAML record,
+  // with children nested within it.
+  //
+  // A single node without children is formatted as:
+  // ```
+  // {node_index: 0, kind: 'foo', text: '...'}
+  // ```
+  // A node with two children, one of them with an error:
+  // ```
+  // {node_index: 2, kind: 'foo', text: '...', children: [
+  //   {node_index: 0, kind: 'bar', text: '...', has_error: yes},
+  //   {node_index: 1, kind: 'baz', text: '...'}]}
+  // ```
+  // The top level is formatted as an array of these nodes.
+  // ```
+  // [
+  // {node_index: 1, kind: 'foo', text: '...'},
+  // {node_index: 0, kind: 'foo', text: '...'},
+  // ...
+  // ]
+  // ```
+  //
+  // This can be parsed as YAML using tools like `python-yq` combined with `jq`
+  // on the command line. The format is also reasonably amenable to other
+  // line-oriented shell tools from `grep` to `awk`.
+  auto Print(llvm::raw_ostream& output) const -> void;
+
+  // Verifies the parse tree structure.
+  //
+  // This tries to check any invariants of the parse tree structure and write
+  // out information about it to stderr. Returns false if anything fails to
+  // verify. This is primarily intended to be used as a debugging aid. A typical
+  // usage is to `assert` on the result. This routine doesn't directly assert so
+  // that it can be used even when asserts are disabled or within a debugger.
+  [[nodiscard]] auto Verify() const -> bool;
+
+ private:
+  class Parser;
+  friend Parser;
+
+  // The in-memory representation of data used for a particular node in the
+  // tree.
+  struct NodeImpl {
+    // The kind of this node. Note that this is only a single byte.
+    ParseNodeKind kind;
+
+    // We have 3 bytes of padding here that we can pack flags or other compact
+    // data into.
+
+    // Whether this node is or contains a parse error.
+    //
+    // When this is true, this node and its children may not have the expected
+    // grammatical production structure. Prior to reasoning about any specific
+    // subtree structure, this flag must be checked.
+    //
+    // Not every node in the path from the root to an error will have this field
+    // set to true. However, any node structure that fails to conform to the
+    // expected grammatical production will be contained within a subtree with
+    // this flag set. Whether parents of that subtree also have it set is
+    // optional (and will depend on the particular parse implementation
+    // strategy). The goal is that you can rely on grammar-based structural
+    // invariants *until* you encounter a node with this set.
+    bool has_error = false;
+
+    // The token root of this node.
+    TokenizedBuffer::Token token;
+
+    // The size of this node's subtree of the parse tree. This is the number of
+    // nodes (and thus tokens) that are covered by this node (and its
+    // descendents) in the parse tree.
+    //
+    // During a *reverse* postorder (RPO) traversal of the parse tree, this can
+    // also be thought of as the offset to the next non-descendant node. When
+    // this node is not the first child of its parent (which is the last child
+    // visited in RPO), that is the offset to the next sibling. When this node
+    // *is* the first child of its parent, this will be an offset to the node's
+    // parent's next sibling, or if it the parent is also a first child, the
+    // grandparent's next sibling, and so on.
+    //
+    // This field should always be a positive integer as at least this node is
+    // part of its subtree.
+    int32_t subtree_size;
+
+    explicit NodeImpl(ParseNodeKind k, TokenizedBuffer::Token t,
+                      int subtree_size_arg)
+        : kind(k), token(t), subtree_size(subtree_size_arg) {}
+  };
+
+  static_assert(sizeof(NodeImpl) == 12,
+                "Unexpected size of node implementation!");
+
+  // Wires up the reference to the tokenized buffer. The global `parse` routine
+  // should be used to actually parse the tokens into a tree.
+  explicit ParseTree(TokenizedBuffer& tokens_arg) : tokens(&tokens_arg) {}
+
+  // Depth-first postorder sequence of node implementation data.
+  llvm::SmallVector<NodeImpl, 0> node_impls;
+
+  TokenizedBuffer* tokens;
+
+  // Indicates if any errors were encountered while parsing.
+  //
+  // This doesn't indicate how much of the tree is structurally accurate with
+  // respect to the grammar. That can be identified by looking at the `HasError`
+  // flag for a given node (see above for details). This simply indicates that
+  // some errors were encountered somewhere. A key implication is that when this
+  // is true we do *not* have the expected 1:1 mapping between tokens and parsed
+  // nodes as some tokens may have been skipped.
+  bool has_errors = false;
+};
+
+// A lightweight handle representing a node in the tree.
+//
+// Objects of this type are small and cheap to copy and store. They don't
+// contain any of the information about the node, and serve as a handle that
+// can be used with the underlying tree to query for detailed information.
+//
+// That said, nodes can be compared and are part of a depth-first pre-order
+// sequence across all nodes in the parse tree.
+class ParseTree::Node {
+ public:
+  // Node handles are default constructable, but such a node cannot be used
+  // for anything. It just allows it to be initialized later through
+  // assignment. Any other operation on a default constructed node is an
+  // error.
+  Node() = default;
+
+  friend auto operator==(Node lhs, Node rhs) -> bool {
+    return lhs.index == rhs.index;
+  }
+  friend auto operator!=(Node lhs, Node rhs) -> bool {
+    return lhs.index != rhs.index;
+  }
+  friend auto operator<(Node lhs, Node rhs) -> bool {
+    return lhs.index < rhs.index;
+  }
+  friend auto operator<=(Node lhs, Node rhs) -> bool {
+    return lhs.index <= rhs.index;
+  }
+  friend auto operator>(Node lhs, Node rhs) -> bool {
+    return lhs.index > rhs.index;
+  }
+  friend auto operator>=(Node lhs, Node rhs) -> bool {
+    return lhs.index >= rhs.index;
+  }
+
+  // Returns an opaque integer identifier of the node in the tree. Clients
+  // should not expect any particular semantics from this value.
+  //
+  // FIXME: Maybe we can switch to stream operator overloads?
+  [[nodiscard]] auto GetIndex() const -> int { return index; }
+
+ private:
+  friend ParseTree;
+  friend Parser;
+  friend PostorderIterator;
+  friend SiblingIterator;
+
+  // Constructs a node with a specific index into the parse tree's postorder
+  // sequence of node implementations.
+  explicit Node(int index_arg) : index(index_arg) {}
+
+  // The index of this node's implementation in the postorder sequence.
+  int32_t index;
+};
+
+// A random-access iterator to the depth-first postorder sequence of parse nodes
+// in the parse tree. It produces `ParseTree::Node` objects which are opaque
+// handles and must be used in conjunction with the `ParseTree` itself.
+class ParseTree::PostorderIterator
+    : public llvm::iterator_facade_base<PostorderIterator,
+                                        std::random_access_iterator_tag, Node,
+                                        int, Node*, Node> {
+ public:
+  // Default construction is only provided to satisfy iterator requirements. It
+  // produces an unusable iterator, and you must assign a valid iterator to it
+  // before performing any operations.
+  PostorderIterator() = default;
+
+  auto operator==(const PostorderIterator& rhs) const -> bool {
+    return node == rhs.node;
+  }
+  auto operator<(const PostorderIterator& rhs) const -> bool {
+    return node < rhs.node;
+  }
+
+  auto operator*() const -> Node { return node; }
+
+  auto operator-(const PostorderIterator& rhs) const -> int {
+    return node.index - rhs.node.index;
+  }
+
+  auto operator+=(int offset) -> PostorderIterator& {
+    node.index += offset;
+    return *this;
+  }
+  auto operator-=(int offset) -> PostorderIterator& {
+    node.index -= offset;
+    return *this;
+  }
+
+ private:
+  friend class ParseTree;
+
+  Node node;
+
+  explicit PostorderIterator(Node n) : node(n) {}
+};
+
+// A forward iterator across the silbings at a particular level in the parse
+// tree. It produces `ParseTree::Node` objects which are opaque handles and must
+// be used in conjunction with the `ParseTree` itself.
+//
+// While this is a forward iterator and may not have good locality within the
+// `ParseTree` data structure, it is still constant time to increment and
+// suitable for algorithms relying on that property.
+//
+// The siblings are discovered through a reverse postorder (RPO) tree traversal
+// (which is made constant time through cached distance information), and so the
+// relative order of siblings matches their RPO order.
+class ParseTree::SiblingIterator
+    : public llvm::iterator_facade_base<
+          SiblingIterator, std::forward_iterator_tag, Node, int, Node*, Node> {
+ public:
+  SiblingIterator() = default;
+
+  auto operator==(const SiblingIterator& rhs) const -> bool {
+    return node == rhs.node;
+  }
+  auto operator<(const SiblingIterator& rhs) const -> bool {
+    // Note that child iterators walk in reverse compared to the postorder
+    // index.
+    return node > rhs.node;
+  }
+
+  auto operator*() const -> Node { return node; }
+
+  using iterator_facade_base::operator++;
+  auto operator++() -> SiblingIterator& {
+    node.index -= std::abs(tree->node_impls[node.index].subtree_size);
+    return *this;
+  }
+
+ private:
+  friend class ParseTree;
+
+  const ParseTree* tree;
+
+  Node node;
+
+  explicit SiblingIterator(const ParseTree& tree_arg, Node n)
+      : tree(&tree_arg), node(n) {}
+};
+
+}  // namespace Carbon
+
+#endif  // PARSER_PARSE_TREE_H_

+ 60 - 0
parser/parse_tree_fuzzer.cpp

@@ -0,0 +1,60 @@
+// Part of the Carbon Language project, under the Apache License v2.0 with LLVM
+// Exceptions. See /LICENSE for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+
+#include <cstddef>
+#include <cstring>
+
+#include "diagnostics/diagnostic_emitter.h"
+#include "lexer/tokenized_buffer.h"
+#include "llvm/ADT/StringRef.h"
+#include "parser/parse_tree.h"
+
+namespace Carbon {
+
+// NOLINTNEXTLINE: Match the documented fuzzer entry point declaration style.
+extern "C" int LLVMFuzzerTestOneInput(const unsigned char* data,
+                                      std::size_t size) {
+  // We need two bytes of data to compute a file name length.
+  if (size < 2)
+    return 0;
+  unsigned short raw_filename_length;
+  std::memcpy(&raw_filename_length, data, 2);
+  data += 2;
+  size -= 2;
+  std::size_t filename_length = raw_filename_length;
+
+  // We need enough data to populate this filename length.
+  if (size < filename_length)
+    return 0;
+  llvm::StringRef filename(reinterpret_cast<const char*>(data),
+                           filename_length);
+  data += filename_length;
+  size -= filename_length;
+
+  // The rest of the data is the source text.
+  auto source = SourceBuffer::CreateFromText(
+      llvm::StringRef(reinterpret_cast<const char*>(data), size), filename);
+
+  // Use a real diagnostic emitter to get lazy codepaths to execute.
+  DiagnosticEmitter emitter = NullDiagnosticEmitter();
+
+  // Lex the input.
+  auto tokens = TokenizedBuffer::Lex(source, emitter);
+  if (tokens.HasErrors())
+    return 0;
+
+  // Now parse it into a tree. Note that parsing will (when asserts are enabled)
+  // walk the entire tree to verify it so we don't have to do that here.
+  ParseTree tree = ParseTree::Parse(tokens, emitter);
+  if (tree.HasErrors())
+    return 0;
+
+  // In the absence of parse errors, we should have exactly as many nodes as
+  // tokens.
+  assert(tree.Size() == tokens.Size() && "Unexpected number of tree nodes!");
+
+  return 0;
+}
+
+}  // namespace Carbon

+ 534 - 0
parser/parse_tree_test.cpp

@@ -0,0 +1,534 @@
+// Part of the Carbon Language project, under the Apache License v2.0 with LLVM
+// Exceptions. See /LICENSE for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+
+#include "parser/parse_tree.h"
+
+#include <forward_list>
+
+#include "diagnostics/diagnostic_emitter.h"
+#include "gmock/gmock.h"
+#include "gtest/gtest.h"
+#include "lexer/tokenized_buffer.h"
+#include "lexer/tokenized_buffer_test_helpers.h"
+#include "llvm/ADT/Sequence.h"
+#include "llvm/Support/SourceMgr.h"
+#include "llvm/Support/YAMLParser.h"
+#include "parser/parse_node_kind.h"
+#include "parser/parse_test_helpers.h"
+
+namespace Carbon {
+namespace {
+
+using Carbon::Testing::IsKeyValueScalars;
+using Carbon::Testing::MatchParseTreeNodes;
+using ::testing::Eq;
+using ::testing::Ne;
+using ::testing::NotNull;
+using ::testing::StrEq;
+
+struct ParseTreeTest : ::testing::Test {
+  std::forward_list<SourceBuffer> source_storage;
+  std::forward_list<TokenizedBuffer> token_storage;
+  DiagnosticEmitter emitter = NullDiagnosticEmitter();
+
+  auto GetSourceBuffer(llvm::Twine t) -> SourceBuffer& {
+    source_storage.push_front(SourceBuffer::CreateFromText(t.str()));
+    return source_storage.front();
+  }
+
+  auto GetTokenizedBuffer(llvm::Twine t) -> TokenizedBuffer& {
+    token_storage.push_front(TokenizedBuffer::Lex(GetSourceBuffer(t), emitter));
+    return token_storage.front();
+  }
+};
+
+TEST_F(ParseTreeTest, Empty) {
+  TokenizedBuffer tokens = GetTokenizedBuffer("");
+  ParseTree tree = ParseTree::Parse(tokens, emitter);
+  EXPECT_FALSE(tree.HasErrors());
+  EXPECT_THAT(tree.Postorder().begin(), Eq(tree.Postorder().end()));
+}
+
+TEST_F(ParseTreeTest, EmptyDeclaration) {
+  TokenizedBuffer tokens = GetTokenizedBuffer(";");
+  ParseTree tree = ParseTree::Parse(tokens, emitter);
+  EXPECT_FALSE(tree.HasErrors());
+  auto it = tree.Postorder().begin();
+  auto end = tree.Postorder().end();
+  ASSERT_THAT(it, Ne(end));
+  ParseTree::Node n = *it++;
+  EXPECT_THAT(it, Eq(end));
+
+  // Directly test the main API so that we get easier to understand errors in
+  // simple cases than what the custom matcher will produce.
+  EXPECT_FALSE(tree.HasErrorInNode(n));
+  EXPECT_THAT(tree.GetNodeKind(n), Eq(ParseNodeKind::EmptyDeclaration()));
+  auto t = tree.GetNodeToken(n);
+  ASSERT_THAT(tokens.Tokens().begin(), Ne(tokens.Tokens().end()));
+  EXPECT_THAT(t, Eq(*tokens.Tokens().begin()));
+  EXPECT_THAT(tokens.GetTokenText(t), Eq(";"));
+
+  EXPECT_THAT(tree.Postorder(n).begin(), Eq(tree.Postorder().begin()));
+  EXPECT_THAT(tree.Postorder(n).end(), Eq(tree.Postorder().end()));
+  EXPECT_THAT(tree.Children(n).begin(), Eq(tree.Children(n).end()));
+}
+
+TEST_F(ParseTreeTest, BasicFunctionDeclaration) {
+  TokenizedBuffer tokens = GetTokenizedBuffer("fn F();");
+  ParseTree tree = ParseTree::Parse(tokens, emitter);
+  EXPECT_FALSE(tree.HasErrors());
+  EXPECT_THAT(
+      tree, MatchParseTreeNodes(
+                {{.kind = ParseNodeKind::FunctionDeclaration(),
+                  .text = "fn",
+                  .children = {
+                      {ParseNodeKind::Identifier(), "F"},
+                      {.kind = ParseNodeKind::ParameterList(),
+                       .text = "(",
+                       .children = {{ParseNodeKind::ParameterListEnd(), ")"}}},
+                      {ParseNodeKind::DeclarationEnd(), ";"}}}}));
+}
+
+TEST_F(ParseTreeTest, NoDeclarationIntroducerOrSemi) {
+  TokenizedBuffer tokens = GetTokenizedBuffer("foo bar baz");
+  ParseTree tree = ParseTree::Parse(tokens, emitter);
+  EXPECT_TRUE(tree.HasErrors());
+  EXPECT_THAT(tree.Postorder().begin(), Eq(tree.Postorder().end()));
+}
+
+TEST_F(ParseTreeTest, NoDeclarationIntroducerWithSemi) {
+  TokenizedBuffer tokens = GetTokenizedBuffer("foo;");
+  ParseTree tree = ParseTree::Parse(tokens, emitter);
+  EXPECT_TRUE(tree.HasErrors());
+  EXPECT_THAT(tree,
+              MatchParseTreeNodes({{.kind = ParseNodeKind::EmptyDeclaration(),
+                                    .text = ";",
+                                    .has_error = true}}));
+}
+
+TEST_F(ParseTreeTest, JustFunctionIntroducerAndSemi) {
+  TokenizedBuffer tokens = GetTokenizedBuffer("fn;");
+  ParseTree tree = ParseTree::Parse(tokens, emitter);
+  EXPECT_TRUE(tree.HasErrors());
+  EXPECT_THAT(tree, MatchParseTreeNodes(
+                        {{.kind = ParseNodeKind::FunctionDeclaration(),
+                          .has_error = true,
+                          .children = {{ParseNodeKind::DeclarationEnd()}}}}));
+}
+
+TEST_F(ParseTreeTest, RepeatedFunctionIntroducerAndSemi) {
+  TokenizedBuffer tokens = GetTokenizedBuffer("fn fn;");
+  ParseTree tree = ParseTree::Parse(tokens, emitter);
+  EXPECT_TRUE(tree.HasErrors());
+  EXPECT_THAT(tree, MatchParseTreeNodes(
+                        {{.kind = ParseNodeKind::FunctionDeclaration(),
+                          .has_error = true,
+                          .children = {{ParseNodeKind::DeclarationEnd()}}}}));
+}
+
+TEST_F(ParseTreeTest, FunctionDeclarationWithNoSignatureOrSemi) {
+  TokenizedBuffer tokens = GetTokenizedBuffer("fn foo");
+  ParseTree tree = ParseTree::Parse(tokens, emitter);
+  EXPECT_TRUE(tree.HasErrors());
+  EXPECT_THAT(tree,
+              MatchParseTreeNodes(
+                  {{.kind = ParseNodeKind::FunctionDeclaration(),
+                    .has_error = true,
+                    .children = {{ParseNodeKind::Identifier(), "foo"}}}}));
+}
+
+TEST_F(ParseTreeTest,
+       FunctionDeclarationWithIdentifierInsteadOfSignatureAndSemi) {
+  TokenizedBuffer tokens = GetTokenizedBuffer("fn foo bar;");
+  ParseTree tree = ParseTree::Parse(tokens, emitter);
+  EXPECT_TRUE(tree.HasErrors());
+  EXPECT_THAT(tree, MatchParseTreeNodes(
+                        {{.kind = ParseNodeKind::FunctionDeclaration(),
+                          .has_error = true,
+                          .children = {{ParseNodeKind::Identifier(), "foo"},
+                                       {ParseNodeKind::DeclarationEnd()}}}}));
+}
+
+TEST_F(ParseTreeTest, FunctionDeclarationWithSingleIdentifierParameterList) {
+  TokenizedBuffer tokens = GetTokenizedBuffer("fn foo(bar);");
+  ParseTree tree = ParseTree::Parse(tokens, emitter);
+  // Note: this might become valid depending on the parameter syntax, this test
+  // shouldn't be taken as a sign it should remain invalid.
+  EXPECT_TRUE(tree.HasErrors());
+  EXPECT_THAT(
+      tree,
+      MatchParseTreeNodes(
+          {{.kind = ParseNodeKind::FunctionDeclaration(),
+            .has_error = true,
+            .children = {{ParseNodeKind::Identifier(), "foo"},
+                         {.kind = ParseNodeKind::ParameterList(),
+                          .has_error = true,
+                          .children = {{ParseNodeKind::ParameterListEnd()}}},
+                         {ParseNodeKind::DeclarationEnd()}}}}));
+}
+
+TEST_F(ParseTreeTest, FunctionDeclarationWithoutName) {
+  TokenizedBuffer tokens = GetTokenizedBuffer("fn ();");
+  ParseTree tree = ParseTree::Parse(tokens, emitter);
+  EXPECT_TRUE(tree.HasErrors());
+  EXPECT_THAT(tree, MatchParseTreeNodes(
+                        {{.kind = ParseNodeKind::FunctionDeclaration(),
+                          .has_error = true,
+                          .children = {{ParseNodeKind::DeclarationEnd()}}}}));
+}
+
+TEST_F(ParseTreeTest,
+       FunctionDeclarationWithoutNameAndManyTokensToSkipInGroupedSymbols) {
+  TokenizedBuffer tokens = GetTokenizedBuffer(
+      "fn (a tokens c d e f g h i j k l m n o p q r s t u v w x y z);");
+  ParseTree tree = ParseTree::Parse(tokens, emitter);
+  EXPECT_TRUE(tree.HasErrors());
+  EXPECT_THAT(tree, MatchParseTreeNodes(
+                        {{.kind = ParseNodeKind::FunctionDeclaration(),
+                          .has_error = true,
+                          .children = {{ParseNodeKind::DeclarationEnd()}}}}));
+}
+
+TEST_F(ParseTreeTest, FunctionDeclarationSkipToNewlineWithoutSemi) {
+  TokenizedBuffer tokens = GetTokenizedBuffer(
+      "fn ()\n"
+      "fn F();");
+  ParseTree tree = ParseTree::Parse(tokens, emitter);
+  EXPECT_TRUE(tree.HasErrors());
+  EXPECT_THAT(
+      tree,
+      MatchParseTreeNodes(
+          {{.kind = ParseNodeKind::FunctionDeclaration(), .has_error = true},
+           {.kind = ParseNodeKind::FunctionDeclaration(),
+            .children = {{ParseNodeKind::Identifier(), "F"},
+                         {.kind = ParseNodeKind::ParameterList(),
+                          .children = {{ParseNodeKind::ParameterListEnd()}}},
+                         {ParseNodeKind::DeclarationEnd()}}}}));
+}
+
+TEST_F(ParseTreeTest, FunctionDeclarationSkipIndentedNewlineWithSemi) {
+  TokenizedBuffer tokens = GetTokenizedBuffer(
+      "fn (x,\n"
+      "    y,\n"
+      "    z);\n"
+      "fn F();");
+  ParseTree tree = ParseTree::Parse(tokens, emitter);
+  EXPECT_TRUE(tree.HasErrors());
+  EXPECT_THAT(
+      tree,
+      MatchParseTreeNodes(
+          {{.kind = ParseNodeKind::FunctionDeclaration(),
+            .has_error = true,
+            .children = {{ParseNodeKind::DeclarationEnd()}}},
+           {.kind = ParseNodeKind::FunctionDeclaration(),
+            .children = {{ParseNodeKind::Identifier(), "F"},
+                         {.kind = ParseNodeKind::ParameterList(),
+                          .children = {{ParseNodeKind::ParameterListEnd()}}},
+                         {ParseNodeKind::DeclarationEnd()}}}}));
+}
+
+TEST_F(ParseTreeTest, FunctionDeclarationSkipIndentedNewlineWithoutSemi) {
+  TokenizedBuffer tokens = GetTokenizedBuffer(
+      "fn (x,\n"
+      "    y,\n"
+      "    z)\n"
+      "fn F();");
+  ParseTree tree = ParseTree::Parse(tokens, emitter);
+  EXPECT_TRUE(tree.HasErrors());
+  EXPECT_THAT(
+      tree,
+      MatchParseTreeNodes(
+          {{.kind = ParseNodeKind::FunctionDeclaration(), .has_error = true},
+           {.kind = ParseNodeKind::FunctionDeclaration(),
+            .children = {{ParseNodeKind::Identifier(), "F"},
+                         {.kind = ParseNodeKind::ParameterList(),
+                          .children = {{ParseNodeKind::ParameterListEnd()}}},
+                         {ParseNodeKind::DeclarationEnd()}}}}));
+}
+
+TEST_F(ParseTreeTest, FunctionDeclarationSkipIndentedNewlineUntilOutdent) {
+  TokenizedBuffer tokens = GetTokenizedBuffer(
+      "  fn (x,\n"
+      "      y,\n"
+      "      z)\n"
+      "fn F();");
+  ParseTree tree = ParseTree::Parse(tokens, emitter);
+  EXPECT_TRUE(tree.HasErrors());
+  EXPECT_THAT(
+      tree,
+      MatchParseTreeNodes(
+          {{.kind = ParseNodeKind::FunctionDeclaration(), .has_error = true},
+           {.kind = ParseNodeKind::FunctionDeclaration(),
+            .children = {{ParseNodeKind::Identifier(), "F"},
+                         {.kind = ParseNodeKind::ParameterList(),
+                          .children = {{ParseNodeKind::ParameterListEnd()}}},
+                         {ParseNodeKind::DeclarationEnd()}}}}));
+}
+
+TEST_F(ParseTreeTest, FunctionDeclarationSkipWithoutSemiToCurly) {
+  // FIXME: We don't have a grammar construct that uses curlies yet so this just
+  // won't parse at all. Once it does, we should ensure that the close brace
+  // gets properly parsed for the struct (or whatever other curly-braced syntax
+  // we have grouping function declarations) despite the invalid function
+  // declaration missing a semicolon.
+  TokenizedBuffer tokens = GetTokenizedBuffer(
+      "struct X { fn () }\n"
+      "fn F();");
+  ParseTree tree = ParseTree::Parse(tokens, emitter);
+  EXPECT_TRUE(tree.HasErrors());
+}
+
+TEST_F(ParseTreeTest, BasicFunctionDefinition) {
+  TokenizedBuffer tokens = GetTokenizedBuffer(
+      "fn F() {\n"
+      "}");
+  ParseTree tree = ParseTree::Parse(tokens, emitter);
+  EXPECT_FALSE(tree.HasErrors());
+  EXPECT_THAT(
+      tree, MatchParseTreeNodes(
+                {{.kind = ParseNodeKind::FunctionDeclaration(),
+                  .children = {
+                      {ParseNodeKind::Identifier(), "F"},
+                      {.kind = ParseNodeKind::ParameterList(),
+                       .children = {{ParseNodeKind::ParameterListEnd()}}},
+                      {.kind = ParseNodeKind::CodeBlock(),
+                       .text = "{",
+                       .children = {{ParseNodeKind::CodeBlockEnd(), "}"}}}}}}));
+}
+
+TEST_F(ParseTreeTest, FunctionDefinitionWithNestedBlocks) {
+  TokenizedBuffer tokens = GetTokenizedBuffer(
+      "fn F() {\n"
+      "  {\n"
+      "    {{}}\n"
+      "  }\n"
+      "}");
+  ParseTree tree = ParseTree::Parse(tokens, emitter);
+  EXPECT_FALSE(tree.HasErrors());
+  EXPECT_THAT(
+      tree,
+      MatchParseTreeNodes(
+          {{.kind = ParseNodeKind::FunctionDeclaration(),
+            .children = {
+                {ParseNodeKind::Identifier(), "F"},
+                {.kind = ParseNodeKind::ParameterList(),
+                 .children = {{ParseNodeKind::ParameterListEnd()}}},
+                {.kind = ParseNodeKind::CodeBlock(),
+                 .children = {
+                     {.kind = ParseNodeKind::CodeBlock(),
+                      .children = {{.kind = ParseNodeKind::CodeBlock(),
+                                    .children =
+                                        {{.kind = ParseNodeKind::CodeBlock(),
+                                          .children = {{ParseNodeKind::
+                                                            CodeBlockEnd()}}},
+                                         {ParseNodeKind::CodeBlockEnd()}}},
+                                   {ParseNodeKind::CodeBlockEnd()}}},
+                     {ParseNodeKind::CodeBlockEnd()}}}}}}));
+}
+
+TEST_F(ParseTreeTest, FunctionDefinitionWithIdenifierInStatements) {
+  TokenizedBuffer tokens = GetTokenizedBuffer(
+      "fn F() {\n"
+      "  bar\n"
+      "}");
+  ParseTree tree = ParseTree::Parse(tokens, emitter);
+  // Note: this might become valid depending on the expression syntax. This test
+  // shouldn't be taken as a sign it should remain invalid.
+  EXPECT_TRUE(tree.HasErrors());
+  EXPECT_THAT(
+      tree,
+      MatchParseTreeNodes(
+          {{.kind = ParseNodeKind::FunctionDeclaration(),
+            .children = {{ParseNodeKind::Identifier(), "F"},
+                         {.kind = ParseNodeKind::ParameterList(),
+                          .children = {{ParseNodeKind::ParameterListEnd()}}},
+                         {.kind = ParseNodeKind::CodeBlock(),
+                          .has_error = true,
+                          .children = {{ParseNodeKind::CodeBlockEnd()}}}}}}));
+}
+
+TEST_F(ParseTreeTest, FunctionDefinitionWithIdenifierInNestedBlock) {
+  TokenizedBuffer tokens = GetTokenizedBuffer(
+      "fn F() {\n"
+      "  {bar}\n"
+      "}");
+  ParseTree tree = ParseTree::Parse(tokens, emitter);
+  // Note: this might become valid depending on the expression syntax. This test
+  // shouldn't be taken as a sign it should remain invalid.
+  EXPECT_TRUE(tree.HasErrors());
+  EXPECT_THAT(
+      tree,
+      MatchParseTreeNodes(
+          {{.kind = ParseNodeKind::FunctionDeclaration(),
+            .children = {
+                {ParseNodeKind::Identifier(), "F"},
+                {.kind = ParseNodeKind::ParameterList(),
+                 .children = {{ParseNodeKind::ParameterListEnd()}}},
+                {.kind = ParseNodeKind::CodeBlock(),
+                 .children = {{.kind = ParseNodeKind::CodeBlock(),
+                               .has_error = true,
+                               .children = {{ParseNodeKind::CodeBlockEnd()}}},
+                              {ParseNodeKind::CodeBlockEnd()}}}}}}));
+}
+
+auto GetAndDropLine(llvm::StringRef& s) -> std::string {
+  auto newline_offset = s.find_first_of('\n');
+  llvm::StringRef line = s.slice(0, newline_offset);
+
+  if (newline_offset != llvm::StringRef::npos)
+    s = s.substr(newline_offset + 1);
+  else
+    s = "";
+
+  return line.str();
+}
+
+TEST_F(ParseTreeTest, Printing) {
+  TokenizedBuffer tokens = GetTokenizedBuffer("fn F();");
+  ParseTree tree = ParseTree::Parse(tokens, emitter);
+  EXPECT_FALSE(tree.HasErrors());
+  std::string print_storage;
+  llvm::raw_string_ostream print_stream(print_storage);
+  tree.Print(print_stream);
+  llvm::StringRef print = print_stream.str();
+  EXPECT_THAT(GetAndDropLine(print), StrEq("["));
+  EXPECT_THAT(GetAndDropLine(print),
+              StrEq("{node_index: 4, kind: 'FunctionDeclaration', text: 'fn', "
+                    "subtree_size: 5, children: ["));
+  EXPECT_THAT(GetAndDropLine(print),
+              StrEq("  {node_index: 0, kind: 'Identifier', text: 'F'},"));
+  EXPECT_THAT(GetAndDropLine(print),
+              StrEq("  {node_index: 2, kind: 'ParameterList', text: '(', "
+                    "subtree_size: 2, children: ["));
+  EXPECT_THAT(GetAndDropLine(print),
+              StrEq("    {node_index: 1, kind: 'ParameterListEnd', "
+                    "text: ')'}]},"));
+  EXPECT_THAT(GetAndDropLine(print),
+              StrEq("  {node_index: 3, kind: 'DeclarationEnd', text: ';'}]},"));
+  EXPECT_THAT(GetAndDropLine(print), StrEq("]"));
+  EXPECT_TRUE(print.empty()) << print;
+}
+
+TEST_F(ParseTreeTest, PrintingAsYAML) {
+  TokenizedBuffer tokens = GetTokenizedBuffer("fn F();");
+  ParseTree tree = ParseTree::Parse(tokens, emitter);
+  EXPECT_FALSE(tree.HasErrors());
+  std::string print_output;
+  llvm::raw_string_ostream print_stream(print_output);
+  tree.Print(print_stream);
+  print_stream.flush();
+
+  // Parse the output into a YAML stream. This will print errors to stderr.
+  llvm::SourceMgr source_manager;
+  llvm::yaml::Stream yaml_stream(print_output, source_manager);
+  auto di = yaml_stream.begin();
+  auto* root_node = llvm::dyn_cast<llvm::yaml::SequenceNode>(di->getRoot());
+  ASSERT_THAT(root_node, NotNull());
+
+  // The root node is just an array of top-level parse nodes.
+  auto ni = root_node->begin();
+  auto ne = root_node->end();
+  auto* node = llvm::dyn_cast<llvm::yaml::MappingNode>(&*ni);
+  ASSERT_THAT(node, NotNull());
+  auto nkvi = node->begin();
+  auto nkve = node->end();
+  EXPECT_THAT(&*nkvi, IsKeyValueScalars("node_index", "4"));
+  ++nkvi;
+  EXPECT_THAT(&*nkvi, IsKeyValueScalars("kind", "FunctionDeclaration"));
+  ++nkvi;
+  EXPECT_THAT(&*nkvi, IsKeyValueScalars("text", "fn"));
+  ++nkvi;
+  EXPECT_THAT(&*nkvi, IsKeyValueScalars("subtree_size", "5"));
+  ++nkvi;
+  auto* children_node = llvm::dyn_cast<llvm::yaml::KeyValueNode>(&*nkvi);
+  ASSERT_THAT(children_node, NotNull());
+  auto* children_key_node =
+      llvm::dyn_cast<llvm::yaml::ScalarNode>(children_node->getKey());
+  ASSERT_THAT(children_key_node, NotNull());
+  EXPECT_THAT(children_key_node->getRawValue(), StrEq("children"));
+  auto* children_value_node =
+      llvm::dyn_cast<llvm::yaml::SequenceNode>(children_node->getValue());
+  ASSERT_THAT(children_value_node, NotNull());
+
+  auto ci = children_value_node->begin();
+  auto ce = children_value_node->end();
+  ASSERT_THAT(ci, Ne(ce));
+  node = llvm::dyn_cast<llvm::yaml::MappingNode>(&*ci);
+  ASSERT_THAT(node, NotNull());
+  auto ckvi = node->begin();
+  EXPECT_THAT(&*ckvi, IsKeyValueScalars("node_index", "0"));
+  ++ckvi;
+  EXPECT_THAT(&*ckvi, IsKeyValueScalars("kind", "Identifier"));
+  ++ckvi;
+  EXPECT_THAT(&*ckvi, IsKeyValueScalars("text", "F"));
+  ++ckvi;
+  EXPECT_THAT(ckvi, Eq(node->end()));
+
+  ++ci;
+  ASSERT_THAT(ci, Ne(ce));
+  node = llvm::dyn_cast<llvm::yaml::MappingNode>(&*ci);
+  ASSERT_THAT(node, NotNull());
+  ckvi = node->begin();
+  auto ckve = node->end();
+  EXPECT_THAT(&*ckvi, IsKeyValueScalars("node_index", "2"));
+  ++ckvi;
+  EXPECT_THAT(&*ckvi, IsKeyValueScalars("kind", "ParameterList"));
+  ++ckvi;
+  EXPECT_THAT(&*ckvi, IsKeyValueScalars("text", "("));
+  ++ckvi;
+  EXPECT_THAT(&*ckvi, IsKeyValueScalars("subtree_size", "2"));
+  ++ckvi;
+  children_node = llvm::dyn_cast<llvm::yaml::KeyValueNode>(&*ckvi);
+  ASSERT_THAT(children_node, NotNull());
+  children_key_node =
+      llvm::dyn_cast<llvm::yaml::ScalarNode>(children_node->getKey());
+  ASSERT_THAT(children_key_node, NotNull());
+  EXPECT_THAT(children_key_node->getRawValue(), StrEq("children"));
+  children_value_node =
+      llvm::dyn_cast<llvm::yaml::SequenceNode>(children_node->getValue());
+  ASSERT_THAT(children_value_node, NotNull());
+
+  auto c2_i = children_value_node->begin();
+  auto c2_e = children_value_node->end();
+  ASSERT_THAT(c2_i, Ne(c2_e));
+  node = llvm::dyn_cast<llvm::yaml::MappingNode>(&*c2_i);
+  ASSERT_THAT(node, NotNull());
+  auto c2_kvi = node->begin();
+  EXPECT_THAT(&*c2_kvi, IsKeyValueScalars("node_index", "1"));
+  ++c2_kvi;
+  EXPECT_THAT(&*c2_kvi, IsKeyValueScalars("kind", "ParameterListEnd"));
+  ++c2_kvi;
+  EXPECT_THAT(&*c2_kvi, IsKeyValueScalars("text", ")"));
+  ++c2_kvi;
+  EXPECT_THAT(c2_kvi, Eq(node->end()));
+  ++c2_i;
+  EXPECT_THAT(c2_i, Eq(c2_e));
+  ++ckvi;
+  EXPECT_THAT(ckvi, Eq(ckve));
+
+  ++ci;
+  ASSERT_THAT(ci, Ne(ce));
+  node = llvm::dyn_cast<llvm::yaml::MappingNode>(&*ci);
+  ASSERT_THAT(node, NotNull());
+  ckvi = node->begin();
+  EXPECT_THAT(&*ckvi, IsKeyValueScalars("node_index", "3"));
+  ++ckvi;
+  EXPECT_THAT(&*ckvi, IsKeyValueScalars("kind", "DeclarationEnd"));
+  ++ckvi;
+  EXPECT_THAT(&*ckvi, IsKeyValueScalars("text", ";"));
+  ++ckvi;
+  EXPECT_THAT(ckvi, Eq(node->end()));
+  ++ci;
+  EXPECT_THAT(ci, Eq(ce));
+
+  ++nkvi;
+  EXPECT_THAT(nkvi, Eq(nkve));
+  ++ni;
+  EXPECT_THAT(ni, Eq(ne));
+  ++di;
+  EXPECT_THAT(di, Eq(yaml_stream.end()));
+}
+
+}  // namespace
+}  // namespace Carbon

+ 360 - 0
parser/parser_impl.cpp

@@ -0,0 +1,360 @@
+// Part of the Carbon Language project, under the Apache License v2.0 with LLVM
+// Exceptions. See /LICENSE for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+
+#include "parser/parser_impl.h"
+
+#include <cstdlib>
+
+#include "lexer/token_kind.h"
+#include "lexer/tokenized_buffer.h"
+#include "llvm/ADT/Optional.h"
+#include "llvm/Support/raw_ostream.h"
+#include "parser/parse_node_kind.h"
+#include "parser/parse_tree.h"
+
+namespace Carbon {
+
+auto ParseTree::Parser::Parse(TokenizedBuffer& tokens,
+                              DiagnosticEmitter& /*unused*/) -> ParseTree {
+  ParseTree tree(tokens);
+
+  // We expect to have a 1:1 correspondence between tokens and tree nodes, so
+  // reserve the space we expect to need here to avoid allocation and copying
+  // overhead.
+  tree.node_impls.reserve(tokens.Size());
+
+  Parser parser(tree, tokens);
+  while (parser.position != parser.end)
+    parser.ParseDeclaration();
+
+  assert(tree.Verify() && "Parse tree built but does not verify!");
+  return tree;
+}
+
+auto ParseTree::Parser::Consume(TokenKind kind) -> TokenizedBuffer::Token {
+  TokenizedBuffer::Token t = *position;
+  assert(tokens.GetKind(t) == kind && "The current token is the wrong kind!");
+  ++position;
+  return t;
+}
+
+auto ParseTree::Parser::ConsumeIf(TokenKind kind)
+    -> llvm::Optional<TokenizedBuffer::Token> {
+  if (tokens.GetKind(*position) != kind)
+    return {};
+
+  return *position++;
+}
+
+auto ParseTree::Parser::AddLeafNode(ParseNodeKind kind,
+                                    TokenizedBuffer::Token token) -> Node {
+  Node n(tree.node_impls.size());
+  tree.node_impls.push_back(NodeImpl(kind, token, /*SubtreeSize=*/1));
+  return n;
+}
+
+auto ParseTree::Parser::ConsumeAndAddLeafNodeIf(TokenKind t_kind,
+                                                ParseNodeKind n_kind)
+    -> llvm::Optional<Node> {
+  auto t = ConsumeIf(t_kind);
+  if (!t)
+    return {};
+
+  return AddLeafNode(n_kind, *t);
+}
+
+auto ParseTree::Parser::MarkNodeError(Node n) -> void {
+  tree.node_impls[n.index].has_error = true;
+  tree.has_errors = true;
+}
+
+// A marker for the start of a node's subtree.
+//
+// This is used to track the size of the node's subtree and ensure at least one
+// parse node is added. It can be used repeatedly if multiple subtrees start at
+// the same position.
+struct ParseTree::Parser::SubtreeStart {
+  int tree_size;
+  bool node_added = false;
+
+  ~SubtreeStart() {
+    assert(node_added && "Never added a node for a subtree region!");
+  }
+};
+
+auto ParseTree::Parser::StartSubtree() -> SubtreeStart {
+  return {static_cast<int>(tree.node_impls.size())};
+}
+
+auto ParseTree::Parser::AddNode(ParseNodeKind n_kind, TokenizedBuffer::Token t,
+                                SubtreeStart& start, bool has_error) -> Node {
+  // The size of the subtree is the change in size from when we started this
+  // subtree to now, but including the node we're about to add.
+  int tree_stop_size = tree.node_impls.size() + 1;
+  int subtree_size = tree_stop_size - start.tree_size;
+
+  Node n(tree.node_impls.size());
+  tree.node_impls.push_back(NodeImpl(n_kind, t, subtree_size));
+  if (has_error)
+    MarkNodeError(n);
+
+  start.node_added = true;
+  return n;
+}
+
+auto ParseTree::Parser::SkipMatchingGroup() -> bool {
+  assert(position != end && "Cannot skip at the end!");
+  TokenizedBuffer::Token t = *position;
+  TokenKind t_kind = tokens.GetKind(t);
+  if (!t_kind.IsOpeningSymbol())
+    return false;
+
+  position = std::next(
+      TokenizedBuffer::TokenIterator(tokens.GetMatchedClosingToken(t)));
+  return true;
+}
+
+auto ParseTree::Parser::SkipPastLikelyDeclarationEnd(
+    TokenizedBuffer::Token skip_root, bool is_inside_declaration)
+    -> llvm::Optional<Node> {
+  if (position == end)
+    return {};
+
+  TokenizedBuffer::Line root_line = tokens.GetLine(skip_root);
+  int root_line_indent = tokens.GetIndentColumnNumber(root_line);
+
+  // We will keep scanning through tokens on the same line as the root or
+  // lines with greater indentation than root's line.
+  auto is_same_line_or_indent_greater_than_root =
+      [&](TokenizedBuffer::Token t) {
+        TokenizedBuffer::Line l = tokens.GetLine(t);
+        if (l == root_line)
+          return true;
+
+        return tokens.GetIndentColumnNumber(l) > root_line_indent;
+      };
+
+  do {
+    TokenKind current_kind = tokens.GetKind(*position);
+    if (current_kind == TokenKind::CloseCurlyBrace())
+      // Immediately bail out if we hit an unmatched close curly, this will
+      // pop us up a level of the syntax grouping.
+      return {};
+
+    // If we find a semicolon, we want to parse it to end the declaration.
+    if (current_kind == TokenKind::Semi()) {
+      TokenizedBuffer::Token semi = *position++;
+
+      // Add a node for the semicolon. If we're inside of a declaration, this
+      // is a declaration ending semicolon, otherwise it simply forms an empty
+      // declaration.
+      return AddLeafNode(is_inside_declaration
+                             ? ParseNodeKind::DeclarationEnd()
+                             : ParseNodeKind::EmptyDeclaration(),
+                         semi);
+    }
+
+    // Skip over any matching group of tokens.
+    if (SkipMatchingGroup())
+      continue;
+
+    // Otherwise just step forward one token.
+    ++position;
+  } while (position != end &&
+           is_same_line_or_indent_greater_than_root(*position));
+
+  return {};
+}
+
+auto ParseTree::Parser::ParseFunctionSignature() -> Node {
+  assert(position != end && "Cannot parse past the end!");
+
+  TokenizedBuffer::Token open_paren = Consume(TokenKind::OpenParen());
+  assert(position != end &&
+         "The lexer ensures we always have a closing paren!");
+  auto start = StartSubtree();
+
+  // FIXME: Add support for parsing parameters.
+
+  bool has_errors = false;
+  auto close_paren = ConsumeIf(TokenKind::CloseParen());
+  if (!close_paren) {
+    llvm::errs() << "ERROR: unexpected token before the close of the "
+                    "parameters on line "
+                 << tokens.GetLineNumber(*position) << "!\n";
+    has_errors = true;
+
+    // We can trivially skip to the actual close parenthesis from here.
+    close_paren = tokens.GetMatchedClosingToken(open_paren);
+    position = std::next(TokenizedBuffer::TokenIterator(*close_paren));
+  }
+  AddLeafNode(ParseNodeKind::ParameterListEnd(), *close_paren);
+
+  // FIXME: Implement parsing of a return type.
+
+  return AddNode(ParseNodeKind::ParameterList(), open_paren, start, has_errors);
+}
+
+auto ParseTree::Parser::ParseCodeBlock() -> Node {
+  assert(position != end && "Cannot parse past the end!");
+
+  TokenizedBuffer::Token open_curly = Consume(TokenKind::OpenCurlyBrace());
+  assert(position != end &&
+         "The lexer ensures we always have a closing curly!");
+  auto start = StartSubtree();
+
+  bool has_errors = false;
+
+  // Loop over all the different possibly nested elements in the code block.
+  for (;;) {
+    switch (tokens.GetKind(*position)) {
+      default:
+        // FIXME: Add support for parsing more expressions & statements.
+        llvm::errs() << "ERROR: unexpected token before the close of the "
+                        "function definition on line "
+                     << tokens.GetLineNumber(*position) << "!\n";
+        has_errors = true;
+
+        // We can trivially skip to the actual close curly brace from here.
+        position = TokenizedBuffer::TokenIterator(
+            tokens.GetMatchedClosingToken(open_curly));
+        // Now fall through to the close curly brace handling code.
+        LLVM_FALLTHROUGH;
+
+      case TokenKind::CloseCurlyBrace():
+        break;
+
+      case TokenKind::OpenCurlyBrace():
+        // FIXME: We should consider avoiding recursion here with some side
+        // stack.
+        ParseCodeBlock();
+        continue;
+    }
+
+    // We only continue looping with `continue` above.
+    break;
+  }
+
+  // We always reach here having set our position in the token stream to the
+  // close curly brace.
+  AddLeafNode(ParseNodeKind::CodeBlockEnd(),
+              Consume(TokenKind::CloseCurlyBrace()));
+
+  return AddNode(ParseNodeKind::CodeBlock(), open_curly, start, has_errors);
+}
+
+auto ParseTree::Parser::ParseFunctionDeclaration() -> Node {
+  assert(position != end && "Cannot parse past the end!");
+
+  TokenizedBuffer::Token function_intro_token = Consume(TokenKind::FnKeyword());
+  auto start = StartSubtree();
+  auto add_error_function_node = [&] {
+    return AddNode(ParseNodeKind::FunctionDeclaration(), function_intro_token,
+                   start, /*has_error=*/true);
+  };
+
+  if (position == end) {
+    llvm::errs() << "ERROR: File ended with a function introducer on line "
+                 << tokens.GetLineNumber(function_intro_token) << "!\n";
+    return add_error_function_node();
+  }
+
+  auto name_n = ConsumeAndAddLeafNodeIf(TokenKind::Identifier(),
+                                        ParseNodeKind::Identifier());
+  if (!name_n) {
+    llvm::errs() << "ERROR: Function declaration with no name on line "
+                 << tokens.GetLineNumber(function_intro_token) << "!\n";
+    // FIXME: We could change the lexer to allow us to synthesize certain
+    // kinds of tokens and try to "recover" here, but unclear that this is
+    // really useful.
+    SkipPastLikelyDeclarationEnd(function_intro_token);
+    return add_error_function_node();
+  }
+  if (position == end) {
+    llvm::errs() << "ERROR: File ended after a function introducer and "
+                    "identifier on line "
+                 << tokens.GetLineNumber(function_intro_token) << "!\n";
+    return add_error_function_node();
+  }
+
+  TokenizedBuffer::Token open_paren = *position;
+  if (tokens.GetKind(open_paren) != TokenKind::OpenParen()) {
+    llvm::errs()
+        << "ERROR: Missing open parentheses in declaration of function '"
+        << tokens.GetTokenText(tree.GetNodeToken(*name_n)) << "' on line "
+        << tokens.GetLineNumber(function_intro_token) << "!\n";
+    SkipPastLikelyDeclarationEnd(function_intro_token);
+    return add_error_function_node();
+  }
+  assert(std::next(position) != end &&
+         "Unbalanced parentheses should be rejected by the lexer.");
+  TokenizedBuffer::Token close_paren =
+      tokens.GetMatchedClosingToken(open_paren);
+
+  Node signature_n = ParseFunctionSignature();
+  assert(*std::prev(position) == close_paren &&
+         "Should have parsed through the close paren, whether successfully "
+         "or with errors.");
+  if (tree.node_impls[signature_n.index].has_error) {
+    // Don't try to parse more of the function declaration, but consume a
+    // declaration ending semicolon if found (without going to a new line).
+    SkipPastLikelyDeclarationEnd(function_intro_token);
+    return add_error_function_node();
+  }
+
+  // See if we should parse a definition which is represented as a code block.
+  if (tokens.GetKind(*position) == TokenKind::OpenCurlyBrace()) {
+    ParseCodeBlock();
+  } else if (!ConsumeAndAddLeafNodeIf(TokenKind::Semi(),
+                                      ParseNodeKind::DeclarationEnd())) {
+    llvm::errs() << "ERROR: Function declaration not terminated by a "
+                    "semicolon on line "
+                 << tokens.GetLineNumber(close_paren) << "!\n";
+    if (tokens.GetLine(*position) == tokens.GetLine(close_paren))
+      // Only need to skip if we've not already found a new line.
+      SkipPastLikelyDeclarationEnd(function_intro_token);
+    return add_error_function_node();
+  }
+
+  // Successfully parsed the function, add that node.
+  return AddNode(ParseNodeKind::FunctionDeclaration(), function_intro_token,
+                 start);
+}
+
+auto ParseTree::Parser::ParseEmptyDeclaration() -> Node {
+  assert(position != end && "Cannot parse past the end!");
+  return AddLeafNode(ParseNodeKind::EmptyDeclaration(),
+                     Consume(TokenKind::Semi()));
+}
+
+auto ParseTree::Parser::ParseDeclaration() -> llvm::Optional<Node> {
+  assert(position != end && "Cannot parse past the end!");
+  TokenizedBuffer::Token t = *position;
+  switch (tokens.GetKind(t)) {
+    case TokenKind::FnKeyword():
+      return ParseFunctionDeclaration();
+    case TokenKind::Semi():
+      return ParseEmptyDeclaration();
+  }
+
+  // We didn't recognize an introducer for a valid declaration.
+  llvm::errs() << "ERROR: Unrecognized declaration introducer '"
+               << tokens.GetTokenText(t) << "' on line "
+               << tokens.GetLineNumber(t) << "!\n";
+
+  // Skip forward past any end of a declaration we simply didn't understand so
+  // that we can find the start of the next declaration or the end of a scope.
+  if (auto found_semi_n =
+          SkipPastLikelyDeclarationEnd(t, /*is_inside_declaration=*/false)) {
+    MarkNodeError(*found_semi_n);
+    return *found_semi_n;
+  }
+
+  // Nothing, not even a semicolon found. We still need to mark that an error
+  // occurred though.
+  tree.has_errors = true;
+  return {};
+}
+
+}  // namespace Carbon

+ 134 - 0
parser/parser_impl.h

@@ -0,0 +1,134 @@
+// Part of the Carbon Language project, under the Apache License v2.0 with LLVM
+// Exceptions. See /LICENSE for license information.
+// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
+
+#ifndef PARSER_PARSER_IMPL_H_
+#define PARSER_PARSER_IMPL_H_
+
+#include "diagnostics/diagnostic_emitter.h"
+#include "lexer/token_kind.h"
+#include "lexer/tokenized_buffer.h"
+#include "llvm/ADT/Optional.h"
+#include "parser/parse_node_kind.h"
+#include "parser/parse_tree.h"
+
+namespace Carbon {
+
+class ParseTree::Parser {
+ public:
+  // Parses the tokens into a parse tree, emitting any errors encountered.
+  //
+  // This is the entry point to the parser implementation.
+  static auto Parse(TokenizedBuffer& tokens, DiagnosticEmitter& de)
+      -> ParseTree;
+
+ private:
+  struct SubtreeStart;
+
+  ParseTree& tree;
+  TokenizedBuffer& tokens;
+
+  TokenizedBuffer::TokenIterator position;
+  TokenizedBuffer::TokenIterator end;
+
+  explicit Parser(ParseTree& tree_arg, TokenizedBuffer& tokens_arg)
+      : tree(tree_arg),
+        tokens(tokens_arg),
+        position(tokens.Tokens().begin()),
+        end(tokens.Tokens().end()) {}
+
+  // Requires (and asserts) that the current position matches the provide
+  // `Kind`. Returns the current token and advances to the next position.
+  auto Consume(TokenKind kind) -> TokenizedBuffer::Token;
+
+  // If the current position's token matches this `Kind`, returns it and
+  // advances to the next position. Otherwise returns an empty optional.
+  auto ConsumeIf(TokenKind kind) -> llvm::Optional<TokenizedBuffer::Token>;
+
+  // Adds a node to the parse tree that is fully parsed, has no children
+  // ("leaf"), and has a subsequent sibling.
+  //
+  // This sets up the next sibling of the node to be the next node in the parse
+  // tree's preorder sequence.
+  auto AddLeafNode(ParseNodeKind kind, TokenizedBuffer::Token token) -> Node;
+
+  // Composes `consumeIf` and `addLeafNode`, propagating the failure case
+  // through the optional.
+  auto ConsumeAndAddLeafNodeIf(TokenKind t_kind, ParseNodeKind n_kind)
+      -> llvm::Optional<Node>;
+
+  // Marks the node `N` as having some parse error and that the tree contains
+  // a node with a parse error.
+  auto MarkNodeError(Node n) -> void;
+
+  // Start parsing one (or more) subtrees of nodes.
+  //
+  // This returns a marker representing start position. It will also enforce
+  // that at least *some* node is added using this starting position. Multiple
+  // nodes can be added if they share a start position though.
+  auto StartSubtree() -> SubtreeStart;
+
+  // Add a node to the parse tree that potentially has a subtree larger than
+  // itself.
+  //
+  // Requires a start marker be passed to compute the size of the subtree rooted
+  // at this node.
+  auto AddNode(ParseNodeKind n_kind, TokenizedBuffer::Token t,
+               SubtreeStart& start, bool has_error = false) -> Node;
+
+  // If the current token is an opening symbol for a matched group, skips
+  // forward to one past the matched closing symbol and returns true. Otherwise,
+  // returns false.
+  auto SkipMatchingGroup() -> bool;
+
+  // Skips forward to move past the likely end of a declaration.
+  //
+  // Looks forward, skipping over any matched symbol groups, to find the next
+  // position that is likely past the end of a declaration. This is a heuristic
+  // and should only be called when skipping past parse errors.
+  //
+  // The strategy for recognizing when we have likely passed the end of a
+  // declaration:
+  // - If we get to close curly brace, we likely ended the entire context of
+  //   declarations.
+  // - If we get to a semicolon, that should have ended the declaration.
+  // - If we get to a new line from the `SkipRoot` token, but with the same or
+  //   less indentation, there is likely a missing semicolon. Continued
+  //   declarations across multiple lines should be indented.
+  //
+  // If we find a semicolon based on this skipping, we try to build a parse node
+  // to represent it and will return that node. Otherwise we will return an
+  // empty optional. If `IsInsideDeclaration` is true (the default) we build a
+  // node that marks the end of the declaration we are inside. Otherwise we
+  // build an empty declaration node.
+  auto SkipPastLikelyDeclarationEnd(TokenizedBuffer::Token skip_root,
+                                    bool is_inside_declaration = true)
+      -> llvm::Optional<Node>;
+
+  // Parses the signature of the function, consisting of a parameter list and an
+  // optional return type. Returns the root node of the signature which must be
+  // based on the open parenthesis of the parameter list.
+  auto ParseFunctionSignature() -> Node;
+
+  // Parses a block of code: `{ ... }`.
+  //
+  // These can form the definition for a function or be nested within a function
+  // definition. These contain variable declarations and statements.
+  auto ParseCodeBlock() -> Node;
+
+  // Parses a function declaration with an optional definition. Returns the
+  // function parse node which is based on the `fn` introducer keyword.
+  auto ParseFunctionDeclaration() -> Node;
+
+  // Parses and returns an empty declaration node from a single semicolon token.
+  auto ParseEmptyDeclaration() -> Node;
+
+  // Tries to parse a declaration. If a declaration, even an empty one after
+  // skipping errors, can be parsed, it is returned. There may be parse errors
+  // even when a node is returned.
+  auto ParseDeclaration() -> llvm::Optional<Node>;
+};
+
+}  // namespace Carbon
+
+#endif  // PARSER_PARSER_IMPL_H_