Просмотр исходного кода

Introduce two speed-of-light benchmarks. (#3270)

The goal of these kinds of benchmarks is to help calibrate other
benchmarks and expectations. They benchmark the underlying hardware
capabilities that we can't avoid, and help illustrate bounds for what is
possible. The term "speed-of-light benchmark" references the aspect of
measuring how fast thing could possible run.

The first is a simple memory bandwidth measurement in the best case
scenario -- using `strcpy` over the buffer. This still does a minimal
number of writes to memory and examines each byte of input to see if it
is null, but can cheat in every way possible to run at the maximum speed
of hardware. To a certain extent, we never expect to get close to this
speed, but it's a good illustration of how much headroom the hardware
has available.

The second is potentially more interesting. This illustrates how fast a
byte-by-byte dispatch loop can potentially be. It uses the technique
that I'm hoping to use in the lexer itself of guaranteed tail recursion
to achieve this with a very small code footprint. The performance of
this technique, even when running in this extremely minimal setting to
establish bounds, is hugely dependent on the number of distinct dispatch
targets, and so the benchmark includes a healthy range to show the range
of performance that we might expect when running in a byte-by-byte mode.
Note that we should expect the lexer to be *faster* than this
"speed-of-light" whenever it is able to lex in larger granules than
byte-wise. But for complex, dense token sequences that force looking at
every byte, this shows the "worst case" "speed-of-light" in a sense.

On my recent AMD cloud VM instance, I get the following results running
the main lexer benchmark with these changes included:

```
-------------------------------------------------------------------------------------------------------------------------
Benchmark                                            Time             CPU   Iterations bytes_per_second tokens_per_second
-------------------------------------------------------------------------------------------------------------------------
BM_ValidKeywords                               3169403 ns      3169283 ns          221        188.44M/s        31.5529M/s
BM_ValidIdentifiers<1, 64, false>             12486725 ns     12486445 ns           51       117.953M/s        8.00868M/s
BM_ValidIdentifiers<1, 1, true>                3950455 ns      3950298 ns          178       72.4252M/s        25.3145M/s
BM_ValidIdentifiers<3, 5, true>               15562294 ns     15561178 ns           45       36.7712M/s        6.42625M/s
BM_ValidIdentifiers<3, 16, true>              16118656 ns     16118374 ns           44       68.0412M/s         6.2041M/s
BM_ValidIdentifiers<12, 64, true>             19116271 ns     19116258 ns           35       199.541M/s        5.23115M/s
BM_ValidMix/10/40                              7074336 ns      7073795 ns           93       140.744M/s        14.1367M/s
BM_ValidMix/25/30                              6790722 ns      6790006 ns          102       131.793M/s        14.7275M/s
BM_ValidMix/50/20                              5960514 ns      5960443 ns          118       112.594M/s        16.7773M/s
BM_ValidMix/75/10                              4325546 ns      4325556 ns          159       102.559M/s        23.1184M/s
BM_SpeedOfLightStrCpy                            24339 ns        24339 ns        29650       35.9049G/s        4.10858G/s
BM_SpeedOfLightDispatch<1>                     1756051 ns      1755800 ns          398       509.668M/s        56.9541M/s
BM_SpeedOfLightDispatch<2>                     1611973 ns      1611725 ns          436       555.228M/s        62.0453M/s
BM_SpeedOfLightDispatch<4>                     2064280 ns      2063990 ns          326       433.565M/s        48.4498M/s
BM_SpeedOfLightDispatch<8>                     2484055 ns      2483946 ns          280       360.263M/s        40.2585M/s
BM_SpeedOfLightDispatch<16>                    4550963 ns      4550894 ns          155       196.637M/s        21.9737M/s
BM_SpeedOfLightDispatch<32>                    6507077 ns      6507090 ns          107       137.523M/s        15.3679M/s
BM_SpeedOfLightDispatch<MaxDispatchTargets>    9071198 ns      9071217 ns           77       98.6499M/s        11.0239M/s
```

Even though we're not lexing anything in the speed-of-light benchmark,
the tokens-per-second measure is still meaningful because we *generated*
the token stream and know how many tokens we put into it. The dispatch
technique easily exceeds hits 10-million tokens/second, but we need to
do substantially better than that to lex at 10-million lines/second.
Fortunately, when the lexer is consuming more than one-byte tokens,
we're already faster than this. And the bytes-per-second numbers from
all but the worst case dispatch scenario are promising.
Chandler Carruth 2 лет назад
Родитель
Сommit
c7e6238fa8
1 измененных файлов с 181 добавлено и 0 удалено
  1. 181 0
      toolchain/lex/tokenized_buffer_benchmark.cpp

+ 181 - 0
toolchain/lex/tokenized_buffer_benchmark.cpp

@@ -5,6 +5,7 @@
 #include <benchmark/benchmark.h>
 
 #include <algorithm>
+#include <utility>
 
 #include "absl/random/random.h"
 #include "common/check.h"
@@ -291,6 +292,8 @@ class LexerBenchHelper {
     return result;
   }
 
+  auto source_text() -> llvm::StringRef { return source_.text(); }
+
  private:
   auto MakeSourceBuffer(llvm::StringRef text) -> SourceBuffer {
     CARBON_CHECK(fs_.addFile(filename_, /*ModificationTime=*/0,
@@ -379,5 +382,183 @@ BENCHMARK(BM_ValidMix)
     ->Args({50, 20})
     ->Args({75, 10});
 
+// This is a speed-of-light benchmark that should reflect memory bandwidth
+// (ideally) of simply reading all the source code. For speed-of-light we use
+// `strcpy` -- this both examines ever byte of the input looking for a null to
+// end the copy, and also writes to a data structure of roughly the same size as
+// the input. This routine is one we expect to be *very* well optimized and give
+// a good approximation of the fastest possible lexer given the physical
+// constraints of the machine. Note that which particular source we use as input
+// here isn't especially interesting, so we just pick one and should update it
+// to reflect whatever distribution is most realistic long-term. The
+// bytes/second throughput is the important output of this routine.
+auto BM_SpeedOfLightStrCpy(benchmark::State& state) -> void {
+  std::string source =
+      RandomMixedSeq(/*symbol_percent=*/25, /*keyword_percent=*/30);
+
+  // A buffer to write the null-terminated contents of `source` into.
+  llvm::OwningArrayRef<char> buffer(source.size() + 1);
+
+  for (auto _ : state) {
+    const char* text = source.data();
+    benchmark::DoNotOptimize(text);
+    strcpy(buffer.data(), text);
+    benchmark::DoNotOptimize(buffer.data());
+  }
+
+  state.SetBytesProcessed(state.iterations() * source.size());
+  state.counters["tokens_per_second"] = benchmark::Counter(
+      NumTokens, benchmark::Counter::kIsIterationInvariantRate);
+}
+BENCHMARK(BM_SpeedOfLightStrCpy);
+
+// This is a speed-of-light benchmark that builds up a best-case byte-wise table
+// dispatch using guaranteed tail recursion. The goal is both to ensure the
+// general technique can reasonably hit the level of performance we need and to
+// establish how far from this speed of light the actual lexer currently sits.
+//
+// A major impact on the observed performance of this technique is how many
+// different functions are reached in this dispatch loop. This benchmark
+// infrastructure tries to bracket the range of performance this technique
+// affords with different numbers of dispatch target functions.
+using DispatchPtrT = auto (*)(ssize_t& index, const char* text, char* buffer)
+    -> void;
+using DispatchTableT = std::array<DispatchPtrT, 256>;
+
+template <const DispatchTableT& Table>
+auto BasicDispatch(ssize_t& index, const char* text, char* buffer) -> void {
+  *buffer = text[index];
+  ++index;
+  [[clang::musttail]] return Table[static_cast<unsigned char>(text[index])](
+      index, text, buffer);
+}
+
+template <const DispatchTableT& Table, char C>
+auto SpecializedDispatch(ssize_t& index, const char* text, char* buffer)
+    -> void {
+  CARBON_CHECK(C == text[index]);
+  *buffer = C;
+  ++index;
+  [[clang::musttail]] return Table[static_cast<unsigned char>(text[index])](
+      index, text, buffer);
+}
+
+// A sample of the symbol characters used in Carbon code. Doesn't need to be
+// perfect, as we just need to have a reasonably large # of distinct dispatch
+// functions.
+constexpr char DispatchSpecializableSymbols[] = {
+    '!', '%', '(', ')', '*', '+', ',', '-', '.', ':',
+    ';', '<', '=', '>', '?', '[', ']', '{', '}', '~',
+};
+
+// Create an array of all the characters we can specialize dispatch over --
+// [0-9A-Za-z] and the symbols above. Similar to the above symbols, doesn't need
+// to be exhaustive.
+constexpr std::array<char, 26 * 2 + 10 + sizeof(DispatchSpecializableSymbols)>
+    DispatchSpecializableChars = []() constexpr {
+      constexpr int Size = sizeof(DispatchSpecializableChars);
+      std::array<char, Size> chars = {};
+      int i = 0;
+      for (char c = '0'; c <= '9'; ++c) {
+        chars[i] = c;
+        ++i;
+      }
+      for (char c = 'A'; c <= 'Z'; ++c) {
+        chars[i] = c;
+        ++i;
+      }
+      for (char c = 'a'; c <= 'z'; ++c) {
+        chars[i] = c;
+        ++i;
+      }
+      for (char c : DispatchSpecializableSymbols) {
+        chars[i] = c;
+        ++i;
+      }
+      CARBON_CHECK(i == Size);
+      return chars;
+    }();
+
+// Instantiate a number of specialized dispatch functions for characters in the
+// array above, and assign those function addresses to the character's entry in
+// the provided table. The provided `tmp_table` is a temporary that will
+// eventually initialize the provided `Table` constant, so the constant is what
+// we propagate to the instantiated function and the temporary is the one we
+// initialize.
+template <const DispatchTableT& Table, size_t... Indices>
+constexpr auto SpecializeDispatchTable(
+    DispatchTableT& tmp_table, std::index_sequence<Indices...> /*indices*/)
+    -> void {
+  static_assert(sizeof...(Indices) <= sizeof(DispatchSpecializableChars));
+  ((tmp_table[static_cast<unsigned char>(DispatchSpecializableChars[Indices])] =
+        &SpecializedDispatch<Table, DispatchSpecializableChars[Indices]>),
+   ...);
+}
+
+// The maximum number of dispatch targets is the size of the array + 1 (for the
+// base case target).
+constexpr int MaxDispatchTargets = sizeof(DispatchSpecializableChars) + 1;
+
+// Dispatch tables with a provided number of distinct dispatch targets. There
+// will always be one additional target for the null byte to end the loop.
+template <int NumDispatchTargets>
+constexpr DispatchTableT DispatchTable = []() constexpr {
+  static_assert(NumDispatchTargets > 0, "Need at least one dispatch target.");
+  static_assert(NumDispatchTargets <= MaxDispatchTargets,
+                "Limited number of dispatch targets available.");
+
+  DispatchTableT tmp_table = {};
+  // Start with the basic dispatch target.
+  for (int i = 0; i < 256; ++i) {
+    tmp_table[i] = &BasicDispatch<DispatchTable<NumDispatchTargets>>;
+  }
+  if constexpr (NumDispatchTargets > 1) {
+    // Add additional dispatch targets from our specializable array.
+    SpecializeDispatchTable<DispatchTable<NumDispatchTargets>>(
+        tmp_table, std::make_index_sequence<NumDispatchTargets - 1>());
+  }
+  // Special case the null byte index to end the tail-dispatch.
+  tmp_table[0] =
+      +[](ssize_t& index, const char* text, char* /*buffer*/) -> void {
+    CARBON_CHECK(text[index] == '\0');
+    return;
+  };
+  return tmp_table;
+}();
+
+template <int NumDispatchTargets>
+auto BM_SpeedOfLightDispatch(benchmark::State& state) -> void {
+  std::string source =
+      RandomMixedSeq(/*symbol_percent=*/25, /*keyword_percent=*/30);
+
+  // A buffer to write to, simulating some minimal write traffic.
+  llvm::OwningArrayRef<char> buffer(source.size());
+
+  for (auto _ : state) {
+    const char* text = source.data();
+    benchmark::DoNotOptimize(text);
+
+    // Use `ssize_t` to minimize indexing overhead.
+    ssize_t i = 0;
+    // The dispatch table tail-recurses through the entire string.
+    DispatchTable<NumDispatchTargets>[static_cast<unsigned char>(text[i])](
+        i, text, buffer.data());
+    CARBON_CHECK(i == static_cast<ssize_t>(source.size()));
+
+    benchmark::DoNotOptimize(buffer.data());
+  }
+
+  state.SetBytesProcessed(state.iterations() * source.size());
+  state.counters["tokens_per_second"] = benchmark::Counter(
+      NumTokens, benchmark::Counter::kIsIterationInvariantRate);
+}
+BENCHMARK(BM_SpeedOfLightDispatch<1>);
+BENCHMARK(BM_SpeedOfLightDispatch<2>);
+BENCHMARK(BM_SpeedOfLightDispatch<4>);
+BENCHMARK(BM_SpeedOfLightDispatch<8>);
+BENCHMARK(BM_SpeedOfLightDispatch<16>);
+BENCHMARK(BM_SpeedOfLightDispatch<32>);
+BENCHMARK(BM_SpeedOfLightDispatch<MaxDispatchTargets>);
+
 }  // namespace
 }  // namespace Carbon::Lex