Parcourir la source

Add up to 10 retries to our test action. (#3401)

This captures the exit code of Bazel and checks for success or permanent
errors on each attempt. It also sleeps a small amount between attempts.
We should be able to increase the retries and sleeps as needed to
minimize flakiness here, and Bazel should even persist incremental
progress efficiently. Hopefully this helps reduce the failure rate of
our CI.

It also changes how we build on a `push` to use a single Bazel clause to
hold this logic.

Managed to get one of the download failures when testing this, and the
retry logic worked but there was a bug in the success logic.

Otherwise seems to work:
- Synthetic failure:
https://github.com/carbon-language/carbon-lang/actions/runs/6872431707/job/18690894069
- Success:
https://github.com/carbon-language/carbon-lang/actions/runs/6872461061

---------

Co-authored-by: josh11b <josh11b@users.noreply.github.com>
Chandler Carruth il y a 2 ans
Parent
commit
49d46bd8a4
1 fichiers modifiés avec 45 ajouts et 9 suppressions
  1. 45 9
      .github/workflows/tests.yaml

+ 45 - 9
.github/workflows/tests.yaml

@@ -215,15 +215,13 @@ jobs:
       # Build and run all targets on branch pushes to ensure we always have a
       # clean tree. We don't expect this to be an interactive path and so don't
       # optimize the latency of this step.
-      - name: Test (${{ matrix.build_mode }})
+      - name: Compute impacted pull request targets (for push)
         if:
           steps.filter.outputs.ignore == 'false' && github.event_name == 'push'
         env:
-          # 'libtool_check_unique failed to generate' workaround.
-          # https://github.com/bazelbuild/bazel/issues/14113#issuecomment-999794586
-          BAZEL_USE_CPP_ONLY_TOOLCHAIN: 1
+          TARGETS_FILE: ${{ runner.temp }}/targets
         run: |
-          bazelisk test -c ${{ matrix.build_mode }} //...
+          echo "//..." >$TARGETS_FILE
 
       # Compute the set of possible rules impacted by this change using
       # Bazel-based diffing. This lets PRs and the merge queue have a much more
@@ -248,8 +246,7 @@ jobs:
 
       # Build and run just the tests impacted by the PR or merge group.
       - name: Test (${{ matrix.build_mode }})
-        if:
-          steps.filter.outputs.ignore == 'false' && github.event_name != 'push'
+        if: steps.filter.outputs.ignore == 'false'
         env:
           # 'libtool_check_unique failed to generate' workaround.
           # https://github.com/bazelbuild/bazel/issues/14113#issuecomment-999794586
@@ -261,8 +258,47 @@ jobs:
           # build, so simply inject an explicit no-op test target.
           echo "//scripts:no_op_test" >> $TARGETS_FILE
 
-          bazelisk test -c ${{ matrix.build_mode }} \
-            --target_pattern_file=$TARGETS_FILE
+          for i in {1..5}; do
+            if (( $i == 4 )); then
+              # Decrease the jobs sharply if we see repeated failures to try to
+              # work around transient network errors even if it makes things
+              # slower.
+              echo "build --jobs=4" >>user.bazelrc
+            fi
+
+            bazel_exit=0
+            bazelisk test -c ${{ matrix.build_mode }} \
+              --target_pattern_file=$TARGETS_FILE || bazel_exit=$?
+
+            # If we succeed, we're done.
+            if (( $bazel_exit == 0 )); then
+              break
+            fi
+
+            # Several error codes are reliably permanent, break immediately.
+            # `1`  -- The build failed.
+            # `2`  -- Command line or environment problem.
+            # `3`  -- Tests failed or timed out, we don't retry at this layer
+            #         on execution timeout.
+            # `4`  -- No tests found, which should be impossible here.
+            # `8`  -- Explicitly interrupted build.
+            #
+            # Note that `36` is documented as "likely permanent", but we retry
+            # it as most of our transient failures actually produce that error
+            # code.
+            if (( $bazel_exit == 1 || $bazel_exit == 2 || $bazel_exit == 3 || \
+                  $bazel_exit == 4 || $bazel_exit == 8 || $bazel_exit == 8 ))
+            then
+              break
+            fi
+
+            echo "Retrying a failed build as it may be transient..."
+            # Also sleep a bit to try to skip over transient machine load.
+            sleep $i
+          done
+
+          # Propagate the Bazel exit code.
+          exit $bazel_exit
 
       # See "Disk space before build".
       - name: Disk space after build