PortableRunner tests: surface worker-thread exceptions on main thread after wait_until_finish() (fixes #35211)#36485
Conversation
Summary of ChangesHello @jh1231223, I'm Gemini Code Assist1! I'm currently reviewing this pull request and will post my feedback shortly. In the meantime, here's a summary to help you and other reviewers quickly get up to speed! This pull request resolves a long-standing flakiness issue in the PortableRunner tests where exceptions originating in worker threads were not properly propagated to the main thread, leading to Highlights
Using Gemini Code AssistThe full guide for Gemini Code Assist can be found on our documentation page, here are some quick tips. Invoking Gemini You can request assistance from Gemini at any point by creating a comment using either
Customization To customize Gemini Code Assist for GitHub experience, repository maintainers can create a configuration file and/or provide a custom code review style guide (such as PEP-8 for Python) by creating and adding files to a Limitations & Feedback Gemini Code Assist may make mistakes. Please leave feedback on any instances where its feedback is incorrect or counter productive. You can react with 👍 and 👎 on @gemini-code-assist comments. If you're interested in giving your feedback about your experience with Gemini Code Assist for Github and other Google products, sign up here. You can also get AI-powered code generation, chat, as well as code reviews directly in the IDE at no cost with the Gemini Code Assist IDE Extension. Footnotes
|
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## master #36485 +/- ##
============================================
- Coverage 55.07% 55.07% -0.01%
Complexity 1666 1666
============================================
Files 1059 1059
Lines 165387 165387
Branches 1190 1190
============================================
- Hits 91088 91083 -5
- Misses 72124 72129 +5
Partials 2175 2175
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
|
Checks are failing. Will not request review until checks are succeeding. If you'd like to override that behavior, comment |
|
Assigning reviewers: R: @damccorm for label python. Note: If you would like to opt out of this review, comment Available commands:
The PR bot will only process comments in the main thread (not review comments). |
|
assign set of reviewers |
|
Reviewers are already assigned to this PR: @damccorm |
| with self.assertRaisesRegex(Exception, 'Failed assert'): | ||
| with self.create_pipeline() as p: | ||
| assert_that(p | beam.Create(['a', 'b']), equal_to(['a'])) | ||
| with patch_portable_runner_for_test(): # pylint: disable=not-context-manager |
There was a problem hiding this comment.
Thanks - this is a good find.
With that said, I think that fixing the test is probably the wrong move here. If this test is flaky, it indicates that the underlying runner is not doing the right thing (the runner should fail this test everytime without any patches). So I think we need to address the behavior on the underlying runner, not in the test.
There was a problem hiding this comment.
Totally agree that the issue isn't the test.
I changed PortableRunner.wait_until_finish() so it re-raises worker errors instead of just logging them on the message stream. Before, failures could get buried in logs and the timing made the run look “successful” sometimes, which caused the flake. Now the error bubbles up, the job fails deterministically, and the test passes without any test changes.
The changes are in portable_runner.py, plus a tiny companion tweak in local_job_service.py.
damccorm
left a comment
There was a problem hiding this comment.
Letting the remaining tests run before merging, but this generally looks good to me - thanks! This is a nice find
Thanks for reviewing!
It seems the current failures are unrelated to the modifications in this PR. If I’ve overlooked something, please let me know and I’ll address it promptly.
Context / Issue
PortableRunnerTest::test_assert_that, a failingassert_thatraised on a worker thread was only surfaced as aPytestUnhandledThreadExceptionWarning, so the test could pass intermittently.What this change does
threading.excepthookduring the test run.beam.Pipeline.run(...).wait_until_finish()so that any captured worker-thread exception is rethrown on the main thread at the end of the run.test_assert_thatto use this shim and assert the expected failure viaassertRaisesRegex, turning the flaky pass into a deterministic failure for the negative case.Why
Ensures the failure path for
assert_thatis exercised reliably and reported as a proper test failure, rather than a warning emitted from a background thread.Summary
The failure originally occurred on a worker thread rather than the main thread, causing pytest to emit a PytestUnhandledThreadExceptionWarning instead of failing the test. This fix intercepts those background-thread exceptions via threading.excepthook and re-throws them on the main thread after wait_until_finish(), so pytest now treats them as proper test failures. The warning no longer appears because the exception is no longer “unhandled,” eliminating the nondeterministic pass/fail behavior.
Verification
Stress run x200 (no flakes):
Result:
200/200passes; no thread-exception warnings.