mixedbrain: exercise standalone Nexus in mixed-version cluster#10251
Open
stephanos wants to merge 21 commits into
Open
mixedbrain: exercise standalone Nexus in mixed-version cluster#10251stephanos wants to merge 21 commits into
stephanos wants to merge 21 commits into
Conversation
…10201) ## Summary - Fix `schedule_action_delay` for CHASM schedules: `DesiredTime` is nil for most starts (only set when blocked behind overlap), causing the metric to record ~56 years (now minus epoch). Use `cmp.Or(start.DesiredTime, start.ActualTime)` to fall back to `ActualTime`, matching V1 behavior. - Add `schedule_generate_latency` timer metric to measure the delay between when a scheduled action was due and when the generator buffered it. Only recorded for non-manual (non-backfill) actions.
## Summary - Emit the `EventBlobSize` for `UpdateWorkflowExecution` requess, tagged with `namespace`. --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
## What Deduplicates `CancelOutstandingWorkerPolls` RPCs by destination matching host during `ShutdownWorker`. Uses `Route()` on the matching client to determine which host each partition maps to, then sends only one RPC per unique host instead of one per partition. ## Why With N partitions across H matching hosts (H << N), the current code sends N RPCs per task type when H would suffice — the RPC cancels all pollers for the `workerInstanceKey` on the target host regardless of which partition was used for routing. ## How did you test it? Unit test 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
## What Clear `StartedClock` on activity retry/pause. To do this, refactored the code that clears per attempt field into a single `ClearActivityStartedState` helper, and updated all code paths. ## Why `StartedClock` is a per-attempt field introduced in #9233 to reconstruct task tokens for cancel worker commands. It was not being cleared when the activity leaves the started state (retry or pause), leaving a stale value during backoff. This can cause cancel commands to be unnecessarily dispatched for activities not currently running on any worker. ## How did you test it? - Unit tests 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
## What changed? Replaces (almost) all use of `s.Eventually` with `s.EventuallyWithT`. ## Why? Assertions are often used inside `s.Eventually` here and that's not safe as it aborts the test immediately.
31e9b97 to
4757265
Compare
## What changed? Use a metrics handler without `header_callsite` tag because Prometheus rejects re-registering the same metric with a different label set and logs `error in prometheus reporter ... event_blob_size ... has different label names`. ## Why? Fix found from regression ## How did you test it? - [ ] built - [ ] run locally and tested manually - [ ] covered by existing tests - [ ] added new unit test(s) - [ ] added new functional test(s) ## Potential risks Fixes a regression introduced in #10223
## What changed? Added a short protorequire package subsection to docs/development/testing.md documenting protorequire.ProtoEqual and the new protorequire.IgnoreFields option, with a minimal usage example. ## Why? Follow-up to PR #9937. Without a doc entry, the new IgnoreFields helper is undiscoverable and contributors will keep reaching for the verbose cmp.Diff pattern. ## How did you test it? - [X] built - [ ] run locally and tested manually - [ ] covered by existing tests - [ ] added new unit test(s) - [ ] added new functional test(s)
4757265 to
3565bde
Compare
## What Add `SyncMatchOutcome` enum to the hooks API (NotMatched, Success, RateLimited) and plumb rate limiting signal from the matcher through to hooks. Keep `IsSyncMatch` as deprecated for backwards compatibility. ## Why Hook consumers (e.g. scaling operators) need to distinguish rate limiting from genuine lack of pollers when deciding whether to scale up workers. ## How did you test it? Unit tests — rate-limited and non-rate-limited scenarios, multiple hooks invocation. 🤖 Generated with [Claude Code](https://claude.com/claude-code) --------- Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
## What changed? Log message for nexus operation cancellation invocation ## Why? Makes the log distinct from errors during operation invocation. ## How did you test it? - [x] built - [ ] run locally and tested manually - [ ] covered by existing tests - [ ] added new unit test(s) - [ ] added new functional test(s)
## What changed? Reverting dropping tasks when feature flag disabled -> returning error Interface change for active or not based on business id ## Why? regression ## How did you test it? - [ ] built - [ ] run locally and tested manually - [ ] covered by existing tests - [ ] added new unit test(s) - [ ] added new functional test(s)
## What changed? Adds TestPGXSimpleProtocol — a new entry in the persistence integration suite that runs `PostgreSQLSuite` under the `postgres12_pgx` plugin with `default_query_exec_mode=simple_protocol`. ## Why? Regression coverage for issues like [#9804](#9804). With pgx ≤ v5.9.1, current_executions.state/status (proto-enum-typed int32 fields with a String() method) were text-encoded via fmt.Stringer and rejected by `Postgres` on simple/exec protocol, the path users land on behind PgBouncer in transaction pooling. pgx v5.9.2 fixed it upstream; this test makes sure we notice if pgx is ever downgraded or if a similar issue sneaks in. ## How did you test it? - [X] built - [X] run locally and tested manually - [ ] covered by existing tests - [ ] added new unit test(s) - [X] added new functional test(s) Verified locally: passes on pgx v5.9.2; fails reproducibly on v5.9.1 with invalid input syntax for type integer: "Created" matching [#9804](#9804). ## Potential risks Adds one additional pass through PostgreSQLSuite to the Integration test job. Job timeout is 15 min, so should be fine, but worth observing.
## What changed? Add flag to skip setting up ES cluster settings. Eg: `temporal-elasticsearch-tool setup-schema --skip-cluster-settings` ## Why? Allow users to not use our provided cluster settings (eg: avoid overwriting their cluster settings). #9857 ## How did you test it? - [x] built - [x] run locally and tested manually - [ ] covered by existing tests - [ ] added new unit test(s) - [ ] added new functional test(s) ## Potential risks
a67de50 to
b806468
Compare
## What changed? Revert `OperatorRateBurstImpl.Burst()` to `baseRateBurstFn.Burst()`. ## Why? We don't have to reduce the burst value for operator priority ## How did you test it? - [x] built - [x] covered by existing tests
## What changed? Added validation of the user metadata in the `StartNexusOperationExecutionRequest`. ## Why? All other fields are validated. ## How did you test it? - [ ] built - [ ] run locally and tested manually - [ ] covered by existing tests - [x] added new unit test(s) - [ ] added new functional test(s)
## What changed? Introduces shared constants for default long poll timeout/buffer and applies them to SAA and SANO. ## Why? After a long internal technical discussion, 60s was determined as the default for the timeout.
## What changed? New counter for dispatched tasks, tagged with their dispatch result. ## Why? Recent debugging around incidents has made us desire this information.
## What changed? Add `WithTags(metrics.CommandType(Unspecified))` to the call sites for Workflow Update event_blob_size metric emission, which was missing in the pathch #10253 to the problem introduced in #10223 ## Why? Fixes a warning and dropped metric when the metrics handler has different shapes ## How did you test it? - [ ] built - [X] run locally and tested manually: Ran server with `make start`, created a simple workflow with an update handler, start the worker, start a workflow, and send the update. Grepped server logs for `error in prometheus reporter` and `event_blob_size` and curled the metrics endpoint `curl -s http://127.0.0.1:8000/metrics > /tmp/snapshot.txt` - [ ] covered by existing tests - [ ] added new unit test(s) - [ ] added new functional test(s) ## Potential risks Minimal, fixing bug
When a payload with json/plain encoding had additional metadata fields, those fields were silently dropped during nexus serialization. Fall back to x-temporal-payload to preserve the full payload, matching the behavior already in place for other encodings.
…10229) ## What Extend `common/config/config_template_embedded.yaml` with (1) SQLite option (2) option to override `rpcAddress` and/or `httpAddress` ## Why https://github.com/temporalio/omes needs to be able to start a Temporal server with sqlite and custom addresses. See temporalio/omes#348
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
b806468 to
e224cd9
Compare
e224cd9 to
43498ac
Compare
Drives omes' standalone-Nexus path through a two-node cluster pairing current against the previous-minor OSS release (TestMixedBrainOSS) or the latest non-rc cloud release (TestMixedBrainCloud). Frontend traffic flows through a round-robin TCP proxy whose connections rotate every 15s, forcing gRPC clients to redial and re-pin to a different backend periodically — without rotation, HTTP/2 multiplexing pins every RPC to one frontend for the whole run, masking the new-on-current-only RPCs (StartNexusOperationExecution) behind a single backend's behavior. LogMonitor (omes/devserver) is wired with MustNotMatch on the current server's log so any "failed assertion: " soft-asserts fail the test. Release-side soft-asserts are out of scope. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
43498ac to
e3fad36
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds a mixed-version validation step for the standalone Nexus feature
landing in temporalio/omes#339. Builds on #10216
(stephanos/omes-server-dl), so the omes devserver integration is its
base.
Concretely, mixedbrain now:
via `devserver.Options.DynamicConfigValues`.
throughput_stress scenario, so half the requests target the current
server (which serves them) and the other half target the release
server (which returns `Unimplemented` — tolerated by omes#339).
cloned omes binary has the `Unimplemented` handling.
This is a stacked PR: #10216 must merge first.
Test plan
🤖 Generated with Claude Code