Skip to content

mixedbrain: exercise standalone Nexus in mixed-version cluster#10251

Open
stephanos wants to merge 21 commits into
stephanos/omes-server-dlfrom
stephanos/sano-mixed-brain-test
Open

mixedbrain: exercise standalone Nexus in mixed-version cluster#10251
stephanos wants to merge 21 commits into
stephanos/omes-server-dlfrom
stephanos/sano-mixed-brain-test

Conversation

@stephanos
Copy link
Copy Markdown
Contributor

Summary

Adds a mixed-version validation step for the standalone Nexus feature
landing in temporalio/omes#339. Builds on #10216
(stephanos/omes-server-dl), so the omes devserver integration is its
base.

Concretely, mixedbrain now:

  • Enables `nexusoperation.enableStandalone` on the current-source server
    via `devserver.Options.DynamicConfigValues`.
  • Passes `--option include-standalone-nexus=true` to the omes
    throughput_stress scenario, so half the requests target the current
    server (which serves them) and the other half target the release
    server (which returns `Unimplemented` — tolerated by omes#339).
  • Bumps the omes go.mod pin to stephanos/sano-load-testing's tip so the
    cloned omes binary has the `Unimplemented` handling.

This is a stacked PR: #10216 must merge first.

Test plan

  • `cd tests/mixedbrain && go build ./...` clean
  • `cd tests/mixedbrain && go test -run TestResolveReleaseVersion` clean
  • CI's Mixed brain test job passes against postgres12

🤖 Generated with Claude Code

chaptersix and others added 5 commits May 12, 2026 15:07
…10201)

## Summary
- Fix `schedule_action_delay` for CHASM schedules: `DesiredTime` is nil
for most starts (only set when blocked behind overlap), causing the
metric to record ~56 years (now minus epoch). Use
`cmp.Or(start.DesiredTime, start.ActualTime)` to fall back to
`ActualTime`, matching V1 behavior.
- Add `schedule_generate_latency` timer metric to measure the delay
between when a scheduled action was due and when the generator buffered
it. Only recorded for non-manual (non-backfill) actions.
## Summary

- Emit the `EventBlobSize` for `UpdateWorkflowExecution` requess, tagged
with `namespace`.

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
## What
Deduplicates `CancelOutstandingWorkerPolls` RPCs by destination matching
host during `ShutdownWorker`. Uses `Route()` on the matching client to
determine which host each partition maps to, then sends only one RPC per
unique host instead of one per partition.

## Why
With N partitions across H matching hosts (H << N), the current code
sends N RPCs per task type when H would suffice — the RPC cancels all
pollers for the `workerInstanceKey` on the target host regardless of
which partition was used for routing.

## How did you test it?
Unit test

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
## What

Clear `StartedClock` on activity retry/pause. To do this, refactored the
code that clears per attempt field into a single
`ClearActivityStartedState` helper, and updated all code paths.

## Why

`StartedClock` is a per-attempt field introduced in #9233 to reconstruct
task tokens for cancel worker commands. It was not being cleared when
the activity leaves the started state (retry or pause), leaving a stale
value during backoff. This can cause cancel commands to be unnecessarily
dispatched for activities not currently running on any worker.

## How did you test it?

- Unit tests

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
## What changed?

Replaces (almost) all use of `s.Eventually` with `s.EventuallyWithT`.

## Why?

Assertions are often used inside `s.Eventually` here and that's not safe
as it aborts the test immediately.
@stephanos stephanos requested review from a team as code owners May 13, 2026 19:02
@stephanos stephanos force-pushed the stephanos/sano-mixed-brain-test branch from 31e9b97 to 4757265 Compare May 13, 2026 21:40
spkane31 and others added 2 commits May 13, 2026 22:24
## What changed?
Use a metrics handler without `header_callsite` tag because Prometheus
rejects re-registering the same metric with a different label set and
logs `error in prometheus reporter ... event_blob_size ... has different
label names`.

## Why?
Fix found from regression

## How did you test it?
- [ ] built
- [ ] run locally and tested manually
- [ ] covered by existing tests
- [ ] added new unit test(s)
- [ ] added new functional test(s)

## Potential risks
Fixes a regression introduced in #10223
## What changed?
Added a short protorequire package subsection to
docs/development/testing.md documenting protorequire.ProtoEqual and the
new protorequire.IgnoreFields option, with a minimal usage example.

## Why?
Follow-up to PR #9937. Without a doc entry, the new IgnoreFields helper
is undiscoverable and contributors will keep reaching for the verbose
cmp.Diff pattern.


## How did you test it?
- [X] built
- [ ] run locally and tested manually
- [ ] covered by existing tests
- [ ] added new unit test(s)
- [ ] added new functional test(s)
@stephanos stephanos force-pushed the stephanos/sano-mixed-brain-test branch from 4757265 to 3565bde Compare May 14, 2026 01:50
rkannan82 and others added 5 commits May 13, 2026 20:10
## What
Add `SyncMatchOutcome` enum to the hooks API (NotMatched, Success,
RateLimited) and plumb rate limiting signal from the matcher through to
hooks. Keep `IsSyncMatch` as deprecated for backwards compatibility.

## Why
Hook consumers (e.g. scaling operators) need to distinguish rate
limiting from genuine lack of pollers when deciding whether to scale up
workers.

## How did you test it?
Unit tests — rate-limited and non-rate-limited scenarios, multiple hooks
invocation.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

---------

Co-authored-by: Claude Opus 4.6 <noreply@anthropic.com>
## What changed?
Log message for nexus operation cancellation invocation

## Why?
Makes the log distinct from errors during operation invocation. 

## How did you test it?
- [x] built
- [ ] run locally and tested manually
- [ ] covered by existing tests
- [ ] added new unit test(s)
- [ ] added new functional test(s)
## What changed?
Reverting dropping tasks when feature flag disabled -> returning error

Interface change for active or not based on business id

## Why?
regression

## How did you test it?
- [ ] built
- [ ] run locally and tested manually
- [ ] covered by existing tests
- [ ] added new unit test(s)
- [ ] added new functional test(s)
## What changed?
Adds TestPGXSimpleProtocol — a new entry in the persistence integration
suite that runs `PostgreSQLSuite` under the `postgres12_pgx` plugin with
`default_query_exec_mode=simple_protocol`.

## Why?
Regression coverage for issues like
[#9804](#9804). With pgx ≤
v5.9.1, current_executions.state/status (proto-enum-typed int32 fields
with a String() method) were text-encoded via fmt.Stringer and rejected
by `Postgres` on simple/exec protocol, the path users land on behind
PgBouncer in transaction pooling. pgx v5.9.2 fixed it upstream; this
test makes sure we notice if pgx is ever downgraded or if a similar
issue sneaks in.

## How did you test it?
- [X] built
- [X] run locally and tested manually
- [ ] covered by existing tests
- [ ] added new unit test(s)
- [X] added new functional test(s)

Verified locally: passes on pgx v5.9.2; fails reproducibly on v5.9.1
with invalid input syntax for type integer: "Created" matching
[#9804](#9804).

## Potential risks
Adds one additional pass through PostgreSQLSuite to the Integration test
job. Job timeout is 15 min, so should be fine, but worth observing.
## What changed?
Add flag to skip setting up ES cluster settings. Eg:
`temporal-elasticsearch-tool setup-schema --skip-cluster-settings`

## Why?
Allow users to not use our provided cluster settings (eg: avoid
overwriting their cluster settings).
#9857

## How did you test it?
- [x] built
- [x] run locally and tested manually
- [ ] covered by existing tests
- [ ] added new unit test(s)
- [ ] added new functional test(s)

## Potential risks
@stephanos stephanos force-pushed the stephanos/sano-mixed-brain-test branch from a67de50 to b806468 Compare May 14, 2026 19:38
prathyushpv and others added 8 commits May 14, 2026 20:29
## What changed?
Revert `OperatorRateBurstImpl.Burst()` to `baseRateBurstFn.Burst()`.

## Why?
We don't have to reduce the burst value for operator priority

## How did you test it?
- [x] built
- [x] covered by existing tests
## What changed?

Added validation of the user metadata in the
`StartNexusOperationExecutionRequest`.

## Why?

All other fields are validated.

## How did you test it?
- [ ] built
- [ ] run locally and tested manually
- [ ] covered by existing tests
- [x] added new unit test(s)
- [ ] added new functional test(s)
## What changed?

Introduces shared constants for default long poll timeout/buffer and
applies them to SAA and SANO.

## Why?

After a long internal technical discussion, 60s was determined as the
default for the timeout.
## What changed?
New counter for dispatched tasks, tagged with their dispatch result.

## Why?
Recent debugging around incidents has made us desire this information.
## What changed?
Add `WithTags(metrics.CommandType(Unspecified))` to the call sites for
Workflow Update event_blob_size metric emission, which was missing in
the pathch #10253 to the problem introduced in #10223

## Why?
Fixes a warning and dropped metric when the metrics handler has
different shapes

## How did you test it?
- [ ] built
- [X] run locally and tested manually: Ran server with `make start`,
created a simple workflow with an update handler, start the worker,
start a workflow, and send the update. Grepped server logs for `error in
prometheus reporter` and `event_blob_size` and curled the metrics
endpoint `curl -s http://127.0.0.1:8000/metrics > /tmp/snapshot.txt`
- [ ] covered by existing tests
- [ ] added new unit test(s)
- [ ] added new functional test(s)

## Potential risks
Minimal, fixing bug
When a payload with json/plain encoding had additional metadata fields,
those fields were silently dropped during nexus serialization. Fall back
to x-temporal-payload to preserve the full payload, matching the
behavior already in place for other encodings.
…10229)

## What

Extend `common/config/config_template_embedded.yaml` with

(1) SQLite option
(2) option to override `rpcAddress` and/or `httpAddress`

## Why

https://github.com/temporalio/omes needs to be able to start a Temporal
server with sqlite and custom addresses.

See temporalio/omes#348
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@stephanos stephanos force-pushed the stephanos/sano-mixed-brain-test branch from b806468 to e224cd9 Compare May 15, 2026 00:10
@stephanos stephanos requested review from a team as code owners May 15, 2026 00:10
@stephanos stephanos force-pushed the stephanos/sano-mixed-brain-test branch from e224cd9 to 43498ac Compare May 15, 2026 00:24
Drives omes' standalone-Nexus path through a two-node cluster pairing
current against the previous-minor OSS release (TestMixedBrainOSS) or
the latest non-rc cloud release (TestMixedBrainCloud).

Frontend traffic flows through a round-robin TCP proxy whose
connections rotate every 15s, forcing gRPC clients to redial and
re-pin to a different backend periodically — without rotation,
HTTP/2 multiplexing pins every RPC to one frontend for the whole
run, masking the new-on-current-only RPCs (StartNexusOperationExecution)
behind a single backend's behavior.

LogMonitor (omes/devserver) is wired with MustNotMatch on the current
server's log so any "failed assertion: " soft-asserts fail the test.
Release-side soft-asserts are out of scope.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@stephanos stephanos force-pushed the stephanos/sano-mixed-brain-test branch from 43498ac to e3fad36 Compare May 15, 2026 00:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.