Skip to content

Memberlist: prevent accidental cross-cluster joins with cluster labels#7385

Open
sandy2008 wants to merge 6 commits intocortexproject:masterfrom
sandy2008:sandy/memberlist-cluster-label
Open

Memberlist: prevent accidental cross-cluster joins with cluster labels#7385
sandy2008 wants to merge 6 commits intocortexproject:masterfrom
sandy2008:sandy/memberlist-cluster-label

Conversation

@sandy2008
Copy link
Copy Markdown
Contributor

@sandy2008 sandy2008 commented Mar 30, 2026

Summary

This fixes a Cortex gap in the memberlist integration: the vendored memberlist library already supports cluster labels, but Cortex did not expose that functionality in its own config. As a result, separate deployments that could reach the same gossip seed addresses could accidentally join the same memberlist cluster.

This change:

  • exposes memberlist.cluster-label and memberlist.cluster-label-verification-disabled in Cortex and wires them into memberlist
  • adds unit tests for same-label clustering, mixed-label isolation, and migration mode with inbound verification disabled
  • adds single-binary integration coverage for cross-cluster isolation and two-phase rollout migration
  • documents rollout guidance and updates example configs

Note: The doc change in docs/guides/migration-kv-store-to-memberlist.md also fixes a pre-existing typo (abort_if_join_fails to abort_if_cluster_join_fails), which is unrelated to the cluster label feature itself.

Security Note

Cluster labels are transmitted in plaintext over the gossip protocol. They are not a security boundary, only an operational safety net. An attacker on the network could still join by matching the label. This is documented in the migration guide.

Why

Without this, Cortex operators have no built-in way to prevent accidental cross-cluster memberlist merges caused by shared seed addresses or cross-namespace discovery. This is the same class of problem described in Mimir issue #6187.

Verification

  • go test ./pkg/ring/kv/memberlist -run "Test(BuildMemberlistConfigClusterLabelOptions|MultipleClients)"
  • go test -v -tags=integration,integration_memberlist -run "TestSingleBinaryWithMemberlistClusterLabel(Isolation|MigrationPhases)$" -count=1 ./integration/...

@sandy2008 sandy2008 changed the title Memberlist: add cluster label support Memberlist: prevent accidental cross-cluster joins with cluster labels Mar 30, 2026
Copy link
Copy Markdown
Member

@CharlieTLe CharlieTLe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall Review

This is a solid, well-tested feature PR. The core implementation is minimal and correct — it properly wires Cortex config to the already-supported memberlist library fields. The test coverage (unit + integration) and documentation updates are thorough.

Main actionable items:

  1. Fix CHANGELOG category[BUGFIX][FEATURE] or [ENHANCEMENT] (see inline comment)
  2. Reconsider the log line change — the original per-RandomizeNodeName log was removed and replaced with an unconditional one; keep original behavior or document the change (see inline comment)
  3. Clean up the polling loop in the test helper (see inline comment)
  4. Minor nits on struct field ordering, test matrix coverage, and migration guide doc change noted inline

Security note: Cluster labels are transmitted in plaintext over the gossip protocol — they are not a security boundary, only an operational safety net. Consider adding a note in the docs that this is not a security mechanism; an attacker on the network could still join by matching the label.

@sandy2008
Copy link
Copy Markdown
Contributor Author

Thanks for the thorough review @CharlieTLe! All feedback addressed in the latest push:

  1. CHANGELOG — Changed [BUGFIX][FEATURE]
  2. Log line — Restored the original per-RandomizeNodeName log and added a separate conditional log for cluster label
  3. Polling loop — Replaced busy-wait with time.Ticker + time.After
  4. getClientErr() — Acknowledged as intentional test diagnostic behavior
  5. Test matrix — Added "configured label with verification enabled" case
  6. Integration negative assertion — Added a 5s observation window + re-assertion to verify cross-cluster isolation stays stable
  7. Doc fix callout — Noted the abort_if_join_fails rename in the PR description
  8. Security note — Added a Security Note section to the PR description clarifying cluster labels are not a security boundary

@sandy2008 sandy2008 requested a review from CharlieTLe March 31, 2026 03:46
@sandy2008
Copy link
Copy Markdown
Contributor Author

@CharlieTLe Please take a look if you have time :)!

Signed-off-by: Sandy Chen <Yuxuan.Chen@morganstanley.com>
Signed-off-by: Sandy Chen <Yuxuan.Chen@morganstanley.com>
…, clean up polling loop, add test cases, and strengthen integration assertion

Signed-off-by: Sandy Chen <Yuxuan.Chen@morganstanley.com>
Signed-off-by: Sandy Chen <Yuxuan.Chen@morganstanley.com>
@sandy2008 sandy2008 force-pushed the sandy/memberlist-cluster-label branch from 9270384 to 2101e57 Compare April 1, 2026 18:52
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Sandy Chen <Yuxuan.Chen@morganstanley.com>
Copy link
Copy Markdown
Member

@CharlieTLe CharlieTLe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review: Memberlist Cluster Labels

Clean PR — the core change is minimal and correct, wiring two existing vendored hashicorp/memberlist config fields into Cortex's config surface. Documentation and test coverage are thorough.

Suggestions

1. Warn when verification is disabled without a label configured

An operator who sets cluster-label-verification-disabled=true without a cluster-label gets no feedback. This weakens isolation for no benefit. Consider adding a warning in buildMemberlistConfig():

if mlCfg.Label == "" && mlCfg.SkipInboundLabelCheck {
    level.Warn(m.logger).Log("msg", "cluster label verification is disabled but no cluster label is configured; this weakens isolation without benefit")
}

2. WatchKey timeout reduction in testMultipleClientsWithConfigGenerator

The refactored test changes the WatchKey context timeout from 10*time.Second to casInterval*3 (3s for the default path). Is this sufficient under CI load? The original 10s was presumably chosen for a reason.

3. time.Sleep(5 * time.Second) in integration test

In TestSingleBinaryWithMemberlistClusterLabelIsolation, the 5-second fixed sleep to verify cross-cluster isolation is fragile. On slow CI it may not be enough; on fast machines it's wasted time. Since requireMemberlistClusterState already uses WaitSumMetrics (which polls), consider polling to assert that cluster-B member counts stay at 2 over an observation window rather than sleeping a fixed duration and checking once.

4. Heavy refactor of existing test

The TestMultipleClients / testMultipleClientsWithConfigGenerator rewrite changes a lot of behavior in the same PR as the feature: t.Errorf → returned errors, added polling retry loop, inline CAS replacing the cas() helper, etc. The new code is cleaner, but mixing test infrastructure refactors with feature work makes it harder to bisect if something breaks. Consider whether this could be a separate prep PR.

5. Minor: migration guide readability

The rollout note for adding labels to an existing unlabeled cluster is quite dense as a single paragraph. A numbered step list would be easier to follow:

  1. Roll out cluster_label_verification_disabled: true everywhere (leave cluster_label empty)
  2. Roll out the shared cluster_label value
  3. Set cluster_label_verification_disabled: false

Overall this looks good — the feature is purely opt-in with safe defaults and the vendored library does all the heavy lifting. The suggestions above are mostly about robustness and review hygiene.

@sandy2008
Copy link
Copy Markdown
Contributor Author

Thanks for the thorough second review @CharlieTLe! Addressed all suggestions:

  1. Warning when verification disabled without label — Added a level.Warn log in buildMemberlistConfig() when SkipInboundLabelCheck is true but no label is configured.

  2. WatchKey timeout — Restored to 10*time.Second. Agreed that casInterval*3 (3s) was too tight for CI.

  3. time.Sleep in integration test — Kept the fixed sleep since this is negative testing (asserting nothing changes, which can't be polled for). Improved the comment to explain the reasoning.

  4. Test refactor in same PR — Acknowledged. The refactor is already interleaved with the feature changes; splitting it out at this point would require a disruptive rebase. Will keep this in mind for future PRs.

  5. Migration guide readability — Restructured the dense paragraph into numbered step lists for both fresh rollout and existing cluster scenarios.

- Warn when cluster label verification is disabled without a label configured
- Restore WatchKey timeout to 10s for CI reliability
- Improve comment explaining fixed sleep in negative integration test
- Restructure migration guide rollout steps as numbered lists

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Signed-off-by: Sandy Chen <Yuxuan.Chen@morganstanley.com>
@sandy2008 sandy2008 requested a review from CharlieTLe April 2, 2026 00:56
@sandy2008
Copy link
Copy Markdown
Contributor Author

@CharlieTLe Done!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants