Memberlist: prevent accidental cross-cluster joins with cluster labels#7385
Memberlist: prevent accidental cross-cluster joins with cluster labels#7385sandy2008 wants to merge 6 commits intocortexproject:masterfrom
Conversation
CharlieTLe
left a comment
There was a problem hiding this comment.
Overall Review
This is a solid, well-tested feature PR. The core implementation is minimal and correct — it properly wires Cortex config to the already-supported memberlist library fields. The test coverage (unit + integration) and documentation updates are thorough.
Main actionable items:
- Fix CHANGELOG category —
[BUGFIX]→[FEATURE]or[ENHANCEMENT](see inline comment) - Reconsider the log line change — the original per-
RandomizeNodeNamelog was removed and replaced with an unconditional one; keep original behavior or document the change (see inline comment) - Clean up the polling loop in the test helper (see inline comment)
- Minor nits on struct field ordering, test matrix coverage, and migration guide doc change noted inline
Security note: Cluster labels are transmitted in plaintext over the gossip protocol — they are not a security boundary, only an operational safety net. Consider adding a note in the docs that this is not a security mechanism; an attacker on the network could still join by matching the label.
|
Thanks for the thorough review @CharlieTLe! All feedback addressed in the latest push:
|
|
@CharlieTLe Please take a look if you have time :)! |
Signed-off-by: Sandy Chen <Yuxuan.Chen@morganstanley.com>
Signed-off-by: Sandy Chen <Yuxuan.Chen@morganstanley.com>
…, clean up polling loop, add test cases, and strengthen integration assertion Signed-off-by: Sandy Chen <Yuxuan.Chen@morganstanley.com>
Signed-off-by: Sandy Chen <Yuxuan.Chen@morganstanley.com>
9270384 to
2101e57
Compare
Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Sandy Chen <Yuxuan.Chen@morganstanley.com>
CharlieTLe
left a comment
There was a problem hiding this comment.
Review: Memberlist Cluster Labels
Clean PR — the core change is minimal and correct, wiring two existing vendored hashicorp/memberlist config fields into Cortex's config surface. Documentation and test coverage are thorough.
Suggestions
1. Warn when verification is disabled without a label configured
An operator who sets cluster-label-verification-disabled=true without a cluster-label gets no feedback. This weakens isolation for no benefit. Consider adding a warning in buildMemberlistConfig():
if mlCfg.Label == "" && mlCfg.SkipInboundLabelCheck {
level.Warn(m.logger).Log("msg", "cluster label verification is disabled but no cluster label is configured; this weakens isolation without benefit")
}2. WatchKey timeout reduction in testMultipleClientsWithConfigGenerator
The refactored test changes the WatchKey context timeout from 10*time.Second to casInterval*3 (3s for the default path). Is this sufficient under CI load? The original 10s was presumably chosen for a reason.
3. time.Sleep(5 * time.Second) in integration test
In TestSingleBinaryWithMemberlistClusterLabelIsolation, the 5-second fixed sleep to verify cross-cluster isolation is fragile. On slow CI it may not be enough; on fast machines it's wasted time. Since requireMemberlistClusterState already uses WaitSumMetrics (which polls), consider polling to assert that cluster-B member counts stay at 2 over an observation window rather than sleeping a fixed duration and checking once.
4. Heavy refactor of existing test
The TestMultipleClients / testMultipleClientsWithConfigGenerator rewrite changes a lot of behavior in the same PR as the feature: t.Errorf → returned errors, added polling retry loop, inline CAS replacing the cas() helper, etc. The new code is cleaner, but mixing test infrastructure refactors with feature work makes it harder to bisect if something breaks. Consider whether this could be a separate prep PR.
5. Minor: migration guide readability
The rollout note for adding labels to an existing unlabeled cluster is quite dense as a single paragraph. A numbered step list would be easier to follow:
- Roll out
cluster_label_verification_disabled: trueeverywhere (leavecluster_labelempty)- Roll out the shared
cluster_labelvalue- Set
cluster_label_verification_disabled: false
Overall this looks good — the feature is purely opt-in with safe defaults and the vendored library does all the heavy lifting. The suggestions above are mostly about robustness and review hygiene.
|
Thanks for the thorough second review @CharlieTLe! Addressed all suggestions:
|
- Warn when cluster label verification is disabled without a label configured - Restore WatchKey timeout to 10s for CI reliability - Improve comment explaining fixed sleep in negative integration test - Restructure migration guide rollout steps as numbered lists Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com> Signed-off-by: Sandy Chen <Yuxuan.Chen@morganstanley.com>
|
@CharlieTLe Done! |
Summary
This fixes a Cortex gap in the memberlist integration: the vendored memberlist library already supports cluster labels, but Cortex did not expose that functionality in its own config. As a result, separate deployments that could reach the same gossip seed addresses could accidentally join the same memberlist cluster.
This change:
memberlist.cluster-labelandmemberlist.cluster-label-verification-disabledin Cortex and wires them into memberlistNote: The doc change in
docs/guides/migration-kv-store-to-memberlist.mdalso fixes a pre-existing typo (abort_if_join_failstoabort_if_cluster_join_fails), which is unrelated to the cluster label feature itself.Security Note
Cluster labels are transmitted in plaintext over the gossip protocol. They are not a security boundary, only an operational safety net. An attacker on the network could still join by matching the label. This is documented in the migration guide.
Why
Without this, Cortex operators have no built-in way to prevent accidental cross-cluster memberlist merges caused by shared seed addresses or cross-namespace discovery. This is the same class of problem described in Mimir issue #6187.
Verification
go test ./pkg/ring/kv/memberlist -run "Test(BuildMemberlistConfigClusterLabelOptions|MultipleClients)"go test -v -tags=integration,integration_memberlist -run "TestSingleBinaryWithMemberlistClusterLabel(Isolation|MigrationPhases)$" -count=1 ./integration/...