Skip to content

Allow dynamic configuration of ringpop replica points#9702

Open
aromanovich wants to merge 1 commit intotemporalio:mainfrom
aromanovich:master
Open

Allow dynamic configuration of ringpop replica points#9702
aromanovich wants to merge 1 commit intotemporalio:mainfrom
aromanovich:master

Conversation

@aromanovich
Copy link
Copy Markdown
Contributor

What changed?

Added a configurable ReplicaPoints variable to the ringpop service resolver. Instead of a hardcoded constant (replicaPoints = 100), the number of virtual nodes per host is now
controlled by a var ReplicaPoints func() int that can be overridden.

When the value returned by ReplicaPoints() changes, the hash ring is rebuilt and a ChangedEvent is emitted -- even if membership itself hasn't changed.

Why?

The default of 100 replica points is not optimal for all combinations of shard and history node counts and may lead to uneven shard distribution across history nodes.

For example, having 4096 shards with 9 history nodes with 100 replica points gives us std dev around 50 shards. Increasing replica points to 500 lowers std dev to around 30 shards. The optimal replica points is roughly shards / nodes, which equates the number of "shard points" and "node points" on the ring.

For 4096 shards and 9 history nodes, distribution got better as follows

  • with 100 replica points: 370, 380, 408, 442, 452, 488, 510, 520, 526 (shards per node);
  • with 500 replica points: 426, 444, 446, 446, 450, 452, 458, 474, 500.

This patch allows setting custom ringpop.ReplicaPoints (e.g. to dynamic config value getter) before starting the server, and gradually (to avoid reshuffling too many shards at once) increasing replica points to the desired value.

How did you test it?

  • built
  • run locally and tested manually
  • covered by existing tests
  • added new unit test(s)
  • added new functional test(s)

Potential risks

None if var ReplicaPoints func() not modified.
If the server is built with overridden ReplicaPoints, it may cause ring inconsistency across cluster nodes during the deployment process.

@aromanovich aromanovich requested review from a team as code owners March 26, 2026 21:22
Copy link
Copy Markdown
Contributor

@dnr dnr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting. I've done tests with up to 200 points and didn't see much improved balancing, but those numbers for using 500 look promising.

I have some concerns with the implementation though:

  1. Changing points at runtime seems unnecessary. It could potentially cause a lot of movement at once, and conflict between pods. It seems more sane to just read a value at startup and use that. The value should not have to change often.
  2. A global variable is not a great approach. First, there's no synchronization. Second, the general pattern in the server is to get config values from either static config or dynamic config. Sometimes values that are "static" are read from dynamic config for convenience. In this case I think both may be available. Maybe we can put it next to system.enableRingpopTLS in dynamic config? For things that need to be injected as code, we can use ServerOptions+fx, but this is just one piece of data, so that doesn't seem necessary.

@aromanovich
Copy link
Copy Markdown
Contributor Author

If we make it a static config option, its change will go through a rolling deployment process, which is slow-ish in most production environments. For example

  • if the frontend gets deployed first and receives an increased replicaPoints value, while the history nodes own shards according to an old value, it will cause ShardOwnershipLost errors. They will be retried, but at the cost of increased latency;
  • if the history nodes are updated gradually with an increased replicaPoints value, updated nodes can start conflicting with not-yet-updated nodes for controlling relocated shards.

This disruption can be minimized

  • by changing replicaPoints on all frontend and history nodes at roughly the same time;
  • by choosing small replicaPoints increments. Hash ring nodes are named {host:port}{i}, where i is an index from [0, replicaPoints-1]. Increasing replicaPoints just adds new points, leaving current ones in place. According to my math (and my practice with the cluster with 4096 shards and 9 history nodes), an increment from 100 to 110 points causes 8% of shards to relocate; an increment from 100 to 500 causes 70% of shards to relocate.

If we make it a dynamic config (let's say system.ringpopReplicaPoints?), it will allow one to make a series of increases that cause only brief and acceptable service degradation.

The current hacky approach (which is not great, I totally agree) is used to achieve exactly that:

import "go.temporal.io/server/common/resource"

resource.DefaultOptions = fx.Options(resource.DefaultOptions, fx.Decorate(func(dc *dynamicconfig.Collection) *dynamicconfig.Collection {
	ringpop.ReplicaPoints = dynamicconfig.RingpopReplicaPoints.Get(dc)
	return dc
}))

// start the server

If you approve, I can implement a proper system.ringpopReplicaPoints dynamic config option.

@dnr
Copy link
Copy Markdown
Contributor

dnr commented Mar 27, 2026

Note that besides history shards, matching task queues also use ringpop consistent hashing to locate owners, and they don't have the history shard linger mechanism to mitigate temporary disagreement: it'll cause flapping instead and a lot of task dispatch latency (though I suppose using excess partitions can mitigate that somewhat).

If at all possible, I would prefer to treat ringpop as a black box and not make assumptions, e..g. I would like to say that all nodes must use the same ReplicaPoints at all times instead of trying to reason about exactly what the setting does. This means it couldn't be changed on a running cluster without downtime, which seems like a reasonable restriction for a very low-level routing setting like this. (Of course, you can bring up a new cluster, migrate namespaces, and turn down the old one.) So, a dynamic config, but just read once at startup.

If you really wanted to be able to make the change live, using the GradualChange mechanism would be the best way to minimize disruption, to synchronize the flip on all nodes. Would it be acceptable to include only the basic (static) dynamic config in an official Temporal release, and you could use a fork with GradualChange for a short time to do the migration?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants