Allow dynamic configuration of ringpop replica points#9702
Allow dynamic configuration of ringpop replica points#9702aromanovich wants to merge 1 commit intotemporalio:mainfrom
Conversation
dnr
left a comment
There was a problem hiding this comment.
Interesting. I've done tests with up to 200 points and didn't see much improved balancing, but those numbers for using 500 look promising.
I have some concerns with the implementation though:
- Changing points at runtime seems unnecessary. It could potentially cause a lot of movement at once, and conflict between pods. It seems more sane to just read a value at startup and use that. The value should not have to change often.
- A global variable is not a great approach. First, there's no synchronization. Second, the general pattern in the server is to get config values from either static config or dynamic config. Sometimes values that are "static" are read from dynamic config for convenience. In this case I think both may be available. Maybe we can put it next to
system.enableRingpopTLSin dynamic config? For things that need to be injected as code, we can use ServerOptions+fx, but this is just one piece of data, so that doesn't seem necessary.
|
If we make it a static config option, its change will go through a rolling deployment process, which is slow-ish in most production environments. For example
This disruption can be minimized
If we make it a dynamic config (let's say The current hacky approach (which is not great, I totally agree) is used to achieve exactly that: import "go.temporal.io/server/common/resource"
resource.DefaultOptions = fx.Options(resource.DefaultOptions, fx.Decorate(func(dc *dynamicconfig.Collection) *dynamicconfig.Collection {
ringpop.ReplicaPoints = dynamicconfig.RingpopReplicaPoints.Get(dc)
return dc
}))
// start the serverIf you approve, I can implement a proper |
|
Note that besides history shards, matching task queues also use ringpop consistent hashing to locate owners, and they don't have the history shard linger mechanism to mitigate temporary disagreement: it'll cause flapping instead and a lot of task dispatch latency (though I suppose using excess partitions can mitigate that somewhat). If at all possible, I would prefer to treat ringpop as a black box and not make assumptions, e..g. I would like to say that all nodes must use the same ReplicaPoints at all times instead of trying to reason about exactly what the setting does. This means it couldn't be changed on a running cluster without downtime, which seems like a reasonable restriction for a very low-level routing setting like this. (Of course, you can bring up a new cluster, migrate namespaces, and turn down the old one.) So, a dynamic config, but just read once at startup. If you really wanted to be able to make the change live, using the GradualChange mechanism would be the best way to minimize disruption, to synchronize the flip on all nodes. Would it be acceptable to include only the basic (static) dynamic config in an official Temporal release, and you could use a fork with GradualChange for a short time to do the migration? |
What changed?
Added a configurable
ReplicaPointsvariable to the ringpop service resolver. Instead of a hardcoded constant (replicaPoints = 100), the number of virtual nodes per host is nowcontrolled by a
var ReplicaPoints func()int that can be overridden.When the value returned by
ReplicaPoints()changes, the hash ring is rebuilt and aChangedEventis emitted -- even if membership itself hasn't changed.Why?
The default of 100 replica points is not optimal for all combinations of shard and history node counts and may lead to uneven shard distribution across history nodes.
For example, having 4096 shards with 9 history nodes with 100 replica points gives us std dev around 50 shards. Increasing replica points to 500 lowers std dev to around 30 shards. The optimal replica points is roughly
shards / nodes, which equates the number of "shard points" and "node points" on the ring.For 4096 shards and 9 history nodes, distribution got better as follows
This patch allows setting custom
ringpop.ReplicaPoints(e.g. to dynamic config value getter) before starting the server, and gradually (to avoid reshuffling too many shards at once) increasing replica points to the desired value.How did you test it?
Potential risks
None if
var ReplicaPoints func()not modified.If the server is built with overridden
ReplicaPoints, it may cause ring inconsistency across cluster nodes during the deployment process.