-
Notifications
You must be signed in to change notification settings - Fork 4k
Description
What version of gRPC-Java are you using?
1.76.3 (regression from 1.70.0).
What is your environment?
Java 21, Linux (GKE), proxyless gRPC with xDS (SotW ADS) from a custom control plane.
One application has startup code that calls LoadBalancerRegistry.deregister() / register() to replace a ServiceLoader-discovered provider with a differently-configured instance.
We acknowledge that our application is likely at fault for creating gRPC channels during framework initialization, before all startup code has finished running.
What did you expect to see?
We'd expect CdsLoadBalancer2 to be resilient to registry mutations.
What did you see instead?
INTERNAL: CdsLb for xdstp://...: Unable to parse the LB config:
Status{code=INTERNAL, description=Failed to parse child policy in wrr_locality LB policy:
{childPolicy=[{els={}}]}}
Cause: None of [els] specified by Service Config are available.
The channel enters TRANSIENT_FAILURE and does not recover.
Steps to reproduce the bug
Setup: The xDS control plane sends a CDS load_balancing_policy with wrr_locality whose endpoint_picking_policy contains three policies in fallback order:
- A custom LB policy (
els) registered as aTypedStruct LeastRequestRoundRobin
The custom policy els has a LoadBalancerProvider registered via ServiceLoader. Application startup code deregisters it and conditionally re-registers a reconfigured instance.
We think that the failure sequence is this:
-
Phase 1 (CDS update) —
XdsClusterResourceparses a CDS response.LoadBalancingPolicyConverterfinds the custom provider in the registry, selects it, and produces{"wrr_locality_experimental": {"childPolicy": [{"els": {}}]}}. Validation viaWrrLocalityLoadBalancerProvider.parseLoadBalancingPolicyConfigsucceeds. The raw JSON map is stored inCdsUpdate.lbPolicyConfig. -
Registry mutation — Application startup code calls
registry.deregister(oldProvider)then conditionallyregistry.register(newProvider). The custom provider is absent from the registry. -
Phase 2 (any xDS update) —
CdsLoadBalancer2.acceptResolvedAddressesre-parses the stored raw config.selectLbPolicyFromListcallsgetProvider("els")→ returns null → error. The channel entersTRANSIENT_FAILUREand does not recover.
Critically, Phase 1 only runs on CDS updates (rare — when cluster config changes). But Phase 2 runs on every XdsConfig change, including EDS updates, which can be very frequent.
The re-parsing is guarded by the comment "Should be impossible, because XdsClusterResource validated this" (CdsLoadBalancer2.java:136), but the assumption that the registry is immutable after validation does not hold.
Why this didn't happen before #12140:
Probably because before #12140 CdsLoadBalancer2.acceptResolvedAddresses had an early return on subsequent calls.
// v1.70.0
if (this.resolvedAddresses != null) {
return Status.OK;
}Suggested action
Ideally, it would be great to find a way to not validate again the load balancer after the choice has been made somewhere else. If it's not fixed, it would still help to document that unregistering a load balancer at runtime can break the application.
Thanks!