Skip to content

PrometheusConverter self-webhook race at startup leaves cluster permanently under-converted (no retry) #2118

@hkf57

Description

@hkf57

Summary

When the operator pod starts in a cluster that already has prometheus-operator resources (ServiceMonitor, PodMonitor, PrometheusRule, etc.), the PrometheusConverter immediately iterates them and tries to CREATE the corresponding VMServiceScrape / VMPodScrape / VMRule resources. Each create call goes through the operator's own ValidatingWebhookConfiguration (failurePolicy: Fail), whose target is the operator's own service.

If this is a fresh start of the only operator pod, the service has zero endpoints (the operator pod is not yet Ready). Every CREATE fails with:

failed calling webhook "vmservicescrapes.operator.victoriametrics.com":
no endpoints available for service "victoriametrics-victoria-metrics-operator"

The converter logs the error and moves on. There is no retry queue, no requeue with backoff, and no re-iteration of source resources once the pod becomes Ready. The cluster is left with a silent, permanent under-conversion.

Symptom

vmagent only scrapes the VM stack's own templated VMServiceScrapes (~10 K8s SD targets in our case). All converted-from-SM targets — kube-state-metrics, kubelet, node-exporter, kube-controller-manager, prometheus-operator self-monitoring, every microservice ServiceMonitor — are silently absent. No errors after startup; everything looks healthy.

Reproduction

  1. Cluster with prometheus-operator already deployed (so ServiceMonitors / PodMonitors / PrometheusRules exist).
  2. Deploy this operator with victoria-metrics-operator.operator.disable_prometheus_converter: false.
  3. Observe operator pod startup logs.

Expected: VMServiceScrapes / VMPodScrapes / VMRules are created for each source resource.

Actual: most/all conversion attempts fail with no endpoints available for service "<operator-svc>". The converter logs the failures and never retries. Resulting cluster state has only the bundled VM-self VMServiceScrapes; nothing converted from prometheus-operator's resources.

Observed in production

We hit this in three of four production Kubernetes clusters that deployed the VM stack on 2026-04-29. The fourth cluster ("CMH") happened to win the race and converted cleanly.

  • Operator: v0.68.3
  • Helm chart: victoria-metrics-k8s-stack 0.72.5 (operator subchart 0.59.3)
  • Affected cluster: K8s 1.34.4-eks-f69f56f
  • Healthy cluster: same operator/chart/K8s version

Affected operator startup logs:

{"level":"info","ts":"2026-04-29T03:30:12Z","logger":"controller.PrometheusConverter","msg":"waiting for api resource","discovery_group":"monitoring.coreos.com/v1","discovery_kind":"ServiceMonitor"}
{"level":"info","ts":"2026-04-29T03:30:17Z","logger":"controller.PrometheusConverter","msg":"object discovered","discovery_group":"monitoring.coreos.com/v1","discovery_kind":"ServiceMonitor"}
{"level":"error","ts":"2026-04-29T03:30:17Z","logger":"controller.PrometheusConverter","msg":"cannot create VMServiceScrape","vmservicescrape":"<name>","namespace":"<ns>","error":"Internal error occurred: failed calling webhook \"vmservicescrapes.operator.victoriametrics.com\": failed to call webhook: Post \"https://victoriametrics-victoria-metrics-operator.monitoring.svc:9443/validate-operator-victoriametrics-com-v1beta1-vmservicescrape?timeout=10s\": no endpoints available for service \"victoriametrics-victoria-metrics-operator\""}

(The pattern repeats once per ServiceMonitor / PodMonitor / PrometheusRule in the cluster — 556 such errors in our affected cluster, 0 in the healthy cluster.)

Workaround

kubectl rollout restart deploy/<operator>. The chart's RollingUpdate strategy keeps the old pod serving the webhook while the new pod boots, so the new converter's create calls succeed against the old pod's endpoint. Verified post-restart in our affected cluster: VMServiceScrape count went from 6 (just bundled) to 73 within ~10 seconds.

This works as a manual remediation but doesn't help fresh deployments.

Suggested fixes (any of these would close the gap)

  1. Retry on transient API errors. The PrometheusConverter should requeue create attempts that fail with no endpoints available, connection refused, i/o timeout, etc., with exponential backoff. A simple client.IsRetryableError(err)-style check would catch this category.
  2. Wait for the operator's own service to have at least one Ready endpoint before starting the converter loop. The converter is racing the kubelet's readiness probe + endpoints controller; pausing for EndpointSlice readiness eliminates the race.
  3. Re-iterate source resources after the pod becomes Ready. Even with no retry on the original failures, a single re-list-and-reconcile after mgr.Elected() fires would catch everything missed during the startup race.
  4. Self-targeted webhook bypass. If the validating webhook target is the operator's own service, the converter could create resources via a path that doesn't traverse the webhook (e.g. dry-run + manual validation + create with --validate=false semantics on the controller-runtime client). Riskier, mentioned for completeness.

Severity

In our experience the symptom is silent — no operator-level error after the startup window, and CRD reconciliation reports 0 errors thereafter — so the under-conversion is invisible to anyone not specifically counting vmservicescrapes against servicemonitors. We only noticed because users reported missing data when querying against vmagent.

Happy to test any candidate fix in our staging clusters before it's released.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions