Summary
When the operator pod starts in a cluster that already has prometheus-operator resources (ServiceMonitor, PodMonitor, PrometheusRule, etc.), the PrometheusConverter immediately iterates them and tries to CREATE the corresponding VMServiceScrape / VMPodScrape / VMRule resources. Each create call goes through the operator's own ValidatingWebhookConfiguration (failurePolicy: Fail), whose target is the operator's own service.
If this is a fresh start of the only operator pod, the service has zero endpoints (the operator pod is not yet Ready). Every CREATE fails with:
failed calling webhook "vmservicescrapes.operator.victoriametrics.com":
no endpoints available for service "victoriametrics-victoria-metrics-operator"
The converter logs the error and moves on. There is no retry queue, no requeue with backoff, and no re-iteration of source resources once the pod becomes Ready. The cluster is left with a silent, permanent under-conversion.
Symptom
vmagent only scrapes the VM stack's own templated VMServiceScrapes (~10 K8s SD targets in our case). All converted-from-SM targets — kube-state-metrics, kubelet, node-exporter, kube-controller-manager, prometheus-operator self-monitoring, every microservice ServiceMonitor — are silently absent. No errors after startup; everything looks healthy.
Reproduction
- Cluster with prometheus-operator already deployed (so ServiceMonitors / PodMonitors / PrometheusRules exist).
- Deploy this operator with
victoria-metrics-operator.operator.disable_prometheus_converter: false.
- Observe operator pod startup logs.
Expected: VMServiceScrapes / VMPodScrapes / VMRules are created for each source resource.
Actual: most/all conversion attempts fail with no endpoints available for service "<operator-svc>". The converter logs the failures and never retries. Resulting cluster state has only the bundled VM-self VMServiceScrapes; nothing converted from prometheus-operator's resources.
Observed in production
We hit this in three of four production Kubernetes clusters that deployed the VM stack on 2026-04-29. The fourth cluster ("CMH") happened to win the race and converted cleanly.
- Operator: v0.68.3
- Helm chart:
victoria-metrics-k8s-stack 0.72.5 (operator subchart 0.59.3)
- Affected cluster: K8s 1.34.4-eks-f69f56f
- Healthy cluster: same operator/chart/K8s version
Affected operator startup logs:
{"level":"info","ts":"2026-04-29T03:30:12Z","logger":"controller.PrometheusConverter","msg":"waiting for api resource","discovery_group":"monitoring.coreos.com/v1","discovery_kind":"ServiceMonitor"}
{"level":"info","ts":"2026-04-29T03:30:17Z","logger":"controller.PrometheusConverter","msg":"object discovered","discovery_group":"monitoring.coreos.com/v1","discovery_kind":"ServiceMonitor"}
{"level":"error","ts":"2026-04-29T03:30:17Z","logger":"controller.PrometheusConverter","msg":"cannot create VMServiceScrape","vmservicescrape":"<name>","namespace":"<ns>","error":"Internal error occurred: failed calling webhook \"vmservicescrapes.operator.victoriametrics.com\": failed to call webhook: Post \"https://victoriametrics-victoria-metrics-operator.monitoring.svc:9443/validate-operator-victoriametrics-com-v1beta1-vmservicescrape?timeout=10s\": no endpoints available for service \"victoriametrics-victoria-metrics-operator\""}
(The pattern repeats once per ServiceMonitor / PodMonitor / PrometheusRule in the cluster — 556 such errors in our affected cluster, 0 in the healthy cluster.)
Workaround
kubectl rollout restart deploy/<operator>. The chart's RollingUpdate strategy keeps the old pod serving the webhook while the new pod boots, so the new converter's create calls succeed against the old pod's endpoint. Verified post-restart in our affected cluster: VMServiceScrape count went from 6 (just bundled) to 73 within ~10 seconds.
This works as a manual remediation but doesn't help fresh deployments.
Suggested fixes (any of these would close the gap)
- Retry on transient API errors. The PrometheusConverter should requeue create attempts that fail with
no endpoints available, connection refused, i/o timeout, etc., with exponential backoff. A simple client.IsRetryableError(err)-style check would catch this category.
- Wait for the operator's own service to have at least one Ready endpoint before starting the converter loop. The converter is racing the kubelet's readiness probe + endpoints controller; pausing for
EndpointSlice readiness eliminates the race.
- Re-iterate source resources after the pod becomes Ready. Even with no retry on the original failures, a single re-list-and-reconcile after
mgr.Elected() fires would catch everything missed during the startup race.
- Self-targeted webhook bypass. If the validating webhook target is the operator's own service, the converter could create resources via a path that doesn't traverse the webhook (e.g. dry-run + manual validation + create with
--validate=false semantics on the controller-runtime client). Riskier, mentioned for completeness.
Severity
In our experience the symptom is silent — no operator-level error after the startup window, and CRD reconciliation reports 0 errors thereafter — so the under-conversion is invisible to anyone not specifically counting vmservicescrapes against servicemonitors. We only noticed because users reported missing data when querying against vmagent.
Happy to test any candidate fix in our staging clusters before it's released.
Summary
When the operator pod starts in a cluster that already has prometheus-operator resources (
ServiceMonitor,PodMonitor,PrometheusRule, etc.), thePrometheusConverterimmediately iterates them and tries to CREATE the correspondingVMServiceScrape/VMPodScrape/VMRuleresources. Each create call goes through the operator's ownValidatingWebhookConfiguration(failurePolicy: Fail), whose target is the operator's own service.If this is a fresh start of the only operator pod, the service has zero endpoints (the operator pod is not yet
Ready). Every CREATE fails with:The converter logs the error and moves on. There is no retry queue, no requeue with backoff, and no re-iteration of source resources once the pod becomes
Ready. The cluster is left with a silent, permanent under-conversion.Symptom
vmagent only scrapes the VM stack's own templated VMServiceScrapes (~10 K8s SD targets in our case). All converted-from-SM targets — kube-state-metrics, kubelet, node-exporter, kube-controller-manager, prometheus-operator self-monitoring, every microservice ServiceMonitor — are silently absent. No errors after startup; everything looks healthy.
Reproduction
victoria-metrics-operator.operator.disable_prometheus_converter: false.Expected: VMServiceScrapes / VMPodScrapes / VMRules are created for each source resource.
Actual: most/all conversion attempts fail with
no endpoints available for service "<operator-svc>". The converter logs the failures and never retries. Resulting cluster state has only the bundled VM-self VMServiceScrapes; nothing converted from prometheus-operator's resources.Observed in production
We hit this in three of four production Kubernetes clusters that deployed the VM stack on 2026-04-29. The fourth cluster ("CMH") happened to win the race and converted cleanly.
victoria-metrics-k8s-stack0.72.5 (operator subchart 0.59.3)Affected operator startup logs:
(The pattern repeats once per ServiceMonitor / PodMonitor / PrometheusRule in the cluster — 556 such errors in our affected cluster, 0 in the healthy cluster.)
Workaround
kubectl rollout restart deploy/<operator>. The chart's RollingUpdate strategy keeps the old pod serving the webhook while the new pod boots, so the new converter's create calls succeed against the old pod's endpoint. Verified post-restart in our affected cluster:VMServiceScrapecount went from 6 (just bundled) to 73 within ~10 seconds.This works as a manual remediation but doesn't help fresh deployments.
Suggested fixes (any of these would close the gap)
no endpoints available,connection refused,i/o timeout, etc., with exponential backoff. A simpleclient.IsRetryableError(err)-style check would catch this category.EndpointSlicereadiness eliminates the race.mgr.Elected()fires would catch everything missed during the startup race.--validate=falsesemantics on the controller-runtime client). Riskier, mentioned for completeness.Severity
In our experience the symptom is silent — no operator-level error after the startup window, and CRD reconciliation reports
0errors thereafter — so the under-conversion is invisible to anyone not specifically countingvmservicescrapesagainstservicemonitors. We only noticed because users reported missing data when querying against vmagent.Happy to test any candidate fix in our staging clusters before it's released.