PrometheusConverter self-webhook race at startup leaves cluster permanently under-converted (no retry)

## Summary

When the operator pod starts in a cluster that already has prometheus-operator resources (`ServiceMonitor`, `PodMonitor`, `PrometheusRule`, etc.), the `PrometheusConverter` immediately iterates them and tries to CREATE the corresponding `VMServiceScrape` / `VMPodScrape` / `VMRule` resources. Each create call goes through the operator's own `ValidatingWebhookConfiguration` (`failurePolicy: Fail`), whose target is the operator's own service.

If this is a fresh start of the only operator pod, the service has zero endpoints (the operator pod is not yet `Ready`). Every CREATE fails with:

```
failed calling webhook "vmservicescrapes.operator.victoriametrics.com":
no endpoints available for service "victoriametrics-victoria-metrics-operator"
```

The converter logs the error and moves on. **There is no retry queue, no requeue with backoff, and no re-iteration of source resources once the pod becomes `Ready`.** The cluster is left with a silent, permanent under-conversion.

## Symptom

vmagent only scrapes the VM stack's own templated VMServiceScrapes (~10 K8s SD targets in our case). All converted-from-SM targets — kube-state-metrics, kubelet, node-exporter, kube-controller-manager, prometheus-operator self-monitoring, every microservice ServiceMonitor — are silently absent. No errors after startup; everything looks healthy.

## Reproduction

1. Cluster with prometheus-operator already deployed (so ServiceMonitors / PodMonitors / PrometheusRules exist).
2. Deploy this operator with `victoria-metrics-operator.operator.disable_prometheus_converter: false`.
3. Observe operator pod startup logs.

Expected: VMServiceScrapes / VMPodScrapes / VMRules are created for each source resource.

Actual: most/all conversion attempts fail with `no endpoints available for service "<operator-svc>"`. The converter logs the failures and never retries. Resulting cluster state has only the bundled VM-self VMServiceScrapes; nothing converted from prometheus-operator's resources.

## Observed in production

We hit this in three of four production Kubernetes clusters that deployed the VM stack on 2026-04-29. The fourth cluster ("CMH") happened to win the race and converted cleanly.

- Operator: v0.68.3
- Helm chart: `victoria-metrics-k8s-stack` 0.72.5 (operator subchart 0.59.3)
- Affected cluster: K8s 1.34.4-eks-f69f56f
- Healthy cluster: same operator/chart/K8s version

Affected operator startup logs:

```
{"level":"info","ts":"2026-04-29T03:30:12Z","logger":"controller.PrometheusConverter","msg":"waiting for api resource","discovery_group":"monitoring.coreos.com/v1","discovery_kind":"ServiceMonitor"}
{"level":"info","ts":"2026-04-29T03:30:17Z","logger":"controller.PrometheusConverter","msg":"object discovered","discovery_group":"monitoring.coreos.com/v1","discovery_kind":"ServiceMonitor"}
{"level":"error","ts":"2026-04-29T03:30:17Z","logger":"controller.PrometheusConverter","msg":"cannot create VMServiceScrape","vmservicescrape":"<name>","namespace":"<ns>","error":"Internal error occurred: failed calling webhook \"vmservicescrapes.operator.victoriametrics.com\": failed to call webhook: Post \"https://victoriametrics-victoria-metrics-operator.monitoring.svc:9443/validate-operator-victoriametrics-com-v1beta1-vmservicescrape?timeout=10s\": no endpoints available for service \"victoriametrics-victoria-metrics-operator\""}
```

(The pattern repeats once per ServiceMonitor / PodMonitor / PrometheusRule in the cluster — 556 such errors in our affected cluster, 0 in the healthy cluster.)

## Workaround

`kubectl rollout restart deploy/<operator>`. The chart's RollingUpdate strategy keeps the old pod serving the webhook while the new pod boots, so the new converter's create calls succeed against the old pod's endpoint. Verified post-restart in our affected cluster: `VMServiceScrape` count went from 6 (just bundled) to 73 within ~10 seconds.

This works as a manual remediation but doesn't help fresh deployments.

## Suggested fixes (any of these would close the gap)

1. **Retry on transient API errors.** The PrometheusConverter should requeue create attempts that fail with `no endpoints available`, `connection refused`, `i/o timeout`, etc., with exponential backoff. A simple `client.IsRetryableError(err)`-style check would catch this category.
2. **Wait for the operator's own service to have at least one Ready endpoint** before starting the converter loop. The converter is racing the kubelet's readiness probe + endpoints controller; pausing for `EndpointSlice` readiness eliminates the race.
3. **Re-iterate source resources after the pod becomes Ready.** Even with no retry on the original failures, a single re-list-and-reconcile after `mgr.Elected()` fires would catch everything missed during the startup race.
4. **Self-targeted webhook bypass.** If the validating webhook target is the operator's own service, the converter could create resources via a path that doesn't traverse the webhook (e.g. dry-run + manual validation + create with `--validate=false` semantics on the controller-runtime client). Riskier, mentioned for completeness.

## Severity

In our experience the symptom is silent — no operator-level error after the startup window, and CRD reconciliation reports `0` errors thereafter — so the under-conversion is invisible to anyone not specifically counting `vmservicescrapes` against `servicemonitors`. We only noticed because users reported missing data when querying against vmagent.

Happy to test any candidate fix in our staging clusters before it's released.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PrometheusConverter self-webhook race at startup leaves cluster permanently under-converted (no retry) #2118

Summary

Symptom

Reproduction

Observed in production

Workaround

Suggested fixes (any of these would close the gap)

Severity

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

PrometheusConverter self-webhook race at startup leaves cluster permanently under-converted (no retry) #2118

Description

Summary

Symptom

Reproduction

Observed in production

Workaround

Suggested fixes (any of these would close the gap)

Severity

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions