vector: drop Wants=cloud-final from populator to break systemd ordering cycle#254
Merged
Conversation
#249 added After= AND Wants= cloud-final.service to the populator unit. The Wants= half pulled cloud-final into the dep graph and created a cycle: vector.service Wants populate-vector-env.service Wants cloud-final.service cloud-final.service Before multi-user.target Wants vector.service At boot, systemd resolves this by silently deleting vector.service/start. Vector never starts, no log, no error. Observed on a prod worker after #249 merged: load=10, vector inactive, journal: "cloud-final.service: Job vector.service/start deleted to break ordering cycle starting with cloud-final.service/start" Drop cloud-final from Wants=. Keep it in After= — that alone is what fixes the original race and avoids forcing cloud-final into our dep graph. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
#249 added `After=cloud-final.service` + `Wants=cloud-final.service` to populate-vector-env.service to fix a race where the populator ran before cloud-init wrote /etc/opensandbox/worker.env. Symptoms: empty env → no KV fetch → empty vector.env → Vector with no Axiom creds. #254-v1 tried to break the resulting systemd cycle by dropping just the Wants=. Tested on dev: cycle still fires, vector still inactive. Real root cause: on Azure this image, BOTH cloud-final.service and cloud-init.target declare `After=multi-user.target`. So ANY ordering dependency on a cloud-init unit from a unit WantedBy=multi-user.target (which populate-vector-env is) creates a cycle. systemd resolves it by silently deleting vector.service/start. This commit: 1. Reverts the unit-file changes from #249. Back to After=/Wants= network-online.target only — same as before #249, no cycle. 2. Fixes the original race at the script level. When neither /etc/opensandbox/worker.env nor server.env exists, the script now exits 1 instead of 0, so Restart=on-failure on the unit retries. With RestartSec=10s and StartLimitBurst=5 / IntervalSec=120, that's a ~50s retry budget — plenty for cloud-init to land worker.env on Azure. Once worker.env exists but VAULT_NAME is still unset, the script exits 0 (treating this as "host genuinely doesn't have KV configured", e.g. dev VMs without managed identity). Validated on dev (opensandbox-dev-tf-worker): before patch: reboot → vector inactive, "ordering cycle" in journal after patch: reboot → vector active, populator active, no cycle Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
3 tasks
motatoes
added a commit
that referenced
this pull request
May 16, 2026
…#256) #254 made the populator exit 1 when the role env file was missing, so systemd's Restart=on-failure could retry. Hit a real bug on prod (osb-worker-c0741893): vector.service has Restart=always. Each restart re-requests the populator unit. systemd counts these as start attempts against the populator's StartLimitBurst=5 / IntervalSec=120 — but they all land in <2 seconds (faster than RestartSec=10s can pace them). Burst tripped, populator enters `failed`, vector also enters `failed`. Journal: 00:43:08 populator: Start request repeated too quickly 00:43:08 populator: Failed ... 5 more in 2 seconds 00:44:00 worker.env written by cloud-init (too late, populator dead) The systemd retry mechanism doesn't compose well when other units re-request you faster than your RestartSec= can pace. Fix: poll inside the script. Single systemd invocation, internal wait up to 90s, source the env file when it appears. No restart-budget interaction. Behaves identically on dev (env file already exists → break out immediately on first iteration). Why the test in #254 didn't catch this: Dev's bootstrap.sh writes /etc/opensandbox/worker.env BEFORE Vector's install step. So at boot, worker.env always exists for the populator. The dev test confirmed the cycle was gone, not that the retry mechanism worked under cloud-init delay. To reproduce on dev would have needed an artificial delay (e.g. systemd-run --on-active=60s touch worker.env) — would catch this in future. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
PR #249 added both `After=cloud-final.service` AND `Wants=cloud-final.service` to `populate-vector-env.service`. The `Wants=` half pulled cloud-final into Vector's dependency graph and created an ordering cycle that systemd silently resolves by deleting `vector.service/start` — Vector never boots, no log, no error.
Reproduction on prod
After #249 merged and a fresh worker rolled, observed on one prod worker:
In the early boot journal:
```
cloud-final.service: Found dependency on vector.service/start
cloud-final.service: Job vector.service/start deleted to break ordering cycle
starting with cloud-final.service/start
```
Why `After=` is enough
`After=cloud-final.service` already gives the ordering needed to fix the original cloud-init race #249 was solving. `Wants=` adds a "pull into dep graph" semantic we don't actually need — cloud-final is a stock cloud-init target that's always present, no need to "want" it.
Test plan
🤖 Generated with Claude Code