vector: poll for worker.env inside populator instead of systemd retry#256
Merged
Conversation
#254 made the populator exit 1 when the role env file was missing, so systemd's Restart=on-failure could retry. Hit a real bug on prod (osb-worker-c0741893): vector.service has Restart=always. Each restart re-requests the populator unit. systemd counts these as start attempts against the populator's StartLimitBurst=5 / IntervalSec=120 — but they all land in <2 seconds (faster than RestartSec=10s can pace them). Burst tripped, populator enters `failed`, vector also enters `failed`. Journal: 00:43:08 populator: Start request repeated too quickly 00:43:08 populator: Failed ... 5 more in 2 seconds 00:44:00 worker.env written by cloud-init (too late, populator dead) The systemd retry mechanism doesn't compose well when other units re-request you faster than your RestartSec= can pace. Fix: poll inside the script. Single systemd invocation, internal wait up to 90s, source the env file when it appears. No restart-budget interaction. Behaves identically on dev (env file already exists → break out immediately on first iteration). Why the test in #254 didn't catch this: Dev's bootstrap.sh writes /etc/opensandbox/worker.env BEFORE Vector's install step. So at boot, worker.env always exists for the populator. The dev test confirmed the cycle was gone, not that the retry mechanism worked under cloud-init delay. To reproduce on dev would have needed an artificial delay (e.g. systemd-run --on-active=60s touch worker.env) — would catch this in future. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
3 tasks
motatoes
added a commit
that referenced
this pull request
May 18, 2026
#257) #256 introduced a 90s internal poll for worker.env. Hit a follow-up issue on prod (osb-worker-0b42c8be): cloud-init wrote worker.env at +4 minutes into boot, our 90s poll gave up at +1.5 minutes. Populator exited 0 with "no KV configured", vector ran without env file, failed, restart-looped into a failed state, and the late env arrival had no effect. Two changes: 1. Bump the poll deadline from 90s to 600s. Azure cloud-init on Standard_D-series VMs takes 3-5 minutes in observed cases; 10 minutes covers the long tail with margin. 2. Add ExecStartPost on the service that does systemctl --no-block reset-failed vector.service systemctl --no-block restart vector.service so when populator finally writes vector.env (potentially after vector has already exhausted its restart budget), vector is reset-failed and restarted. --no-block avoids the deadlock with vector's After=populator dep. What we explored but didn't ship: systemd Path units (populate-vector-env.path watching worker.env or boot-finished). Eight dev-test reboots surfaced: ordering cycles, RemainAfterExit no-op on path triggers, Wants= cascade re-triggers, and dir-level inotify storms (50-250 starts/sec when cloud-init writes any file in the watched directory). Concluded the path-unit-on-shared-dir interaction is the wrong tool for this. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
12 tasks
motatoes
added a commit
that referenced
this pull request
May 19, 2026
The previous iteration (since reverted in #259) shipped a 600s synchronous in-script wait on worker.env. On Azure that deadlocked the boot: cloud-final.service is ordered After=multi-user.target on Ubuntu Azure images, and writing /etc/opensandbox/worker.env is what cloud-final does. multi-user.target couldn't reach active while the populator was waiting (vector.service wants populator, multi-user wants vector). Every new Azure worker was reaped at exactly 600s by scaler.go's pendingWorkerTTL=10min. This change makes the populator exit fast in *all* boot paths: - If /etc/opensandbox/{worker,server}.env exists at populator-run time (dev hosts, image bake, reboot of a healthy VM), the populator pulls real creds from Key Vault and writes vector.env synchronously — unchanged behavior. - If neither role env exists (Azure first boot, cloud-final hasn't run yet), the populator: 1. writes a stub vector.env with all expected variables defined but empty, so `vector validate` passes and the service can start (the axiom sink fails its healthcheck and buffers to disk), 2. starts a new companion unit populate-vector-env-wait.service (not WantedBy=multi-user.target, so it doesn't block boot), 3. exits 0 in ~1s. The wait unit polls /etc/opensandbox/{worker,server}.env every 5s for up to 30 min (past Azure cloud-init's worst-case ~5 min), then re-runs the main populator (which now finds the role env file and goes through the synchronous path) and does `systemctl reset-failed + restart vector.service` so the disk buffer flushes into Axiom with the real token. Why prior approaches failed (full history in populate-vector-env.sh header): #249 After=cloud-final → systemd cycle, vector dropped silently. #254 exit 1 + Restart=on-failure → vector's restart-burst burnt the StartLimitBurst budget in <2s. #256 internal 90s poll → multi-user blocked 90s, populator gave up before cloud-final arrived at ~4 min anyway. #257 internal 600s poll → boot deadlock, every Azure worker reaped. What we explored but didn't ship: - systemd .path unit watching the specific worker.env file (not the dir): would work, but adds a third unit and still needs the same decoupling between vector.service and the populator at boot time that this approach already achieves more directly. - Type=forking + setsid + disown in one unit: the detached child can be killed by systemd on unit stop unless KillMode=process, which has subtler semantics than a clean separate unit. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
#254 (just merged) made the populator `exit 1` when `/etc/opensandbox/{worker,server}.env` was missing, expecting systemd's `Restart=on-failure` to retry until cloud-init lands the file. It hit a follow-up bug on prod that the in-place dev test in #254 didn't catch.
What broke on prod
Observed on `osb-worker-c0741893` (rotated to the post-#254 AMI):
```
vector.service: failed
populate-vector-env.service: failed
worker.env mtime: 00:44:00 (cloud-init wrote it)
populator journal:
00:43:08 populator: Start request repeated too quickly
00:43:08 populator: Failed with result 'exit-code'
00:43:09 populator: Start request repeated too quickly
00:43:09 populator: Failed
... 5 more in <2 seconds
```
Mechanism:
Fix
Poll inside the script instead of relying on systemd retry. Single invocation, internal wait up to 90s. No restart-budget interaction:
```bash
DEADLINE=$(($(date +%s) + 90))
while [ $(date +%s) -lt $DEADLINE ]; do
[ -f /etc/opensandbox/worker.env ] || [ -f /etc/opensandbox/server.env ] && break
log "waiting for cloud-init to write env file..."
sleep 5
done
[ -f /etc/opensandbox/worker.env ] && . /etc/opensandbox/worker.env
[ -f /etc/opensandbox/server.env ] && . /etc/opensandbox/server.env
VAULT_NAME="${OPENSANDBOX_AZURE_KEY_VAULT_NAME:-}"
```
Behaves identically on dev (env file already exists → first iteration breaks out, no wait).
Why #254 dev test didn't catch this
The dev test in #254 confirmed that the systemd ordering cycle was gone. It did NOT exercise the cloud-init-delay path because:
Adding a runbook for that simulation would close this gap going forward.
Test plan
🤖 Generated with Claude Code