vector: poll for worker.env inside populator instead of systemd retry by motatoes · Pull Request #256 · diggerhq/opencomputer

motatoes · 2026-05-16T01:06:27Z

Summary

#254 (just merged) made the populator `exit 1` when `/etc/opensandbox/{worker,server}.env` was missing, expecting systemd's `Restart=on-failure` to retry until cloud-init lands the file. It hit a follow-up bug on prod that the in-place dev test in #254 didn't catch.

What broke on prod

Observed on `osb-worker-c0741893` (rotated to the post-#254 AMI):

```
vector.service: failed
populate-vector-env.service: failed
worker.env mtime: 00:44:00 (cloud-init wrote it)
populator journal:
00:43:08 populator: Start request repeated too quickly
00:43:08 populator: Failed with result 'exit-code'
00:43:09 populator: Start request repeated too quickly
00:43:09 populator: Failed
... 5 more in <2 seconds
```

Mechanism:

`vector.service` has `Restart=always`.
Each Vector restart re-requests `populate-vector-env.service` (Wants= dep).
systemd counts those re-requests against the populator's `StartLimitBurst=5` / `IntervalSec=120`.
Vector restarts faster than `RestartSec=10s` can pace the populator's own retries — burst exhausts in 2 seconds.
Populator enters `failed`. Vector also `failed`. Both stuck until manual intervention.
~50 seconds later, cloud-init writes worker.env. Too late — both units are dead.

Fix

Poll inside the script instead of relying on systemd retry. Single invocation, internal wait up to 90s. No restart-budget interaction:

```bash
DEADLINE=$(($(date +%s) + 90))
while [ $(date +%s) -lt $DEADLINE ]; do
[ -f /etc/opensandbox/worker.env ] || [ -f /etc/opensandbox/server.env ] && break
log "waiting for cloud-init to write env file..."
sleep 5
done
[ -f /etc/opensandbox/worker.env ] && . /etc/opensandbox/worker.env
[ -f /etc/opensandbox/server.env ] && . /etc/opensandbox/server.env
VAULT_NAME="${OPENSANDBOX_AZURE_KEY_VAULT_NAME:-}"
```

Behaves identically on dev (env file already exists → first iteration breaks out, no wait).

Why #254 dev test didn't catch this

The dev test in #254 confirmed that the systemd ordering cycle was gone. It did NOT exercise the cloud-init-delay path because:

Dev cluster's `bootstrap.sh` writes `/etc/opensandbox/worker.env` BEFORE Vector's install step.
So at boot on dev, `worker.env` always exists when the populator runs — the script's "missing env file" branch is never hit.
The cycle test happened to pass because it didn't depend on retry timing; the retry-burst test would have needed an artificial cloud-init delay (e.g. `systemd-run --on-active=60s touch /etc/opensandbox/worker.env` before reboot) to reproduce.

Adding a runbook for that simulation would close this gap going forward.

Test plan

Apply on dev — `worker.env` exists, populator should run once and succeed immediately (no wait).
Simulate cloud-init delay on dev: `rm /etc/opensandbox/worker.env` + `systemd-run --on-active=45 sh -c 'echo OPENSANDBOX_AZURE_KEY_VAULT_NAME=opencomputer-dev-kv > /etc/opensandbox/worker.env'` + reboot. Confirm populator waits, succeeds at ~45s, vector starts.
Roll to prod via AMI rebake + worker rotation. Confirm `vector.service: active` on freshly-booted prod workers.

🤖 Generated with Claude Code

#254 made the populator exit 1 when the role env file was missing, so systemd's Restart=on-failure could retry. Hit a real bug on prod (osb-worker-c0741893): vector.service has Restart=always. Each restart re-requests the populator unit. systemd counts these as start attempts against the populator's StartLimitBurst=5 / IntervalSec=120 — but they all land in <2 seconds (faster than RestartSec=10s can pace them). Burst tripped, populator enters `failed`, vector also enters `failed`. Journal: 00:43:08 populator: Start request repeated too quickly 00:43:08 populator: Failed ... 5 more in 2 seconds 00:44:00 worker.env written by cloud-init (too late, populator dead) The systemd retry mechanism doesn't compose well when other units re-request you faster than your RestartSec= can pace. Fix: poll inside the script. Single systemd invocation, internal wait up to 90s, source the env file when it appears. No restart-budget interaction. Behaves identically on dev (env file already exists → break out immediately on first iteration). Why the test in #254 didn't catch this: Dev's bootstrap.sh writes /etc/opensandbox/worker.env BEFORE Vector's install step. So at boot, worker.env always exists for the populator. The dev test confirmed the cycle was gone, not that the retry mechanism worked under cloud-init delay. To reproduce on dev would have needed an artificial delay (e.g. systemd-run --on-active=60s touch worker.env) — would catch this in future. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

breardon2011

Approve

#257) #256 introduced a 90s internal poll for worker.env. Hit a follow-up issue on prod (osb-worker-0b42c8be): cloud-init wrote worker.env at +4 minutes into boot, our 90s poll gave up at +1.5 minutes. Populator exited 0 with "no KV configured", vector ran without env file, failed, restart-looped into a failed state, and the late env arrival had no effect. Two changes: 1. Bump the poll deadline from 90s to 600s. Azure cloud-init on Standard_D-series VMs takes 3-5 minutes in observed cases; 10 minutes covers the long tail with margin. 2. Add ExecStartPost on the service that does systemctl --no-block reset-failed vector.service systemctl --no-block restart vector.service so when populator finally writes vector.env (potentially after vector has already exhausted its restart budget), vector is reset-failed and restarted. --no-block avoids the deadlock with vector's After=populator dep. What we explored but didn't ship: systemd Path units (populate-vector-env.path watching worker.env or boot-finished). Eight dev-test reboots surfaced: ordering cycles, RemainAfterExit no-op on path triggers, Wants= cascade re-triggers, and dir-level inotify storms (50-250 starts/sec when cloud-init writes any file in the watched directory). Concluded the path-unit-on-shared-dir interaction is the wrong tool for this. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The previous iteration (since reverted in #259) shipped a 600s synchronous in-script wait on worker.env. On Azure that deadlocked the boot: cloud-final.service is ordered After=multi-user.target on Ubuntu Azure images, and writing /etc/opensandbox/worker.env is what cloud-final does. multi-user.target couldn't reach active while the populator was waiting (vector.service wants populator, multi-user wants vector). Every new Azure worker was reaped at exactly 600s by scaler.go's pendingWorkerTTL=10min. This change makes the populator exit fast in *all* boot paths: - If /etc/opensandbox/{worker,server}.env exists at populator-run time (dev hosts, image bake, reboot of a healthy VM), the populator pulls real creds from Key Vault and writes vector.env synchronously — unchanged behavior. - If neither role env exists (Azure first boot, cloud-final hasn't run yet), the populator: 1. writes a stub vector.env with all expected variables defined but empty, so `vector validate` passes and the service can start (the axiom sink fails its healthcheck and buffers to disk), 2. starts a new companion unit populate-vector-env-wait.service (not WantedBy=multi-user.target, so it doesn't block boot), 3. exits 0 in ~1s. The wait unit polls /etc/opensandbox/{worker,server}.env every 5s for up to 30 min (past Azure cloud-init's worst-case ~5 min), then re-runs the main populator (which now finds the role env file and goes through the synchronous path) and does `systemctl reset-failed + restart vector.service` so the disk buffer flushes into Axiom with the real token. Why prior approaches failed (full history in populate-vector-env.sh header): #249 After=cloud-final → systemd cycle, vector dropped silently. #254 exit 1 + Restart=on-failure → vector's restart-burst burnt the StartLimitBurst budget in <2s. #256 internal 90s poll → multi-user blocked 90s, populator gave up before cloud-final arrived at ~4 min anyway. #257 internal 600s poll → boot deadlock, every Azure worker reaped. What we explored but didn't ship: - systemd .path unit watching the specific worker.env file (not the dir): would work, but adds a third unit and still needs the same decoupling between vector.service and the populator at boot time that this approach already achieves more directly. - Type=forking + setsid + disown in one unit: the detached child can be killed by systemd on unit stop unless KillMode=process, which has subtler semantics than a clean separate unit. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

motatoes marked this pull request as ready for review May 16, 2026 01:11

breardon2011 approved these changes May 16, 2026

View reviewed changes

motatoes merged commit 68d60b3 into main May 16, 2026
1 check passed

motatoes mentioned this pull request May 16, 2026

vector: bump populator poll to 600s + ExecStartPost vector restart #257

Merged

3 tasks

motatoes mentioned this pull request May 18, 2026

vector: detach populator from boot when role env is missing #260

Merged

12 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vector: poll for worker.env inside populator instead of systemd retry#256

vector: poll for worker.env inside populator instead of systemd retry#256
motatoes merged 1 commit into
mainfrom
fix/populator-poll-for-worker-env

motatoes commented May 16, 2026

Uh oh!

breardon2011 left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

motatoes commented May 16, 2026

Summary

What broke on prod

Fix

Why #254 dev test didn't catch this

Test plan

Uh oh!

breardon2011 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants