Skip to content

vector: poll for worker.env inside populator instead of systemd retry#256

Merged
motatoes merged 1 commit into
mainfrom
fix/populator-poll-for-worker-env
May 16, 2026
Merged

vector: poll for worker.env inside populator instead of systemd retry#256
motatoes merged 1 commit into
mainfrom
fix/populator-poll-for-worker-env

Conversation

@motatoes
Copy link
Copy Markdown
Contributor

Summary

#254 (just merged) made the populator `exit 1` when `/etc/opensandbox/{worker,server}.env` was missing, expecting systemd's `Restart=on-failure` to retry until cloud-init lands the file. It hit a follow-up bug on prod that the in-place dev test in #254 didn't catch.

What broke on prod

Observed on `osb-worker-c0741893` (rotated to the post-#254 AMI):

```
vector.service: failed
populate-vector-env.service: failed
worker.env mtime: 00:44:00 (cloud-init wrote it)
populator journal:
00:43:08 populator: Start request repeated too quickly
00:43:08 populator: Failed with result 'exit-code'
00:43:09 populator: Start request repeated too quickly
00:43:09 populator: Failed
... 5 more in <2 seconds
```

Mechanism:

  1. `vector.service` has `Restart=always`.
  2. Each Vector restart re-requests `populate-vector-env.service` (Wants= dep).
  3. systemd counts those re-requests against the populator's `StartLimitBurst=5` / `IntervalSec=120`.
  4. Vector restarts faster than `RestartSec=10s` can pace the populator's own retries — burst exhausts in 2 seconds.
  5. Populator enters `failed`. Vector also `failed`. Both stuck until manual intervention.
  6. ~50 seconds later, cloud-init writes worker.env. Too late — both units are dead.

Fix

Poll inside the script instead of relying on systemd retry. Single invocation, internal wait up to 90s. No restart-budget interaction:

```bash
DEADLINE=$(($(date +%s) + 90))
while [ $(date +%s) -lt $DEADLINE ]; do
[ -f /etc/opensandbox/worker.env ] || [ -f /etc/opensandbox/server.env ] && break
log "waiting for cloud-init to write env file..."
sleep 5
done
[ -f /etc/opensandbox/worker.env ] && . /etc/opensandbox/worker.env
[ -f /etc/opensandbox/server.env ] && . /etc/opensandbox/server.env
VAULT_NAME="${OPENSANDBOX_AZURE_KEY_VAULT_NAME:-}"
```

Behaves identically on dev (env file already exists → first iteration breaks out, no wait).

Why #254 dev test didn't catch this

The dev test in #254 confirmed that the systemd ordering cycle was gone. It did NOT exercise the cloud-init-delay path because:

  • Dev cluster's `bootstrap.sh` writes `/etc/opensandbox/worker.env` BEFORE Vector's install step.
  • So at boot on dev, `worker.env` always exists when the populator runs — the script's "missing env file" branch is never hit.
  • The cycle test happened to pass because it didn't depend on retry timing; the retry-burst test would have needed an artificial cloud-init delay (e.g. `systemd-run --on-active=60s touch /etc/opensandbox/worker.env` before reboot) to reproduce.

Adding a runbook for that simulation would close this gap going forward.

Test plan

  • Apply on dev — `worker.env` exists, populator should run once and succeed immediately (no wait).
  • Simulate cloud-init delay on dev: `rm /etc/opensandbox/worker.env` + `systemd-run --on-active=45 sh -c 'echo OPENSANDBOX_AZURE_KEY_VAULT_NAME=opencomputer-dev-kv > /etc/opensandbox/worker.env'` + reboot. Confirm populator waits, succeeds at ~45s, vector starts.
  • Roll to prod via AMI rebake + worker rotation. Confirm `vector.service: active` on freshly-booted prod workers.

🤖 Generated with Claude Code

#254 made the populator exit 1 when the role env file was missing, so
systemd's Restart=on-failure could retry. Hit a real bug on prod
(osb-worker-c0741893):

  vector.service has Restart=always. Each restart re-requests the
  populator unit. systemd counts these as start attempts against the
  populator's StartLimitBurst=5 / IntervalSec=120 — but they all land
  in <2 seconds (faster than RestartSec=10s can pace them). Burst
  tripped, populator enters `failed`, vector also enters `failed`.

  Journal:
    00:43:08 populator: Start request repeated too quickly
    00:43:08 populator: Failed
    ... 5 more in 2 seconds
    00:44:00 worker.env written by cloud-init (too late, populator dead)

The systemd retry mechanism doesn't compose well when other units
re-request you faster than your RestartSec= can pace.

Fix: poll inside the script. Single systemd invocation, internal wait
up to 90s, source the env file when it appears. No restart-budget
interaction. Behaves identically on dev (env file already exists →
break out immediately on first iteration).

Why the test in #254 didn't catch this:
  Dev's bootstrap.sh writes /etc/opensandbox/worker.env BEFORE Vector's
  install step. So at boot, worker.env always exists for the populator.
  The dev test confirmed the cycle was gone, not that the retry
  mechanism worked under cloud-init delay. To reproduce on dev would
  have needed an artificial delay (e.g. systemd-run --on-active=60s
  touch worker.env) — would catch this in future.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@motatoes motatoes marked this pull request as ready for review May 16, 2026 01:11
Copy link
Copy Markdown
Contributor

@breardon2011 breardon2011 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Approve

@motatoes motatoes merged commit 68d60b3 into main May 16, 2026
1 check passed
motatoes added a commit that referenced this pull request May 18, 2026
#257)

#256 introduced a 90s internal poll for worker.env. Hit a follow-up
issue on prod (osb-worker-0b42c8be): cloud-init wrote worker.env at
+4 minutes into boot, our 90s poll gave up at +1.5 minutes. Populator
exited 0 with "no KV configured", vector ran without env file, failed,
restart-looped into a failed state, and the late env arrival had no
effect.

Two changes:

1. Bump the poll deadline from 90s to 600s. Azure cloud-init on
   Standard_D-series VMs takes 3-5 minutes in observed cases; 10
   minutes covers the long tail with margin.

2. Add ExecStartPost on the service that does
     systemctl --no-block reset-failed vector.service
     systemctl --no-block restart vector.service
   so when populator finally writes vector.env (potentially after
   vector has already exhausted its restart budget), vector is
   reset-failed and restarted. --no-block avoids the deadlock with
   vector's After=populator dep.

What we explored but didn't ship:
  systemd Path units (populate-vector-env.path watching worker.env
  or boot-finished). Eight dev-test reboots surfaced: ordering
  cycles, RemainAfterExit no-op on path triggers, Wants= cascade
  re-triggers, and dir-level inotify storms (50-250 starts/sec when
  cloud-init writes any file in the watched directory). Concluded
  the path-unit-on-shared-dir interaction is the wrong tool for this.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
motatoes added a commit that referenced this pull request May 19, 2026
The previous iteration (since reverted in #259) shipped a 600s
synchronous in-script wait on worker.env. On Azure that deadlocked the
boot: cloud-final.service is ordered After=multi-user.target on Ubuntu
Azure images, and writing /etc/opensandbox/worker.env is what
cloud-final does. multi-user.target couldn't reach active while the
populator was waiting (vector.service wants populator, multi-user
wants vector). Every new Azure worker was reaped at exactly 600s by
scaler.go's pendingWorkerTTL=10min.

This change makes the populator exit fast in *all* boot paths:

- If /etc/opensandbox/{worker,server}.env exists at populator-run time
  (dev hosts, image bake, reboot of a healthy VM), the populator pulls
  real creds from Key Vault and writes vector.env synchronously —
  unchanged behavior.

- If neither role env exists (Azure first boot, cloud-final hasn't
  run yet), the populator:
    1. writes a stub vector.env with all expected variables defined
       but empty, so `vector validate` passes and the service can
       start (the axiom sink fails its healthcheck and buffers to
       disk),
    2. starts a new companion unit populate-vector-env-wait.service
       (not WantedBy=multi-user.target, so it doesn't block boot),
    3. exits 0 in ~1s.

  The wait unit polls /etc/opensandbox/{worker,server}.env every 5s
  for up to 30 min (past Azure cloud-init's worst-case ~5 min), then
  re-runs the main populator (which now finds the role env file and
  goes through the synchronous path) and does
  `systemctl reset-failed + restart vector.service` so the disk
  buffer flushes into Axiom with the real token.

Why prior approaches failed (full history in populate-vector-env.sh
header):
  #249  After=cloud-final → systemd cycle, vector dropped silently.
  #254  exit 1 + Restart=on-failure → vector's restart-burst burnt
        the StartLimitBurst budget in <2s.
  #256  internal 90s poll → multi-user blocked 90s, populator gave up
        before cloud-final arrived at ~4 min anyway.
  #257  internal 600s poll → boot deadlock, every Azure worker reaped.

What we explored but didn't ship:
  - systemd .path unit watching the specific worker.env file (not the
    dir): would work, but adds a third unit and still needs the same
    decoupling between vector.service and the populator at boot time
    that this approach already achieves more directly.
  - Type=forking + setsid + disown in one unit: the detached child
    can be killed by systemd on unit stop unless KillMode=process,
    which has subtler semantics than a clean separate unit.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants