Skip to content

vector: bump populator poll to 600s + ExecStartPost vector restart#257

Open
motatoes wants to merge 1 commit into
mainfrom
fix/populator-path-unit
Open

vector: bump populator poll to 600s + ExecStartPost vector restart#257
motatoes wants to merge 1 commit into
mainfrom
fix/populator-path-unit

Conversation

@motatoes
Copy link
Copy Markdown
Contributor

Summary

Two-line behavioral fix to populator that addresses the prod symptom from #256: cloud-init takes longer than 90s on Azure prod workers, populator exits before env file arrives, vector enters failed state and never recovers.

Changes

  1. Poll deadline 90s → 600s in `populate-vector-env.sh`. Observed on `osb-worker-0b42c8be`: cloud-init wrote worker.env at +4 minutes into boot; vector: poll for worker.env inside populator instead of systemd retry #256's 90s budget expired at +1.5 minutes. Azure cloud-init on Standard_D-series VMs takes 3–5 min in practice; 10 minutes covers the long tail with margin.

  2. `ExecStartPost` on the service that does `systemctl --no-block reset-failed vector.service` + `systemctl --no-block restart vector.service`. When populator finally writes vector.env, vector may already be in failed state from earlier restart-loops. reset-failed clears that, restart picks up the new env. `--no-block` avoids deadlock with vector's `After=populator` dep.

What we explored but didn't ship

systemd Path unit (`populate-vector-env.path` watching either worker.env or cloud-init's boot-finished marker). Eight dev-test reboots surfaced four distinct subtle bugs in succession:

  1. `After=cloud-final.service` creates a systemd ordering cycle (cloud-init declares After=multi-user.target on this Azure image)
  2. `RemainAfterExit=yes` makes path-unit's `systemctl start` a no-op
  3. Vector's `Wants=populator` cascades on vector restart, burns populator's StartLimit in <1s
  4. Dir-level inotify storms: 50–250 populator starts per second when cloud-init writes any file in the watched directory (even watching boot-finished in /var/lib/cloud/instance/ tripped this)

Each fix surfaced the next bug. Concluded path-unit-on-shared-dir interaction is the wrong tool for this; the poll approach is simpler with a known-bounded failure mode (timeout at 10 min).

Test plan

  • Dev reboot with fake-cloud-init writing worker.env at +240s: populator polls correctly, finds the file inside its budget, exits cleanly
  • AMI rebake + prod worker rotation: confirm vector reaches `active` on freshly-booted prod workers (worker.env arriving at +3-5min from cloud-init)
  • In-place patch on existing prod workers: run new script + service file via az run-command, restart populate-vector-env.service, confirm vector starts

🤖 Generated with Claude Code

#256 introduced a 90s internal poll for worker.env. Hit a follow-up
issue on prod (osb-worker-0b42c8be): cloud-init wrote worker.env at
+4 minutes into boot, our 90s poll gave up at +1.5 minutes. Populator
exited 0 with "no KV configured", vector ran without env file, failed,
restart-looped into a failed state, and the late env arrival had no
effect.

Two changes:

1. Bump the poll deadline from 90s to 600s. Azure cloud-init on
   Standard_D-series VMs takes 3-5 minutes in observed cases; 10
   minutes covers the long tail with margin.

2. Add ExecStartPost on the service that does
     systemctl --no-block reset-failed vector.service
     systemctl --no-block restart vector.service
   so when populator finally writes vector.env (potentially after
   vector has already exhausted its restart budget), vector is
   reset-failed and restarted. --no-block avoids the deadlock with
   vector's After=populator dep.

What we explored but didn't ship:
  systemd Path units (populate-vector-env.path watching worker.env
  or boot-finished). Eight dev-test reboots surfaced: ordering
  cycles, RemainAfterExit no-op on path triggers, Wants= cascade
  re-triggers, and dir-level inotify storms (50-250 starts/sec when
  cloud-init writes any file in the watched directory). Concluded
  the path-unit-on-shared-dir interaction is the wrong tool for this.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant