vector: bump populator poll to 600s + ExecStartPost vector restart#257
Open
motatoes wants to merge 1 commit into
Open
vector: bump populator poll to 600s + ExecStartPost vector restart#257motatoes wants to merge 1 commit into
motatoes wants to merge 1 commit into
Conversation
#256 introduced a 90s internal poll for worker.env. Hit a follow-up issue on prod (osb-worker-0b42c8be): cloud-init wrote worker.env at +4 minutes into boot, our 90s poll gave up at +1.5 minutes. Populator exited 0 with "no KV configured", vector ran without env file, failed, restart-looped into a failed state, and the late env arrival had no effect. Two changes: 1. Bump the poll deadline from 90s to 600s. Azure cloud-init on Standard_D-series VMs takes 3-5 minutes in observed cases; 10 minutes covers the long tail with margin. 2. Add ExecStartPost on the service that does systemctl --no-block reset-failed vector.service systemctl --no-block restart vector.service so when populator finally writes vector.env (potentially after vector has already exhausted its restart budget), vector is reset-failed and restarted. --no-block avoids the deadlock with vector's After=populator dep. What we explored but didn't ship: systemd Path units (populate-vector-env.path watching worker.env or boot-finished). Eight dev-test reboots surfaced: ordering cycles, RemainAfterExit no-op on path triggers, Wants= cascade re-triggers, and dir-level inotify storms (50-250 starts/sec when cloud-init writes any file in the watched directory). Concluded the path-unit-on-shared-dir interaction is the wrong tool for this. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Two-line behavioral fix to populator that addresses the prod symptom from #256: cloud-init takes longer than 90s on Azure prod workers, populator exits before env file arrives, vector enters failed state and never recovers.
Changes
Poll deadline 90s → 600s in `populate-vector-env.sh`. Observed on `osb-worker-0b42c8be`: cloud-init wrote worker.env at +4 minutes into boot; vector: poll for worker.env inside populator instead of systemd retry #256's 90s budget expired at +1.5 minutes. Azure cloud-init on Standard_D-series VMs takes 3–5 min in practice; 10 minutes covers the long tail with margin.
`ExecStartPost` on the service that does `systemctl --no-block reset-failed vector.service` + `systemctl --no-block restart vector.service`. When populator finally writes vector.env, vector may already be in failed state from earlier restart-loops. reset-failed clears that, restart picks up the new env. `--no-block` avoids deadlock with vector's `After=populator` dep.
What we explored but didn't ship
systemd Path unit (`populate-vector-env.path` watching either worker.env or cloud-init's boot-finished marker). Eight dev-test reboots surfaced four distinct subtle bugs in succession:
Each fix surfaced the next bug. Concluded path-unit-on-shared-dir interaction is the wrong tool for this; the poll approach is simpler with a known-bounded failure mode (timeout at 10 min).
Test plan
🤖 Generated with Claude Code