Fix app redeployment prior to finish by joshhalley · Pull Request #85 · cisco-open/cisco-virtual-kubelet

joshhalley · 2026-03-15T23:30:21Z

Description

Returns an error when WaitForAppStatus times out during pod deployment instead of logging a warning and continuing to the next container.

Problem

In DeployPod, after submitting an app configuration and issuing the install RPC, the code waited up to 120 seconds for the app to reach DEPLOYED status. If the timeout was exceeded, it logged a warning and continued execution:

if err := d.WaitForAppStatus(ctx, appConfig.AppName(), "DEPLOYED", 120*time.Second); err != nil {
    log.G(ctx).Warnf("App %s did not reach DEPLOYED within timeout: %v (will continue)", appConfig.AppName(), err)
}

This caused two problems:

For multi-container pods: The next container's install would be submitted while the previous was still in an indeterminate state, potentially overwhelming the device or causing silent install failures.
For all pods: CreatePod returned success to the VK library even though the app may not have been installed. Subsequent lifecycle steps (activate, start) driven by the reconciler would operate against an app in an unknown state, often failing silently and leaving the app stuck.

Fix

Single-line change: replace the warn-and-continue with a hard error return:

return fmt.Errorf("app %s did not reach DEPLOYED status: %w", appConfig.AppName(), err)

The VK library will see the CreatePod failure and handle retry/backoff, rather than the provider silently advancing past a failed install.

Type of Change

Checklist

I have read the contributing guidelines
Existing issues have been referenced (where applicable)
I have verified this change is not present in other open pull requests
Functionality is documented
All code style checks pass
New code contribution is covered by automated tests
All new and existing tests pass

Note: One pre-existing test failure (TestGetResourceConfig_FromRequests) exists on develop due to the ephemeral-storage test bug tracked in pr/johalley/TestGetResourceConfig_FromRequests. It is unrelated to this change.

cunningr

I think I understand mostly what the issue is here, but I have a test question:

Due to the way IOx AppHosting lifecycle behaves, the reconciler is designed to try and progress an App container through the state machine even if an RPC times out since this often takes a long time and the IOx device cannot process these concurrently. So we just keep trying to move things on.

By returning this hard error, do we risk any race condition loop where we keep deleting an incomplete App because it took longer than some arbitrary timeout for the IOx device to progress states?

Fix app redeployment prior to finish

96efdff

cunningr reviewed Mar 17, 2026

View reviewed changes

Merge branch 'develop' into pr/johalley/deploycontinues

bb8792f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix app redeployment prior to finish#85

Fix app redeployment prior to finish#85
joshhalley wants to merge 2 commits intodevelopfrom
pr/johalley/deploycontinues

joshhalley commented Mar 15, 2026

Uh oh!

cunningr left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

joshhalley commented Mar 15, 2026

Description

Problem

Fix

Related

Type of Change

Checklist

Uh oh!

cunningr left a comment

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants