Skip to content

Fix app redeployment prior to finish#85

Open
joshhalley wants to merge 2 commits intodevelopfrom
pr/johalley/deploycontinues
Open

Fix app redeployment prior to finish#85
joshhalley wants to merge 2 commits intodevelopfrom
pr/johalley/deploycontinues

Conversation

@joshhalley
Copy link
Contributor

Description

Returns an error when WaitForAppStatus times out during pod deployment instead of logging a warning and continuing to the next container.

Problem

In DeployPod, after submitting an app configuration and issuing the install RPC, the code waited up to 120 seconds for the app to reach DEPLOYED status. If the timeout was exceeded, it logged a warning and continued execution:

if err := d.WaitForAppStatus(ctx, appConfig.AppName(), "DEPLOYED", 120*time.Second); err != nil {
    log.G(ctx).Warnf("App %s did not reach DEPLOYED within timeout: %v (will continue)", appConfig.AppName(), err)
}

This caused two problems:

  1. For multi-container pods: The next container's install would be submitted while the previous was still in an indeterminate state, potentially overwhelming the device or causing silent install failures.
  2. For all pods: CreatePod returned success to the VK library even though the app may not have been installed. Subsequent lifecycle steps (activate, start) driven by the reconciler would operate against an app in an unknown state, often failing silently and leaving the app stuck.

Fix

Single-line change: replace the warn-and-continue with a hard error return:

return fmt.Errorf("app %s did not reach DEPLOYED status: %w", appConfig.AppName(), err)

The VK library will see the CreatePod failure and handle retry/backoff, rather than the provider silently advancing past a failed install.

Related

  • Identified as item 3.3 in the architecture & code review
  • WaitForAppStatus already respects context cancellation, so the timeout path is the only unhandled error case

Type of Change

  • Bug Fix
  • New Feature
  • Breaking Change
  • Refactor
  • Documentation
  • Other (please describe)

Checklist

  • I have read the contributing guidelines
  • Existing issues have been referenced (where applicable)
  • I have verified this change is not present in other open pull requests
  • Functionality is documented
  • All code style checks pass
  • New code contribution is covered by automated tests
  • All new and existing tests pass

Note: One pre-existing test failure (TestGetResourceConfig_FromRequests) exists on develop due to the ephemeral-storage test bug tracked in pr/johalley/TestGetResourceConfig_FromRequests. It is unrelated to this change.

Copy link
Contributor

@cunningr cunningr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I understand mostly what the issue is here, but I have a test question:

Due to the way IOx AppHosting lifecycle behaves, the reconciler is designed to try and progress an App container through the state machine even if an RPC times out since this often takes a long time and the IOx device cannot process these concurrently. So we just keep trying to move things on.

By returning this hard error, do we risk any race condition loop where we keep deleting an incomplete App because it took longer than some arbitrary timeout for the IOx device to progress states?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants