Skip to content

conductor: fix on-error cleanup skipped when source task cancelled-after-completion#443

Merged
Dany9966 merged 1 commit into
cloudbase:masterfrom
claudiubelu:fix-resource-leak
May 27, 2026
Merged

conductor: fix on-error cleanup skipped when source task cancelled-after-completion#443
Dany9966 merged 1 commit into
cloudbase:masterfrom
claudiubelu:fix-resource-leak

Conversation

@claudiubelu
Copy link
Copy Markdown
Member

When concurrently deploying resources on the source (deploy_source) and on the target (deploy_target), if deploy_target fails while deploy_source is still running, the conductor marks deploy_source as CANCELLING and sends it a cancel signal.

If deploy_source finishes its work before the signal kills it, task_completed() is sent for a task in the CANCELLING state. The conductor saves the result (so source_resources, including the resource ID is saved in the DB) and marks the task CANCELED_AFTER_COMPLETION.

The on-error cleanup scheduling logic in _advance_execution_state checks:

elif TASK_STATUS_COMPLETED in non_error_parents.values():

CANCELED_AFTER_COMPLETION != COMPLETED, so the check is False. DELETE_TRANSFER_SOURCE_RESOURCES is unscheduled and the source resource is leaked.

Adding CLEANUP_TASK_TRIGGER_STATUSES, covering both COMPLETED and CANCELED_AFTER_COMPLETION, since both mean the task ran to completion and may have created resources that need cleaning up. Use that constant in the on-error scheduling check.

Adding assertions in coriolis/tests/integration/test_failure_recovery.py, that check in the DB that DELETE_TRANSFER_SOURCE_RESOURCES / DELETE_TRANSFER_TARGET_RESOURCES reached a completed status, and that the source /target resources were zeroed out in the action info.

Comment thread coriolis/tests/integration/test_failure_recovery.py Outdated
…ter-completion

When concurrently deploying resources on the source (deploy_source)
and on the target (deploy_target), if deploy_target fails while deploy_source
is still running, the conductor marks deploy_source as `CANCELLING` and
sends it a cancel signal.

If deploy_source finishes its work before the signal kills it,
`task_completed()` is sent for a task in the `CANCELLING` state. The
conductor saves the result (so source_resources, including the resource
ID is saved in the DB) and marks the task `CANCELED_AFTER_COMPLETION`.

The on-error cleanup scheduling logic in _advance_execution_state checks:

```
elif TASK_STATUS_COMPLETED in non_error_parents.values():
```

`CANCELED_AFTER_COMPLETION` != `COMPLETED`, so the check is False.
`DELETE_TRANSFER_SOURCE_RESOURCES` is unscheduled and the source
resource is leaked.

Adding `CLEANUP_TASK_TRIGGER_STATUSES`, covering both `COMPLETED` and
`CANCELED_AFTER_COMPLETION`, since both mean the task ran to completion and
may have created resources that need cleaning up. Use that constant in the
on-error scheduling check.

Adding assertions in `coriolis/tests/integration/test_failure_recovery.py`,
that check in the DB that `DELETE_TRANSFER_SOURCE_RESOURCES` /
`DELETE_TRANSFER_TARGET_RESOURCES` reached a completed status, and that
the source /target resources were zeroed out in the action info.
@Dany9966 Dany9966 merged commit 8dc0bd6 into cloudbase:master May 27, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants