Problem
Deploying ephemeral stacks (e.g. commit-fa647ca... for CI/testing) leaves orphaned CloudFormation stacks in the account. These accumulate over time and:
- ENIs from AgentCore/Lambda runtimes linger for 20-45+ minutes after the runtime is deleted, blocking VPC/Security Group teardown and causing
DELETE_FAILED stacks
- Orphaned resources (DynamoDB tables, Cognito pools, log groups, secrets) accumulate costs
- Manual cleanup is tedious and error-prone
Proposal
An automated cleanup system that runs on a schedule and deletes ephemeral stacks that have exceeded their max age.
Design
Scheduled trigger: EventBridge rule firing every N hours (configurable, default: 4h).
Compute: Lambda function (Python or Node) that:
- Lists all CloudFormation stacks matching description
"ABCA Development Stack"
- Skips stacks with termination protection enabled (permanent environments:
-dev, -tst, -stg, -prd, -prod)
- Skips stacks in
*_IN_PROGRESS states
- For stacks exceeding max age:
- Force-detaches Hyperplane ENIs (Lambda/AgentCore) attached to the stack's security groups
- Waits briefly for detachment
- Deletes now-available ENIs
- Initiates
delete-stack
- Logs actions to CloudWatch for audit
Max age behavior
MAX_AGE_HOURS env var |
Behavior |
Set to N |
Delete ephemeral stacks older than N hours |
Missing or 0 |
Delete all non-protected ephemeral stacks (default) |
The default of zero means: if you deployed it without termination protection, you intended it to be temporary. The scheduler cleans up everything that isn't explicitly protected. This is the safest default for cost — permanent stacks opt in to protection via the -dev/-prd suffix convention.
Termination protection convention (from PR #70)
cdk/src/main.ts enables terminationProtection for stacks whose name ends in a known environment suffix:
const protectedEnvs = ['dev', 'tst', 'stg', 'prd', 'prod'];
const isProtected = protectedEnvs.some(env => stackName.endsWith(`-${env}`));
All other stacks (e.g. commit-abc123, feature-xyz, scott-test) are ephemeral and eligible for cleanup.
Infrastructure
EventBridge Rule (rate: every 4h)
└─► Lambda Function
├─ cloudformation:ListStacks, DescribeStacks, DeleteStack, ListStackResources
├─ ec2:DescribeNetworkInterfaces, DetachNetworkInterface, DeleteNetworkInterface
└─ logs:CreateLogGroup, PutLogEvents
Could be deployed as:
- A separate lightweight CDK stack (
EphemeralCleanupStack) with its own termination protection
- Or a construct added to the main ABCA stack (simpler, but couples lifecycle)
Configuration (environment variables on the Lambda)
| Variable |
Default |
Description |
MAX_AGE_HOURS |
0 |
Stacks older than this are deleted. 0 = delete all non-protected. |
STACK_DESCRIPTION_FILTER |
ABCA Development Stack |
Only target stacks matching this description |
SCHEDULE_RATE |
rate(4 hours) |
EventBridge schedule expression |
DRY_RUN |
false |
Log what would be deleted without acting |
Existing script
scripts/cleanup-ephemeral-stacks.sh (added in PR #70) implements the core logic as a CLI tool. The Lambda would be a port of this with the same safety rails.
ENI cleanup detail
AgentCore and Lambda create "Hyperplane ENIs" managed by the service. After runtime deletion, AWS's internal GC releases them in 20-45 minutes — but CloudFormation times out waiting. The cleanup sequence:
ec2:DescribeNetworkInterfaces filtered by the stack's security group IDs
ec2:DetachNetworkInterface --force for each in-use ENI
- Wait 15s
ec2:DeleteNetworkInterface for each now-available ENI
cloudformation:DeleteStack
Acceptance criteria
Problem
Deploying ephemeral stacks (e.g.
commit-fa647ca...for CI/testing) leaves orphaned CloudFormation stacks in the account. These accumulate over time and:DELETE_FAILEDstacksProposal
An automated cleanup system that runs on a schedule and deletes ephemeral stacks that have exceeded their max age.
Design
Scheduled trigger: EventBridge rule firing every N hours (configurable, default: 4h).
Compute: Lambda function (Python or Node) that:
"ABCA Development Stack"-dev,-tst,-stg,-prd,-prod)*_IN_PROGRESSstatesdelete-stackMax age behavior
MAX_AGE_HOURSenv varN0The default of zero means: if you deployed it without termination protection, you intended it to be temporary. The scheduler cleans up everything that isn't explicitly protected. This is the safest default for cost — permanent stacks opt in to protection via the
-dev/-prdsuffix convention.Termination protection convention (from PR #70)
cdk/src/main.tsenablesterminationProtectionfor stacks whose name ends in a known environment suffix:All other stacks (e.g.
commit-abc123,feature-xyz,scott-test) are ephemeral and eligible for cleanup.Infrastructure
Could be deployed as:
EphemeralCleanupStack) with its own termination protectionConfiguration (environment variables on the Lambda)
MAX_AGE_HOURS00= delete all non-protected.STACK_DESCRIPTION_FILTERABCA Development StackSCHEDULE_RATErate(4 hours)DRY_RUNfalseExisting script
scripts/cleanup-ephemeral-stacks.sh(added in PR #70) implements the core logic as a CLI tool. The Lambda would be a port of this with the same safety rails.ENI cleanup detail
AgentCore and Lambda create "Hyperplane ENIs" managed by the service. After runtime deletion, AWS's internal GC releases them in 20-45 minutes — but CloudFormation times out waiting. The cleanup sequence:
ec2:DescribeNetworkInterfacesfiltered by the stack's security group IDsec2:DetachNetworkInterface --forcefor eachin-useENIec2:DeleteNetworkInterfacefor each now-availableENIcloudformation:DeleteStackAcceptance criteria
MAX_AGE_HOURS=0deletes all non-protected stacks (default)DRY_RUNmode for testing