Skip to content

feat: automated ephemeral stack cleanup (scheduled) #72

@scottschreckengaust

Description

@scottschreckengaust

Problem

Deploying ephemeral stacks (e.g. commit-fa647ca... for CI/testing) leaves orphaned CloudFormation stacks in the account. These accumulate over time and:

  1. ENIs from AgentCore/Lambda runtimes linger for 20-45+ minutes after the runtime is deleted, blocking VPC/Security Group teardown and causing DELETE_FAILED stacks
  2. Orphaned resources (DynamoDB tables, Cognito pools, log groups, secrets) accumulate costs
  3. Manual cleanup is tedious and error-prone

Proposal

An automated cleanup system that runs on a schedule and deletes ephemeral stacks that have exceeded their max age.

Design

Scheduled trigger: EventBridge rule firing every N hours (configurable, default: 4h).

Compute: Lambda function (Python or Node) that:

  1. Lists all CloudFormation stacks matching description "ABCA Development Stack"
  2. Skips stacks with termination protection enabled (permanent environments: -dev, -tst, -stg, -prd, -prod)
  3. Skips stacks in *_IN_PROGRESS states
  4. For stacks exceeding max age:
    • Force-detaches Hyperplane ENIs (Lambda/AgentCore) attached to the stack's security groups
    • Waits briefly for detachment
    • Deletes now-available ENIs
    • Initiates delete-stack
  5. Logs actions to CloudWatch for audit

Max age behavior

MAX_AGE_HOURS env var Behavior
Set to N Delete ephemeral stacks older than N hours
Missing or 0 Delete all non-protected ephemeral stacks (default)

The default of zero means: if you deployed it without termination protection, you intended it to be temporary. The scheduler cleans up everything that isn't explicitly protected. This is the safest default for cost — permanent stacks opt in to protection via the -dev/-prd suffix convention.

Termination protection convention (from PR #70)

cdk/src/main.ts enables terminationProtection for stacks whose name ends in a known environment suffix:

const protectedEnvs = ['dev', 'tst', 'stg', 'prd', 'prod'];
const isProtected = protectedEnvs.some(env => stackName.endsWith(`-${env}`));

All other stacks (e.g. commit-abc123, feature-xyz, scott-test) are ephemeral and eligible for cleanup.

Infrastructure

EventBridge Rule (rate: every 4h)
  └─► Lambda Function
        ├─ cloudformation:ListStacks, DescribeStacks, DeleteStack, ListStackResources
        ├─ ec2:DescribeNetworkInterfaces, DetachNetworkInterface, DeleteNetworkInterface
        └─ logs:CreateLogGroup, PutLogEvents

Could be deployed as:

  • A separate lightweight CDK stack (EphemeralCleanupStack) with its own termination protection
  • Or a construct added to the main ABCA stack (simpler, but couples lifecycle)

Configuration (environment variables on the Lambda)

Variable Default Description
MAX_AGE_HOURS 0 Stacks older than this are deleted. 0 = delete all non-protected.
STACK_DESCRIPTION_FILTER ABCA Development Stack Only target stacks matching this description
SCHEDULE_RATE rate(4 hours) EventBridge schedule expression
DRY_RUN false Log what would be deleted without acting

Existing script

scripts/cleanup-ephemeral-stacks.sh (added in PR #70) implements the core logic as a CLI tool. The Lambda would be a port of this with the same safety rails.

ENI cleanup detail

AgentCore and Lambda create "Hyperplane ENIs" managed by the service. After runtime deletion, AWS's internal GC releases them in 20-45 minutes — but CloudFormation times out waiting. The cleanup sequence:

  1. ec2:DescribeNetworkInterfaces filtered by the stack's security group IDs
  2. ec2:DetachNetworkInterface --force for each in-use ENI
  3. Wait 15s
  4. ec2:DeleteNetworkInterface for each now-available ENI
  5. cloudformation:DeleteStack

Acceptance criteria

  • Lambda function that cleans up ephemeral stacks on schedule
  • EventBridge rule with configurable schedule
  • ENI force-detach before stack deletion
  • Termination protection honored (never deletes protected stacks)
  • MAX_AGE_HOURS=0 deletes all non-protected stacks (default)
  • CloudWatch logging of all actions
  • DRY_RUN mode for testing
  • Deployed as CDK construct (either standalone stack or integrated)

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or request

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions