Skip to content

fix: resolve SQLite connection pool timeout after idle period#3148

Open
hedypamungkas wants to merge 12 commits intotailcallhq:mainfrom
hedypamungkas:fix/connection-pool-timeout-idle
Open

fix: resolve SQLite connection pool timeout after idle period#3148
hedypamungkas wants to merge 12 commits intotailcallhq:mainfrom
hedypamungkas:fix/connection-pool-timeout-idle

Conversation

@hedypamungkas
Copy link
Copy Markdown

Summary

Fix the Failed to get connection from pool: timed out waiting for connection error that occurs when a user resumes a Forge conversation after several hours of idle time, by adding connection health validation, removing unnecessary warm connections, and implementing pool self-healing.

Context

When a user leaves a Forge terminal idle for hours and then resumes, the first database operation fails with a connection pool timeout. Opening a new terminal works fine because it creates a fresh DatabasePool with no stale connections.

Root Cause

The issue is caused by the interaction between idle_timeout, min_idle, and the lack of connection health validation:

  1. idle_timeout: 600s — After 10 minutes idle, r2d2 evicts connections from the pool
  2. min_idle: Some(1) — Pool tries to maintain at least 1 idle connection, creating new ones after eviction — these can become stale
  3. No health check on acquireon_acquire only runs PRAGMAs, never validates existing connections are alive
  4. connection_timeout: 5s — Too aggressive; if a recreated connection is stale, checkout fails quickly

After hours of idle, the pool has cycled through many create/evict cycles. The SQLite WAL file may have been modified by other Forge processes, or OS-level resource cleanup may have invalidated the connection. When the user resumes, the first checkout hits this stale connection and times out.

Distinction from PR #3033

PR #3033 (already merged) fixes concurrent write contention by moving SQLite ops to spawn_blocking. This PR addresses a different scenario: a single user resuming after idle — no concurrent writes involved.

Changes

Phase 1: Connection Health Check on Acquire

Change File Rationale
Add SELECT 1 health check in on_acquire pool.rs:164-170 Catches stale connections at checkout time before they propagate as errors
Change min_idle: Some(1)min_idle: None pool.rs:33 For a CLI tool idle for hours, maintaining warm connections is counterproductive
Increase connection_timeout: 5s15s pool.rs:34 Gives adequate time for fresh connection creation without noticeable user delay
Add PRAGMA wal_checkpoint(TRUNCATE) pool.rs:184-189 Ensures clean WAL state after long idle periods

Phase 2: Pool Self-Healing (Safety Net)

Change File Rationale
Add recreate_pool() method pool.rs:112-132 Rebuilds the pool from scratch as a last-resort recovery mechanism
Self-healing get_connection() pool.rs:78-105 After all retries fail, attempts pool recreation before one final checkout

Key Implementation Details

  • DatabasePool now wraps the pool in Mutex<DbPool> and stores PoolConfig to enable safe pool recreation
  • Pool recreation reuses build_pool() which re-runs migrations, ensuring a fully initialized fresh pool
  • SELECT 1 on SQLite is sub-millisecond — the latency trade-off is negligible for guaranteed connection validity
  • The Mutex is only held during connection checkout, not during the entire operation, so concurrent access is not blocked

Testing

All 295 tests pass (including 5 new tests):

# Run all tests
cargo test -p forge_repo

# Run just the new pool tests
cargo test -p forge_repo -- pool::tests

New Tests Added

Test What It Verifies
test_idle_eviction_recovery Pool with 100ms idle timeout recovers after eviction by creating fresh connections
test_health_check_on_acquire 5 consecutive get_connection() calls all succeed with SELECT 1 validation
test_pool_config_defaults Asserts new defaults: min_idle: None, connection_timeout: 15s
test_pool_recreation_after_simulated_failure recreate_pool() works correctly on a file-based database
test_wal_checkpoint_on_acquire PRAGMA wal_checkpoint(TRUNCATE) doesn't error on in-memory DBs

Verification

  • cargo check -p forge_repo — passed
  • cargo clippy -p forge_repo -- -D warnings — 0 warnings
  • cargo test -p forge_repo — 295 passed, 0 failed
  • cargo insta test -p forge_repo --accept — 295 passed, no snapshot changes

Risks and Mitigations

Risk Mitigation
Health check adds latency on every checkout SELECT 1 on SQLite is sub-millisecond
Removing min_idle means first query after idle is slightly slower SQLite connection creation is fast (< 50ms), far better than a 5s timeout failure
Pool recreation could lose in-flight operations Only happens after all retries fail, meaning no operations are in-flight
WAL checkpoint could block if another process holds the lock busy_timeout = 30000 ensures SQLite waits up to 30s for locks

- Add SELECT 1 health check in SqliteCustomizer::on_acquire to catch stale connections
- Remove min_idle: Some(1) to avoid keeping stale connections alive during idle
- Increase connection_timeout from 5s to 15s for adequate fresh connection creation
- Add PRAGMA wal_checkpoint(TRUNCATE) to ensure clean WAL state after idle
- Add recreate_pool() method for last-resort pool recovery
- Modify get_connection() to attempt pool recreation after all retries fail
- Add 5 new tests covering idle eviction, health check, pool recreation, and WAL checkpoint

Co-Authored-By: ForgeCode <noreply@forgecode.dev>
@CLAassistant
Copy link
Copy Markdown

CLAassistant commented Apr 24, 2026

CLA assistant check
All committers have signed the CLA.

@github-actions github-actions Bot added the type: fix Iterations on existing features or infrastructure. label Apr 24, 2026
@amitksingh1490
Copy link
Copy Markdown
Contributor

@hedypamungkas This is causing significant performance degradation in bootup can you please check?

@github-actions
Copy link
Copy Markdown

Action required: PR inactive for 5 days.
Status update or closure in 10 days.

@github-actions github-actions Bot added the state: inactive No current action needed/possible; issue fixed, out of scope, or superseded. label May 10, 2026
@github-actions github-actions Bot removed the state: inactive No current action needed/possible; issue fixed, out of scope, or superseded. label May 10, 2026
@laststylebender14 laststylebender14 self-assigned this May 11, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

type: fix Iterations on existing features or infrastructure.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants