Skip to content

feat: add Daytona sandbox provider as self-hosted Modal alternative#361

Open
raghavpillai wants to merge 9 commits intoColeMurray:mainfrom
raghavpillai:feat/daytona-sandbox-provider
Open

feat: add Daytona sandbox provider as self-hosted Modal alternative#361
raghavpillai wants to merge 9 commits intoColeMurray:mainfrom
raghavpillai:feat/daytona-sandbox-provider

Conversation

@raghavpillai
Copy link
Copy Markdown

@raghavpillai raghavpillai commented Mar 16, 2026

Summary

Closes #359

  • Adds packages/daytona-infra — a FastAPI backend that implements the same sandbox HTTP API as modal-infra using the Daytona SDK for container management and S3/MinIO for snapshots
  • Adds DaytonaClient and DaytonaSandboxProvider to the control plane, matching the existing ModalClient / ModalSandboxProvider interfaces
  • SessionDO and repo-image routes pick the provider at runtime: set DAYTONA_API_URL + DAYTONA_API_SECRET for Daytona, otherwise falls back to Modal
  • Terraform variables for daytona_api_url and daytona_api_secret, wired into control plane worker bindings

packages/daytona-infra

  • Sandbox lifecycle: create, warm, snapshot (workspace tarball to S3), restore from snapshot
  • Repo image builds: async build with callback to control plane on completion
  • Auth: GitHub App token generation, HMAC-signed internal tokens
  • Supervisor: OpenCode server + WebSocket bridge process management with crash recovery (ported from modal-infra)
  • Dockerfiles: Dockerfile (API server) and Dockerfile.sandbox (dev environment base image with Node 22, Python 3.12, OpenCode, Playwright)

Control plane

  • DaytonaClient (sandbox/daytona-client.ts) — HTTP client with HMAC auth, request logging, error classification
  • DaytonaSandboxProvider (sandbox/providers/daytona-provider.ts) — SandboxProvider implementation with transient/permanent error classification for circuit breaker
  • Provider selection in SessionDO.createLifecycleManager() and repo-images.ts

No changes to existing Modal behavior — if DAYTONA_API_URL is unset, everything works exactly as before.

Review feedback addressed

  • Renamed modal_object_id to provider_object_id in web API responses and updated DaytonaClient to read the new key
  • Implemented actual S3 delete in /api/delete-provider-image
  • GHES proxy: removed HMAC auth requirement (OAuth paths need to forward GitHub tokens), forwards Authorization header, uses GHES_CA_BUNDLE env var for TLS verification, path allowlist (login/oauth/, api/v3/)
  • Error responses now return HTTP 500 via JSONResponse instead of HTTP 200 with success: false
  • Replaced subprocess.run with asyncio.create_subprocess_exec in manager.py
  • DaytonaClient now throws DaytonaApiError on success: false responses instead of returning error objects

Add packages/daytona-infra — a self-hosted FastAPI backend that
implements the same sandbox HTTP API surface as modal-infra using
the Daytona SDK for container management and S3/MinIO for snapshots.

Includes: sandbox lifecycle (create/warm/snapshot/restore), repo
image builds, GitHub App token generation, OpenCode supervisor
entrypoint, and WebSocket bridge.
Adds the client and provider classes that let the control plane
communicate with the Daytona infra API using the same interface
as the existing Modal provider.
- SessionDO prefers Daytona when DAYTONA_API_URL + DAYTONA_API_SECRET
  are set, falls back to Modal otherwise
- Repo image builds use the same provider selection logic
- Add daytona_api_url / daytona_api_secret terraform variables
- Thread DAYTONA_API_URL and DAYTONA_API_SECRET into CF Worker env
@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps bot commented Mar 16, 2026

Greptile Summary

Adds Daytona as a self-hosted alternative to Modal for sandbox management, including a new packages/daytona-infra FastAPI service and corresponding DaytonaClient/DaytonaSandboxProvider in the control plane. The control plane cleanly selects the provider at runtime based on environment variables, and existing Modal behavior is unaffected.

  • Critical: Response key mismatch between Daytona API server and clientweb_api.py returns modal_object_id in both create-sandbox and restore-sandbox responses, but daytona-client.ts reads provider_object_id. This causes providerObjectId to always be undefined, breaking snapshot and restore operations that depend on the provider's internal sandbox ID.
  • api_delete_provider_image is a no-op — it logs the request and returns deleted: True without actually deleting the S3 object, leading to unbounded storage growth.
  • Error responses in web_api.py return HTTP 200 with success: false, which prevents the control plane's circuit breaker from correctly classifying transient vs. permanent errors via status codes.
  • The GHES reverse proxy disables TLS certificate verification (verify=False), unlike the existing modal-infra codebase.
  • manager.py uses synchronous subprocess.run() inside an async handler to start the sandbox entrypoint, which will block the event loop.
Prompt To Fix All With AI
This is a comment left during a code review.
Path: packages/daytona-infra/src/web_api.py
Line: 174

Comment:
**Response key mismatch causes `providerObjectId` to always be `undefined`**

The Python API returns `modal_object_id` (line 174) for backward compatibility, but `DaytonaClient.createSandbox()` in `daytona-client.ts:168,181` parses the response expecting `provider_object_id`. Since the Daytona client is the only consumer of this Daytona API, it will never find `provider_object_id` in the response and `modalObjectId` will always be `undefined`.

This means the control plane will never have a valid `providerObjectId` for Daytona sandboxes, which will break snapshot and restore operations that need the provider's internal sandbox ID.

The same mismatch exists in `api_restore_sandbox` at line 403.

Either the Python response keys should use `provider_object_id`, or the TypeScript client should read `modal_object_id`. Since this is the Daytona-specific path, the cleaner fix is:

```suggestion
                "provider_object_id": handle.provider_object_id,
```

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: packages/daytona-infra/src/web_api.py
Line: 403

Comment:
**Same response key mismatch as `create-sandbox`**

Same issue as above — `daytona-client.ts:256` reads `provider_object_id` from the response, but this returns `modal_object_id`. The restore result's `modalObjectId` will always be `undefined`.

```suggestion
                "provider_object_id": handle.provider_object_id,  # Daytona client expects this key
```

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: packages/daytona-infra/src/web_api.py
Line: 633-647

Comment:
**`api_delete_provider_image` does not actually delete the S3 object**

This endpoint logs the request and returns `"deleted": True`, but never calls `s3.delete_object()`. The control plane calls this to clean up replaced provider images after a new build succeeds. Since it silently claims success without doing anything, S3 will accumulate stale snapshot tarballs indefinitely.

The actual S3 deletion should be implemented here (get the S3 client and bucket from `request.app.state` and call `delete_object`).

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: packages/daytona-infra/src/web_api.py
Line: 752

Comment:
**TLS verification disabled on GHES proxy**

`verify=False` disables TLS certificate verification for all requests proxied to the GHES instance. This makes the connection vulnerable to man-in-the-middle attacks. The existing `modal-infra` package does not use `verify=False` anywhere.

If the GHES instance uses a self-signed or internal CA certificate, the preferred approach is to set the `SSL_CERT_FILE` or `REQUESTS_CA_BUNDLE` environment variable to point to the CA bundle, or pass the CA path via `verify="/path/to/ca-bundle.crt"`.

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: packages/daytona-infra/src/web_api.py
Line: 179-183

Comment:
**Error responses return HTTP 200 with `success: false`**

When `api_create_sandbox` (and several other endpoints) catch a general `Exception`, they return an HTTP 200 with `{"success": False, "error": "..."}`. The `DaytonaClient` in the control plane checks `response.ok` first (line 161 of `daytona-client.ts`) and only falls through to parse the JSON body if the status is 2xx. This means a server-side error here will be parsed as a "successful" response with `success=false`, which is handled — but the `DaytonaApiError` with its status-code-based transient/permanent classification in the provider won't fire. All errors from these endpoints will be classified as generic permanent errors, even if the root cause is transient (e.g., Daytona SDK timeout).

Consider returning a proper HTTP 500 status instead:

```python
from fastapi.responses import JSONResponse
return JSONResponse(status_code=500, content={"success": False, "error": str(e)})
```

This would let the control plane's circuit breaker correctly classify the error.

How can I resolve this? If you propose a fix, please make it concise.

---

This is a comment left during a code review.
Path: packages/daytona-infra/src/sandbox/manager.py
Line: 188-203

Comment:
**Synchronous `subprocess.run` blocks the async event loop**

`subprocess.run()` is a blocking call. In an async FastAPI handler, this will block the entire event loop for up to 30 seconds (the timeout), preventing other requests from being served. The rest of the manager correctly uses `async`/`await` patterns.

Consider using `asyncio.create_subprocess_exec` instead, consistent with how the entrypoint and bridge modules handle subprocess calls:

```python
proc = await asyncio.create_subprocess_exec(
    "docker", "exec", "-d",
    "-e", "PYTHONPATH=/app",
    sandbox.id,
    "python", "-m", "sandbox.entrypoint",
    stdout=asyncio.subprocess.PIPE,
    stderr=asyncio.subprocess.PIPE,
)
stdout, stderr = await asyncio.wait_for(proc.communicate(), timeout=30)
```

How can I resolve this? If you propose a fix, please make it concise.

Last reviewed commit: 143e7da

"success": True,
"data": {
"sandbox_id": handle.sandbox_id,
"modal_object_id": handle.provider_object_id, # Keep same key for API compat
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Response key mismatch causes providerObjectId to always be undefined

The Python API returns modal_object_id (line 174) for backward compatibility, but DaytonaClient.createSandbox() in daytona-client.ts:168,181 parses the response expecting provider_object_id. Since the Daytona client is the only consumer of this Daytona API, it will never find provider_object_id in the response and modalObjectId will always be undefined.

This means the control plane will never have a valid providerObjectId for Daytona sandboxes, which will break snapshot and restore operations that need the provider's internal sandbox ID.

The same mismatch exists in api_restore_sandbox at line 403.

Either the Python response keys should use provider_object_id, or the TypeScript client should read modal_object_id. Since this is the Daytona-specific path, the cleaner fix is:

Suggested change
"modal_object_id": handle.provider_object_id, # Keep same key for API compat
"provider_object_id": handle.provider_object_id,
Prompt To Fix With AI
This is a comment left during a code review.
Path: packages/daytona-infra/src/web_api.py
Line: 174

Comment:
**Response key mismatch causes `providerObjectId` to always be `undefined`**

The Python API returns `modal_object_id` (line 174) for backward compatibility, but `DaytonaClient.createSandbox()` in `daytona-client.ts:168,181` parses the response expecting `provider_object_id`. Since the Daytona client is the only consumer of this Daytona API, it will never find `provider_object_id` in the response and `modalObjectId` will always be `undefined`.

This means the control plane will never have a valid `providerObjectId` for Daytona sandboxes, which will break snapshot and restore operations that need the provider's internal sandbox ID.

The same mismatch exists in `api_restore_sandbox` at line 403.

Either the Python response keys should use `provider_object_id`, or the TypeScript client should read `modal_object_id`. Since this is the Daytona-specific path, the cleaner fix is:

```suggestion
                "provider_object_id": handle.provider_object_id,
```

How can I resolve this? If you propose a fix, please make it concise.

"success": True,
"data": {
"sandbox_id": handle.sandbox_id,
"modal_object_id": handle.provider_object_id, # Keep same key for API compat
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same response key mismatch as create-sandbox

Same issue as above — daytona-client.ts:256 reads provider_object_id from the response, but this returns modal_object_id. The restore result's modalObjectId will always be undefined.

Suggested change
"modal_object_id": handle.provider_object_id, # Keep same key for API compat
"provider_object_id": handle.provider_object_id, # Daytona client expects this key
Prompt To Fix With AI
This is a comment left during a code review.
Path: packages/daytona-infra/src/web_api.py
Line: 403

Comment:
**Same response key mismatch as `create-sandbox`**

Same issue as above — `daytona-client.ts:256` reads `provider_object_id` from the response, but this returns `modal_object_id`. The restore result's `modalObjectId` will always be `undefined`.

```suggestion
                "provider_object_id": handle.provider_object_id,  # Daytona client expects this key
```

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines +633 to +647
try:
# For S3-based snapshots, we can delete the object.
# Log the request for auditability regardless.
log.info(
"image.delete_requested",
provider_image_id=provider_image_id,
)

return {
"success": True,
"data": {
"provider_image_id": provider_image_id,
"deleted": True,
},
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

api_delete_provider_image does not actually delete the S3 object

This endpoint logs the request and returns "deleted": True, but never calls s3.delete_object(). The control plane calls this to clean up replaced provider images after a new build succeeds. Since it silently claims success without doing anything, S3 will accumulate stale snapshot tarballs indefinitely.

The actual S3 deletion should be implemented here (get the S3 client and bucket from request.app.state and call delete_object).

Prompt To Fix With AI
This is a comment left during a code review.
Path: packages/daytona-infra/src/web_api.py
Line: 633-647

Comment:
**`api_delete_provider_image` does not actually delete the S3 object**

This endpoint logs the request and returns `"deleted": True`, but never calls `s3.delete_object()`. The control plane calls this to clean up replaced provider images after a new build succeeds. Since it silently claims success without doing anything, S3 will accumulate stale snapshot tarballs indefinitely.

The actual S3 deletion should be implemented here (get the S3 client and bucket from `request.app.state` and call `delete_object`).

How can I resolve this? If you propose a fix, please make it concise.

}
headers["Host"] = ghes_hostname

async with httpx.AsyncClient(verify=False, timeout=30.0) as client:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TLS verification disabled on GHES proxy

verify=False disables TLS certificate verification for all requests proxied to the GHES instance. This makes the connection vulnerable to man-in-the-middle attacks. The existing modal-infra package does not use verify=False anywhere.

If the GHES instance uses a self-signed or internal CA certificate, the preferred approach is to set the SSL_CERT_FILE or REQUESTS_CA_BUNDLE environment variable to point to the CA bundle, or pass the CA path via verify="/path/to/ca-bundle.crt".

Prompt To Fix With AI
This is a comment left during a code review.
Path: packages/daytona-infra/src/web_api.py
Line: 752

Comment:
**TLS verification disabled on GHES proxy**

`verify=False` disables TLS certificate verification for all requests proxied to the GHES instance. This makes the connection vulnerable to man-in-the-middle attacks. The existing `modal-infra` package does not use `verify=False` anywhere.

If the GHES instance uses a self-signed or internal CA certificate, the preferred approach is to set the `SSL_CERT_FILE` or `REQUESTS_CA_BUNDLE` environment variable to point to the CA bundle, or pass the CA path via `verify="/path/to/ca-bundle.crt"`.

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines +179 to +183
except Exception as e:
outcome = "error"
http_status = 500
log.error("api.error", exc=e, endpoint_name="api_create_sandbox")
return {"success": False, "error": str(e)}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Error responses return HTTP 200 with success: false

When api_create_sandbox (and several other endpoints) catch a general Exception, they return an HTTP 200 with {"success": False, "error": "..."}. The DaytonaClient in the control plane checks response.ok first (line 161 of daytona-client.ts) and only falls through to parse the JSON body if the status is 2xx. This means a server-side error here will be parsed as a "successful" response with success=false, which is handled — but the DaytonaApiError with its status-code-based transient/permanent classification in the provider won't fire. All errors from these endpoints will be classified as generic permanent errors, even if the root cause is transient (e.g., Daytona SDK timeout).

Consider returning a proper HTTP 500 status instead:

from fastapi.responses import JSONResponse
return JSONResponse(status_code=500, content={"success": False, "error": str(e)})

This would let the control plane's circuit breaker correctly classify the error.

Prompt To Fix With AI
This is a comment left during a code review.
Path: packages/daytona-infra/src/web_api.py
Line: 179-183

Comment:
**Error responses return HTTP 200 with `success: false`**

When `api_create_sandbox` (and several other endpoints) catch a general `Exception`, they return an HTTP 200 with `{"success": False, "error": "..."}`. The `DaytonaClient` in the control plane checks `response.ok` first (line 161 of `daytona-client.ts`) and only falls through to parse the JSON body if the status is 2xx. This means a server-side error here will be parsed as a "successful" response with `success=false`, which is handled — but the `DaytonaApiError` with its status-code-based transient/permanent classification in the provider won't fire. All errors from these endpoints will be classified as generic permanent errors, even if the root cause is transient (e.g., Daytona SDK timeout).

Consider returning a proper HTTP 500 status instead:

```python
from fastapi.responses import JSONResponse
return JSONResponse(status_code=500, content={"success": False, "error": str(e)})
```

This would let the control plane's circuit breaker correctly classify the error.

How can I resolve this? If you propose a fix, please make it concise.

Comment on lines +188 to +203
import subprocess
result = subprocess.run(
[
"docker", "exec", "-d",
"-e", f"PYTHONPATH=/app",
sandbox.id,
"python", "-m", "sandbox.entrypoint",
],
capture_output=True, text=True, timeout=30,
)
if result.returncode == 0:
log.info("sandbox.entrypoint_started", sandbox_id=sandbox_id, provider_id=sandbox.id)
else:
log.warning("sandbox.entrypoint_start_failed", sandbox_id=sandbox_id, stderr=result.stderr[:500])
except Exception as e:
log.warning("sandbox.entrypoint_start_error", sandbox_id=sandbox_id, error=str(e))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Synchronous subprocess.run blocks the async event loop

subprocess.run() is a blocking call. In an async FastAPI handler, this will block the entire event loop for up to 30 seconds (the timeout), preventing other requests from being served. The rest of the manager correctly uses async/await patterns.

Consider using asyncio.create_subprocess_exec instead, consistent with how the entrypoint and bridge modules handle subprocess calls:

proc = await asyncio.create_subprocess_exec(
    "docker", "exec", "-d",
    "-e", "PYTHONPATH=/app",
    sandbox.id,
    "python", "-m", "sandbox.entrypoint",
    stdout=asyncio.subprocess.PIPE,
    stderr=asyncio.subprocess.PIPE,
)
stdout, stderr = await asyncio.wait_for(proc.communicate(), timeout=30)
Prompt To Fix With AI
This is a comment left during a code review.
Path: packages/daytona-infra/src/sandbox/manager.py
Line: 188-203

Comment:
**Synchronous `subprocess.run` blocks the async event loop**

`subprocess.run()` is a blocking call. In an async FastAPI handler, this will block the entire event loop for up to 30 seconds (the timeout), preventing other requests from being served. The rest of the manager correctly uses `async`/`await` patterns.

Consider using `asyncio.create_subprocess_exec` instead, consistent with how the entrypoint and bridge modules handle subprocess calls:

```python
proc = await asyncio.create_subprocess_exec(
    "docker", "exec", "-d",
    "-e", "PYTHONPATH=/app",
    sandbox.id,
    "python", "-m", "sandbox.entrypoint",
    stdout=asyncio.subprocess.PIPE,
    stderr=asyncio.subprocess.PIPE,
)
stdout, stderr = await asyncio.wait_for(proc.communicate(), timeout=30)
```

How can I resolve this? If you propose a fix, please make it concise.

- Fix DaytonaClient reading provider_object_id instead of modal_object_id
  (field name mismatch — server sends modal_object_id for API compat)
- Fix getLatestSnapshot snake_case to camelCase field mapping
- Throw DaytonaApiError on success:false responses for correct circuit
  breaker error classification (was generic Error → always permanent)
- Add require_auth() to /ghes-proxy endpoint
- Add path allowlist to /ghes-proxy (only login/oauth/ and api/v3/)
- Replace verify=False with configurable GHES_CA_BUNDLE
- Implement actual S3 delete in delete-provider-image (was a no-op)
- Pass explicit api_base to generate_installation_token
return f"https://{host}/api/v3"


def generate_jwt(app_id: str, private_key: str) -> str:
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should likely introduce a shared sandbox-infra package to avoid duplicating this in the N many providers

EXPIRED = "expired"


class Repository(BaseModel):
Copy link
Copy Markdown
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ditto with this, it appears duplicated from the modal infra

- Rename API response key modal_object_id to provider_object_id across services
- Throw DaytonaApiError on API failures and missing data instead of returning error objects in client
- Convert infra sandbox entrypoint subprocess call to asyncio and capture stdout/stderr
- Use JSONResponse for error responses in web_api endpoints
- Remove auth requirement from GHES proxy and stop stripping Authorization header
Addresses @ColeMurray's review feedback to avoid duplicating auth,
types, models, and logging across sandbox providers.

New package: packages/sandbox-shared
- sandbox_shared.auth.github_app — GitHub App JWT + installation tokens
- sandbox_shared.auth.internal — HMAC token generation/verification
  (parameterized env var name instead of hardcoded MODAL/DAYTONA)
- sandbox_shared.sandbox.types — SandboxStatus, SessionConfig, event models
- sandbox_shared.sandbox.log_config — structured JSON logging
  (parameterized service name via set_default_service())
- sandbox_shared.registry.models — Snapshot, Repository, SnapshotMetadata
- sandbox_shared.sandbox.bridge — agent bridge (WebSocket, PR creation)
- sandbox_shared.sandbox.tools/ — OpenCode custom tool scripts

daytona-infra now imports from sandbox-shared via thin re-export
shims. modal-infra is untouched for now (migration is a follow-up
to avoid breaking the Modal deployment pipeline).
…red)

sandbox-shared was copied from modal-infra which doesn't have
resolve_api_base (that's a GHES addition). The re-export shim
was trying to import it from sandbox-shared, causing ImportError
at startup. Move it back to a local definition in daytona-infra.
Move resolve_api_base into sandbox-shared so daytona-infra imports
it cleanly. No more local copies — all GitHub App auth lives in
sandbox-shared.
@ColeMurray
Copy link
Copy Markdown
Owner

@raghavpillai i've added the shared sandbox-runtime which should simplify a lot of this PR:
https://github.com/ColeMurray/background-agents/tree/main/packages/sandbox-runtime

Copy link
Copy Markdown
Owner

@ColeMurray ColeMurray left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

rebase with sandbox-runtime and remove duplicates

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: Daytona sandbox provider (self-hosted alternative to Modal)

2 participants