feat: add Daytona sandbox provider as self-hosted Modal alternative#361
feat: add Daytona sandbox provider as self-hosted Modal alternative#361raghavpillai wants to merge 9 commits intoColeMurray:mainfrom
Conversation
Add packages/daytona-infra — a self-hosted FastAPI backend that implements the same sandbox HTTP API surface as modal-infra using the Daytona SDK for container management and S3/MinIO for snapshots. Includes: sandbox lifecycle (create/warm/snapshot/restore), repo image builds, GitHub App token generation, OpenCode supervisor entrypoint, and WebSocket bridge.
Adds the client and provider classes that let the control plane communicate with the Daytona infra API using the same interface as the existing Modal provider.
- SessionDO prefers Daytona when DAYTONA_API_URL + DAYTONA_API_SECRET are set, falls back to Modal otherwise - Repo image builds use the same provider selection logic - Add daytona_api_url / daytona_api_secret terraform variables - Thread DAYTONA_API_URL and DAYTONA_API_SECRET into CF Worker env
Greptile SummaryAdds Daytona as a self-hosted alternative to Modal for sandbox management, including a new
Prompt To Fix All With AIThis is a comment left during a code review.
Path: packages/daytona-infra/src/web_api.py
Line: 174
Comment:
**Response key mismatch causes `providerObjectId` to always be `undefined`**
The Python API returns `modal_object_id` (line 174) for backward compatibility, but `DaytonaClient.createSandbox()` in `daytona-client.ts:168,181` parses the response expecting `provider_object_id`. Since the Daytona client is the only consumer of this Daytona API, it will never find `provider_object_id` in the response and `modalObjectId` will always be `undefined`.
This means the control plane will never have a valid `providerObjectId` for Daytona sandboxes, which will break snapshot and restore operations that need the provider's internal sandbox ID.
The same mismatch exists in `api_restore_sandbox` at line 403.
Either the Python response keys should use `provider_object_id`, or the TypeScript client should read `modal_object_id`. Since this is the Daytona-specific path, the cleaner fix is:
```suggestion
"provider_object_id": handle.provider_object_id,
```
How can I resolve this? If you propose a fix, please make it concise.
---
This is a comment left during a code review.
Path: packages/daytona-infra/src/web_api.py
Line: 403
Comment:
**Same response key mismatch as `create-sandbox`**
Same issue as above — `daytona-client.ts:256` reads `provider_object_id` from the response, but this returns `modal_object_id`. The restore result's `modalObjectId` will always be `undefined`.
```suggestion
"provider_object_id": handle.provider_object_id, # Daytona client expects this key
```
How can I resolve this? If you propose a fix, please make it concise.
---
This is a comment left during a code review.
Path: packages/daytona-infra/src/web_api.py
Line: 633-647
Comment:
**`api_delete_provider_image` does not actually delete the S3 object**
This endpoint logs the request and returns `"deleted": True`, but never calls `s3.delete_object()`. The control plane calls this to clean up replaced provider images after a new build succeeds. Since it silently claims success without doing anything, S3 will accumulate stale snapshot tarballs indefinitely.
The actual S3 deletion should be implemented here (get the S3 client and bucket from `request.app.state` and call `delete_object`).
How can I resolve this? If you propose a fix, please make it concise.
---
This is a comment left during a code review.
Path: packages/daytona-infra/src/web_api.py
Line: 752
Comment:
**TLS verification disabled on GHES proxy**
`verify=False` disables TLS certificate verification for all requests proxied to the GHES instance. This makes the connection vulnerable to man-in-the-middle attacks. The existing `modal-infra` package does not use `verify=False` anywhere.
If the GHES instance uses a self-signed or internal CA certificate, the preferred approach is to set the `SSL_CERT_FILE` or `REQUESTS_CA_BUNDLE` environment variable to point to the CA bundle, or pass the CA path via `verify="/path/to/ca-bundle.crt"`.
How can I resolve this? If you propose a fix, please make it concise.
---
This is a comment left during a code review.
Path: packages/daytona-infra/src/web_api.py
Line: 179-183
Comment:
**Error responses return HTTP 200 with `success: false`**
When `api_create_sandbox` (and several other endpoints) catch a general `Exception`, they return an HTTP 200 with `{"success": False, "error": "..."}`. The `DaytonaClient` in the control plane checks `response.ok` first (line 161 of `daytona-client.ts`) and only falls through to parse the JSON body if the status is 2xx. This means a server-side error here will be parsed as a "successful" response with `success=false`, which is handled — but the `DaytonaApiError` with its status-code-based transient/permanent classification in the provider won't fire. All errors from these endpoints will be classified as generic permanent errors, even if the root cause is transient (e.g., Daytona SDK timeout).
Consider returning a proper HTTP 500 status instead:
```python
from fastapi.responses import JSONResponse
return JSONResponse(status_code=500, content={"success": False, "error": str(e)})
```
This would let the control plane's circuit breaker correctly classify the error.
How can I resolve this? If you propose a fix, please make it concise.
---
This is a comment left during a code review.
Path: packages/daytona-infra/src/sandbox/manager.py
Line: 188-203
Comment:
**Synchronous `subprocess.run` blocks the async event loop**
`subprocess.run()` is a blocking call. In an async FastAPI handler, this will block the entire event loop for up to 30 seconds (the timeout), preventing other requests from being served. The rest of the manager correctly uses `async`/`await` patterns.
Consider using `asyncio.create_subprocess_exec` instead, consistent with how the entrypoint and bridge modules handle subprocess calls:
```python
proc = await asyncio.create_subprocess_exec(
"docker", "exec", "-d",
"-e", "PYTHONPATH=/app",
sandbox.id,
"python", "-m", "sandbox.entrypoint",
stdout=asyncio.subprocess.PIPE,
stderr=asyncio.subprocess.PIPE,
)
stdout, stderr = await asyncio.wait_for(proc.communicate(), timeout=30)
```
How can I resolve this? If you propose a fix, please make it concise.Last reviewed commit: 143e7da |
| "success": True, | ||
| "data": { | ||
| "sandbox_id": handle.sandbox_id, | ||
| "modal_object_id": handle.provider_object_id, # Keep same key for API compat |
There was a problem hiding this comment.
Response key mismatch causes providerObjectId to always be undefined
The Python API returns modal_object_id (line 174) for backward compatibility, but DaytonaClient.createSandbox() in daytona-client.ts:168,181 parses the response expecting provider_object_id. Since the Daytona client is the only consumer of this Daytona API, it will never find provider_object_id in the response and modalObjectId will always be undefined.
This means the control plane will never have a valid providerObjectId for Daytona sandboxes, which will break snapshot and restore operations that need the provider's internal sandbox ID.
The same mismatch exists in api_restore_sandbox at line 403.
Either the Python response keys should use provider_object_id, or the TypeScript client should read modal_object_id. Since this is the Daytona-specific path, the cleaner fix is:
| "modal_object_id": handle.provider_object_id, # Keep same key for API compat | |
| "provider_object_id": handle.provider_object_id, |
Prompt To Fix With AI
This is a comment left during a code review.
Path: packages/daytona-infra/src/web_api.py
Line: 174
Comment:
**Response key mismatch causes `providerObjectId` to always be `undefined`**
The Python API returns `modal_object_id` (line 174) for backward compatibility, but `DaytonaClient.createSandbox()` in `daytona-client.ts:168,181` parses the response expecting `provider_object_id`. Since the Daytona client is the only consumer of this Daytona API, it will never find `provider_object_id` in the response and `modalObjectId` will always be `undefined`.
This means the control plane will never have a valid `providerObjectId` for Daytona sandboxes, which will break snapshot and restore operations that need the provider's internal sandbox ID.
The same mismatch exists in `api_restore_sandbox` at line 403.
Either the Python response keys should use `provider_object_id`, or the TypeScript client should read `modal_object_id`. Since this is the Daytona-specific path, the cleaner fix is:
```suggestion
"provider_object_id": handle.provider_object_id,
```
How can I resolve this? If you propose a fix, please make it concise.| "success": True, | ||
| "data": { | ||
| "sandbox_id": handle.sandbox_id, | ||
| "modal_object_id": handle.provider_object_id, # Keep same key for API compat |
There was a problem hiding this comment.
Same response key mismatch as create-sandbox
Same issue as above — daytona-client.ts:256 reads provider_object_id from the response, but this returns modal_object_id. The restore result's modalObjectId will always be undefined.
| "modal_object_id": handle.provider_object_id, # Keep same key for API compat | |
| "provider_object_id": handle.provider_object_id, # Daytona client expects this key |
Prompt To Fix With AI
This is a comment left during a code review.
Path: packages/daytona-infra/src/web_api.py
Line: 403
Comment:
**Same response key mismatch as `create-sandbox`**
Same issue as above — `daytona-client.ts:256` reads `provider_object_id` from the response, but this returns `modal_object_id`. The restore result's `modalObjectId` will always be `undefined`.
```suggestion
"provider_object_id": handle.provider_object_id, # Daytona client expects this key
```
How can I resolve this? If you propose a fix, please make it concise.| try: | ||
| # For S3-based snapshots, we can delete the object. | ||
| # Log the request for auditability regardless. | ||
| log.info( | ||
| "image.delete_requested", | ||
| provider_image_id=provider_image_id, | ||
| ) | ||
|
|
||
| return { | ||
| "success": True, | ||
| "data": { | ||
| "provider_image_id": provider_image_id, | ||
| "deleted": True, | ||
| }, | ||
| } |
There was a problem hiding this comment.
api_delete_provider_image does not actually delete the S3 object
This endpoint logs the request and returns "deleted": True, but never calls s3.delete_object(). The control plane calls this to clean up replaced provider images after a new build succeeds. Since it silently claims success without doing anything, S3 will accumulate stale snapshot tarballs indefinitely.
The actual S3 deletion should be implemented here (get the S3 client and bucket from request.app.state and call delete_object).
Prompt To Fix With AI
This is a comment left during a code review.
Path: packages/daytona-infra/src/web_api.py
Line: 633-647
Comment:
**`api_delete_provider_image` does not actually delete the S3 object**
This endpoint logs the request and returns `"deleted": True`, but never calls `s3.delete_object()`. The control plane calls this to clean up replaced provider images after a new build succeeds. Since it silently claims success without doing anything, S3 will accumulate stale snapshot tarballs indefinitely.
The actual S3 deletion should be implemented here (get the S3 client and bucket from `request.app.state` and call `delete_object`).
How can I resolve this? If you propose a fix, please make it concise.| } | ||
| headers["Host"] = ghes_hostname | ||
|
|
||
| async with httpx.AsyncClient(verify=False, timeout=30.0) as client: |
There was a problem hiding this comment.
TLS verification disabled on GHES proxy
verify=False disables TLS certificate verification for all requests proxied to the GHES instance. This makes the connection vulnerable to man-in-the-middle attacks. The existing modal-infra package does not use verify=False anywhere.
If the GHES instance uses a self-signed or internal CA certificate, the preferred approach is to set the SSL_CERT_FILE or REQUESTS_CA_BUNDLE environment variable to point to the CA bundle, or pass the CA path via verify="/path/to/ca-bundle.crt".
Prompt To Fix With AI
This is a comment left during a code review.
Path: packages/daytona-infra/src/web_api.py
Line: 752
Comment:
**TLS verification disabled on GHES proxy**
`verify=False` disables TLS certificate verification for all requests proxied to the GHES instance. This makes the connection vulnerable to man-in-the-middle attacks. The existing `modal-infra` package does not use `verify=False` anywhere.
If the GHES instance uses a self-signed or internal CA certificate, the preferred approach is to set the `SSL_CERT_FILE` or `REQUESTS_CA_BUNDLE` environment variable to point to the CA bundle, or pass the CA path via `verify="/path/to/ca-bundle.crt"`.
How can I resolve this? If you propose a fix, please make it concise.| except Exception as e: | ||
| outcome = "error" | ||
| http_status = 500 | ||
| log.error("api.error", exc=e, endpoint_name="api_create_sandbox") | ||
| return {"success": False, "error": str(e)} |
There was a problem hiding this comment.
Error responses return HTTP 200 with success: false
When api_create_sandbox (and several other endpoints) catch a general Exception, they return an HTTP 200 with {"success": False, "error": "..."}. The DaytonaClient in the control plane checks response.ok first (line 161 of daytona-client.ts) and only falls through to parse the JSON body if the status is 2xx. This means a server-side error here will be parsed as a "successful" response with success=false, which is handled — but the DaytonaApiError with its status-code-based transient/permanent classification in the provider won't fire. All errors from these endpoints will be classified as generic permanent errors, even if the root cause is transient (e.g., Daytona SDK timeout).
Consider returning a proper HTTP 500 status instead:
from fastapi.responses import JSONResponse
return JSONResponse(status_code=500, content={"success": False, "error": str(e)})This would let the control plane's circuit breaker correctly classify the error.
Prompt To Fix With AI
This is a comment left during a code review.
Path: packages/daytona-infra/src/web_api.py
Line: 179-183
Comment:
**Error responses return HTTP 200 with `success: false`**
When `api_create_sandbox` (and several other endpoints) catch a general `Exception`, they return an HTTP 200 with `{"success": False, "error": "..."}`. The `DaytonaClient` in the control plane checks `response.ok` first (line 161 of `daytona-client.ts`) and only falls through to parse the JSON body if the status is 2xx. This means a server-side error here will be parsed as a "successful" response with `success=false`, which is handled — but the `DaytonaApiError` with its status-code-based transient/permanent classification in the provider won't fire. All errors from these endpoints will be classified as generic permanent errors, even if the root cause is transient (e.g., Daytona SDK timeout).
Consider returning a proper HTTP 500 status instead:
```python
from fastapi.responses import JSONResponse
return JSONResponse(status_code=500, content={"success": False, "error": str(e)})
```
This would let the control plane's circuit breaker correctly classify the error.
How can I resolve this? If you propose a fix, please make it concise.| import subprocess | ||
| result = subprocess.run( | ||
| [ | ||
| "docker", "exec", "-d", | ||
| "-e", f"PYTHONPATH=/app", | ||
| sandbox.id, | ||
| "python", "-m", "sandbox.entrypoint", | ||
| ], | ||
| capture_output=True, text=True, timeout=30, | ||
| ) | ||
| if result.returncode == 0: | ||
| log.info("sandbox.entrypoint_started", sandbox_id=sandbox_id, provider_id=sandbox.id) | ||
| else: | ||
| log.warning("sandbox.entrypoint_start_failed", sandbox_id=sandbox_id, stderr=result.stderr[:500]) | ||
| except Exception as e: | ||
| log.warning("sandbox.entrypoint_start_error", sandbox_id=sandbox_id, error=str(e)) |
There was a problem hiding this comment.
Synchronous subprocess.run blocks the async event loop
subprocess.run() is a blocking call. In an async FastAPI handler, this will block the entire event loop for up to 30 seconds (the timeout), preventing other requests from being served. The rest of the manager correctly uses async/await patterns.
Consider using asyncio.create_subprocess_exec instead, consistent with how the entrypoint and bridge modules handle subprocess calls:
proc = await asyncio.create_subprocess_exec(
"docker", "exec", "-d",
"-e", "PYTHONPATH=/app",
sandbox.id,
"python", "-m", "sandbox.entrypoint",
stdout=asyncio.subprocess.PIPE,
stderr=asyncio.subprocess.PIPE,
)
stdout, stderr = await asyncio.wait_for(proc.communicate(), timeout=30)Prompt To Fix With AI
This is a comment left during a code review.
Path: packages/daytona-infra/src/sandbox/manager.py
Line: 188-203
Comment:
**Synchronous `subprocess.run` blocks the async event loop**
`subprocess.run()` is a blocking call. In an async FastAPI handler, this will block the entire event loop for up to 30 seconds (the timeout), preventing other requests from being served. The rest of the manager correctly uses `async`/`await` patterns.
Consider using `asyncio.create_subprocess_exec` instead, consistent with how the entrypoint and bridge modules handle subprocess calls:
```python
proc = await asyncio.create_subprocess_exec(
"docker", "exec", "-d",
"-e", "PYTHONPATH=/app",
sandbox.id,
"python", "-m", "sandbox.entrypoint",
stdout=asyncio.subprocess.PIPE,
stderr=asyncio.subprocess.PIPE,
)
stdout, stderr = await asyncio.wait_for(proc.communicate(), timeout=30)
```
How can I resolve this? If you propose a fix, please make it concise.- Fix DaytonaClient reading provider_object_id instead of modal_object_id (field name mismatch — server sends modal_object_id for API compat) - Fix getLatestSnapshot snake_case to camelCase field mapping - Throw DaytonaApiError on success:false responses for correct circuit breaker error classification (was generic Error → always permanent) - Add require_auth() to /ghes-proxy endpoint - Add path allowlist to /ghes-proxy (only login/oauth/ and api/v3/) - Replace verify=False with configurable GHES_CA_BUNDLE - Implement actual S3 delete in delete-provider-image (was a no-op) - Pass explicit api_base to generate_installation_token
| return f"https://{host}/api/v3" | ||
|
|
||
|
|
||
| def generate_jwt(app_id: str, private_key: str) -> str: |
There was a problem hiding this comment.
we should likely introduce a shared sandbox-infra package to avoid duplicating this in the N many providers
| EXPIRED = "expired" | ||
|
|
||
|
|
||
| class Repository(BaseModel): |
There was a problem hiding this comment.
ditto with this, it appears duplicated from the modal infra
- Rename API response key modal_object_id to provider_object_id across services - Throw DaytonaApiError on API failures and missing data instead of returning error objects in client - Convert infra sandbox entrypoint subprocess call to asyncio and capture stdout/stderr - Use JSONResponse for error responses in web_api endpoints - Remove auth requirement from GHES proxy and stop stripping Authorization header
Addresses @ColeMurray's review feedback to avoid duplicating auth, types, models, and logging across sandbox providers. New package: packages/sandbox-shared - sandbox_shared.auth.github_app — GitHub App JWT + installation tokens - sandbox_shared.auth.internal — HMAC token generation/verification (parameterized env var name instead of hardcoded MODAL/DAYTONA) - sandbox_shared.sandbox.types — SandboxStatus, SessionConfig, event models - sandbox_shared.sandbox.log_config — structured JSON logging (parameterized service name via set_default_service()) - sandbox_shared.registry.models — Snapshot, Repository, SnapshotMetadata - sandbox_shared.sandbox.bridge — agent bridge (WebSocket, PR creation) - sandbox_shared.sandbox.tools/ — OpenCode custom tool scripts daytona-infra now imports from sandbox-shared via thin re-export shims. modal-infra is untouched for now (migration is a follow-up to avoid breaking the Modal deployment pipeline).
…red) sandbox-shared was copied from modal-infra which doesn't have resolve_api_base (that's a GHES addition). The re-export shim was trying to import it from sandbox-shared, causing ImportError at startup. Move it back to a local definition in daytona-infra.
Move resolve_api_base into sandbox-shared so daytona-infra imports it cleanly. No more local copies — all GitHub App auth lives in sandbox-shared.
|
@raghavpillai i've added the shared sandbox-runtime which should simplify a lot of this PR: |
ColeMurray
left a comment
There was a problem hiding this comment.
rebase with sandbox-runtime and remove duplicates
Summary
Closes #359
packages/daytona-infra— a FastAPI backend that implements the same sandbox HTTP API asmodal-infrausing the Daytona SDK for container management and S3/MinIO for snapshotsDaytonaClientandDaytonaSandboxProviderto the control plane, matching the existingModalClient/ModalSandboxProviderinterfacesSessionDOand repo-image routes pick the provider at runtime: setDAYTONA_API_URL+DAYTONA_API_SECRETfor Daytona, otherwise falls back to Modaldaytona_api_urlanddaytona_api_secret, wired into control plane worker bindingspackages/daytona-inframodal-infra)Dockerfile(API server) andDockerfile.sandbox(dev environment base image with Node 22, Python 3.12, OpenCode, Playwright)Control plane
DaytonaClient(sandbox/daytona-client.ts) — HTTP client with HMAC auth, request logging, error classificationDaytonaSandboxProvider(sandbox/providers/daytona-provider.ts) —SandboxProviderimplementation with transient/permanent error classification for circuit breakerSessionDO.createLifecycleManager()andrepo-images.tsNo changes to existing Modal behavior — if
DAYTONA_API_URLis unset, everything works exactly as before.Review feedback addressed
modal_object_idtoprovider_object_idin web API responses and updated DaytonaClient to read the new key/api/delete-provider-imageGHES_CA_BUNDLEenv var for TLS verification, path allowlist (login/oauth/,api/v3/)success: falsesubprocess.runwithasyncio.create_subprocess_execinmanager.pyDaytonaApiErroronsuccess: falseresponses instead of returning error objects