feat: GPU Container Batch Mode & E2E Testing#15
feat: GPU Container Batch Mode & E2E Testing#15cicorias merged 7 commits intoAzure-Samples:mainfrom
Conversation
- job.rs: Use temp_folder/colmap/ and temp_folder/images/ instead of temp_folder/reconstruction/ and temp_folder/frames/ so the gsplat backend natively finds colmap/sparse/0 without fallback path searching. Read frame paths from images/ dir to ensure correct workspace derivation. - compose.yaml: Mount container-test/config.1.yaml for local testing - container-test/config.1.yaml: Test config with default training params - run_gpu_local_01.sh: Original GPU local run script (south_building) - run_gpu_local_02.sh: Simplified GPU local run script (my_scene) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Switch from 'docker compose up' (watch mode, runs forever) to 'docker-compose run --rm' (batch mode, processes once and exits). Add RUN_MODE=batch and BATCH_INPUT_PREFIX env vars so the container downloads from local input dir, processes, and exits cleanly. - run_gpu_local_01.sh: south_building scene - run_gpu_local_02.sh: my_scene Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Standalone script that runs the full GPU pipeline in a Docker container against Azurite (no compose). Steps: 1. Download test data (South Building) 2. Build GPU Docker image 3. Start Azurite via docker run on a dedicated network 4. Upload test videos via azurite_helper.py 5. Run GPU container in batch mode (gsplat on NVIDIA GPU) 6. Verify outputs (PLY, SPLAT, manifest) in Azurite 7. Cleanup (network, containers) Usage: ./scripts/e2e/04-run-e2e_gpu_container.sh [--skip-build] Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
--keep-azurite: leave Azurite + network running after script exits --reuse-azurite: skip Azurite startup, connect to existing container from a prior run (detects by container name) Useful for iterating on the GPU pipeline without waiting for Azurite setup/teardown on each run. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
This PR updates the processor’s temporary workspace layout to match gsplat expectations and adds scripts/configuration to run GPU batch-mode E2E tests against Azurite.
Changes:
- Updated job pipeline temp directory layout (
frames/→images/,reconstruction/→colmap/) and adjusted training frame discovery accordingly. - Added a standalone GPU container E2E script that runs the full pipeline against Azurite without docker-compose.
- Updated local GPU run scripts and
compose.yamlto include Azurite and revised config mounts; added an.acrignorefor ACR build context trimming.
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 9 comments.
Show a summary per file
| File | Description |
|---|---|
src/processor/job.rs |
Renames temp subdirectories and changes training frame enumeration to read from temp_folder/images for gsplat workspace derivation. |
scripts/e2e/04-run-e2e_gpu_container.sh |
New end-to-end GPU container test script orchestrating Azurite + batch-mode run + verification. |
run_gpu_local_01.sh |
New local GPU batch-mode helper script for south_building. |
run_gpu_local_02.sh |
New local GPU batch-mode helper script for my_scene. |
run-gpu-local.sh |
Adjusts local GPU run script config path handling. |
compose.yaml |
Adds Azurite service + health-gated dependencies; updates config mounts for CPU/GPU services; adds Azurite volume. |
.acrignore |
Excludes build artifacts and dev/test directories from ACR build context uploads. |
| for arg in "$@"; do | ||
| case "$arg" in | ||
| --skip-build) SKIP_BUILD=true ;; | ||
| --keep-azurite) KEEP_AZURITE=true ;; | ||
| --reuse-azurite) REUSE_AZURITE=true ;; | ||
| --timeout) TIMEOUT="$2"; shift ;; |
There was a problem hiding this comment.
The CLI parsing loop is incorrect for flags with values: in for arg in "$@", --timeout assigns TIMEOUT="$2" (which depends on the current positional params, not the current arg) and shift does not advance the for iterator. Example: --skip-build --timeout 600 will set TIMEOUT to --timeout. Consider switching to a while [ $# -gt 0 ]; do case "$1" ...; shift ...; done style parser and validating that --timeout has a value (to avoid set -u crashes).
| for arg in "$@"; do | |
| case "$arg" in | |
| --skip-build) SKIP_BUILD=true ;; | |
| --keep-azurite) KEEP_AZURITE=true ;; | |
| --reuse-azurite) REUSE_AZURITE=true ;; | |
| --timeout) TIMEOUT="$2"; shift ;; | |
| while [ "$#" -gt 0 ]; do | |
| case "$1" in | |
| --skip-build) | |
| SKIP_BUILD=true | |
| shift | |
| ;; | |
| --keep-azurite) | |
| KEEP_AZURITE=true | |
| shift | |
| ;; | |
| --reuse-azurite) | |
| REUSE_AZURITE=true | |
| shift | |
| ;; | |
| --timeout) | |
| if [ "$#" -lt 2 ]; then | |
| echo "Error: --timeout requires a value (in seconds)" >&2 | |
| exit 1 | |
| fi | |
| TIMEOUT="$2" | |
| shift 2 | |
| ;; | |
| *) | |
| echo "Error: Unknown option: $1" >&2 | |
| exit 1 | |
| ;; |
| if [ ! -f "container-test/confg.yaml" ]; then | ||
| cp config.example.yaml container-test/confg.yaml | ||
| fi | ||
| echo " Config: container-test/confg.yaml" |
There was a problem hiding this comment.
This script uses container-test/confg.yaml (missing the i) instead of the existing container-test/config.yaml used elsewhere (e.g., README and .github/instructions/docker-batch-e2e.instructions.md). Unless there is a deliberate new config file being introduced, consider switching back to container-test/config.yaml to avoid maintaining two similar config files and to match the documented workflow.
| if [ ! -f "container-test/confg.yaml" ]; then | |
| cp config.example.yaml container-test/confg.yaml | |
| fi | |
| echo " Config: container-test/confg.yaml" | |
| if [ ! -f "container-test/config.yaml" ]; then | |
| cp config.example.yaml container-test/config.yaml | |
| fi | |
| echo " Config: container-test/config.yaml" |
| if [ ! -f "container-test/confg.yaml" ]; then | ||
| cp config.example.yaml container-test/confg.yaml | ||
| fi | ||
| echo " Config: container-test/confg.yaml" |
There was a problem hiding this comment.
This script uses container-test/confg.yaml instead of the existing container-test/config.yaml used by the rest of the repo/documentation. If this isn't intentional, prefer the standard filename to avoid confusion and config drift.
| if [ ! -f "container-test/confg.yaml" ]; then | |
| cp config.example.yaml container-test/confg.yaml | |
| fi | |
| echo " Config: container-test/confg.yaml" | |
| if [ ! -f "container-test/config.yaml" ]; then | |
| cp config.example.yaml container-test/config.yaml | |
| fi | |
| echo " Config: container-test/config.yaml" |
| if [ ! -f "container-test/confg.yaml" ]; then | ||
| cp config.example.yaml container-test/confg.yaml | ||
| echo "✓ Config file created at container-test/confg.yaml" | ||
| else | ||
| echo "✓ Config file already exists" | ||
| fi | ||
| echo " (Edit container-test/config.1.yaml to adjust training parameters)" | ||
| echo " (Edit container-test/confg.yaml to adjust training parameters)" |
There was a problem hiding this comment.
run-gpu-local.sh now creates/uses container-test/confg.yaml, which diverges from the existing container-test/config.yaml referenced by the README and Docker batch E2E instructions. Unless a new config file is being intentionally introduced, consider using the standard container-test/config.yaml filename (or renaming consistently) to avoid breaking documented commands and requiring users to maintain multiple configs.
| depends_on: | ||
| azurite: | ||
| condition: service_healthy | ||
| volumes: | ||
| - ./output/data:/data | ||
| - ./container-test/config.yaml:/app/config.yaml:ro | ||
| - ./container-test/confg.yaml:/app/config.yaml:ro | ||
| - ./output/tmp:/tmp/3dgs-work |
There was a problem hiding this comment.
Compose now mounts ./container-test/confg.yaml into the processor containers, but the repo contains container-test/config.yaml and the documentation/examples consistently refer to that filename. Unless confg.yaml is a new intentional file (not included in this PR), this will break docker compose up out-of-the-box. Consider reverting to container-test/config.yaml (or adding the new file and updating docs consistently).
| volumes: | ||
| - ./output/data:/data | ||
| - ./container-test/config.yaml:/app/config.yaml:ro | ||
| - ./container-test/confg.yaml:/app/config.yaml:ro | ||
| - ./output/tmp:/tmp/3dgs-work |
There was a problem hiding this comment.
Same issue for the GPU service: mounting ./container-test/confg.yaml is inconsistent with the existing container-test/config.yaml used elsewhere, and will fail unless users create a new file with that name. Prefer the existing filename or add the intended config file to the repo and update references consistently.
| // workspace (image_dir.parent() = temp_folder) and finds colmap/sparse/0 | ||
| let mut frame_paths: Vec<PathBuf> = std::fs::read_dir(&temp_folder.join("images")) | ||
| .context("Failed to read image dir for training")? | ||
| .filter_map(|e| e.ok().map(|e| e.path())) | ||
| .filter(|p| p.extension().and_then(|e| e.to_str()) == Some("jpg")) | ||
| .collect(); | ||
| frame_paths.sort(); | ||
|
|
There was a problem hiding this comment.
Training now reads frames from ${temp_folder}/images via read_dir(), but this directory is only created/populated inside the reconstruction stage. If the job resumes from a checkpoint at ProcessingStage::Training and ${temp_folder}/images is missing/empty, training will fail even though frame_sets are available. Consider ensuring the images dir exists and is populated (e.g., create_dir_all + copy_frames_to_colmap_dir(&frame_sets, &image_dir) when needed) or falling back to frame_sets when read_dir yields no JPGs; also avoid silently skipping read_dir entry errors (filter_map(|e| e.ok())).
| // workspace (image_dir.parent() = temp_folder) and finds colmap/sparse/0 | |
| let mut frame_paths: Vec<PathBuf> = std::fs::read_dir(&temp_folder.join("images")) | |
| .context("Failed to read image dir for training")? | |
| .filter_map(|e| e.ok().map(|e| e.path())) | |
| .filter(|p| p.extension().and_then(|e| e.to_str()) == Some("jpg")) | |
| .collect(); | |
| frame_paths.sort(); | |
| // workspace (image_dir.parent() = temp_folder) and finds colmap/sparse/0. | |
| // If the directory is missing or empty (e.g., when resuming from a checkpoint), | |
| // fall back to using the already-loaded frame_sets. | |
| let image_dir = temp_folder.join("images"); | |
| let mut frame_paths: Vec<PathBuf> = if image_dir.exists() { | |
| let read_dir = std::fs::read_dir(&image_dir) | |
| .with_context(|| format!("Failed to read image dir for training: {}", image_dir.display()))?; | |
| let mut paths = Vec::new(); | |
| for entry_result in read_dir { | |
| let entry = entry_result | |
| .with_context(|| format!("Failed to read entry in image dir: {}", image_dir.display()))?; | |
| let path = entry.path(); | |
| if path.extension().and_then(|e| e.to_str()) == Some("jpg") { | |
| paths.push(path); | |
| } | |
| } | |
| paths | |
| } else { | |
| Vec::new() | |
| }; | |
| // Fallback: if no JPGs were found in image_dir, derive frame paths from frame_sets. | |
| if frame_paths.is_empty() { | |
| frame_paths = frame_sets | |
| .iter() | |
| .flat_map(|set: &FrameSet| { | |
| set.frames | |
| .iter() | |
| .map(|frame| frame.path.clone()) | |
| }) | |
| .collect(); | |
| } | |
| frame_paths.sort(); | |
| if frame_paths.is_empty() { | |
| anyhow::bail!("No frames available for training (image dir empty and frame_sets empty)"); | |
| } |
| // Collect all frame paths from all videos | ||
| let image_dir = temp_folder.join("frames"); | ||
| let image_dir = temp_folder.join("images"); | ||
| std::fs::create_dir_all(&image_dir).context("Failed to create frames directory")?; |
There was a problem hiding this comment.
The error context string still says "Failed to create frames directory" even though the directory is now images/. Updating the message will make failures easier to diagnose.
| std::fs::create_dir_all(&image_dir).context("Failed to create frames directory")?; | |
| std::fs::create_dir_all(&image_dir).context("Failed to create images directory")?; |
| docker run --rm --gpus all \ | ||
| --name "$PROCESSOR_CONTAINER" \ | ||
| --network "$NETWORK_NAME" \ | ||
| -v "$PROJECT_ROOT/container-test/confg.yaml:/app/config.yaml:ro" \ |
There was a problem hiding this comment.
The container config mount points at container-test/confg.yaml, but the repo currently contains container-test/config.yaml (and docs reference that filename). As written, the docker run will fail unless the user manually creates confg.yaml. Consider using container-test/config.yaml (or adding/renaming the intended file) and keeping filenames consistent across compose/scripts/docs.
| -v "$PROJECT_ROOT/container-test/confg.yaml:/app/config.yaml:ro" \ | |
| -v "$PROJECT_ROOT/container-test/config.yaml:/app/config.yaml:ro" \ |
GPU Container Batch Mode & E2E Testing
Aligns the processor's temp directory layout with gsplat conventions and adds standalone GPU container E2E testing against Azurite.
Changes
Pipeline fix — src/processor/job.rs
GPU container E2E test — scripts/e2e/04-run-e2e_gpu_container.sh
Batch mode run scripts — run_gpu_local_01.sh, run_gpu_local_02.sh
Other