Skip to content

 feat: GPU Container Batch Mode & E2E Testing#15

Merged
cicorias merged 7 commits intoAzure-Samples:mainfrom
cicorias:vishnu-local-clean
Mar 26, 2026
Merged

 feat: GPU Container Batch Mode & E2E Testing#15
cicorias merged 7 commits intoAzure-Samples:mainfrom
cicorias:vishnu-local-clean

Conversation

@cicorias
Copy link
Copy Markdown
Member

GPU Container Batch Mode & E2E Testing

Aligns the processor's temp directory layout with gsplat conventions and adds standalone GPU container E2E testing against Azurite.

Changes

Pipeline fix — src/processor/job.rs

  • Renamed temp directories to match gsplat's expected workspace layout:
  • temp_folder/frames/ → temp_folder/images/
  • temp_folder/reconstruction/ → temp_folder/colmap/
  • Frame paths for training now read from images/ directory so the gsplat backend derives the correct workspace and finds colmap/sparse/0 natively — eliminates fallback path searching.

GPU container E2E test — scripts/e2e/04-run-e2e_gpu_container.sh

  • Standalone script that runs the full GPU pipeline in a Docker container against Azurite (no compose).
  • Steps: download test data → build GPU image → start Azurite → upload videos → run batch processor → verify outputs → cleanup.
  • Supports --skip-build, --keep-azurite (persist Azurite between runs), and --reuse-azurite (connect to existing Azurite from a prior run).

Batch mode run scripts — run_gpu_local_01.sh, run_gpu_local_02.sh

  • Use docker-compose run --rm (not docker compose up) for batch mode execution.
  • Set RUN_MODE=batch + BATCH_INPUT_PREFIX so the container processes once and exits.
  • _01 uses south_building scene, _02 uses my_scene.

Other

  • compose.yaml — Added Azurite service dependency for GPU profile; mount config.1.yaml.
  • container-test/config.1.yaml — Test config with default training hyperparameters.
  • .acrignore — Excludes build artifacts from ACR context uploads.
  • run-gpu-local.sh — Minor path fix.

cicorias and others added 7 commits March 26, 2026 13:17
- job.rs: Use temp_folder/colmap/ and temp_folder/images/ instead of
  temp_folder/reconstruction/ and temp_folder/frames/ so the gsplat
  backend natively finds colmap/sparse/0 without fallback path searching.
  Read frame paths from images/ dir to ensure correct workspace derivation.
- compose.yaml: Mount container-test/config.1.yaml for local testing
- container-test/config.1.yaml: Test config with default training params
- run_gpu_local_01.sh: Original GPU local run script (south_building)
- run_gpu_local_02.sh: Simplified GPU local run script (my_scene)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Switch from 'docker compose up' (watch mode, runs forever) to
'docker-compose run --rm' (batch mode, processes once and exits).
Add RUN_MODE=batch and BATCH_INPUT_PREFIX env vars so the container
downloads from local input dir, processes, and exits cleanly.

- run_gpu_local_01.sh: south_building scene
- run_gpu_local_02.sh: my_scene

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Standalone script that runs the full GPU pipeline in a Docker container
against Azurite (no compose). Steps:
1. Download test data (South Building)
2. Build GPU Docker image
3. Start Azurite via docker run on a dedicated network
4. Upload test videos via azurite_helper.py
5. Run GPU container in batch mode (gsplat on NVIDIA GPU)
6. Verify outputs (PLY, SPLAT, manifest) in Azurite
7. Cleanup (network, containers)

Usage: ./scripts/e2e/04-run-e2e_gpu_container.sh [--skip-build]

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
--keep-azurite: leave Azurite + network running after script exits
--reuse-azurite: skip Azurite startup, connect to existing container
  from a prior run (detects by container name)

Useful for iterating on the GPU pipeline without waiting for Azurite
setup/teardown on each run.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings March 26, 2026 19:20
@cicorias cicorias merged commit 0ec37f4 into Azure-Samples:main Mar 26, 2026
4 checks passed
@cicorias cicorias deleted the vishnu-local-clean branch March 26, 2026 19:20
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR updates the processor’s temporary workspace layout to match gsplat expectations and adds scripts/configuration to run GPU batch-mode E2E tests against Azurite.

Changes:

  • Updated job pipeline temp directory layout (frames/images/, reconstruction/colmap/) and adjusted training frame discovery accordingly.
  • Added a standalone GPU container E2E script that runs the full pipeline against Azurite without docker-compose.
  • Updated local GPU run scripts and compose.yaml to include Azurite and revised config mounts; added an .acrignore for ACR build context trimming.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 9 comments.

Show a summary per file
File Description
src/processor/job.rs Renames temp subdirectories and changes training frame enumeration to read from temp_folder/images for gsplat workspace derivation.
scripts/e2e/04-run-e2e_gpu_container.sh New end-to-end GPU container test script orchestrating Azurite + batch-mode run + verification.
run_gpu_local_01.sh New local GPU batch-mode helper script for south_building.
run_gpu_local_02.sh New local GPU batch-mode helper script for my_scene.
run-gpu-local.sh Adjusts local GPU run script config path handling.
compose.yaml Adds Azurite service + health-gated dependencies; updates config mounts for CPU/GPU services; adds Azurite volume.
.acrignore Excludes build artifacts and dev/test directories from ACR build context uploads.

Comment on lines +45 to +50
for arg in "$@"; do
case "$arg" in
--skip-build) SKIP_BUILD=true ;;
--keep-azurite) KEEP_AZURITE=true ;;
--reuse-azurite) REUSE_AZURITE=true ;;
--timeout) TIMEOUT="$2"; shift ;;
Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The CLI parsing loop is incorrect for flags with values: in for arg in "$@", --timeout assigns TIMEOUT="$2" (which depends on the current positional params, not the current arg) and shift does not advance the for iterator. Example: --skip-build --timeout 600 will set TIMEOUT to --timeout. Consider switching to a while [ $# -gt 0 ]; do case "$1" ...; shift ...; done style parser and validating that --timeout has a value (to avoid set -u crashes).

Suggested change
for arg in "$@"; do
case "$arg" in
--skip-build) SKIP_BUILD=true ;;
--keep-azurite) KEEP_AZURITE=true ;;
--reuse-azurite) REUSE_AZURITE=true ;;
--timeout) TIMEOUT="$2"; shift ;;
while [ "$#" -gt 0 ]; do
case "$1" in
--skip-build)
SKIP_BUILD=true
shift
;;
--keep-azurite)
KEEP_AZURITE=true
shift
;;
--reuse-azurite)
REUSE_AZURITE=true
shift
;;
--timeout)
if [ "$#" -lt 2 ]; then
echo "Error: --timeout requires a value (in seconds)" >&2
exit 1
fi
TIMEOUT="$2"
shift 2
;;
*)
echo "Error: Unknown option: $1" >&2
exit 1
;;

Copilot uses AI. Check for mistakes.
Comment on lines +43 to +46
if [ ! -f "container-test/confg.yaml" ]; then
cp config.example.yaml container-test/confg.yaml
fi
echo " Config: container-test/confg.yaml"
Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This script uses container-test/confg.yaml (missing the i) instead of the existing container-test/config.yaml used elsewhere (e.g., README and .github/instructions/docker-batch-e2e.instructions.md). Unless there is a deliberate new config file being introduced, consider switching back to container-test/config.yaml to avoid maintaining two similar config files and to match the documented workflow.

Suggested change
if [ ! -f "container-test/confg.yaml" ]; then
cp config.example.yaml container-test/confg.yaml
fi
echo " Config: container-test/confg.yaml"
if [ ! -f "container-test/config.yaml" ]; then
cp config.example.yaml container-test/config.yaml
fi
echo " Config: container-test/config.yaml"

Copilot uses AI. Check for mistakes.
Comment on lines +43 to +46
if [ ! -f "container-test/confg.yaml" ]; then
cp config.example.yaml container-test/confg.yaml
fi
echo " Config: container-test/confg.yaml"
Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This script uses container-test/confg.yaml instead of the existing container-test/config.yaml used by the rest of the repo/documentation. If this isn't intentional, prefer the standard filename to avoid confusion and config drift.

Suggested change
if [ ! -f "container-test/confg.yaml" ]; then
cp config.example.yaml container-test/confg.yaml
fi
echo " Config: container-test/confg.yaml"
if [ ! -f "container-test/config.yaml" ]; then
cp config.example.yaml container-test/config.yaml
fi
echo " Config: container-test/config.yaml"

Copilot uses AI. Check for mistakes.
Comment on lines +48 to +54
if [ ! -f "container-test/confg.yaml" ]; then
cp config.example.yaml container-test/confg.yaml
echo "✓ Config file created at container-test/confg.yaml"
else
echo "✓ Config file already exists"
fi
echo " (Edit container-test/config.1.yaml to adjust training parameters)"
echo " (Edit container-test/confg.yaml to adjust training parameters)"
Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

run-gpu-local.sh now creates/uses container-test/confg.yaml, which diverges from the existing container-test/config.yaml referenced by the README and Docker batch E2E instructions. Unless a new config file is being intentionally introduced, consider using the standard container-test/config.yaml filename (or renaming consistently) to avoid breaking documented commands and requiring users to maintain multiple configs.

Copilot uses AI. Check for mistakes.
Comment on lines +39 to 45
depends_on:
azurite:
condition: service_healthy
volumes:
- ./output/data:/data
- ./container-test/config.yaml:/app/config.yaml:ro
- ./container-test/confg.yaml:/app/config.yaml:ro
- ./output/tmp:/tmp/3dgs-work
Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Compose now mounts ./container-test/confg.yaml into the processor containers, but the repo contains container-test/config.yaml and the documentation/examples consistently refer to that filename. Unless confg.yaml is a new intentional file (not included in this PR), this will break docker compose up out-of-the-box. Consider reverting to container-test/config.yaml (or adding the new file and updating docs consistently).

Copilot uses AI. Check for mistakes.
Comment on lines 84 to 87
volumes:
- ./output/data:/data
- ./container-test/config.yaml:/app/config.yaml:ro
- ./container-test/confg.yaml:/app/config.yaml:ro
- ./output/tmp:/tmp/3dgs-work
Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same issue for the GPU service: mounting ./container-test/confg.yaml is inconsistent with the existing container-test/config.yaml used elsewhere, and will fail unless users create a new file with that name. Prefer the existing filename or add the intended config file to the repo and update references consistently.

Copilot uses AI. Check for mistakes.
Comment on lines +501 to 508
// workspace (image_dir.parent() = temp_folder) and finds colmap/sparse/0
let mut frame_paths: Vec<PathBuf> = std::fs::read_dir(&temp_folder.join("images"))
.context("Failed to read image dir for training")?
.filter_map(|e| e.ok().map(|e| e.path()))
.filter(|p| p.extension().and_then(|e| e.to_str()) == Some("jpg"))
.collect();
frame_paths.sort();

Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Training now reads frames from ${temp_folder}/images via read_dir(), but this directory is only created/populated inside the reconstruction stage. If the job resumes from a checkpoint at ProcessingStage::Training and ${temp_folder}/images is missing/empty, training will fail even though frame_sets are available. Consider ensuring the images dir exists and is populated (e.g., create_dir_all + copy_frames_to_colmap_dir(&frame_sets, &image_dir) when needed) or falling back to frame_sets when read_dir yields no JPGs; also avoid silently skipping read_dir entry errors (filter_map(|e| e.ok())).

Suggested change
// workspace (image_dir.parent() = temp_folder) and finds colmap/sparse/0
let mut frame_paths: Vec<PathBuf> = std::fs::read_dir(&temp_folder.join("images"))
.context("Failed to read image dir for training")?
.filter_map(|e| e.ok().map(|e| e.path()))
.filter(|p| p.extension().and_then(|e| e.to_str()) == Some("jpg"))
.collect();
frame_paths.sort();
// workspace (image_dir.parent() = temp_folder) and finds colmap/sparse/0.
// If the directory is missing or empty (e.g., when resuming from a checkpoint),
// fall back to using the already-loaded frame_sets.
let image_dir = temp_folder.join("images");
let mut frame_paths: Vec<PathBuf> = if image_dir.exists() {
let read_dir = std::fs::read_dir(&image_dir)
.with_context(|| format!("Failed to read image dir for training: {}", image_dir.display()))?;
let mut paths = Vec::new();
for entry_result in read_dir {
let entry = entry_result
.with_context(|| format!("Failed to read entry in image dir: {}", image_dir.display()))?;
let path = entry.path();
if path.extension().and_then(|e| e.to_str()) == Some("jpg") {
paths.push(path);
}
}
paths
} else {
Vec::new()
};
// Fallback: if no JPGs were found in image_dir, derive frame paths from frame_sets.
if frame_paths.is_empty() {
frame_paths = frame_sets
.iter()
.flat_map(|set: &FrameSet| {
set.frames
.iter()
.map(|frame| frame.path.clone())
})
.collect();
}
frame_paths.sort();
if frame_paths.is_empty() {
anyhow::bail!("No frames available for training (image dir empty and frame_sets empty)");
}

Copilot uses AI. Check for mistakes.
// Collect all frame paths from all videos
let image_dir = temp_folder.join("frames");
let image_dir = temp_folder.join("images");
std::fs::create_dir_all(&image_dir).context("Failed to create frames directory")?;
Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The error context string still says "Failed to create frames directory" even though the directory is now images/. Updating the message will make failures easier to diagnose.

Suggested change
std::fs::create_dir_all(&image_dir).context("Failed to create frames directory")?;
std::fs::create_dir_all(&image_dir).context("Failed to create images directory")?;

Copilot uses AI. Check for mistakes.
docker run --rm --gpus all \
--name "$PROCESSOR_CONTAINER" \
--network "$NETWORK_NAME" \
-v "$PROJECT_ROOT/container-test/confg.yaml:/app/config.yaml:ro" \
Copy link

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The container config mount points at container-test/confg.yaml, but the repo currently contains container-test/config.yaml (and docs reference that filename). As written, the docker run will fail unless the user manually creates confg.yaml. Consider using container-test/config.yaml (or adding/renaming the intended file) and keeping filenames consistent across compose/scripts/docs.

Suggested change
-v "$PROJECT_ROOT/container-test/confg.yaml:/app/config.yaml:ro" \
-v "$PROJECT_ROOT/container-test/config.yaml:/app/config.yaml:ro" \

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants