feat: GPU Container Batch Mode & E2E Testing by cicorias · Pull Request #15 · Azure-Samples/3DGS-accelerator

cicorias · 2026-03-26T19:20:13Z

GPU Container Batch Mode & E2E Testing

Aligns the processor's temp directory layout with gsplat conventions and adds standalone GPU container E2E testing against Azurite.

Changes

Pipeline fix — src/processor/job.rs

Renamed temp directories to match gsplat's expected workspace layout:
temp_folder/frames/ → temp_folder/images/
temp_folder/reconstruction/ → temp_folder/colmap/
Frame paths for training now read from images/ directory so the gsplat backend derives the correct workspace and finds colmap/sparse/0 natively — eliminates fallback path searching.

GPU container E2E test — scripts/e2e/04-run-e2e_gpu_container.sh

Standalone script that runs the full GPU pipeline in a Docker container against Azurite (no compose).
Steps: download test data → build GPU image → start Azurite → upload videos → run batch processor → verify outputs → cleanup.
Supports --skip-build, --keep-azurite (persist Azurite between runs), and --reuse-azurite (connect to existing Azurite from a prior run).

Batch mode run scripts — run_gpu_local_01.sh, run_gpu_local_02.sh

Use docker-compose run --rm (not docker compose up) for batch mode execution.
Set RUN_MODE=batch + BATCH_INPUT_PREFIX so the container processes once and exits.
_01 uses south_building scene, _02 uses my_scene.

Other

compose.yaml — Added Azurite service dependency for GPU profile; mount config.1.yaml.
container-test/config.1.yaml — Test config with default training hyperparameters.
.acrignore — Excludes build artifacts from ACR context uploads.
run-gpu-local.sh — Minor path fix.

- job.rs: Use temp_folder/colmap/ and temp_folder/images/ instead of temp_folder/reconstruction/ and temp_folder/frames/ so the gsplat backend natively finds colmap/sparse/0 without fallback path searching. Read frame paths from images/ dir to ensure correct workspace derivation. - compose.yaml: Mount container-test/config.1.yaml for local testing - container-test/config.1.yaml: Test config with default training params - run_gpu_local_01.sh: Original GPU local run script (south_building) - run_gpu_local_02.sh: Simplified GPU local run script (my_scene) Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Switch from 'docker compose up' (watch mode, runs forever) to 'docker-compose run --rm' (batch mode, processes once and exits). Add RUN_MODE=batch and BATCH_INPUT_PREFIX env vars so the container downloads from local input dir, processes, and exits cleanly. - run_gpu_local_01.sh: south_building scene - run_gpu_local_02.sh: my_scene Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Standalone script that runs the full GPU pipeline in a Docker container against Azurite (no compose). Steps: 1. Download test data (South Building) 2. Build GPU Docker image 3. Start Azurite via docker run on a dedicated network 4. Upload test videos via azurite_helper.py 5. Run GPU container in batch mode (gsplat on NVIDIA GPU) 6. Verify outputs (PLY, SPLAT, manifest) in Azurite 7. Cleanup (network, containers) Usage: ./scripts/e2e/04-run-e2e_gpu_container.sh [--skip-build] Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

--keep-azurite: leave Azurite + network running after script exits --reuse-azurite: skip Azurite startup, connect to existing container from a prior run (detects by container name) Useful for iterating on the GPU pipeline without waiting for Azurite setup/teardown on each run. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot

Pull request overview

This PR updates the processor’s temporary workspace layout to match gsplat expectations and adds scripts/configuration to run GPU batch-mode E2E tests against Azurite.

Changes:

Updated job pipeline temp directory layout (frames/ → images/, reconstruction/ → colmap/) and adjusted training frame discovery accordingly.
Added a standalone GPU container E2E script that runs the full pipeline against Azurite without docker-compose.
Updated local GPU run scripts and compose.yaml to include Azurite and revised config mounts; added an .acrignore for ACR build context trimming.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 9 comments.

Show a summary per file

File	Description
`src/processor/job.rs`	Renames temp subdirectories and changes training frame enumeration to read from `temp_folder/images` for gsplat workspace derivation.
`scripts/e2e/04-run-e2e_gpu_container.sh`	New end-to-end GPU container test script orchestrating Azurite + batch-mode run + verification.
`run_gpu_local_01.sh`	New local GPU batch-mode helper script for `south_building`.
`run_gpu_local_02.sh`	New local GPU batch-mode helper script for `my_scene`.
`run-gpu-local.sh`	Adjusts local GPU run script config path handling.
`compose.yaml`	Adds Azurite service + health-gated dependencies; updates config mounts for CPU/GPU services; adds Azurite volume.
`.acrignore`	Excludes build artifacts and dev/test directories from ACR build context uploads.

Copilot · 2026-03-26T19:28:36Z

scripts/e2e/04-run-e2e_gpu_container.sh

+for arg in "$@"; do
+  case "$arg" in
+    --skip-build) SKIP_BUILD=true ;;
+    --keep-azurite) KEEP_AZURITE=true ;;
+    --reuse-azurite) REUSE_AZURITE=true ;;
+    --timeout) TIMEOUT="$2"; shift ;;


The CLI parsing loop is incorrect for flags with values: in for arg in "$@", --timeout assigns TIMEOUT="$2" (which depends on the current positional params, not the current arg) and shift does not advance the for iterator. Example: --skip-build --timeout 600 will set TIMEOUT to --timeout. Consider switching to a while [ $# -gt 0 ]; do case "$1" ...; shift ...; done style parser and validating that --timeout has a value (to avoid set -u crashes).

Suggested change

for arg in "$@"; do

case "$arg" in

--skip-build) SKIP_BUILD=true ;;

--keep-azurite) KEEP_AZURITE=true ;;

--reuse-azurite) REUSE_AZURITE=true ;;

--timeout) TIMEOUT="$2"; shift ;;

while [ "$#" -gt 0 ]; do

case "$1" in

--skip-build)

SKIP_BUILD=true

shift

;;

--keep-azurite)

KEEP_AZURITE=true

shift

;;

--reuse-azurite)

REUSE_AZURITE=true

shift

;;

--timeout)

if [ "$#" -lt 2 ]; then

echo "Error: --timeout requires a value (in seconds)" >&2

exit 1

fi

TIMEOUT="$2"

shift 2

;;

*)

echo "Error: Unknown option: $1" >&2

exit 1

;;

Copilot · 2026-03-26T19:28:36Z

run_gpu_local_01.sh

+if [ ! -f "container-test/confg.yaml" ]; then
+    cp config.example.yaml container-test/confg.yaml
+fi
+echo "  Config: container-test/confg.yaml"


This script uses container-test/confg.yaml (missing the i) instead of the existing container-test/config.yaml used elsewhere (e.g., README and .github/instructions/docker-batch-e2e.instructions.md). Unless there is a deliberate new config file being introduced, consider switching back to container-test/config.yaml to avoid maintaining two similar config files and to match the documented workflow.

Suggested change

if [ ! -f "container-test/confg.yaml" ]; then

cp config.example.yaml container-test/confg.yaml

fi

echo " Config: container-test/confg.yaml"

if [ ! -f "container-test/config.yaml" ]; then

cp config.example.yaml container-test/config.yaml

fi

echo " Config: container-test/config.yaml"

Copilot · 2026-03-26T19:28:37Z

run_gpu_local_02.sh

+if [ ! -f "container-test/confg.yaml" ]; then
+    cp config.example.yaml container-test/confg.yaml
+fi
+echo "  Config: container-test/confg.yaml"


This script uses container-test/confg.yaml instead of the existing container-test/config.yaml used by the rest of the repo/documentation. If this isn't intentional, prefer the standard filename to avoid confusion and config drift.

Suggested change

if [ ! -f "container-test/confg.yaml" ]; then

cp config.example.yaml container-test/confg.yaml

fi

echo " Config: container-test/confg.yaml"

if [ ! -f "container-test/config.yaml" ]; then

cp config.example.yaml container-test/config.yaml

fi

echo " Config: container-test/config.yaml"

Copilot · 2026-03-26T19:28:37Z

run-gpu-local.sh

+if [ ! -f "container-test/confg.yaml" ]; then
+    cp config.example.yaml container-test/confg.yaml
+    echo "✓ Config file created at container-test/confg.yaml"
 else
    echo "✓ Config file already exists"
 fi
-echo "   (Edit container-test/config.1.yaml to adjust training parameters)"
+echo "   (Edit container-test/confg.yaml to adjust training parameters)"


run-gpu-local.sh now creates/uses container-test/confg.yaml, which diverges from the existing container-test/config.yaml referenced by the README and Docker batch E2E instructions. Unless a new config file is being intentionally introduced, consider using the standard container-test/config.yaml filename (or renaming consistently) to avoid breaking documented commands and requiring users to maintain multiple configs.

Copilot · 2026-03-26T19:28:37Z

compose.yaml

+    depends_on:
+      azurite:
+        condition: service_healthy
    volumes:
      - ./output/data:/data
-      - ./container-test/config.yaml:/app/config.yaml:ro
+      - ./container-test/confg.yaml:/app/config.yaml:ro
      - ./output/tmp:/tmp/3dgs-work


Compose now mounts ./container-test/confg.yaml into the processor containers, but the repo contains container-test/config.yaml and the documentation/examples consistently refer to that filename. Unless confg.yaml is a new intentional file (not included in this PR), this will break docker compose up out-of-the-box. Consider reverting to container-test/config.yaml (or adding the new file and updating docs consistently).

Copilot · 2026-03-26T19:28:37Z

compose.yaml

    volumes:
      - ./output/data:/data
-      - ./container-test/config.yaml:/app/config.yaml:ro
+      - ./container-test/confg.yaml:/app/config.yaml:ro
      - ./output/tmp:/tmp/3dgs-work


Same issue for the GPU service: mounting ./container-test/confg.yaml is inconsistent with the existing container-test/config.yaml used elsewhere, and will fail unless users create a new file with that name. Prefer the existing filename or add the intended config file to the repo and update references consistently.

Copilot · 2026-03-26T19:28:38Z

src/processor/job.rs

+        // workspace (image_dir.parent() = temp_folder) and finds colmap/sparse/0
+        let mut frame_paths: Vec<PathBuf> = std::fs::read_dir(&temp_folder.join("images"))
+            .context("Failed to read image dir for training")?
+            .filter_map(|e| e.ok().map(|e| e.path()))
+            .filter(|p| p.extension().and_then(|e| e.to_str()) == Some("jpg"))
            .collect();
+        frame_paths.sort();



Training now reads frames from ${temp_folder}/images via read_dir(), but this directory is only created/populated inside the reconstruction stage. If the job resumes from a checkpoint at ProcessingStage::Training and ${temp_folder}/images is missing/empty, training will fail even though frame_sets are available. Consider ensuring the images dir exists and is populated (e.g., create_dir_all + copy_frames_to_colmap_dir(&frame_sets, &image_dir) when needed) or falling back to frame_sets when read_dir yields no JPGs; also avoid silently skipping read_dir entry errors (filter_map(|e| e.ok())).

Suggested change

// workspace (image_dir.parent() = temp_folder) and finds colmap/sparse/0

let mut frame_paths: Vec<PathBuf> = std::fs::read_dir(&temp_folder.join("images"))

.context("Failed to read image dir for training")?

.filter_map(|e| e.ok().map(|e| e.path()))

.filter(|p| p.extension().and_then(|e| e.to_str()) == Some("jpg"))

.collect();

frame_paths.sort();

// workspace (image_dir.parent() = temp_folder) and finds colmap/sparse/0.

// If the directory is missing or empty (e.g., when resuming from a checkpoint),

// fall back to using the already-loaded frame_sets.

let image_dir = temp_folder.join("images");

let mut frame_paths: Vec<PathBuf> = if image_dir.exists() {

let read_dir = std::fs::read_dir(&image_dir)

.with_context(|| format!("Failed to read image dir for training: {}", image_dir.display()))?;

let mut paths = Vec::new();

for entry_result in read_dir {

let entry = entry_result

.with_context(|| format!("Failed to read entry in image dir: {}", image_dir.display()))?;

let path = entry.path();

if path.extension().and_then(|e| e.to_str()) == Some("jpg") {

paths.push(path);

}

}

paths

} else {

Vec::new()

};

// Fallback: if no JPGs were found in image_dir, derive frame paths from frame_sets.

if frame_paths.is_empty() {

frame_paths = frame_sets

.iter()

.flat_map(|set: &FrameSet| {

set.frames

.iter()

.map(|frame| frame.path.clone())

})

.collect();

}

frame_paths.sort();

if frame_paths.is_empty() {

anyhow::bail!("No frames available for training (image dir empty and frame_sets empty)");

}

Copilot · 2026-03-26T19:28:38Z

src/processor/job.rs

        // Collect all frame paths from all videos
-        let image_dir = temp_folder.join("frames");
+        let image_dir = temp_folder.join("images");
        std::fs::create_dir_all(&image_dir).context("Failed to create frames directory")?;


The error context string still says "Failed to create frames directory" even though the directory is now images/. Updating the message will make failures easier to diagnose.

Suggested change

std::fs::create_dir_all(&image_dir).context("Failed to create frames directory")?;

std::fs::create_dir_all(&image_dir).context("Failed to create images directory")?;

Copilot · 2026-03-26T19:28:38Z

scripts/e2e/04-run-e2e_gpu_container.sh

+docker run --rm --gpus all \
+  --name "$PROCESSOR_CONTAINER" \
+  --network "$NETWORK_NAME" \
+  -v "$PROJECT_ROOT/container-test/confg.yaml:/app/config.yaml:ro" \


The container config mount points at container-test/confg.yaml, but the repo currently contains container-test/config.yaml (and docs reference that filename). As written, the docker run will fail unless the user manually creates confg.yaml. Consider using container-test/config.yaml (or adding/renaming the intended file) and keeping filenames consistent across compose/scripts/docs.

Suggested change

-v "$PROJECT_ROOT/container-test/confg.yaml:/app/config.yaml:ro" \

-v "$PROJECT_ROOT/container-test/config.yaml:/app/config.yaml:ro" \

cicorias and others added 7 commits March 26, 2026 13:17

azureite

1c77553

fix depen

8edd2e8

add acr ignore

152c702

Copilot AI review requested due to automatic review settings March 26, 2026 19:20

cicorias merged commit 0ec37f4 into Azure-Samples:main Mar 26, 2026
4 checks passed

cicorias deleted the vishnu-local-clean branch March 26, 2026 19:20

Copilot started reviewing on behalf of cicorias March 26, 2026 19:21 View session

Copilot AI reviewed Mar 26, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: GPU Container Batch Mode & E2E Testing#15

feat: GPU Container Batch Mode & E2E Testing#15
cicorias merged 7 commits intoAzure-Samples:mainfrom
cicorias:vishnu-local-clean

cicorias commented Mar 26, 2026

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Mar 26, 2026

Uh oh!

Copilot AI Mar 26, 2026

Uh oh!

Copilot AI Mar 26, 2026

Uh oh!

Copilot AI Mar 26, 2026

Uh oh!

Copilot AI Mar 26, 2026

Uh oh!

Copilot AI Mar 26, 2026

Uh oh!

Copilot AI Mar 26, 2026

Uh oh!

Copilot AI Mar 26, 2026

Uh oh!

Copilot AI Mar 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

-        // workspace (image_dir.parent() = temp_folder) and finds colmap/sparse/0
-        let mut frame_paths: Vec<PathBuf> = std::fs::read_dir(&temp_folder.join("images"))
-            .context("Failed to read image dir for training")?
-            .filter_map(|e| e.ok().map(|e| e.path()))
-            .filter(|p| p.extension().and_then(|e| e.to_str()) == Some("jpg"))
-            .collect();
-        frame_paths.sort();
+        // workspace (image_dir.parent() = temp_folder) and finds colmap/sparse/0.
+        // If the directory is missing or empty (e.g., when resuming from a checkpoint),
+        // fall back to using the already-loaded frame_sets.
+        let image_dir = temp_folder.join("images");
+        let mut frame_paths: Vec<PathBuf> = if image_dir.exists() {
+            let read_dir = std::fs::read_dir(&image_dir)
+                .with_context(|| format!("Failed to read image dir for training: {}", image_dir.display()))?;
+            let mut paths = Vec::new();
+            for entry_result in read_dir {
+                let entry = entry_result
+                    .with_context(|| format!("Failed to read entry in image dir: {}", image_dir.display()))?;
+                let path = entry.path();
+                if path.extension().and_then(|e| e.to_str()) == Some("jpg") {
+                    paths.push(path);
+                }
+            }
+            paths
+        } else {
+            Vec::new()
+        };
+        // Fallback: if no JPGs were found in image_dir, derive frame paths from frame_sets.
+        if frame_paths.is_empty() {
+            frame_paths = frame_sets
+                .iter()
+                .flat_map(|set: &FrameSet| {
+                    set.frames
+                        .iter()
+                        .map(|frame| frame.path.clone())
+                })
+                .collect();
+        }
+        frame_paths.sort();
+        if frame_paths.is_empty() {
+            anyhow::bail!("No frames available for training (image dir empty and frame_sets empty)");
+        }

	std::fs::create_dir_all(&image_dir).context("Failed to create frames directory")?;
	std::fs::create_dir_all(&image_dir).context("Failed to create images directory")?;

	-v "$PROJECT_ROOT/container-test/confg.yaml:/app/config.yaml:ro" \
	-v "$PROJECT_ROOT/container-test/config.yaml:/app/config.yaml:ro" \

Conversation

cicorias commented Mar 26, 2026

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Mar 26, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants