Skip to content

fix: allow Ray actor pools to scale to zero#1996

Open
nightcityblade wants to merge 3 commits into
NVIDIA-NeMo:mainfrom
nightcityblade:fix/issue-1544
Open

fix: allow Ray actor pools to scale to zero#1996
nightcityblade wants to merge 3 commits into
NVIDIA-NeMo:mainfrom
nightcityblade:fix/issue-1544

Conversation

@nightcityblade
Copy link
Copy Markdown
Contributor

Description

Allow Ray Data actor stage concurrency to use a minimum actor count of 0 when it is derived from available resources.

This lets Ray Data scale an actor pool down and free resources for other stages in constrained clusters instead of pinning at least one actor.

Closes #1544.

Usage

calculate_concurrency_for_actors_for_stage(stage)
# returns (0, max_actors) for resource-derived actor pool concurrency

Checklist

  • I am familiar with the Contributing Guide.
  • New or Existing tests cover these changes.
  • The documentation is up to date with these changes.

Testing

  • uv run ruff check nemo_curator/backends/experimental/ray_data/utils.py tests/backends/experimental/ray_data/test_utils.py
  • uv run ruff format --check nemo_curator/backends/experimental/ray_data/utils.py tests/backends/experimental/ray_data/test_utils.py
  • uv run --python 3.12 pytest tests/backends/experimental/ray_data/test_utils.py -q (blocked locally: NeMo-Curator currently only supports Linux systems; this machine is Darwin)

Signed-off-by: nightcityblade <nightcityblade@gmail.com>
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 17, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 17, 2026

Greptile Summary

This PR fixes Ray actor pool concurrency to use 0 as the minimum instead of 1, allowing Ray Data to scale an actor pool fully down on constrained clusters. It also adds an explicit RuntimeError guard for the previously-silent (1, 0) case where resources are insufficient to fit even one actor.

  • calculate_concurrency_for_actors_for_stage now returns (0, max_actors) for resource-derived pools, replacing the old (1, max_actors) that pinned one actor even when the cluster was under pressure.
  • A RuntimeError with a detailed diagnostic message is raised when max_actors < 1, replacing the former silent (1, 0) return that would have passed an unusable concurrency range to Dataset.map_batches.

Confidence Score: 5/5

Safe to merge — the change is small, well-tested, and the RuntimeError guard cleanly replaces the former silent (1, 0) path.

The diff is confined to a single utility function and its tests. Returning (0, max_actors) is a deliberate Ray Data contract that allows the actor pool to scale down, and the new RuntimeError guard ensures the previously-dangerous insufficient-resources path is surfaced immediately rather than silently passed to Dataset.map_batches. All affected tests are updated consistently.

No files require special attention.

Important Files Changed

Filename Overview
nemo_curator/backends/experimental/ray_data/utils.py Changed return value from (1, max_actors) to (0, max_actors) and added a RuntimeError guard when max_actors < 1; docstring updated to match new contract.
tests/backends/experimental/ray_data/test_utils.py Updated all resource-derived concurrency assertions from (1, N) to (0, N) and converted the insufficient-resources test to expect RuntimeError instead of (1, 0).

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[calculate_concurrency_for_actors_for_stage] --> B{num_workers set and positive?}
    B -- Yes --> C[return int: max of 1 and num_workers]
    B -- No --> D[get_available_cpu_gpu_resources]
    D --> E[Compute max_cpu_actors from CPU constraint]
    E --> F[Compute max_gpu_actors from GPU constraint]
    F --> G[max_actors = min of cpu and gpu limits]
    G --> H{max_actors less than 1?}
    H -- Yes --> I[raise RuntimeError with diagnostics]
    H -- No --> J[return tuple: 0 to max_actors for Ray Data scaling]
Loading

Reviews (3): Last reviewed commit: "fix: error on unschedulable actor concur..." | Re-trigger Greptile

Comment thread nemo_curator/backends/experimental/ray_data/utils.py
@nightcityblade
Copy link
Copy Markdown
Contributor Author

Thanks — I addressed the remaining Greptile docstring note in 848c112.

What changed:

  • clarified that the resource-derived return shape is (min_actors, max_actors)
  • documented that the minimum is now 0 for the actor-pool path so Ray Data can scale down fully on constrained clusters

Validation:

  • python3 -m py_compile nemo_curator/backends/experimental/ray_data/utils.py

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Consider using a minimum of 0 resources in our Ray data actor pool strategy.

2 participants