fix: allow Ray actor pools to scale to zero by nightcityblade · Pull Request #1996 · NVIDIA-NeMo/Curator

nightcityblade · 2026-05-17T15:15:14Z

Description

Allow Ray Data actor stage concurrency to use a minimum actor count of 0 when it is derived from available resources.

This lets Ray Data scale an actor pool down and free resources for other stages in constrained clusters instead of pinning at least one actor.

Closes #1544.

Usage

calculate_concurrency_for_actors_for_stage(stage)
# returns (0, max_actors) for resource-derived actor pool concurrency

Checklist

I am familiar with the Contributing Guide.
New or Existing tests cover these changes.
The documentation is up to date with these changes.

Testing

uv run ruff check nemo_curator/backends/experimental/ray_data/utils.py tests/backends/experimental/ray_data/test_utils.py
uv run ruff format --check nemo_curator/backends/experimental/ray_data/utils.py tests/backends/experimental/ray_data/test_utils.py
uv run --python 3.12 pytest tests/backends/experimental/ray_data/test_utils.py -q (blocked locally: NeMo-Curator currently only supports Linux systems; this machine is Darwin)

Signed-off-by: nightcityblade <nightcityblade@gmail.com>

copy-pr-bot · 2026-05-17T15:15:17Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

greptile-apps · 2026-05-17T15:20:32Z

Greptile Summary

This PR fixes Ray actor pool concurrency to use 0 as the minimum instead of 1, allowing Ray Data to scale an actor pool fully down on constrained clusters. It also adds an explicit RuntimeError guard for the previously-silent (1, 0) case where resources are insufficient to fit even one actor.

calculate_concurrency_for_actors_for_stage now returns (0, max_actors) for resource-derived pools, replacing the old (1, max_actors) that pinned one actor even when the cluster was under pressure.
A RuntimeError with a detailed diagnostic message is raised when max_actors < 1, replacing the former silent (1, 0) return that would have passed an unusable concurrency range to Dataset.map_batches.

Confidence Score: 5/5

Safe to merge — the change is small, well-tested, and the RuntimeError guard cleanly replaces the former silent (1, 0) path.

The diff is confined to a single utility function and its tests. Returning (0, max_actors) is a deliberate Ray Data contract that allows the actor pool to scale down, and the new RuntimeError guard ensures the previously-dangerous insufficient-resources path is surfaced immediately rather than silently passed to Dataset.map_batches. All affected tests are updated consistently.

No files require special attention.

Important Files Changed

Filename	Overview
nemo_curator/backends/experimental/ray_data/utils.py	Changed return value from (1, max_actors) to (0, max_actors) and added a RuntimeError guard when max_actors < 1; docstring updated to match new contract.
tests/backends/experimental/ray_data/test_utils.py	Updated all resource-derived concurrency assertions from (1, N) to (0, N) and converted the insufficient-resources test to expect RuntimeError instead of (1, 0).

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[calculate_concurrency_for_actors_for_stage] --> B{num_workers set and positive?}
    B -- Yes --> C[return int: max of 1 and num_workers]
    B -- No --> D[get_available_cpu_gpu_resources]
    D --> E[Compute max_cpu_actors from CPU constraint]
    E --> F[Compute max_gpu_actors from GPU constraint]
    F --> G[max_actors = min of cpu and gpu limits]
    G --> H{max_actors less than 1?}
    H -- Yes --> I[raise RuntimeError with diagnostics]
    H -- No --> J[return tuple: 0 to max_actors for Ray Data scaling]

_{Reviews (3): Last reviewed commit: "fix: error on unschedulable actor concur..." | Re-trigger Greptile}

nightcityblade · 2026-05-18T03:05:05Z

Thanks — I addressed the remaining Greptile docstring note in 848c112.

What changed:

clarified that the resource-derived return shape is (min_actors, max_actors)
documented that the minimum is now 0 for the actor-pool path so Ray Data can scale down fully on constrained clusters

Validation:

python3 -m py_compile nemo_curator/backends/experimental/ray_data/utils.py

fix: allow Ray actor pools to scale to zero

ce00328

Signed-off-by: nightcityblade <nightcityblade@gmail.com>

nightcityblade requested review from abhinavg4, ayushdg, oyilmaz-nvidia and praateekmahajan as code owners May 17, 2026 15:15

github-actions Bot added the community-request label May 17, 2026

greptile-apps Bot reviewed May 17, 2026

View reviewed changes

Comment thread nemo_curator/backends/experimental/ray_data/utils.py

docs: clarify zero-min actor pool docstring

848c112

fix: error on unschedulable actor concurrency

c916bd2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: allow Ray actor pools to scale to zero#1996

fix: allow Ray actor pools to scale to zero#1996
nightcityblade wants to merge 3 commits into
NVIDIA-NeMo:mainfrom
nightcityblade:fix/issue-1544

nightcityblade commented May 17, 2026

Uh oh!

copy-pr-bot Bot commented May 17, 2026

Uh oh!

greptile-apps Bot commented May 17, 2026 •

edited

Loading

Uh oh!

Uh oh!

nightcityblade commented May 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

nightcityblade commented May 17, 2026

Description

Usage

Checklist

Testing

Uh oh!

copy-pr-bot Bot commented May 17, 2026

Uh oh!

greptile-apps Bot commented May 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Flowchart

Uh oh!

Uh oh!

nightcityblade commented May 18, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

greptile-apps Bot commented May 17, 2026 •

edited

Loading