fix: allow Ray actor pools to scale to zero#1996
Conversation
Signed-off-by: nightcityblade <nightcityblade@gmail.com>
Greptile SummaryThis PR fixes Ray actor pool concurrency to use
Confidence Score: 5/5Safe to merge — the change is small, well-tested, and the RuntimeError guard cleanly replaces the former silent (1, 0) path. The diff is confined to a single utility function and its tests. Returning (0, max_actors) is a deliberate Ray Data contract that allows the actor pool to scale down, and the new RuntimeError guard ensures the previously-dangerous insufficient-resources path is surfaced immediately rather than silently passed to Dataset.map_batches. All affected tests are updated consistently. No files require special attention. Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[calculate_concurrency_for_actors_for_stage] --> B{num_workers set and positive?}
B -- Yes --> C[return int: max of 1 and num_workers]
B -- No --> D[get_available_cpu_gpu_resources]
D --> E[Compute max_cpu_actors from CPU constraint]
E --> F[Compute max_gpu_actors from GPU constraint]
F --> G[max_actors = min of cpu and gpu limits]
G --> H{max_actors less than 1?}
H -- Yes --> I[raise RuntimeError with diagnostics]
H -- No --> J[return tuple: 0 to max_actors for Ray Data scaling]
Reviews (3): Last reviewed commit: "fix: error on unschedulable actor concur..." | Re-trigger Greptile |
|
Thanks — I addressed the remaining Greptile docstring note in 848c112. What changed:
Validation:
|
Description
Allow Ray Data actor stage concurrency to use a minimum actor count of
0when it is derived from available resources.This lets Ray Data scale an actor pool down and free resources for other stages in constrained clusters instead of pinning at least one actor.
Closes #1544.
Usage
Checklist
Testing
uv run ruff check nemo_curator/backends/experimental/ray_data/utils.py tests/backends/experimental/ray_data/test_utils.pyuv run ruff format --check nemo_curator/backends/experimental/ray_data/utils.py tests/backends/experimental/ray_data/test_utils.pyuv run --python 3.12 pytest tests/backends/experimental/ray_data/test_utils.py -q(blocked locally: NeMo-Curator currently only supports Linux systems; this machine is Darwin)