Skip to content

[Do not merge] Grpc server v4 conda forge deps#995

Open
tmckayus wants to merge 20 commits intoNVIDIA:release/26.04from
tmckayus:grpc-server-v4-conda-forge-deps
Open

[Do not merge] Grpc server v4 conda forge deps#995
tmckayus wants to merge 20 commits intoNVIDIA:release/26.04from
tmckayus:grpc-server-v4-conda-forge-deps

Conversation

@tmckayus
Copy link
Copy Markdown
Contributor

testing dep changes, clone of #939

tmckayus added 20 commits March 6, 2026 16:53
cuopt uses embedded grpc client to solve problems on a remote server
* switch to use of argparse
* make switch statements explicit (no default)
* raise errors with cuopt_expects
* use a second standard logger for server main and workers so
that logging is uniform but server meta data is not interleaved
with solver output
The same parameters will be removed from solve_lp in another
change.
previous workaround of build from source is not necessary since
a bug with libabseil has been fixed
@tmckayus tmckayus requested review from a team as code owners March 25, 2026 20:42
@tmckayus tmckayus added feature request New feature or request non-breaking Introduces a non-breaking change do not merge Do not merge if this flag is set labels Mar 25, 2026
@coderabbitai
Copy link
Copy Markdown

coderabbitai bot commented Mar 25, 2026

📝 Walkthrough

Walkthrough

This pull request adds comprehensive gRPC-based remote execution support to cuOpt, enabling distributed optimization workloads across a client-server architecture. It includes complete protocol definitions, a full-featured gRPC client library, a multi-process server implementation with GPU worker management, environment-variable-driven client configuration, chunked array transfer support, TLS/mTLS security, and extensive test coverage.

Changes

Cohort / File(s) Summary
Documentation
GRPC_INTERFACE.md, GRPC_QUICK_START.md, GRPC_SERVER_ARCHITECTURE.md
Three comprehensive guides documenting the gRPC protocol surface, client/server setup, TLS configuration, and multi-process server architecture with worker process IPC design.
Protobuf Definitions
cpp/src/grpc/cuopt_remote.proto, cpp/src/grpc/cuopt_remote_service.proto
Protocol buffer message and service definitions for the CuOptRemoteService, including request/response types, chunked transfer messages, job lifecycle enums, and solver configuration schemas for LP and MIP.
gRPC Client Library
cpp/src/grpc/client/grpc_client.hpp, cpp/src/grpc/client/grpc_client.cpp, cpp/src/grpc/client/solve_remote.cpp
Public client API with configurable connection, polling, TLS, chunking, and log streaming; implementation of synchronous/async job submission, status checking, result retrieval, and remote solve entry points (solve_lp_remote, solve_mip_remote) with environment-variable configuration.
Problem/Settings Mappers
cpp/src/grpc/grpc_problem_mapper.hpp, cpp/src/grpc/grpc_problem_mapper.cpp, cpp/src/grpc/grpc_settings_mapper.hpp, cpp/src/grpc/grpc_settings_mapper.cpp
Bidirectional template converters between CPU optimization problems, solver settings, and protobuf messages; chunked array estimation, partitioning, and reconstruction support.
Solution/Service Mappers
cpp/src/grpc/grpc_solution_mapper.hpp, cpp/src/grpc/grpc_solution_mapper.cpp, cpp/src/grpc/grpc_service_mapper.hpp, cpp/src/grpc/grpc_service_mapper.cpp
Converters for LP/MIP solutions and chunked result transfer; factory functions for submit request construction.
gRPC Server Core
cpp/src/grpc/server/grpc_service_impl.cpp, cpp/src/grpc/server/grpc_server_types.hpp, cpp/src/grpc/server/grpc_server_main.cpp
Service implementation handling all RPC endpoints (submit, chunked upload/download, status, result, streaming); shared-memory/synchronization types; server initialization, TLS setup, worker spawning, and background thread orchestration.
Server Support Infrastructure
cpp/src/grpc/server/grpc_worker.cpp, cpp/src/grpc/server/grpc_worker_infra.cpp, cpp/src/grpc/server/grpc_job_management.cpp, cpp/src/grpc/server/grpc_pipe_io.cpp, cpp/src/grpc/server/grpc_pipe_serialization.hpp, cpp/src/grpc/server/grpc_server_logger.hpp, cpp/src/grpc/server/grpc_server_logger.cpp, cpp/src/grpc/server/grpc_server_threads.cpp, cpp/src/grpc/server/grpc_field_element_size.hpp, cpp/src/grpc/server/grpc_incumbent_proto.hpp
Worker process solver dispatch; pipe-based IPC for job/result/incumbent transfer; job queue submission and status tracking; result retrieval and session management background threads; logging infrastructure; field size metadata.
gRPC Unit Tests
cpp/tests/linear_programming/grpc/grpc_client_test.cpp, cpp/tests/linear_programming/grpc/grpc_client_test_helper.hpp, cpp/tests/linear_programming/grpc/grpc_pipe_serialization_test.cpp, cpp/tests/linear_programming/grpc/grpc_test_log_capture.hpp
Comprehensive mock-based client unit tests, pipe serialization round-trip validation, test helper utilities for mock injection and log capture.
gRPC Integration Tests
cpp/tests/linear_programming/grpc/grpc_integration_test.cpp, cpp/tests/linear_programming/grpc/CMakeLists.txt
End-to-end server lifecycle management, concurrent/async solve scenarios, chunked transfer validation, error recovery, TLS/mTLS fixture management, and chunk validation tests; CMake test configuration.
C API Integration
cpp/tests/linear_programming/c_api_tests/c_api_tests.cpp
Updated CPU-only test environment to start/manage a real cuopt_grpc_server subprocess and verify remote execution paths.
Build System Updates
build.sh, cpp/CMakeLists.txt, python/libcuopt/CMakeLists.txt, ci/build_wheel_libcuopt.sh, ci/utils/install_protobuf_grpc.sh
Added cuopt_grpc_server build target and command-line option; integrated protobuf/gRPC code generation; added Protobuf/gRPC toolchain install script; extended CMake to locate and link gRPC/protobuf; applied RPATH configuration to server executable.
Conda Environment Dependencies
conda/environments/all_cuda-129_arch-aarch64.yaml, conda/environments/all_cuda-129_arch-x86_64.yaml, conda/environments/all_cuda-131_arch-aarch64.yaml, conda/environments/all_cuda-131_arch-x86_64.yaml, conda/recipes/libcuopt/recipe.yaml, dependencies.yaml
Added gRPC, Protobuf, Abseil, c-ares, re2, openssl, and libuuid dependencies; updated libcuopt recipe to include cuopt_grpc_server binary in package contents.
Remote Solver Entry Points & Cleanup
cpp/include/cuopt/linear_programming/solve_remote.hpp, cpp/src/pdlp/solve.cu, cpp/src/pdlp/solve_remote.cu, cpp/src/mip_heuristics/solve.cu, cpp/src/pdlp/CMakeLists.txt, cpp/src/pdlp/cpu_optimization_problem.cpp, cpp/src/pdlp/optimization_problem.cu
Updated solve_lp_remote signature to remove problem_checking/pdlp_solver_mode parameters; replaced stub solve_remote.cu with functional gRPC client implementation; removed debug logging; removed solve_remote.cu from CMake sources.
Python Remote Execution Tests
python/cuopt/cuopt/tests/linear_programming/test_cpu_only_execution.py
Added gRPC server subprocess lifecycle management; new TLS/mTLS test fixtures with certificate generation; enhanced CLI and solve correctness assertions; environment-variable-driven remote configuration; hardened subprocess entry-point dispatch.

Estimated code review effort

🎯 5 (Critical) | ⏱️ ~120 minutes

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Copy link
Copy Markdown

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 11

Note

Due to the large number of review comments, Critical severity comments were prioritized as inline comments.

🟠 Major comments (16)
cpp/include/cuopt/linear_programming/solve_remote.hpp-35-37 (1)

35-37: ⚠️ Potential issue | 🟠 Major

Don't silently ignore non-default LP solve flags on the remote path.

The interface overload still accepts problem_checking and use_pdlp_solver_mode, but removing them here means remote execution falls back to built-in behavior while local execution still honors caller input. Either carry those knobs in the remote request or fail fast when non-default values are passed.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/include/cuopt/linear_programming/solve_remote.hpp` around lines 35 - 37,
The remote LP solver overload solve_lp_remote currently ignores
caller-controlled flags in pdlp_solver_settings_t (notably problem_checking and
use_pdlp_solver_mode); update the function to either accept and propagate those
flags in the remote request or validate settings and error when they are
non-default: modify the signature of
solve_lp_remote(cpu_optimization_problem_t<i_t,f_t> const& cpu_problem,
pdlp_solver_settings_t<i_t,f_t> const& settings) to include the relevant
settings on the remote payload and ensure the remote handler uses
problem_checking and use_pdlp_solver_mode, or add an early check inside
solve_lp_remote that throws or returns an error if settings.problem_checking or
settings.use_pdlp_solver_mode differ from their defaults so behavior is
consistent with local solves.
cpp/tests/linear_programming/c_api_tests/c_api_tests.cpp-428-477 (1)

428-477: ⚠️ Potential issue | 🟠 Major

Avoid a fixed port plus raw TCP connect for readiness.

If port_ is already occupied, the child can exit with EADDRINUSE while tcp_connect_check() still succeeds against the stale listener. The suite then points CUOPT_REMOTE_* at the wrong process and becomes flaky. Pick an ephemeral free port, or at least verify server_pid_ is still alive and answering a gRPC-specific probe before exporting the env vars.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/tests/linear_programming/c_api_tests/c_api_tests.cpp` around lines 428 -
477, The current startup uses a fixed/derived port_ and then only
tcp_connect_check(port_, ...) which can hit a stale listener; change to pick an
ephemeral free port (e.g., create a temporary TCP socket, bind to port 0 to
discover the assigned port, read it into port_, close the socket) before forking
and launching the server (references: port_, server_pid_, server_path_, execl),
then after fork/exec use a gRPC-specific readiness probe (or at minimum verify
server_pid_ is still alive via waitpid/WNOHANG) instead of relying solely on
tcp_connect_check(); only call setenv("CUOPT_REMOTE_HOST", ...) and
setenv("CUOPT_REMOTE_PORT", ...) after the ephemeral port is confirmed and the
child process (server_pid_) is verified responsive.
cpp/tests/linear_programming/c_api_tests/c_api_tests.cpp-420-478 (1)

420-478: ⚠️ Potential issue | 🟠 Major

Snapshot the original env before any early return.

SetUpTestSuite() returns on "binary not found", fork() failure, and startup timeout before orig_* / *_was_set_ are initialized. TearDownTestSuite() then falls into the unsetenv() branches and clobbers any pre-existing CUDA_VISIBLE_DEVICES, CUOPT_REMOTE_HOST, and CUOPT_REMOTE_PORT values when this suite is skipped.

Minimal fix
   static void SetUpTestSuite()
   {
+    const char* cv     = getenv("CUDA_VISIBLE_DEVICES");
+    const char* rh     = getenv("CUOPT_REMOTE_HOST");
+    const char* rp     = getenv("CUOPT_REMOTE_PORT");
+    orig_cuda_visible_ = cv ? cv : "";
+    orig_remote_host_  = rh ? rh : "";
+    orig_remote_port_  = rp ? rp : "";
+    cuda_was_set_      = (cv != nullptr);
+    host_was_set_      = (rh != nullptr);
+    port_was_set_      = (rp != nullptr);
+
     server_path_ = find_server_binary();
     if (server_path_.empty()) {
       skip_reason_ = "cuopt_grpc_server binary not found";
       return;
     }
@@
-    const char* cv     = getenv("CUDA_VISIBLE_DEVICES");
-    const char* rh     = getenv("CUOPT_REMOTE_HOST");
-    const char* rp     = getenv("CUOPT_REMOTE_PORT");
-    orig_cuda_visible_ = cv ? cv : "";
-    orig_remote_host_  = rh ? rh : "";
-    orig_remote_port_  = rp ? rp : "";
-    cuda_was_set_      = (cv != nullptr);
-    host_was_set_      = (rh != nullptr);
-    port_was_set_      = (rp != nullptr);
-
     setenv("CUDA_VISIBLE_DEVICES", "", 1);

As per coding guidelines, "Ensure test isolation: prevent GPU state, cached memory, and global variables from leaking between test cases; verify each test independently initializes its environment."

Also applies to: 480-496

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/tests/linear_programming/c_api_tests/c_api_tests.cpp` around lines 420 -
478, SetUpTestSuite currently returns early (when server binary missing, fork
fails, or startup times out) before orig_cuda_visible_, orig_remote_host_,
orig_remote_port_ and the boolean flags cuda_was_set_, host_was_set_,
port_was_set_ are initialized, causing TearDownTestSuite to clobber pre-existing
env vars; fix by snapshotting the original environment at the very start of
SetUpTestSuite (read getenv("CUDA_VISIBLE_DEVICES"),
getenv("CUOPT_REMOTE_HOST"), getenv("CUOPT_REMOTE_PORT") into orig_* and set the
*_was_set_ flags accordingly) so these variables are always initialized even if
the function returns early, then proceed with the existing logic that may
override env and start the server.
cpp/CMakeLists.txt-711-770 (1)

711-770: ⚠️ Potential issue | 🟠 Major

Gate cuopt_grpc_server in LP-only builds or compile out the MIP path.

grpc_worker.cpp unconditionally calls solve_mip() with mip_solver_settings_t, but the target is added without conditional gating. In LP-only builds, the MIP solver components are excluded while this target remains enabled, causing linker errors. Either wrap the target in an if(NOT BUILD_LP_ONLY) block or add conditional compilation guards in grpc_worker.cpp to skip MIP paths when LP-only mode is active.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/CMakeLists.txt` around lines 711 - 770, The cuopt_grpc_server target
exposes MIP-dependent code (grpc_worker.cpp calls solve_mip() with
mip_solver_settings_t) but is built even in LP-only configurations; either wrap
the entire target in CMake with an if(NOT BUILD_LP_ONLY) ... endif() so
cuopt_grpc_server is not added/linked when BUILD_LP_ONLY is true, or add
compile-time guards inside src/grpc/server/grpc_worker.cpp around calls to
solve_mip() and any use of mip_solver_settings_t (e.g. `#ifdef` BUILD_LP_ONLY /
`#ifndef` BUILD_LP_ONLY or equivalent CMake-generated define) so the MIP path is
excluded in LP-only builds and the target can link without MIP symbols.
cpp/src/grpc/server/grpc_server_main.cpp-195-204 (1)

195-204: ⚠️ Potential issue | 🟠 Major

These fixed SHM names make multiple servers interfere with each other.

Each instance unlinks and recreates the same SHM_JOB_QUEUE, SHM_RESULT_QUEUE, and SHM_CONTROL objects. Starting a second server on a different port will target the same shared-memory segments, so the two instances can tear down or overwrite each other's queues. Namespace the SHM names per server instance (for example by port/pid) and pass that namespace to workers.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/server/grpc_server_main.cpp` around lines 195 - 204, The
hardcoded shared-memory names SHM_JOB_QUEUE, SHM_RESULT_QUEUE and SHM_CONTROL
cause cross-instance interference; modify the server to generate a unique
namespace (e.g., include the server port and/or pid) and use that to build
per-instance names before calling shm_unlink and create_shared_memory. Update
the code paths that reference SHM_JOB_QUEUE/SHM_RESULT_QUEUE/SHM_CONTROL (the
allocations assigning job_queue, result_queue, and shm_ctrl via
create_shared_memory) to use the namespaced strings, and ensure the same
generated namespace is passed to worker processes so they open the matching
segments instead of the global static names.
cpp/src/grpc/client/grpc_client.cpp-863-869 (1)

863-869: ⚠️ Potential issue | 🟠 Major

Chunked upload still buffers the whole payload before sending.

build_array_chunk_requests() materializes a full std::vector<SendArrayChunkRequest> first, so the large-problem path keeps an extra copy of every chunk in memory. That defeats the main reason to chunk at all and can exhaust client RAM on the models most likely to use this path. Generate/send chunks incrementally instead of prebuilding them all.

As per coding guidelines, "Verify correct problem size checks before expensive GPU/CPU operations; prevent resource exhaustion on oversized problems."

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/client/grpc_client.cpp` around lines 863 - 869, The code
currently calls build_array_chunk_requests(...) which returns a fully
materialized std::vector<SendArrayChunkRequest> (assigned to chunk_requests),
causing an extra full-copy of all chunks in memory; change this to produce/send
chunks incrementally by replacing build_array_chunk_requests with a streaming
API (e.g., an iterator/generator or a producer callback) so the caller can loop
and send each SendArrayChunkRequest as it is created instead of storing them
all; update the call site that assigns chunk_requests to instead iterate over
the generator or invoke the callback for each chunk (or change
build_array_chunk_requests to accept a visitor std::function<void(const
SendArrayChunkRequest&)> or return a lazy iterator) and ensure existing code
that references chunk_requests is adapted to send each chunk immediately to
avoid buffering the whole payload.
cpp/src/grpc/server/grpc_server_threads.cpp-147-153 (1)

147-153: ⚠️ Potential issue | 🟠 Major

Don't overload cancelled for transport failures.

Line 152 turns a pipe send failure into cancelled=true, but the rest of the server treats that bit as a user-initiated cancel. mark_worker_jobs_failed() will later report RESULT_CANCELLED / "Job was cancelled" for an internal I/O error, and if the worker stays alive this slot never gets failed or recycled here. Fail the job explicitly instead of reusing the cancellation flag.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/server/grpc_server_threads.cpp` around lines 147 - 153, The code
currently sets job_queue[i].cancelled = true on a transport send failure, which
conflates I/O errors with user cancellations; instead, on send failure leave
cancelled untouched and mark the job explicitly failed (e.g., set a
job_queue[i].failed flag or job_queue[i].result = RESULT_IO_ERROR / an internal
I/O error code and any job_queue[i].status = JOB_FAILED if present), and ensure
any downstream cleanup uses that failure indicator (or call the existing
failure-path helper such as mark_worker_jobs_failed()/a similar job-fail helper
for this specific job) so the job is reported as an internal I/O failure rather
than RESULT_CANCELLED.
cpp/src/grpc/server/grpc_worker_infra.cpp-183-187 (1)

183-187: ⚠️ Potential issue | 🟠 Major

Crash cleanup should drop chunked payloads too.

This only erases pending_job_data. Chunked submissions are buffered in pending_chunked_data, so a worker crash before dispatch leaks the full chunk vector for that job. Clear the chunked map here as well.

🐛 Suggested fix
       {
         std::lock_guard<std::mutex> lock(pending_data_mutex);
         pending_job_data.erase(job_id);
+        pending_chunked_data.erase(job_id);
       }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/server/grpc_worker_infra.cpp` around lines 183 - 187, The
cleanup currently only erases pending_job_data and leaks chunked submissions;
update the same crash cleanup block to also remove the job's entry from
pending_chunked_data under the same lock (pending_data_mutex) so both maps are
cleared for job_id; ensure you call pending_chunked_data.erase(job_id) alongside
pending_job_data.erase(job_id) within the scope guarded by
std::lock_guard<std::mutex> on pending_data_mutex.
cpp/src/grpc/server/grpc_worker_infra.cpp-140-146 (1)

140-146: ⚠️ Potential issue | 🟠 Major

Keep worker_pids aligned with the logical worker id.

worker_monitor_thread() treats worker_pids[i] as worker i, but this loop only appends successful spawns. If worker 1 fails and worker 2 succeeds, PID 2 lands at index 1 and later restarts/pipe lookups target the wrong slot. Pre-size the vector and assign by worker_id.

🐛 Suggested fix
 void spawn_workers()
 {
+  worker_pids.assign(config.num_workers, 0);
   for (int i = 0; i < config.num_workers; ++i) {
     pid_t pid = spawn_worker(i, false);
-    if (pid < 0) { continue; }
-    worker_pids.push_back(pid);
+    if (pid > 0) { worker_pids[i] = pid; }
   }
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/server/grpc_worker_infra.cpp` around lines 140 - 146,
spawn_workers currently appends only successful spawns so worker_pids indices no
longer match logical worker ids; pre-size worker_pids to config.num_workers
(fill with a sentinel like -1) and then assign by index: call spawn_worker(i,
false) and set worker_pids[i] = pid (leave sentinel on failure) so
worker_monitor_thread() can safely treat worker_pids[i] as worker i; update any
logic that expects empty vector length accordingly.
python/cuopt/cuopt/tests/linear_programming/test_cpu_only_execution.py-33-35 (1)

33-35: ⚠️ Potential issue | 🟠 Major

The new dataset default is not propagated into the subprocess entrypoints.

The module-level default moved to ./datasets, but the child helpers later in this file still fall back to "./" when RAPIDS_DATASET_ROOT_DIR is unset. In that case the parent test resolves datasets from the new location while the subprocess still looks under the repo root and fails to open the MPS files. Apply the same default to each _impl_*_cpu_only() helper.

🐛 Suggested fix
-    dataset_root = os.environ.get("RAPIDS_DATASET_ROOT_DIR", "./")
+    dataset_root = os.environ.get("RAPIDS_DATASET_ROOT_DIR", "./datasets")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@python/cuopt/cuopt/tests/linear_programming/test_cpu_only_execution.py`
around lines 33 - 35, The subprocess helper functions (the _impl_*_cpu_only()
helpers) still fall back to "./" for RAPIDS_DATASET_ROOT_DIR causing
subprocesses to look in the repo root; update each helper to use the same
module-level default by reading os.environ.get("RAPIDS_DATASET_ROOT_DIR",
RAPIDS_DATASET_ROOT_DIR) or otherwise defaulting to the RAPIDS_DATASET_ROOT_DIR
constant defined at module top, so subprocess entrypoints resolve datasets from
"./datasets" when the env var is unset.
cpp/src/grpc/client/grpc_client.cpp-1094-1108 (1)

1094-1108: ⚠️ Potential issue | 🟠 Major

Reject short or mismatched chunk replies before memcpy.

elem_offset advances by the requested chunk size, not by elements_in_chunk(). If the server returns a short chunk, this loop leaves a gap in array_bytes and still reports success. This block also never checks that the reply's download_id, field_id, and element_offset match the request. Treat any short/mismatched reply as a protocol error (or continue from the returned offset) before copying bytes.

🐛 Suggested guard
       cuopt::remote::GetResultChunkResponse chunk_resp;
       auto status = impl_->stub->GetResultChunk(&chunk_ctx, chunk_req, &chunk_resp);

       if (!status.ok()) {
         last_error_ = "GetResultChunk failed: " + status.error_message();
         return false;
       }
+
+      if (chunk_resp.download_id() != download_id || chunk_resp.field_id() != field_id ||
+          chunk_resp.element_offset() != elem_offset) {
+        last_error_ = "GetResultChunk: response metadata mismatch";
+        return false;
+      }

       int64_t elems_received = chunk_resp.elements_in_chunk();
       const auto& data       = chunk_resp.data();

-      if (elems_received < 0 || elems_received > elems_wanted ||
+      if (elems_received != elems_wanted ||
           elems_received > total_elems - elem_offset) {
         last_error_ = "GetResultChunk: invalid element count";
         return false;
       }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/client/grpc_client.cpp` around lines 1094 - 1108, The current
block copies chunk_resp.data() into array_bytes without verifying the reply
matches the request and without handling short chunks correctly; before the
memcpy in the GetResultChunk handling code, validate that
chunk_resp.download_id(), chunk_resp.field_id(), and chunk_resp.element_offset()
match the request's expected download_id/field_id and elem_offset (or
treat/accept the returned element_offset consistently), and ensure
elems_received is the value used to advance elem_offset (or reject if
elems_received != elems_wanted) — if any mismatch or a short chunk is detected,
set last_error_ to a descriptive protocol-error message and return false instead
of copying; update total_bytes_received and elem_offset only after successful
validation.
cpp/tests/linear_programming/grpc/grpc_integration_test.cpp-312-320 (1)

312-320: ⚠️ Potential issue | 🟠 Major

Add SAN extension to the generated server certificate.

The C++ fixture generates a CN-only localhost certificate, while python/cuopt/cuopt/tests/linear_programming/test_cpu_only_execution.py includes an explicit SAN extension subjectAltName=DNS:localhost,IP:127.0.0.1. Modern OpenSSL respects SAN over CN; CN-only certs rely on fallback behavior that varies across versions and configurations. Align the C++ test by adding the same SAN extension via an -extfile parameter to the openssl x509 call.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/tests/linear_programming/grpc/grpc_integration_test.cpp` around lines 312
- 320, The server certificate generation in the test creates a CN-only cert
(server_key, server_csr, server_crt) which lacks a subjectAltName and can fail
hostname validation; update the openssl x509 invocation in the code that calls
run(...) (the block creating server_crt using ca_crt and ca_key) to include an
-extfile that defines subjectAltName=DNS:localhost,IP:127.0.0.1 (create a
temporary extfile or inline one before the run call) so the produced certificate
includes the SAN extension matching the Python test.
cpp/src/grpc/cuopt_remote_service.proto-314-331 (1)

314-331: ⚠️ Potential issue | 🟠 Major

Make incumbent pagination bounded by default.

Each Incumbent carries the full assignment vector, so max_count <= 0 => no limit can easily exceed max_message_bytes or force the server to build a very large response in memory. This is also the public client default today, so callers will hit it unless they remember to paginate manually. Require a positive page size here, and clamp or reject oversized pages server-side.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/cuopt_remote_service.proto` around lines 314 - 331, The
Incumbent pagination currently allows max_count <= 0 to mean "no limit", which
can produce huge responses; change the handling so IncumbentRequest.max_count
must be positive and server-side code enforces a safe maximum: update the
service logic that processes IncumbentRequest to (a) treat non-positive
max_count as invalid (return an error) or replace it with a configurable
DEFAULT_PAGE_SIZE, and (b) clamp any requested max_count to a
SERVER_MAX_PAGE_SIZE before building the IncumbentResponse.incumbents list;
ensure these checks live where requests are decoded/handled (the RPC handler
that reads IncumbentRequest) and return a clear error or truncated page when the
client asks for an oversized page.
cpp/src/grpc/server/grpc_server_types.hpp-257-259 (1)

257-259: ⚠️ Potential issue | 🟠 Major

Namespace the shared-memory object names per server instance.

These fixed names prevent two servers on the same host from coexisting cleanly: the second instance will either attach to the first instance's queues or remove them during cleanup. Include something instance-specific here (for example port/PID/UUID) and keep it in server state for symmetric teardown.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/server/grpc_server_types.hpp` around lines 257 - 259, The
shared-memory names SHM_JOB_QUEUE, SHM_RESULT_QUEUE and SHM_CONTROL are
hard-coded and must be namespaced per server instance; change them to be
constructed with an instance-specific identifier (e.g., port, PID or generated
UUID) and store that identifier on the server object so teardown uses the exact
same names. Update the places that reference
SHM_JOB_QUEUE/SHM_RESULT_QUEUE/SHM_CONTROL to build names using the stored
instance id (for example suffix or prefix) during creation, attach and cleanup
so two servers on the same host do not collide.
cpp/src/grpc/server/grpc_server_types.hpp-177-184 (1)

177-184: ⚠️ Potential issue | 🟠 Major

Initialize every pipe FD to -1.

If worker_pipes is resized/default-constructed before every field is assigned, these members are indeterminate. Any cleanup or partial-failure path that conditionally closes them can hit unrelated descriptors.

Suggested fix
 struct WorkerPipes {
-  int to_worker_fd;
-  int from_worker_fd;
-  int worker_read_fd;
-  int worker_write_fd;
-  int incumbent_from_worker_fd;
-  int worker_incumbent_write_fd;
+  int to_worker_fd = -1;
+  int from_worker_fd = -1;
+  int worker_read_fd = -1;
+  int worker_write_fd = -1;
+  int incumbent_from_worker_fd = -1;
+  int worker_incumbent_write_fd = -1;
 };
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/server/grpc_server_types.hpp` around lines 177 - 184, The struct
WorkerPipes currently leaves its file-descriptor members uninitialized; add
explicit initialization of every FD to -1 to avoid closing unrelated descriptors
during cleanup—either add in-class member initializers for to_worker_fd,
from_worker_fd, worker_read_fd, worker_write_fd, incumbent_from_worker_fd,
worker_incumbent_write_fd = -1 or provide a default constructor on WorkerPipes
that sets each of those members to -1; update any construction sites if
necessary to rely on the default-initialized values.
cpp/src/grpc/cuopt_remote.proto-251-271 (1)

251-271: 🛠️ Refactor suggestion | 🟠 Major

Add *_UNSPECIFIED = 0 to both ArrayFieldId and ResultFieldId enums, then shift all real field IDs up by 1.

In proto3, an omitted enum field deserializes to 0. With the current numbering, a missing or malformed field_id is indistinguishable from FIELD_A_VALUES and RESULT_PRIMAL_SOLUTION respectively, allowing malformed chunk requests to be routed to real arrays instead of being rejected. This must be fixed before the wire format ships.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/cuopt_remote.proto` around lines 251 - 271, Add a zero-valued
sentinel to both enums and renumber existing members: introduce
ARRAY_FIELD_ID_UNSPECIFIED = 0 in the ArrayFieldId enum and
RESULT_FIELD_ID_UNSPECIFIED = 0 in the ResultFieldId enum, then increment every
existing enum value by 1 (e.g., FIELD_A_VALUES becomes 1, FIELD_A_INDICES
becomes 2, etc., and RESULT_PRIMAL_SOLUTION shifts from 0 to 1). Update any code
that constructs, parses, or switches on ArrayFieldId or ResultFieldId (search
for uses of ArrayFieldId and ResultFieldId, and for literal enum values like
FIELD_A_VALUES or RESULT_PRIMAL_SOLUTION) to use the new numeric values or
compare against the new _UNSPECIFIED sentinel to reject missing/zero field_id
cases. Ensure generated code and any RPC/serialization tests are
regenerated/updated to reflect the shifted enum values.
🟡 Minor comments (3)
cpp/src/grpc/server/grpc_incumbent_proto.hpp-38-45 (1)

38-45: ⚠️ Potential issue | 🟡 Minor

Add INT_MAX guard before protobuf parse to prevent size truncation.

ParseFromArray() takes an int length parameter. Without a bounds check, oversized size_t input truncates silently instead of rejecting malformed IPC data cleanly. Add the guard:

if (size > static_cast<size_t>(std::numeric_limits<int>::max())) { return false; }

This mirrors the INT_MAX check in build_incumbent_proto() and prevents undefined behavior from input sanitization.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/server/grpc_incumbent_proto.hpp` around lines 38 - 45,
parse_incumbent_proto currently calls cuopt::remote::Incumbent::ParseFromArray
with size cast to int, which can truncate large size_t values; add a guard at
the start of parse_incumbent_proto to return false when size >
static_cast<size_t>(std::numeric_limits<int>::max()) so the function rejects
oversized inputs instead of silently truncating before calling ParseFromArray,
mirroring the INT_MAX check used in build_incumbent_proto.
cpp/src/grpc/server/grpc_server_main.cpp-115-118 (1)

115-118: ⚠️ Potential issue | 🟡 Minor

--verbose is parsed but never affects config.verbose.

config.verbose is derived only from --quiet, so --verbose is effectively a no-op. Either drop the flag or include it in the final verbosity calculation so the CLI matches what it advertises.

Also applies to: 156-156

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/server/grpc_server_main.cpp` around lines 115 - 118, The
--verbose flag is parsed by program.add_argument("-v", "--verbose") but never
used when computing config.verbose (which is only set from --quiet); update the
final verbosity calculation to consider both flags (e.g., set config.verbose =
parsed_verbose && !parsed_quiet or apply precedence rules), or remove the
add_argument call if you intend not to support verbose; adjust the code
locations that compute config.verbose (refer to the config.verbose assignment
and the
program.add_argument("-q","--quiet")/program.add_argument("-v","--verbose")
usages) so the CLI behavior matches the advertised flags.
cpp/src/grpc/server/grpc_pipe_serialization.hpp-241-249 (1)

241-249: ⚠️ Potential issue | 🟡 Minor

Missing return value check for SerializeToArray.

SerializeToArray returns false on failure, but the return value is not checked. If serialization fails, an uninitialized blob is returned.

🐛 Proposed fix
 inline std::vector<uint8_t> serialize_submit_request_to_pipe(
   const cuopt::remote::SubmitJobRequest& request)
 {
   size_t byte_size = request.ByteSizeLong();
   if (byte_size == 0 || byte_size > static_cast<size_t>(std::numeric_limits<int>::max())) return {};
   std::vector<uint8_t> blob(byte_size);
-  request.SerializeToArray(blob.data(), static_cast<int>(byte_size));
+  if (!request.SerializeToArray(blob.data(), static_cast<int>(byte_size))) return {};
   return blob;
 }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/server/grpc_pipe_serialization.hpp` around lines 241 - 249, The
function serialize_submit_request_to_pipe does not check the boolean return
value of request.SerializeToArray, so a failed serialization can return an
uninitialized blob; update serialize_submit_request_to_pipe to capture the bool
result of request.SerializeToArray(blob.data(), static_cast<int>(byte_size)) and
if it returns false, return an empty vector (or otherwise handle the error
consistently with surrounding code) instead of returning the blob; ensure you
still validate byte_size as before and only return the populated blob when
SerializeToArray succeeds.
🧹 Nitpick comments (8)
conda/recipes/libcuopt/recipe.yaml (1)

205-209: Smoke-test the new server executable.

package_contents only proves the file was installed. A trivial launch check like cuopt_grpc_server --help would catch missing shared libraries or bad RPATHs for the newly shipped binary.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@conda/recipes/libcuopt/recipe.yaml` around lines 205 - 209, Add a smoke-test
that actually runs the shipped server binary rather than just listing it: update
the recipe to include a test/run_test (or test/commands) entry that executes
"cuopt_grpc_server --help" (or an equivalent non-destructive invocation) after
install, so missing shared libs or bad RPATHs are caught; reference the existing
package_contents entry and the cuopt_grpc_server binary when adding this test.
cpp/CMakeLists.txt (1)

275-301: Use generator expressions to reference protoc and the gRPC plugin targets.

The current code reads IMPORTED_LOCATION properties directly from protobuf::protoc, which is fragile across different package configurations. The official protobuf documentation recommends using the generator expression $<TARGET_FILE:protobuf::protoc> instead. Similarly, the gRPC plugin target should be gRPC::grpc_cpp_plugin (the officially exported namespaced target from gRPCConfig.cmake), not the plain grpc_cpp_plugin. This ensures consistency across config-package installs and in-tree builds.

Specifically:

  • Replace the get_target_property calls for protoc with $<TARGET_FILE:protobuf::protoc>
  • Update the plugin check to use TARGET gRPC::grpc_cpp_plugin and $<TARGET_FILE:gRPC::grpc_cpp_plugin>
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/CMakeLists.txt` around lines 275 - 301, The current CMake logic reads
IMPORTED_LOCATION from protobuf::protoc and checks for a plain grpc_cpp_plugin
target; replace those fragile property reads with generator expressions and the
proper gRPC target: when TARGET protobuf::protoc is true set _PROTOBUF_PROTOC to
the generator expression $<TARGET_FILE:protobuf::protoc> (instead of using
get_target_property(IMPORTED_LOCATION*)), and when checking the gRPC plugin use
TARGET gRPC::grpc_cpp_plugin and set _GRPC_CPP_PLUGIN_EXECUTABLE to
$<TARGET_FILE:gRPC::grpc_cpp_plugin>; keep the fallback find_program/
find_package branches and the fatal-error checks unchanged.
cpp/tests/linear_programming/grpc/CMakeLists.txt (1)

119-124: Consider adding a test timeout for the integration test.

Integration tests that spawn external processes (the gRPC server) can hang indefinitely if the server fails to start or respond. Consider adding a timeout property to prevent CI pipeline hangs.

💡 Proposed enhancement
 add_test(
     NAME GRPC_INTEGRATION_TEST
     COMMAND ${CMAKE_COMMAND} -E env
         "CUOPT_GRPC_SERVER_PATH=$<TARGET_FILE:cuopt_grpc_server>"
         $<TARGET_FILE:GRPC_INTEGRATION_TEST>
 )
+
+set_tests_properties(GRPC_INTEGRATION_TEST PROPERTIES TIMEOUT 300)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/tests/linear_programming/grpc/CMakeLists.txt` around lines 119 - 124, The
GRPC_INTEGRATION_TEST add_test currently can hang indefinitely; modify the CMake
test registration to set a timeout by adding a test property for
GRPC_INTEGRATION_TEST (or using set_tests_properties on the test name) and
assign a TIMEOUT value (seconds) appropriate for startup and test execution
(e.g., 60 or configurable via a variable). Locate the add_test registering
GRPC_INTEGRATION_TEST and follow it with
set_tests_properties(GRPC_INTEGRATION_TEST PROPERTIES TIMEOUT <seconds>) so the
spawned cuopt_grpc_server process is forcibly killed after the timeout.
GRPC_SERVER_ARCHITECTURE.md (1)

50-56: Consider documenting shared memory name collision risk for multi-instance deployments.

The shared memory segment names (/cuopt_job_queue, /cuopt_result_queue, /cuopt_control) are fixed. Running multiple server instances on the same host would cause collisions. Consider adding a note about this limitation or documenting how to use --port to differentiate instances if the segments are port-specific.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@GRPC_SERVER_ARCHITECTURE.md` around lines 50 - 56, Document that the shared
memory segment names `/cuopt_job_queue`, `/cuopt_result_queue`, and
`/cuopt_control` are fixed and will collide when multiple server instances run
on the same host; add a brief note in the "Shared Memory Segments" section
explaining the risk and recommend solutions (namespace segments per-instance
using a configurable prefix such as PID, port, or --port value, or allow a
CLI/config option to set custom segment names) and show an example naming
convention to avoid collisions.
cpp/tests/linear_programming/grpc/grpc_client_test.cpp (1)

848-860: Assert the reconstructed primal vector here too.

This test only checks the objective copied from the chunked header. A regression in RESULT_PRIMAL_SOLUTION stitching would still pass. Assert the returned primal values are {1.5, 2.5} so the chunked download path is exercising actual numerical reconstruction.

As per coding guidelines, "**/*test*.{cpp,cu,py}: Write tests validating numerical correctness of optimization results (not just 'runs without error')."

Also applies to: 871-876

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/tests/linear_programming/grpc/grpc_client_test.cpp` around lines 848 -
860, The test mocks a GetResultChunk returning two primal values but never
asserts the reconstructed primal vector; update the test after the call that
verifies the objective to also fetch the returned primal solution and assert its
elements equal {1.5, 2.5} (i.e., add numeric equality checks for the
reconstructed primal vector produced by the code path that handles
cuopt::remote::RESULT_PRIMAL_SOLUTION). Apply the same additional assertions at
the second mocked-response site referenced (around lines 871-876) so both
chunked-download cases validate numerical reconstruction.
cpp/tests/linear_programming/grpc/grpc_integration_test.cpp (1)

959-984: This test never proves log_callback was invoked.

received_logs is populated, but the assertions only check solve success. If streaming regresses and no callback fires, this test still passes. Assert at least one callback invocation or a known log line so the test covers the behavior in its name.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/tests/linear_programming/grpc/grpc_integration_test.cpp` around lines 959
- 984, The test SolveMIPWithLogCallback currently never asserts that
config.log_callback was invoked; modify the test to assert that received_logs
contains at least one entry (or contains a known log substring) after
client->solve_mip returns. Use the same log_mutex to safely read received_logs
(e.g., lock_guard on log_mutex) and then EXPECT_FALSE(received_logs.empty()) or
EXPECT_TRUE(any entry contains "presolve"/other known text) so the test fails if
streaming/callbacks regress.
cpp/src/grpc/server/grpc_service_impl.cpp (1)

437-438: Hardcoded element size assumes all result arrays are doubles.

The element size is hardcoded to sizeof(double), which works for the current LP/MIP solution arrays but could break if result arrays with different element types are added.

Consider fetching the element size from the array descriptor in ChunkedResultHeader that was set via add_result_array_descriptor() in populate_chunked_result_header_* for consistency.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/server/grpc_service_impl.cpp` around lines 437 - 438, The code
currently hardcodes elem_size = sizeof(double) which assumes all result arrays
are doubles; instead query the element size from the ChunkedResultHeader's array
descriptor (the descriptor added via add_result_array_descriptor() in
populate_chunked_result_header_*), compute elem_size from that descriptor and
then derive array_size = total_bytes / elem_size so different element types are
handled correctly; update the logic that sets elem_size and array_size (replace
the hardcoded sizeof(double) usage) and ensure you handle any missing
descriptor/error cases.
cpp/src/grpc/server/grpc_job_management.cpp (1)

250-260: Redundant double-check of slot validity.

The slot validation at lines 257-260 duplicates the check at lines 251-252. While the intent appears to be guarding against ABA races, both checks use the same conditions without an intervening operation that could cause the state to change.

If defensive re-validation is desired, consider adding a comment explaining the race being guarded against.

♻️ Simplified version (if redundancy is unintentional)
   for (size_t i = 0; i < MAX_JOBS; ++i) {
-    if (!job_queue[i].ready.load(std::memory_order_acquire)) continue;
-    if (strcmp(job_queue[i].job_id, job_id.c_str()) != 0) continue;
-
-    // Re-validate the slot: the job_id could have changed between the
-    // initial check and now if the slot was recycled.  Load ready with
-    // acquire so we see all writes that published it.
-    if (!job_queue[i].ready.load(std::memory_order_acquire) ||
-        strcmp(job_queue[i].job_id, job_id.c_str()) != 0) {
+    // Load ready with acquire to see all writes that published the slot.
+    if (!job_queue[i].ready.load(std::memory_order_acquire)) continue;
+    if (strcmp(job_queue[i].job_id, job_id.c_str()) != 0) {
       continue;
     }
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/server/grpc_job_management.cpp` around lines 250 - 260, The loop
in grpc_job_management.cpp redundantly re-checks the same slot readiness and
job_id twice (the two uses of job_queue[i].ready.load(std::memory_order_acquire)
and strcmp(job_queue[i].job_id, job_id.c_str())), which either should be removed
or documented as an ABA defense; update the code by either deleting the second
identical validation (inside the comment block) to avoid duplicate checks, or if
you intend to defend against ABA/race windows keep the second check but replace
it with a clear comment explaining the exact race being guarded and prefer a
robust mechanism (e.g., a sequence/version atomic on job_queue entries) instead
of silent duplicate loads—refer to job_queue, MAX_JOBS, job_id,
ready.load(std::memory_order_acquire), and strcmp when making the change.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: e26b5406-c826-4a7f-abc9-dcc595a81d46

📥 Commits

Reviewing files that changed from the base of the PR and between f80d6cf and 1fd4002.

📒 Files selected for processing (56)
  • GRPC_INTERFACE.md
  • GRPC_QUICK_START.md
  • GRPC_SERVER_ARCHITECTURE.md
  • build.sh
  • ci/build_wheel_libcuopt.sh
  • ci/utils/install_protobuf_grpc.sh
  • conda/environments/all_cuda-129_arch-aarch64.yaml
  • conda/environments/all_cuda-129_arch-x86_64.yaml
  • conda/environments/all_cuda-131_arch-aarch64.yaml
  • conda/environments/all_cuda-131_arch-x86_64.yaml
  • conda/recipes/libcuopt/recipe.yaml
  • cpp/CMakeLists.txt
  • cpp/include/cuopt/linear_programming/solve_remote.hpp
  • cpp/src/grpc/client/grpc_client.cpp
  • cpp/src/grpc/client/grpc_client.hpp
  • cpp/src/grpc/client/solve_remote.cpp
  • cpp/src/grpc/cuopt_remote.proto
  • cpp/src/grpc/cuopt_remote_service.proto
  • cpp/src/grpc/grpc_problem_mapper.cpp
  • cpp/src/grpc/grpc_problem_mapper.hpp
  • cpp/src/grpc/grpc_service_mapper.cpp
  • cpp/src/grpc/grpc_service_mapper.hpp
  • cpp/src/grpc/grpc_settings_mapper.cpp
  • cpp/src/grpc/grpc_settings_mapper.hpp
  • cpp/src/grpc/grpc_solution_mapper.cpp
  • cpp/src/grpc/grpc_solution_mapper.hpp
  • cpp/src/grpc/server/grpc_field_element_size.hpp
  • cpp/src/grpc/server/grpc_incumbent_proto.hpp
  • cpp/src/grpc/server/grpc_job_management.cpp
  • cpp/src/grpc/server/grpc_pipe_io.cpp
  • cpp/src/grpc/server/grpc_pipe_serialization.hpp
  • cpp/src/grpc/server/grpc_server_logger.cpp
  • cpp/src/grpc/server/grpc_server_logger.hpp
  • cpp/src/grpc/server/grpc_server_main.cpp
  • cpp/src/grpc/server/grpc_server_threads.cpp
  • cpp/src/grpc/server/grpc_server_types.hpp
  • cpp/src/grpc/server/grpc_service_impl.cpp
  • cpp/src/grpc/server/grpc_worker.cpp
  • cpp/src/grpc/server/grpc_worker_infra.cpp
  • cpp/src/mip_heuristics/solve.cu
  • cpp/src/pdlp/CMakeLists.txt
  • cpp/src/pdlp/cpu_optimization_problem.cpp
  • cpp/src/pdlp/optimization_problem.cu
  • cpp/src/pdlp/solve.cu
  • cpp/src/pdlp/solve_remote.cu
  • cpp/tests/linear_programming/CMakeLists.txt
  • cpp/tests/linear_programming/c_api_tests/c_api_tests.cpp
  • cpp/tests/linear_programming/grpc/CMakeLists.txt
  • cpp/tests/linear_programming/grpc/grpc_client_test.cpp
  • cpp/tests/linear_programming/grpc/grpc_client_test_helper.hpp
  • cpp/tests/linear_programming/grpc/grpc_integration_test.cpp
  • cpp/tests/linear_programming/grpc/grpc_pipe_serialization_test.cpp
  • cpp/tests/linear_programming/grpc/grpc_test_log_capture.hpp
  • dependencies.yaml
  • python/cuopt/cuopt/tests/linear_programming/test_cpu_only_execution.py
  • python/libcuopt/CMakeLists.txt
💤 Files with no reviewable changes (4)
  • cpp/src/pdlp/cpu_optimization_problem.cpp
  • cpp/src/pdlp/CMakeLists.txt
  • cpp/src/pdlp/optimization_problem.cu
  • cpp/src/pdlp/solve_remote.cu

Comment on lines +94 to +99
- openssl
- c-ares
- libgrpc
- libprotobuf
- libabseil
- re2
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "== top-level shared host requirements =="
sed -n '80,100p' conda/recipes/libcuopt/recipe.yaml

echo
echo "== libcuopt output host/run requirements =="
sed -n '165,194p' conda/recipes/libcuopt/recipe.yaml

echo
echo "== wheel build note for the same binary =="
sed -n '20,28p' ci/build_wheel_libcuopt.sh

Repository: NVIDIA/cuopt

Length of output: 1745


🏁 Script executed:

#!/bin/bash
# Check what cache.build.script does
rg -A 30 "cache\.build\.script" conda/recipes/libcuopt/recipe.yaml | head -60

# Also check the complete libcuopt output section to understand build flow
echo "=== Full libcuopt output section ===" 
sed -n '150,210p' conda/recipes/libcuopt/recipe.yaml

Repository: NVIDIA/cuopt

Length of output: 1699


🏁 Script executed:

# Find and display the cache.build.script section
sed -n '1,150p' conda/recipes/libcuopt/recipe.yaml | grep -A 50 "cache\.build\.script"

Repository: NVIDIA/cuopt

Length of output: 38


🏁 Script executed:

# Search for cache.build.script anywhere in the file
rg "cache\.build\.script" conda/recipes/libcuopt/recipe.yaml -A 40

# Also let's see the structure around lines 1-80 to understand the recipe better
echo "=== Top section of recipe ===" 
head -80 conda/recipes/libcuopt/recipe.yaml

Repository: NVIDIA/cuopt

Length of output: 3975


🏁 Script executed:

# Check if build.sh actually builds libcuopt/cuopt_grpc_server
echo "=== Checking build.sh for libcuopt build and cuopt_grpc_server ===" 
grep -n "libcuopt\|cuopt_grpc_server" build.sh | head -20

# Also check CMakeLists.txt to see if cuopt_grpc_server depends on libuuid
echo
echo "=== Checking CMakeLists.txt for cuopt_grpc_server and libuuid ===" 
find . -name "CMakeLists.txt" -exec grep -l "cuopt_grpc_server\|libuuid" {} \; 2>/dev/null

# Let's look at the actual CMakeLists in cpp directory
echo
echo "=== cpp/CMakeLists.txt search ===" 
grep -n "cuopt_grpc_server\|libuuid\|uuid" cpp/CMakeLists.txt 2>/dev/null || echo "Not found in cpp/CMakeLists.txt"

# Check cpp/build directory or subdirectories
find cpp -name "CMakeLists.txt" -type f 2>/dev/null | head -5

Repository: NVIDIA/cuopt

Length of output: 2611


🏁 Script executed:

# Verify the full top-level host requirements section to confirm libuuid is missing
echo "=== Complete top-level host requirements ===" 
sed -n '73,105p' conda/recipes/libcuopt/recipe.yaml | grep -E "^\s+- " | sort

# Double-check if libuuid appears anywhere in top-level requirements
echo
echo "=== Checking for libuuid in top-level section ===" 
sed -n '1,145p' conda/recipes/libcuopt/recipe.yaml | grep -i "libuuid"
echo "Exit code: $?"

Repository: NVIDIA/cuopt

Length of output: 729


Add libuuid to the shared cache host requirements.

cache.build.script builds libcuopt (line 23-30 in build.sh), which includes the cuopt_grpc_server target. The CMakeLists.txt explicitly requires uuid at link time (find_library(UUID_LIBRARY uuid REQUIRED) at line 749 in cpp/CMakeLists.txt). Since output-specific libcuopt requirements apply only after the shared cache build completes, the conda build environment will fail at link time for cuopt_grpc_server without libuuid in the top-level host requirements.

Also applies to: 174-190

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@conda/recipes/libcuopt/recipe.yaml` around lines 94 - 99, The shared-cache
host requirements for libcuopt are missing libuuid, which breaks linking of the
cuopt_grpc_server target during cache.build.script (build.sh lines that build
libcuopt); update the recipe.yaml host requirements list(s) that include
openssl, c-ares, libgrpc, libprotobuf, libabseil, re2 to also include libuuid so
the system-provided UUID library satisfies find_library(UUID_LIBRARY uuid
REQUIRED) in cpp/CMakeLists.txt and prevents link-time failures (also apply the
same addition to the other host requirements block referenced in the file).

package cuopt.remote;

// Import the existing message definitions
import "cuopt_remote.proto";
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Fix the Buf import path.

buf resolves imports from the module root, so import "cuopt_remote.proto"; does not find cpp/src/grpc/cuopt_remote.proto. Update this import once the protos are moved/aligned with their package path.

🧰 Tools
🪛 Buf (1.66.1)

[error] 9-9: import "cuopt_remote.proto": file does not exist

(COMPILE)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/cuopt_remote_service.proto` at line 9, Update the import in
cuopt_remote_service.proto so Buf can resolve it from the module root: replace
the current import "cuopt_remote.proto" with the module-root-relative path that
matches where cuopt_remote.proto will live (i.e., the package-aligned path after
you move/align protos); modify the import statement in
cuopt_remote_service.proto (the existing import "cuopt_remote.proto") to the
correct module-root-relative path.


syntax = "proto3";

package cuopt.remote;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Move the proto files under a package-matching path.

buf lint will fail here because package cuopt.remote is not under a cuopt/remote/ directory. Either move this file (and cuopt_remote_service.proto) under a module-root-relative cuopt/remote/ path or rename the package/imports consistently.

🧰 Tools
🪛 Buf (1.66.1)

[error] 6-6: Files with package "cuopt.remote" must be within a directory "cuopt/remote" relative to root but were in directory "cpp/src/grpc".

(PACKAGE_DIRECTORY_MATCH)

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/cuopt_remote.proto` at line 6, The proto package declaration
"package cuopt.remote" in cuopt_remote.proto (and the matching
cuopt_remote_service.proto) must live in a directory path that matches the
package; either move both files into a module-root-relative cuopt/remote/
directory so their filesystem path matches the package, or update the package
line and any imports to a package name that matches their current directory;
adjust imports in both cuopt_remote.proto and cuopt_remote_service.proto to
remain consistent after the move/rename so buf lint passes.

Comment on lines +139 to +146
// proto3 defaults numeric fields to 0; treat negative iteration_limit as "unset"
// so the server keeps the library default (typically max()).
if (pb_settings.iteration_limit() >= 0) {
const auto limit = pb_settings.iteration_limit();
settings.iteration_limit = (limit > static_cast<int64_t>(std::numeric_limits<i_t>::max()))
? std::numeric_limits<i_t>::max()
: static_cast<i_t>(limit);
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Check how clients set iteration_limit when using server defaults
rg -n "iteration_limit" --type cpp -C3 cpp/src/grpc/client/

Repository: NVIDIA/cuopt

Length of output: 38


🏁 Script executed:

# Check repository structure and find client-related code
find . -type d -name "*client*" | head -20

Repository: NVIDIA/cuopt

Length of output: 196


🏁 Script executed:

# Look for proto files to understand the iteration_limit schema
fd -e proto | head -20

Repository: NVIDIA/cuopt

Length of output: 127


🏁 Script executed:

# Examine the grpc_settings_mapper.cpp file more broadly to understand context
cat -n cpp/src/grpc/grpc_settings_mapper.cpp | head -100

Repository: NVIDIA/cuopt

Length of output: 5006


🏁 Script executed:

# Search for iteration_limit usage patterns across the codebase
rg -n "iteration_limit" --type cpp -C2 | head -100

Repository: NVIDIA/cuopt

Length of output: 8440


🏁 Script executed:

# Check the proto schema definition for iteration_limit
cat cpp/src/grpc/cuopt_remote.proto | grep -A 10 -B 5 "iteration_limit"

Repository: NVIDIA/cuopt

Length of output: 634


🏁 Script executed:

# Look at the test case mentioned in search results
sed -n '77,150p' cpp/tests/linear_programming/c_api_tests/c_api_tests.cpp

Repository: NVIDIA/cuopt

Length of output: 3092


🏁 Script executed:

# Check if there are any gRPC-specific tests or documentation about client behavior
fd -e "test" -e "spec" | xargs grep -l "iteration_limit" 2>/dev/null | head -10

Repository: NVIDIA/cuopt

Length of output: 38


🏁 Script executed:

# Search for any existing bug reports or TODOs about this iteration_limit issue
rg -n "iteration_limit.*0" --type cpp -C2 cpp/src/grpc/

Repository: NVIDIA/cuopt

Length of output: 614


🏁 Script executed:

# Check if there are integration tests that test the from_proto behavior with 0 or unset values
fd -e "test.cpp" -e "test.hpp" | xargs grep -l "from_proto\|PDLPSolverSettings" 2>/dev/null | head -5

Repository: NVIDIA/cuopt

Length of output: 38


🏁 Script executed:

# Look at the complete from_proto function to see the full context
sed -n '120,180p' cpp/src/grpc/grpc_settings_mapper.cpp

Repository: NVIDIA/cuopt

Length of output: 3450


🏁 Script executed:

# Check if there are any other similar numeric field handling patterns to confirm consistency
sed -n '150,180p' cpp/src/grpc/grpc_settings_mapper.cpp

Repository: NVIDIA/cuopt

Length of output: 1745


Fix asymmetry in iteration_limit proto3 handling to prevent unintended zero-iteration behavior.

Lines 141–146 accept iteration_limit >= 0, which treats proto3's default value of 0 (when the field is omitted by the client) as a valid setting. This causes immediate solver termination (0 iterations) rather than using the library default (max()).

The proto schema explicitly documents the -1 sentinel for "unset" and warns that proto3 defaults to 0 when fields are omitted. The asymmetry is clear: to_proto() correctly converts max()-1, but from_proto() incorrectly accepts 0 as explicitly set.

Fix: Either change the condition to if (pb_settings.iteration_limit() > 0) to treat 0 and negative values as unset, or validate against a valid range with a fallback default (matching the pattern used for presolver and pdlp_precision in lines 154–159 and 172–178).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/grpc_settings_mapper.cpp` around lines 139 - 146, The from-proto
handling for iteration_limit incorrectly treats proto3's default 0 as an
explicit setting; update the logic in grpc_settings_mapper.cpp that reads
pb_settings.iteration_limit() so it only accepts positive values (e.g., change
the guard to pb_settings.iteration_limit() > 0) or otherwise validate the range
and fall back to library default (std::numeric_limits<i_t>::max()) like the
presolver/pdlp_precision pattern; ensure settings.iteration_limit is set to
max() when the proto value is 0 or negative, and keep the existing cap that maps
values above numeric_limits<i_t>::max() down to that max.

Comment on lines +33 to +75
bool read_from_pipe(int fd, void* data, size_t size, int timeout_ms)
{
uint8_t* ptr = static_cast<uint8_t*>(data);
size_t remaining = size;

// Poll once to enforce timeout before the first read. After data starts
// flowing, blocking read() is sufficient — if the writer dies the pipe
// closes and read() returns 0 (EOF). Avoids ~10k extra poll() syscalls
// per bulk transfer.
struct pollfd pfd = {fd, POLLIN, 0};
int pr;
do {
pr = poll(&pfd, 1, timeout_ms);
} while (pr < 0 && errno == EINTR);
if (pr < 0) {
SERVER_LOG_ERROR("[Server] poll() failed on pipe: %s", strerror(errno));
return false;
}
if (pr == 0) {
SERVER_LOG_ERROR("[Server] Timeout waiting for pipe data (waited %dms)", timeout_ms);
return false;
}
if (pfd.revents & (POLLERR | POLLHUP | POLLNVAL)) {
SERVER_LOG_ERROR("[Server] Pipe error/hangup detected");
return false;
}

while (remaining > 0) {
ssize_t nread = ::read(fd, ptr, remaining);
if (nread > 0) {
ptr += nread;
remaining -= nread;
continue;
}
if (nread == 0) {
SERVER_LOG_ERROR("[Server] Pipe EOF (writer closed)");
return false;
}
if (errno == EINTR) continue;
SERVER_LOG_ERROR("[Server] Pipe read error: %s", strerror(errno));
return false;
}
return true;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🌐 Web query:

On POSIX pipes, can poll() return POLLIN and POLLHUP together while unread data remains, and does a single initial poll() followed by blocking read() enforce a timeout for the entire exact-length read?

💡 Result:

Yes, on POSIX pipes, poll can return both POLLIN and POLLHUP together while unread data remains in the pipe. This occurs when data has been written to the pipe and the write end is closed by the last writer: POLLIN indicates data is available to read without blocking, and POLLHUP indicates the pipe has been closed by the last writer. These events are explicitly not mutually exclusive per POSIX specification. After reading all available data, a subsequent read will return 0 (EOF). No, a single initial poll with timeout followed by a blocking read does not enforce a timeout for the entire exact-length read. The poll timeout only applies to waiting for the pipe to become readable (i.e., data arrives or EOF condition). Once poll returns (successfully or timed out), the subsequent blocking read has no timeout: if data is available but less than requested, read will block indefinitely waiting for more data (if write end still open) or until EOF (if closed). To timeout an exact-length read, repeated non-blocking reads or additional polls after partial reads are needed.

Citations:


Poll inside the read loop with timeout enforcement; POLLIN and POLLHUP can occur together while data remains buffered.

After the initial poll returns, subsequent blocking read() calls are no longer bounded by timeout_ms, which can cause indefinite hangs on partial frames. Additionally, treating any POLLHUP as fatal immediately drops valid buffered data—on POSIX pipes, POLLIN and POLLHUP occur together when the writer closes after writing. This violates timeout guarantees and can wedge server threads. Implement polling in a loop with a shrinking deadline, only treating HUP as fatal after draining all readable bytes.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/server/grpc_pipe_io.cpp` around lines 33 - 75, In
read_from_pipe, replace the one-time initial poll with a looped poll+read that
enforces a shrinking deadline so the total wait never exceeds timeout_ms:
compute a deadline (now + timeout_ms), then in the while(remaining>0) loop call
poll with timeout = max(0, deadline - now) and only call read when poll reports
POLLIN (or POLLIN|POLLHUP); do not treat POLLHUP as fatal if POLLIN is also set
— continue reading until read() returns 0 (EOF) or remaining == 0; treat POLLHUP
as fatal only when poll reports POLLHUP without POLLIN or when read() returns 0
after draining buffered data; handle EINTR for poll/read and log
timeout/poll/read errors with context. Ensure all references are applied to the
read_from_pipe function and its local pfd/pr/remaining variables.

Comment on lines +186 to +216
// Chunked download session state (raw arrays from worker)
struct ChunkedDownloadState {
uint32_t problem_category = cuopt::remote::LP;
std::chrono::steady_clock::time_point created;
cuopt::remote::ChunkedResultHeader result_header;
std::map<int32_t, std::vector<uint8_t>> raw_arrays; // ResultFieldId -> raw bytes
};

// Per-array allocation cap for chunked uploads (4 GiB).
static constexpr int64_t kMaxChunkedArrayBytes = 4LL * 1024 * 1024 * 1024;

// Maximum concurrent chunked upload + download sessions (global across all clients).
static constexpr size_t kMaxChunkedSessions = 16;

// Stale session timeout: sessions with no activity for this long are reaped.
static constexpr int kSessionTimeoutSeconds = 300;

struct ChunkedUploadState {
uint32_t problem_category = cuopt::remote::LP;
cuopt::remote::ChunkedProblemHeader header;
struct FieldMeta {
int64_t total_elements = 0;
int64_t element_size = 0;
int64_t received_bytes = 0;
};
std::map<int32_t, FieldMeta> field_meta;
std::vector<cuopt::remote::ArrayChunk> chunks;
int64_t total_chunks = 0;
int64_t total_bytes = 0;
std::chrono::steady_clock::time_point last_activity;
};
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Bound chunked sessions by total bytes, not just array count.

ChunkedUploadState::chunks and ChunkedDownloadState::raw_arrays retain full payloads in RAM. With a 4 GiB per-array cap and 16 concurrent sessions, one client can pin tens of GiB before a worker consumes the data or before a download finishes. Please enforce per-upload and global byte budgets, and reject or spill once total_bytes crosses them. As per coding guidelines, "Validate input sanitization to prevent buffer overflows and resource exhaustion attacks; avoid unsafe deserialization of problem files" and "Verify correct problem size checks before expensive GPU/CPU operations; prevent resource exhaustion on oversized problems".

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/server/grpc_server_types.hpp` around lines 186 - 216, The
upload/download state currently holds whole payloads in
ChunkedUploadState::chunks and ChunkedDownloadState::raw_arrays allowing a
client to pin many GiB; add and enforce byte budgets: introduce per-session
limit (e.g. kMaxChunkedSessionBytes) and a global in-process limit (e.g.
kGlobalMaxChunkedBytes) and check them whenever you append to
ChunkedUploadState::chunks or increment ChunkedUploadState::total_bytes, and
whenever you insert into ChunkedDownloadState::raw_arrays or set its
result_header sizes; if adding the bytes would exceed the per-session or global
budget, reject the operation with an appropriate error/GRPC status (or spill to
disk) and do not retain the payload, and update any session bookkeeping (e.g.
increment/decrement global allocated bytes on accept/release) so that session
cleanup and ChunkedUploadState::total_bytes reflect true memory usage.

Comment on lines +103 to +137
pid_t spawn_worker(int worker_id, bool is_replacement)
{
std::lock_guard<std::mutex> lock(worker_pipes_mutex);

if (is_replacement) { close_worker_pipes_server(worker_id); }

if (!create_worker_pipes(worker_id)) {
SERVER_LOG_ERROR("[Server] Failed to create pipes for %s%d",
is_replacement ? "replacement worker " : "worker ",
worker_id);
return -1;
}

pid_t pid = fork();
if (pid < 0) {
SERVER_LOG_ERROR("[Server] Failed to fork %s%d",
is_replacement ? "replacement worker " : "worker ",
worker_id);
close_all_worker_pipes(worker_pipes[worker_id]);
return -1;
} else if (pid == 0) {
// Child: close all fds belonging to other workers.
for (int j = 0; j < static_cast<int>(worker_pipes.size()); ++j) {
if (j != worker_id) { close_all_worker_pipes(worker_pipes[j]); }
}
// Close the server-side ends of this worker's pipes (child uses the other ends).
close_and_reset(worker_pipes[worker_id].to_worker_fd);
close_and_reset(worker_pipes[worker_id].from_worker_fd);
close_and_reset(worker_pipes[worker_id].incumbent_from_worker_fd);
worker_process(worker_id);
_exit(0);
}

close_worker_pipes_child_ends(worker_id);
return pid;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

# Find files related to worker spawning and threading
fd -t f "grpc_worker|grpc_server" cpp/src/grpc/server/ | head -20

Repository: NVIDIA/cuopt

Length of output: 346


🏁 Script executed:

# Look for spawn_single_worker usage
rg "spawn_single_worker" --type cpp --type hpp -B 2 -A 5

Repository: NVIDIA/cuopt

Length of output: 83


🏁 Script executed:

# Check worker_monitor_thread definition
rg "worker_monitor_thread" --type cpp --type hpp -B 2 -A 10

Repository: NVIDIA/cuopt

Length of output: 83


🏁 Script executed:

# Check worker_pids usage pattern
rg "worker_pids" --type cpp --type hpp -B 2 -A 3

Repository: NVIDIA/cuopt

Length of output: 83


🏁 Script executed:

# Check pending_chunked_data cleanup/handling
rg "pending_chunked_data" --type cpp --type hpp -B 2 -A 3

Repository: NVIDIA/cuopt

Length of output: 83


🏁 Script executed:

# Examine the grpc_server_main.cpp to understand thread startup
fd -t f "grpc_server_main" cpp/src/grpc/server/ -exec cat {} \;

Repository: NVIDIA/cuopt

Length of output: 202


🏁 Script executed:

# Find spawn_single_worker callers - use correct file types
rg "spawn_single_worker" cpp/src/grpc/server/ -B 2 -A 5

Repository: NVIDIA/cuopt

Length of output: 2209


🏁 Script executed:

# Find worker_monitor_thread
rg "worker_monitor_thread" cpp/src/grpc/server/ -B 2 -A 15

Repository: NVIDIA/cuopt

Length of output: 4272


🏁 Script executed:

# Check worker_pids usage
rg "worker_pids" cpp/src/grpc/server/ -B 2 -A 3

Repository: NVIDIA/cuopt

Length of output: 3840


🏁 Script executed:

# Check pending_chunked_data
rg "pending_chunked_data" cpp/src/grpc/server/ -B 2 -A 3

Repository: NVIDIA/cuopt

Length of output: 3024


🏁 Script executed:

# Read grpc_server_main.cpp
cat cpp/src/grpc/server/grpc_server_main.cpp

Repository: NVIDIA/cuopt

Length of output: 12863


🏁 Script executed:

# Read grpc_server_threads.cpp
cat cpp/src/grpc/server/grpc_server_threads.cpp

Repository: NVIDIA/cuopt

Length of output: 15164


🏁 Script executed:

# Find mark_worker_jobs_failed implementation to check cleanup
rg "mark_worker_jobs_failed" cpp/src/grpc/server/ -A 30

Repository: NVIDIA/cuopt

Length of output: 5919


🏁 Script executed:

# Examine spawn_workers to verify indexing issue with failed spawns
rg "void spawn_workers" cpp/src/grpc/server/ -A 15

Repository: NVIDIA/cuopt

Length of output: 2427


🏁 Script executed:

# Check if worker_pids is pre-sized or if initial spawn failure leaves gaps
rg "worker_pids.*resize|worker_pids.*reserve|worker_pids.*size()" cpp/src/grpc/server/ -B 2 -A 2

Repository: NVIDIA/cuopt

Length of output: 419


🏁 Script executed:

# Look at full mark_worker_jobs_failed to confirm pending_chunked_data handling
rg "mark_worker_jobs_failed" cpp/src/grpc/server/grpc_worker_infra.cpp -A 60

Repository: NVIDIA/cuopt

Length of output: 2714


🏁 Script executed:

# Check if is_chunked job can leave pending_chunked_data entries orphaned
rg "is_chunked" cpp/src/grpc/server/ -B 2 -A 2

Repository: NVIDIA/cuopt

Length of output: 3860


Forking from the live multithreaded server creates unsafe process state.

spawn_single_worker() is called from worker_monitor_thread() after the gRPC server and background threads (result_retrieval_thread, incumbent_retrieval_thread, session_reaper_thread, plus gRPC service threads) are already running. Line 115 in spawn_worker() calls fork() while the parent process is multithreaded, which violates POSIX fork safety: the child inherits all mutexes, allocator state, and other thread-local resources in an inconsistent state. Although the child calls _exit(0) immediately to avoid destructor issues, the fork itself risks undefined behavior if any thread other than the monitor thread held worker_pipes_mutex or was inside malloc/free during the fork.

Replace the fork-based respawning with posix_spawn (which is fork-safe and does not inherit multithreaded state), or move worker supervision into a separate single-threaded parent process.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/server/grpc_worker_infra.cpp` around lines 103 - 137,
spawn_worker currently calls fork() from a multithreaded process (called by
worker_monitor_thread) which is unsafe; replace the fork-based respawn with
posix_spawn to avoid inheriting thread/allocator state. Update spawn_worker to
construct posix_spawn_file_actions that (1) dup2 the worker-side pipe fds into
the expected child descriptors and close the server-side ends, (2) close all
other worker pipe fds via file actions, and (3) set up any required env/args to
invoke the same worker entry (the code path currently run by worker_process).
Use posix_spawnp/posix_spawn to launch the worker binary and return the child
pid, and remove the fork/child branch and direct calls to worker_process; keep
logic that closes child ends in the parent (close_worker_pipes_child_ends) and
preserve error logging around pipe creation and spawn failures while continuing
to honor is_replacement and the worker_pipes_mutex.

Comment on lines +154 to +180
static void store_simple_result(const std::string& job_id,
int worker_id,
ResultStatus status,
const char* error_message)
{
for (size_t i = 0; i < MAX_RESULTS; ++i) {
if (result_queue[i].ready.load(std::memory_order_acquire)) continue;
bool expected = false;
if (!result_queue[i].claimed.compare_exchange_strong(
expected, true, std::memory_order_acq_rel)) {
continue;
}
if (result_queue[i].ready.load(std::memory_order_acquire)) {
result_queue[i].claimed.store(false, std::memory_order_release);
continue;
}
copy_cstr(result_queue[i].job_id, job_id);
result_queue[i].status = status;
result_queue[i].data_size = 0;
result_queue[i].worker_index.store(worker_id, std::memory_order_relaxed);
copy_cstr(result_queue[i].error_message, error_message);
result_queue[i].error_message[sizeof(result_queue[i].error_message) - 1] = '\0';
result_queue[i].retrieved.store(false, std::memory_order_relaxed);
result_queue[i].ready.store(true, std::memory_order_release);
result_queue[i].claimed.store(false, std::memory_order_release);
break;
}
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Don't drop completion metadata when result_queue is full.

Both store_simple_result() and publish_result() just fall out of the loop when every slot is busy. The job slot is still released afterward, so the server loses the only completion record and the client can wait forever for a result that was already produced. This needs retry/backoff or an explicit failure path before reset_job_slot() runs.

Also applies to: 408-441

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/server/grpc_worker.cpp` around lines 154 - 180,
store_simple_result (and similarly publish_result) currently drops completion
metadata if all result_queue slots are busy; detect the “queue full” condition
after the loop and do not simply return — instead implement a retry loop with
short backoff (or a bounded number of attempts) that retries claiming a
result_queue slot, and if still unsuccessful before reset_job_slot() is called,
write an explicit failure record into the job slot (e.g., set status to a
special ERROR/QUEUE_FULL and copy an explanatory error_message) so the client is
notified; update store_simple_result, publish_result, and call sites around
reset_job_slot to either wait for the retry to finish or to ensure the explicit
failure record is stored before releasing the job slot.

Comment on lines +206 to +270
static DeserializedJob read_problem_from_pipe(int worker_id, const JobQueueEntry& job)
{
DeserializedJob dj;

int read_fd = worker_pipes[worker_id].worker_read_fd;
bool is_chunked_job = job.is_chunked.load();

auto pipe_recv_t0 = std::chrono::steady_clock::now();

if (is_chunked_job) {
// Chunked path: the server wrote a ChunkedProblemHeader followed by
// a set of raw typed arrays (constraint matrix, bounds, etc.).
// This avoids a single giant protobuf allocation for large problems.
cuopt::remote::ChunkedProblemHeader chunked_header;
std::map<int32_t, std::vector<uint8_t>> arrays;
if (!read_chunked_request_from_pipe(read_fd, chunked_header, arrays)) { return dj; }

if (config.verbose) {
int64_t total_bytes = 0;
for (const auto& [fid, data] : arrays) {
total_bytes += data.size();
}
log_pipe_throughput("pipe_job_recv", total_bytes, pipe_recv_t0);
SERVER_LOG_INFO(
"[Worker] IPC path: CHUNKED (%zu arrays, %ld bytes)", arrays.size(), total_bytes);
}
if (chunked_header.has_lp_settings()) {
map_proto_to_pdlp_settings(chunked_header.lp_settings(), dj.lp_settings);
}
if (chunked_header.has_mip_settings()) {
map_proto_to_mip_settings(chunked_header.mip_settings(), dj.mip_settings);
}
dj.enable_incumbents = chunked_header.enable_incumbents();
map_chunked_arrays_to_problem(chunked_header, arrays, dj.problem);
} else {
// Unary path: the entire SubmitJobRequest was serialized as a single
// protobuf blob. Simpler but copies more memory for large problems.
std::vector<uint8_t> request_data;
if (!recv_job_data_pipe(read_fd, job.data_size, request_data)) { return dj; }

if (config.verbose) {
log_pipe_throughput("pipe_job_recv", static_cast<int64_t>(request_data.size()), pipe_recv_t0);
}
cuopt::remote::SubmitJobRequest submit_request;
if (!submit_request.ParseFromArray(request_data.data(),
static_cast<int>(request_data.size())) ||
(!submit_request.has_lp_request() && !submit_request.has_mip_request())) {
return dj;
}
if (submit_request.has_lp_request()) {
const auto& req = submit_request.lp_request();
SERVER_LOG_INFO("[Worker] IPC path: UNARY LP (%zu bytes)", request_data.size());
map_proto_to_problem(req.problem(), dj.problem);
map_proto_to_pdlp_settings(req.settings(), dj.lp_settings);
} else {
const auto& req = submit_request.mip_request();
SERVER_LOG_INFO("[Worker] IPC path: UNARY MIP (%zu bytes)", request_data.size());
map_proto_to_problem(req.problem(), dj.problem);
map_proto_to_mip_settings(req.settings(), dj.mip_settings);
dj.enable_incumbents = req.has_enable_incumbents() ? req.enable_incumbents() : true;
}
}

dj.success = true;
return dj;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
fd 'grpc_*mapper*.cpp' cpp/src | xargs rg -n "map_chunked_arrays_to_problem|map_proto_to_problem|map_proto_to_pdlp_settings|map_proto_to_mip_settings|throw |cuopt_expects"

Repository: NVIDIA/cuopt

Length of output: 44417


🏁 Script executed:

# Get context around the read_problem_from_pipe function and its call sites
sed -n '190,280p' cpp/src/grpc/server/grpc_worker.cpp | cat -n

Repository: NVIDIA/cuopt

Length of output: 4612


🏁 Script executed:

# Also check lines 514-520 as mentioned in the review
sed -n '500,535p' cpp/src/grpc/server/grpc_worker.cpp | cat -n

Repository: NVIDIA/cuopt

Length of output: 1813


🏁 Script executed:

# Search for try-catch blocks around the read_problem_from_pipe call
rg -B 10 -A 10 'read_problem_from_pipe' cpp/src/grpc/server/grpc_worker.cpp

Repository: NVIDIA/cuopt

Length of output: 1589


🏁 Script executed:

# Check the mapper implementations to see what exceptions they can throw
sed -n '144,250p' cpp/src/grpc/grpc_problem_mapper.cpp | head -100

Repository: NVIDIA/cuopt

Length of output: 5200


🏁 Script executed:

# Look at the complete grpc_worker.cpp worker loop to see full exception handling context
sed -n '470,540p' cpp/src/grpc/server/grpc_worker.cpp | cat -n

Repository: NVIDIA/cuopt

Length of output: 3133


🏁 Script executed:

# Check if there's a try-catch wrapping the entire worker loop
rg -B 5 -A 15 'while.*job_queue.*ready' cpp/src/grpc/server/grpc_worker.cpp

Repository: NVIDIA/cuopt

Length of output: 38


🏁 Script executed:

# Check what happens around lines 514-520 in the actual file
wc -l cpp/src/grpc/server/grpc_worker.cpp

Repository: NVIDIA/cuopt

Length of output: 95


🏁 Script executed:

# Read around line 514-520 to see the second location mentioned
sed -n '500,530p' cpp/src/grpc/server/grpc_worker.cpp | cat -n

Repository: NVIDIA/cuopt

Length of output: 1536


🏁 Script executed:

# Get a broader view of the worker_process function to see all exception handling
grep -n "try\|catch" cpp/src/grpc/server/grpc_worker.cpp

Repository: NVIDIA/cuopt

Length of output: 630


🏁 Script executed:

# Look at the try-catch blocks to see what they protect
sed -n '283,335p' cpp/src/grpc/server/grpc_worker.cpp | cat -n

Repository: NVIDIA/cuopt

Length of output: 3265


🏁 Script executed:

# Check the second try-catch block
sed -n '346,395p' cpp/src/grpc/server/grpc_worker.cpp | cat -n

Repository: NVIDIA/cuopt

Length of output: 2911


🏁 Script executed:

# Examine the complete flow from read_problem_from_pipe call to understand control flow
sed -n '490,550p' cpp/src/grpc/server/grpc_worker.cpp | cat -n

Repository: NVIDIA/cuopt

Length of output: 2696


Add exception handling around mapper calls in read_problem_from_pipe() to prevent unhandled exceptions from crashing the worker.

The mapping helper functions (map_proto_to_problem(), map_chunked_arrays_to_problem(), map_proto_to_pdlp_settings(), map_proto_to_mip_settings()) throw exceptions on invalid input (e.g., std::runtime_error for unknown variable types, std::invalid_argument for invalid settings). These exceptions are not caught in read_problem_from_pipe(), allowing a malformed request to crash the worker process before store_simple_result() and reset_job_slot() execute, leaving the job slot in a claimed state and preventing proper error reporting.

Wrap the mapper calls in a try-catch block (lines 233–239 for chunked path, lines 258–264 for unary path) and return success=false on exception, similar to how I/O errors are handled. This ensures all code paths properly signal failure and allow cleanup to proceed.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/server/grpc_worker.cpp` around lines 206 - 270,
read_problem_from_pipe currently calls map_proto_to_problem,
map_chunked_arrays_to_problem, map_proto_to_pdlp_settings and
map_proto_to_mip_settings without protection; wrap the mapping logic in both the
chunked and unary paths in a try-catch that catches std::exception (or (...) if
preferred), log the exception (e.g., SERVER_LOG_ERROR with the what() message),
set dj.success = false and return dj so the caller can clean up; ensure the try
covers the calls that populate dj (including enable_incumbents assignments) so
any thrown exception results in the same failure return path.

Comment on lines +421 to +430
result_slot = static_cast<int>(i);
ResultQueueEntry& result = result_queue[i];
copy_cstr(result.job_id, job_id);
result.status = sr.success ? RESULT_SUCCESS : RESULT_ERROR;
result.data_size = sr.success ? std::max<uint64_t>(result_total_bytes, 1) : 0;
result.worker_index.store(worker_id, std::memory_order_relaxed);
if (!sr.success) { copy_cstr(result.error_message, sr.error_message); }
result.retrieved.store(false, std::memory_order_relaxed);
result.ready.store(true, std::memory_order_release);
result.claimed.store(false, std::memory_order_release);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical

Publish ready=true only after the pipe payload is committed.

The shared-memory entry is marked RESULT_SUCCESS / ready before write_result_to_pipe() finishes. A reader can observe a ready result and start consuming the pipe, then block indefinitely if the later pipe write fails or only partially succeeds. Move the publish step after a successful pipe write, or add a distinct "payload ready" state.

As per coding guidelines, "Ensure race conditions are absent in multi-threaded server implementations; verify proper synchronization of shared state."

Also applies to: 443-464

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/grpc/server/grpc_worker.cpp` around lines 421 - 430, The code marks
result.ready and RESULT_SUCCESS before the pipe payload is committed; fix by
deferring publication until write_result_to_pipe() completes successfully: call
write_result_to_pipe(...) first, check its return, then set
result.data_size/result.status (sr.success ? RESULT_SUCCESS : RESULT_ERROR) and
copy sr.error_message as needed, and only after a successful pipe write store
result.ready.store(true, std::memory_order_release); on write failure set
RESULT_ERROR and a proper error_message and publish ready only when the payload
is guaranteed committed (or add a separate payload_ready flag and publish that
after write_result_to_pipe()); update any related stores (result.worker_index,
result.retrieved, result.claimed) to maintain memory-ordering consistent with
publishing ready.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

do not merge Do not merge if this flag is set feature request New feature or request non-breaking Introduces a non-breaking change

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant