[Do not merge] Grpc server v4 conda forge deps#995
[Do not merge] Grpc server v4 conda forge deps#995tmckayus wants to merge 20 commits intoNVIDIA:release/26.04from
Conversation
cuopt uses embedded grpc client to solve problems on a remote server
* switch to use of argparse * make switch statements explicit (no default) * raise errors with cuopt_expects * use a second standard logger for server main and workers so that logging is uniform but server meta data is not interleaved with solver output
The same parameters will be removed from solve_lp in another change.
previous workaround of build from source is not necessary since a bug with libabseil has been fixed
📝 WalkthroughWalkthroughThis pull request adds comprehensive gRPC-based remote execution support to cuOpt, enabling distributed optimization workloads across a client-server architecture. It includes complete protocol definitions, a full-featured gRPC client library, a multi-process server implementation with GPU worker management, environment-variable-driven client configuration, chunked array transfer support, TLS/mTLS security, and extensive test coverage. Changes
Estimated code review effort🎯 5 (Critical) | ⏱️ ~120 minutes ✨ Finishing Touches🧪 Generate unit tests (beta)
|
There was a problem hiding this comment.
Actionable comments posted: 11
Note
Due to the large number of review comments, Critical severity comments were prioritized as inline comments.
🟠 Major comments (16)
cpp/include/cuopt/linear_programming/solve_remote.hpp-35-37 (1)
35-37:⚠️ Potential issue | 🟠 MajorDon't silently ignore non-default LP solve flags on the remote path.
The interface overload still accepts
problem_checkinganduse_pdlp_solver_mode, but removing them here means remote execution falls back to built-in behavior while local execution still honors caller input. Either carry those knobs in the remote request or fail fast when non-default values are passed.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/include/cuopt/linear_programming/solve_remote.hpp` around lines 35 - 37, The remote LP solver overload solve_lp_remote currently ignores caller-controlled flags in pdlp_solver_settings_t (notably problem_checking and use_pdlp_solver_mode); update the function to either accept and propagate those flags in the remote request or validate settings and error when they are non-default: modify the signature of solve_lp_remote(cpu_optimization_problem_t<i_t,f_t> const& cpu_problem, pdlp_solver_settings_t<i_t,f_t> const& settings) to include the relevant settings on the remote payload and ensure the remote handler uses problem_checking and use_pdlp_solver_mode, or add an early check inside solve_lp_remote that throws or returns an error if settings.problem_checking or settings.use_pdlp_solver_mode differ from their defaults so behavior is consistent with local solves.cpp/tests/linear_programming/c_api_tests/c_api_tests.cpp-428-477 (1)
428-477:⚠️ Potential issue | 🟠 MajorAvoid a fixed port plus raw TCP connect for readiness.
If
port_is already occupied, the child can exit withEADDRINUSEwhiletcp_connect_check()still succeeds against the stale listener. The suite then pointsCUOPT_REMOTE_*at the wrong process and becomes flaky. Pick an ephemeral free port, or at least verifyserver_pid_is still alive and answering a gRPC-specific probe before exporting the env vars.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/tests/linear_programming/c_api_tests/c_api_tests.cpp` around lines 428 - 477, The current startup uses a fixed/derived port_ and then only tcp_connect_check(port_, ...) which can hit a stale listener; change to pick an ephemeral free port (e.g., create a temporary TCP socket, bind to port 0 to discover the assigned port, read it into port_, close the socket) before forking and launching the server (references: port_, server_pid_, server_path_, execl), then after fork/exec use a gRPC-specific readiness probe (or at minimum verify server_pid_ is still alive via waitpid/WNOHANG) instead of relying solely on tcp_connect_check(); only call setenv("CUOPT_REMOTE_HOST", ...) and setenv("CUOPT_REMOTE_PORT", ...) after the ephemeral port is confirmed and the child process (server_pid_) is verified responsive.cpp/tests/linear_programming/c_api_tests/c_api_tests.cpp-420-478 (1)
420-478:⚠️ Potential issue | 🟠 MajorSnapshot the original env before any early return.
SetUpTestSuite()returns on "binary not found",fork()failure, and startup timeout beforeorig_*/*_was_set_are initialized.TearDownTestSuite()then falls into theunsetenv()branches and clobbers any pre-existingCUDA_VISIBLE_DEVICES,CUOPT_REMOTE_HOST, andCUOPT_REMOTE_PORTvalues when this suite is skipped.Minimal fix
static void SetUpTestSuite() { + const char* cv = getenv("CUDA_VISIBLE_DEVICES"); + const char* rh = getenv("CUOPT_REMOTE_HOST"); + const char* rp = getenv("CUOPT_REMOTE_PORT"); + orig_cuda_visible_ = cv ? cv : ""; + orig_remote_host_ = rh ? rh : ""; + orig_remote_port_ = rp ? rp : ""; + cuda_was_set_ = (cv != nullptr); + host_was_set_ = (rh != nullptr); + port_was_set_ = (rp != nullptr); + server_path_ = find_server_binary(); if (server_path_.empty()) { skip_reason_ = "cuopt_grpc_server binary not found"; return; } @@ - const char* cv = getenv("CUDA_VISIBLE_DEVICES"); - const char* rh = getenv("CUOPT_REMOTE_HOST"); - const char* rp = getenv("CUOPT_REMOTE_PORT"); - orig_cuda_visible_ = cv ? cv : ""; - orig_remote_host_ = rh ? rh : ""; - orig_remote_port_ = rp ? rp : ""; - cuda_was_set_ = (cv != nullptr); - host_was_set_ = (rh != nullptr); - port_was_set_ = (rp != nullptr); - setenv("CUDA_VISIBLE_DEVICES", "", 1);As per coding guidelines, "Ensure test isolation: prevent GPU state, cached memory, and global variables from leaking between test cases; verify each test independently initializes its environment."
Also applies to: 480-496
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/tests/linear_programming/c_api_tests/c_api_tests.cpp` around lines 420 - 478, SetUpTestSuite currently returns early (when server binary missing, fork fails, or startup times out) before orig_cuda_visible_, orig_remote_host_, orig_remote_port_ and the boolean flags cuda_was_set_, host_was_set_, port_was_set_ are initialized, causing TearDownTestSuite to clobber pre-existing env vars; fix by snapshotting the original environment at the very start of SetUpTestSuite (read getenv("CUDA_VISIBLE_DEVICES"), getenv("CUOPT_REMOTE_HOST"), getenv("CUOPT_REMOTE_PORT") into orig_* and set the *_was_set_ flags accordingly) so these variables are always initialized even if the function returns early, then proceed with the existing logic that may override env and start the server.cpp/CMakeLists.txt-711-770 (1)
711-770:⚠️ Potential issue | 🟠 MajorGate
cuopt_grpc_serverin LP-only builds or compile out the MIP path.
grpc_worker.cppunconditionally callssolve_mip()withmip_solver_settings_t, but the target is added without conditional gating. In LP-only builds, the MIP solver components are excluded while this target remains enabled, causing linker errors. Either wrap the target in anif(NOT BUILD_LP_ONLY)block or add conditional compilation guards ingrpc_worker.cppto skip MIP paths when LP-only mode is active.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/CMakeLists.txt` around lines 711 - 770, The cuopt_grpc_server target exposes MIP-dependent code (grpc_worker.cpp calls solve_mip() with mip_solver_settings_t) but is built even in LP-only configurations; either wrap the entire target in CMake with an if(NOT BUILD_LP_ONLY) ... endif() so cuopt_grpc_server is not added/linked when BUILD_LP_ONLY is true, or add compile-time guards inside src/grpc/server/grpc_worker.cpp around calls to solve_mip() and any use of mip_solver_settings_t (e.g. `#ifdef` BUILD_LP_ONLY / `#ifndef` BUILD_LP_ONLY or equivalent CMake-generated define) so the MIP path is excluded in LP-only builds and the target can link without MIP symbols.cpp/src/grpc/server/grpc_server_main.cpp-195-204 (1)
195-204:⚠️ Potential issue | 🟠 MajorThese fixed SHM names make multiple servers interfere with each other.
Each instance unlinks and recreates the same
SHM_JOB_QUEUE,SHM_RESULT_QUEUE, andSHM_CONTROLobjects. Starting a second server on a different port will target the same shared-memory segments, so the two instances can tear down or overwrite each other's queues. Namespace the SHM names per server instance (for example by port/pid) and pass that namespace to workers.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/grpc/server/grpc_server_main.cpp` around lines 195 - 204, The hardcoded shared-memory names SHM_JOB_QUEUE, SHM_RESULT_QUEUE and SHM_CONTROL cause cross-instance interference; modify the server to generate a unique namespace (e.g., include the server port and/or pid) and use that to build per-instance names before calling shm_unlink and create_shared_memory. Update the code paths that reference SHM_JOB_QUEUE/SHM_RESULT_QUEUE/SHM_CONTROL (the allocations assigning job_queue, result_queue, and shm_ctrl via create_shared_memory) to use the namespaced strings, and ensure the same generated namespace is passed to worker processes so they open the matching segments instead of the global static names.cpp/src/grpc/client/grpc_client.cpp-863-869 (1)
863-869:⚠️ Potential issue | 🟠 MajorChunked upload still buffers the whole payload before sending.
build_array_chunk_requests()materializes a fullstd::vector<SendArrayChunkRequest>first, so the large-problem path keeps an extra copy of every chunk in memory. That defeats the main reason to chunk at all and can exhaust client RAM on the models most likely to use this path. Generate/send chunks incrementally instead of prebuilding them all.As per coding guidelines, "Verify correct problem size checks before expensive GPU/CPU operations; prevent resource exhaustion on oversized problems."
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/grpc/client/grpc_client.cpp` around lines 863 - 869, The code currently calls build_array_chunk_requests(...) which returns a fully materialized std::vector<SendArrayChunkRequest> (assigned to chunk_requests), causing an extra full-copy of all chunks in memory; change this to produce/send chunks incrementally by replacing build_array_chunk_requests with a streaming API (e.g., an iterator/generator or a producer callback) so the caller can loop and send each SendArrayChunkRequest as it is created instead of storing them all; update the call site that assigns chunk_requests to instead iterate over the generator or invoke the callback for each chunk (or change build_array_chunk_requests to accept a visitor std::function<void(const SendArrayChunkRequest&)> or return a lazy iterator) and ensure existing code that references chunk_requests is adapted to send each chunk immediately to avoid buffering the whole payload.cpp/src/grpc/server/grpc_server_threads.cpp-147-153 (1)
147-153:⚠️ Potential issue | 🟠 MajorDon't overload
cancelledfor transport failures.Line 152 turns a pipe send failure into
cancelled=true, but the rest of the server treats that bit as a user-initiated cancel.mark_worker_jobs_failed()will later reportRESULT_CANCELLED/"Job was cancelled"for an internal I/O error, and if the worker stays alive this slot never gets failed or recycled here. Fail the job explicitly instead of reusing the cancellation flag.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/grpc/server/grpc_server_threads.cpp` around lines 147 - 153, The code currently sets job_queue[i].cancelled = true on a transport send failure, which conflates I/O errors with user cancellations; instead, on send failure leave cancelled untouched and mark the job explicitly failed (e.g., set a job_queue[i].failed flag or job_queue[i].result = RESULT_IO_ERROR / an internal I/O error code and any job_queue[i].status = JOB_FAILED if present), and ensure any downstream cleanup uses that failure indicator (or call the existing failure-path helper such as mark_worker_jobs_failed()/a similar job-fail helper for this specific job) so the job is reported as an internal I/O failure rather than RESULT_CANCELLED.cpp/src/grpc/server/grpc_worker_infra.cpp-183-187 (1)
183-187:⚠️ Potential issue | 🟠 MajorCrash cleanup should drop chunked payloads too.
This only erases
pending_job_data. Chunked submissions are buffered inpending_chunked_data, so a worker crash before dispatch leaks the full chunk vector for that job. Clear the chunked map here as well.🐛 Suggested fix
{ std::lock_guard<std::mutex> lock(pending_data_mutex); pending_job_data.erase(job_id); + pending_chunked_data.erase(job_id); }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/grpc/server/grpc_worker_infra.cpp` around lines 183 - 187, The cleanup currently only erases pending_job_data and leaks chunked submissions; update the same crash cleanup block to also remove the job's entry from pending_chunked_data under the same lock (pending_data_mutex) so both maps are cleared for job_id; ensure you call pending_chunked_data.erase(job_id) alongside pending_job_data.erase(job_id) within the scope guarded by std::lock_guard<std::mutex> on pending_data_mutex.cpp/src/grpc/server/grpc_worker_infra.cpp-140-146 (1)
140-146:⚠️ Potential issue | 🟠 MajorKeep
worker_pidsaligned with the logical worker id.
worker_monitor_thread()treatsworker_pids[i]as workeri, but this loop only appends successful spawns. If worker 1 fails and worker 2 succeeds, PID 2 lands at index 1 and later restarts/pipe lookups target the wrong slot. Pre-size the vector and assign byworker_id.🐛 Suggested fix
void spawn_workers() { + worker_pids.assign(config.num_workers, 0); for (int i = 0; i < config.num_workers; ++i) { pid_t pid = spawn_worker(i, false); - if (pid < 0) { continue; } - worker_pids.push_back(pid); + if (pid > 0) { worker_pids[i] = pid; } } }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/grpc/server/grpc_worker_infra.cpp` around lines 140 - 146, spawn_workers currently appends only successful spawns so worker_pids indices no longer match logical worker ids; pre-size worker_pids to config.num_workers (fill with a sentinel like -1) and then assign by index: call spawn_worker(i, false) and set worker_pids[i] = pid (leave sentinel on failure) so worker_monitor_thread() can safely treat worker_pids[i] as worker i; update any logic that expects empty vector length accordingly.python/cuopt/cuopt/tests/linear_programming/test_cpu_only_execution.py-33-35 (1)
33-35:⚠️ Potential issue | 🟠 MajorThe new dataset default is not propagated into the subprocess entrypoints.
The module-level default moved to
./datasets, but the child helpers later in this file still fall back to"./"whenRAPIDS_DATASET_ROOT_DIRis unset. In that case the parent test resolves datasets from the new location while the subprocess still looks under the repo root and fails to open the MPS files. Apply the same default to each_impl_*_cpu_only()helper.🐛 Suggested fix
- dataset_root = os.environ.get("RAPIDS_DATASET_ROOT_DIR", "./") + dataset_root = os.environ.get("RAPIDS_DATASET_ROOT_DIR", "./datasets")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@python/cuopt/cuopt/tests/linear_programming/test_cpu_only_execution.py` around lines 33 - 35, The subprocess helper functions (the _impl_*_cpu_only() helpers) still fall back to "./" for RAPIDS_DATASET_ROOT_DIR causing subprocesses to look in the repo root; update each helper to use the same module-level default by reading os.environ.get("RAPIDS_DATASET_ROOT_DIR", RAPIDS_DATASET_ROOT_DIR) or otherwise defaulting to the RAPIDS_DATASET_ROOT_DIR constant defined at module top, so subprocess entrypoints resolve datasets from "./datasets" when the env var is unset.cpp/src/grpc/client/grpc_client.cpp-1094-1108 (1)
1094-1108:⚠️ Potential issue | 🟠 MajorReject short or mismatched chunk replies before
memcpy.
elem_offsetadvances by the requested chunk size, not byelements_in_chunk(). If the server returns a short chunk, this loop leaves a gap inarray_bytesand still reports success. This block also never checks that the reply'sdownload_id,field_id, andelement_offsetmatch the request. Treat any short/mismatched reply as a protocol error (or continue from the returned offset) before copying bytes.🐛 Suggested guard
cuopt::remote::GetResultChunkResponse chunk_resp; auto status = impl_->stub->GetResultChunk(&chunk_ctx, chunk_req, &chunk_resp); if (!status.ok()) { last_error_ = "GetResultChunk failed: " + status.error_message(); return false; } + + if (chunk_resp.download_id() != download_id || chunk_resp.field_id() != field_id || + chunk_resp.element_offset() != elem_offset) { + last_error_ = "GetResultChunk: response metadata mismatch"; + return false; + } int64_t elems_received = chunk_resp.elements_in_chunk(); const auto& data = chunk_resp.data(); - if (elems_received < 0 || elems_received > elems_wanted || + if (elems_received != elems_wanted || elems_received > total_elems - elem_offset) { last_error_ = "GetResultChunk: invalid element count"; return false; }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/grpc/client/grpc_client.cpp` around lines 1094 - 1108, The current block copies chunk_resp.data() into array_bytes without verifying the reply matches the request and without handling short chunks correctly; before the memcpy in the GetResultChunk handling code, validate that chunk_resp.download_id(), chunk_resp.field_id(), and chunk_resp.element_offset() match the request's expected download_id/field_id and elem_offset (or treat/accept the returned element_offset consistently), and ensure elems_received is the value used to advance elem_offset (or reject if elems_received != elems_wanted) — if any mismatch or a short chunk is detected, set last_error_ to a descriptive protocol-error message and return false instead of copying; update total_bytes_received and elem_offset only after successful validation.cpp/tests/linear_programming/grpc/grpc_integration_test.cpp-312-320 (1)
312-320:⚠️ Potential issue | 🟠 MajorAdd SAN extension to the generated server certificate.
The C++ fixture generates a CN-only
localhostcertificate, whilepython/cuopt/cuopt/tests/linear_programming/test_cpu_only_execution.pyincludes an explicit SAN extensionsubjectAltName=DNS:localhost,IP:127.0.0.1. Modern OpenSSL respects SAN over CN; CN-only certs rely on fallback behavior that varies across versions and configurations. Align the C++ test by adding the same SAN extension via an-extfileparameter to theopenssl x509call.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/tests/linear_programming/grpc/grpc_integration_test.cpp` around lines 312 - 320, The server certificate generation in the test creates a CN-only cert (server_key, server_csr, server_crt) which lacks a subjectAltName and can fail hostname validation; update the openssl x509 invocation in the code that calls run(...) (the block creating server_crt using ca_crt and ca_key) to include an -extfile that defines subjectAltName=DNS:localhost,IP:127.0.0.1 (create a temporary extfile or inline one before the run call) so the produced certificate includes the SAN extension matching the Python test.cpp/src/grpc/cuopt_remote_service.proto-314-331 (1)
314-331:⚠️ Potential issue | 🟠 MajorMake incumbent pagination bounded by default.
Each
Incumbentcarries the full assignment vector, somax_count <= 0 => no limitcan easily exceedmax_message_bytesor force the server to build a very large response in memory. This is also the public client default today, so callers will hit it unless they remember to paginate manually. Require a positive page size here, and clamp or reject oversized pages server-side.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/grpc/cuopt_remote_service.proto` around lines 314 - 331, The Incumbent pagination currently allows max_count <= 0 to mean "no limit", which can produce huge responses; change the handling so IncumbentRequest.max_count must be positive and server-side code enforces a safe maximum: update the service logic that processes IncumbentRequest to (a) treat non-positive max_count as invalid (return an error) or replace it with a configurable DEFAULT_PAGE_SIZE, and (b) clamp any requested max_count to a SERVER_MAX_PAGE_SIZE before building the IncumbentResponse.incumbents list; ensure these checks live where requests are decoded/handled (the RPC handler that reads IncumbentRequest) and return a clear error or truncated page when the client asks for an oversized page.cpp/src/grpc/server/grpc_server_types.hpp-257-259 (1)
257-259:⚠️ Potential issue | 🟠 MajorNamespace the shared-memory object names per server instance.
These fixed names prevent two servers on the same host from coexisting cleanly: the second instance will either attach to the first instance's queues or remove them during cleanup. Include something instance-specific here (for example port/PID/UUID) and keep it in server state for symmetric teardown.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/grpc/server/grpc_server_types.hpp` around lines 257 - 259, The shared-memory names SHM_JOB_QUEUE, SHM_RESULT_QUEUE and SHM_CONTROL are hard-coded and must be namespaced per server instance; change them to be constructed with an instance-specific identifier (e.g., port, PID or generated UUID) and store that identifier on the server object so teardown uses the exact same names. Update the places that reference SHM_JOB_QUEUE/SHM_RESULT_QUEUE/SHM_CONTROL to build names using the stored instance id (for example suffix or prefix) during creation, attach and cleanup so two servers on the same host do not collide.cpp/src/grpc/server/grpc_server_types.hpp-177-184 (1)
177-184:⚠️ Potential issue | 🟠 MajorInitialize every pipe FD to
-1.If
worker_pipesis resized/default-constructed before every field is assigned, these members are indeterminate. Any cleanup or partial-failure path that conditionally closes them can hit unrelated descriptors.Suggested fix
struct WorkerPipes { - int to_worker_fd; - int from_worker_fd; - int worker_read_fd; - int worker_write_fd; - int incumbent_from_worker_fd; - int worker_incumbent_write_fd; + int to_worker_fd = -1; + int from_worker_fd = -1; + int worker_read_fd = -1; + int worker_write_fd = -1; + int incumbent_from_worker_fd = -1; + int worker_incumbent_write_fd = -1; };🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/grpc/server/grpc_server_types.hpp` around lines 177 - 184, The struct WorkerPipes currently leaves its file-descriptor members uninitialized; add explicit initialization of every FD to -1 to avoid closing unrelated descriptors during cleanup—either add in-class member initializers for to_worker_fd, from_worker_fd, worker_read_fd, worker_write_fd, incumbent_from_worker_fd, worker_incumbent_write_fd = -1 or provide a default constructor on WorkerPipes that sets each of those members to -1; update any construction sites if necessary to rely on the default-initialized values.cpp/src/grpc/cuopt_remote.proto-251-271 (1)
251-271: 🛠️ Refactor suggestion | 🟠 MajorAdd
*_UNSPECIFIED = 0to bothArrayFieldIdandResultFieldIdenums, then shift all real field IDs up by 1.In proto3, an omitted enum field deserializes to
0. With the current numbering, a missing or malformedfield_idis indistinguishable fromFIELD_A_VALUESandRESULT_PRIMAL_SOLUTIONrespectively, allowing malformed chunk requests to be routed to real arrays instead of being rejected. This must be fixed before the wire format ships.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/grpc/cuopt_remote.proto` around lines 251 - 271, Add a zero-valued sentinel to both enums and renumber existing members: introduce ARRAY_FIELD_ID_UNSPECIFIED = 0 in the ArrayFieldId enum and RESULT_FIELD_ID_UNSPECIFIED = 0 in the ResultFieldId enum, then increment every existing enum value by 1 (e.g., FIELD_A_VALUES becomes 1, FIELD_A_INDICES becomes 2, etc., and RESULT_PRIMAL_SOLUTION shifts from 0 to 1). Update any code that constructs, parses, or switches on ArrayFieldId or ResultFieldId (search for uses of ArrayFieldId and ResultFieldId, and for literal enum values like FIELD_A_VALUES or RESULT_PRIMAL_SOLUTION) to use the new numeric values or compare against the new _UNSPECIFIED sentinel to reject missing/zero field_id cases. Ensure generated code and any RPC/serialization tests are regenerated/updated to reflect the shifted enum values.
🟡 Minor comments (3)
cpp/src/grpc/server/grpc_incumbent_proto.hpp-38-45 (1)
38-45:⚠️ Potential issue | 🟡 MinorAdd
INT_MAXguard before protobuf parse to prevent size truncation.
ParseFromArray()takes anintlength parameter. Without a bounds check, oversizedsize_tinput truncates silently instead of rejecting malformed IPC data cleanly. Add the guard:if (size > static_cast<size_t>(std::numeric_limits<int>::max())) { return false; }This mirrors the
INT_MAXcheck inbuild_incumbent_proto()and prevents undefined behavior from input sanitization.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/grpc/server/grpc_incumbent_proto.hpp` around lines 38 - 45, parse_incumbent_proto currently calls cuopt::remote::Incumbent::ParseFromArray with size cast to int, which can truncate large size_t values; add a guard at the start of parse_incumbent_proto to return false when size > static_cast<size_t>(std::numeric_limits<int>::max()) so the function rejects oversized inputs instead of silently truncating before calling ParseFromArray, mirroring the INT_MAX check used in build_incumbent_proto.cpp/src/grpc/server/grpc_server_main.cpp-115-118 (1)
115-118:⚠️ Potential issue | 🟡 Minor
--verboseis parsed but never affectsconfig.verbose.
config.verboseis derived only from--quiet, so--verboseis effectively a no-op. Either drop the flag or include it in the final verbosity calculation so the CLI matches what it advertises.Also applies to: 156-156
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/grpc/server/grpc_server_main.cpp` around lines 115 - 118, The --verbose flag is parsed by program.add_argument("-v", "--verbose") but never used when computing config.verbose (which is only set from --quiet); update the final verbosity calculation to consider both flags (e.g., set config.verbose = parsed_verbose && !parsed_quiet or apply precedence rules), or remove the add_argument call if you intend not to support verbose; adjust the code locations that compute config.verbose (refer to the config.verbose assignment and the program.add_argument("-q","--quiet")/program.add_argument("-v","--verbose") usages) so the CLI behavior matches the advertised flags.cpp/src/grpc/server/grpc_pipe_serialization.hpp-241-249 (1)
241-249:⚠️ Potential issue | 🟡 MinorMissing return value check for
SerializeToArray.
SerializeToArrayreturnsfalseon failure, but the return value is not checked. If serialization fails, an uninitialized blob is returned.🐛 Proposed fix
inline std::vector<uint8_t> serialize_submit_request_to_pipe( const cuopt::remote::SubmitJobRequest& request) { size_t byte_size = request.ByteSizeLong(); if (byte_size == 0 || byte_size > static_cast<size_t>(std::numeric_limits<int>::max())) return {}; std::vector<uint8_t> blob(byte_size); - request.SerializeToArray(blob.data(), static_cast<int>(byte_size)); + if (!request.SerializeToArray(blob.data(), static_cast<int>(byte_size))) return {}; return blob; }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/grpc/server/grpc_pipe_serialization.hpp` around lines 241 - 249, The function serialize_submit_request_to_pipe does not check the boolean return value of request.SerializeToArray, so a failed serialization can return an uninitialized blob; update serialize_submit_request_to_pipe to capture the bool result of request.SerializeToArray(blob.data(), static_cast<int>(byte_size)) and if it returns false, return an empty vector (or otherwise handle the error consistently with surrounding code) instead of returning the blob; ensure you still validate byte_size as before and only return the populated blob when SerializeToArray succeeds.
🧹 Nitpick comments (8)
conda/recipes/libcuopt/recipe.yaml (1)
205-209: Smoke-test the new server executable.
package_contentsonly proves the file was installed. A trivial launch check likecuopt_grpc_server --helpwould catch missing shared libraries or bad RPATHs for the newly shipped binary.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@conda/recipes/libcuopt/recipe.yaml` around lines 205 - 209, Add a smoke-test that actually runs the shipped server binary rather than just listing it: update the recipe to include a test/run_test (or test/commands) entry that executes "cuopt_grpc_server --help" (or an equivalent non-destructive invocation) after install, so missing shared libs or bad RPATHs are caught; reference the existing package_contents entry and the cuopt_grpc_server binary when adding this test.cpp/CMakeLists.txt (1)
275-301: Use generator expressions to reference protoc and the gRPC plugin targets.The current code reads
IMPORTED_LOCATIONproperties directly fromprotobuf::protoc, which is fragile across different package configurations. The official protobuf documentation recommends using the generator expression$<TARGET_FILE:protobuf::protoc>instead. Similarly, the gRPC plugin target should begRPC::grpc_cpp_plugin(the officially exported namespaced target from gRPCConfig.cmake), not the plaingrpc_cpp_plugin. This ensures consistency across config-package installs and in-tree builds.Specifically:
- Replace the
get_target_propertycalls for protoc with$<TARGET_FILE:protobuf::protoc>- Update the plugin check to use
TARGET gRPC::grpc_cpp_pluginand$<TARGET_FILE:gRPC::grpc_cpp_plugin>🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/CMakeLists.txt` around lines 275 - 301, The current CMake logic reads IMPORTED_LOCATION from protobuf::protoc and checks for a plain grpc_cpp_plugin target; replace those fragile property reads with generator expressions and the proper gRPC target: when TARGET protobuf::protoc is true set _PROTOBUF_PROTOC to the generator expression $<TARGET_FILE:protobuf::protoc> (instead of using get_target_property(IMPORTED_LOCATION*)), and when checking the gRPC plugin use TARGET gRPC::grpc_cpp_plugin and set _GRPC_CPP_PLUGIN_EXECUTABLE to $<TARGET_FILE:gRPC::grpc_cpp_plugin>; keep the fallback find_program/ find_package branches and the fatal-error checks unchanged.cpp/tests/linear_programming/grpc/CMakeLists.txt (1)
119-124: Consider adding a test timeout for the integration test.Integration tests that spawn external processes (the gRPC server) can hang indefinitely if the server fails to start or respond. Consider adding a timeout property to prevent CI pipeline hangs.
💡 Proposed enhancement
add_test( NAME GRPC_INTEGRATION_TEST COMMAND ${CMAKE_COMMAND} -E env "CUOPT_GRPC_SERVER_PATH=$<TARGET_FILE:cuopt_grpc_server>" $<TARGET_FILE:GRPC_INTEGRATION_TEST> ) + +set_tests_properties(GRPC_INTEGRATION_TEST PROPERTIES TIMEOUT 300)🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/tests/linear_programming/grpc/CMakeLists.txt` around lines 119 - 124, The GRPC_INTEGRATION_TEST add_test currently can hang indefinitely; modify the CMake test registration to set a timeout by adding a test property for GRPC_INTEGRATION_TEST (or using set_tests_properties on the test name) and assign a TIMEOUT value (seconds) appropriate for startup and test execution (e.g., 60 or configurable via a variable). Locate the add_test registering GRPC_INTEGRATION_TEST and follow it with set_tests_properties(GRPC_INTEGRATION_TEST PROPERTIES TIMEOUT <seconds>) so the spawned cuopt_grpc_server process is forcibly killed after the timeout.GRPC_SERVER_ARCHITECTURE.md (1)
50-56: Consider documenting shared memory name collision risk for multi-instance deployments.The shared memory segment names (
/cuopt_job_queue,/cuopt_result_queue,/cuopt_control) are fixed. Running multiple server instances on the same host would cause collisions. Consider adding a note about this limitation or documenting how to use--portto differentiate instances if the segments are port-specific.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@GRPC_SERVER_ARCHITECTURE.md` around lines 50 - 56, Document that the shared memory segment names `/cuopt_job_queue`, `/cuopt_result_queue`, and `/cuopt_control` are fixed and will collide when multiple server instances run on the same host; add a brief note in the "Shared Memory Segments" section explaining the risk and recommend solutions (namespace segments per-instance using a configurable prefix such as PID, port, or --port value, or allow a CLI/config option to set custom segment names) and show an example naming convention to avoid collisions.cpp/tests/linear_programming/grpc/grpc_client_test.cpp (1)
848-860: Assert the reconstructed primal vector here too.This test only checks the objective copied from the chunked header. A regression in
RESULT_PRIMAL_SOLUTIONstitching would still pass. Assert the returned primal values are{1.5, 2.5}so the chunked download path is exercising actual numerical reconstruction.As per coding guidelines, "
**/*test*.{cpp,cu,py}: Write tests validating numerical correctness of optimization results (not just 'runs without error')."Also applies to: 871-876
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/tests/linear_programming/grpc/grpc_client_test.cpp` around lines 848 - 860, The test mocks a GetResultChunk returning two primal values but never asserts the reconstructed primal vector; update the test after the call that verifies the objective to also fetch the returned primal solution and assert its elements equal {1.5, 2.5} (i.e., add numeric equality checks for the reconstructed primal vector produced by the code path that handles cuopt::remote::RESULT_PRIMAL_SOLUTION). Apply the same additional assertions at the second mocked-response site referenced (around lines 871-876) so both chunked-download cases validate numerical reconstruction.cpp/tests/linear_programming/grpc/grpc_integration_test.cpp (1)
959-984: This test never proveslog_callbackwas invoked.
received_logsis populated, but the assertions only check solve success. If streaming regresses and no callback fires, this test still passes. Assert at least one callback invocation or a known log line so the test covers the behavior in its name.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/tests/linear_programming/grpc/grpc_integration_test.cpp` around lines 959 - 984, The test SolveMIPWithLogCallback currently never asserts that config.log_callback was invoked; modify the test to assert that received_logs contains at least one entry (or contains a known log substring) after client->solve_mip returns. Use the same log_mutex to safely read received_logs (e.g., lock_guard on log_mutex) and then EXPECT_FALSE(received_logs.empty()) or EXPECT_TRUE(any entry contains "presolve"/other known text) so the test fails if streaming/callbacks regress.cpp/src/grpc/server/grpc_service_impl.cpp (1)
437-438: Hardcoded element size assumes all result arrays are doubles.The element size is hardcoded to
sizeof(double), which works for the current LP/MIP solution arrays but could break if result arrays with different element types are added.Consider fetching the element size from the array descriptor in
ChunkedResultHeaderthat was set viaadd_result_array_descriptor()inpopulate_chunked_result_header_*for consistency.🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/grpc/server/grpc_service_impl.cpp` around lines 437 - 438, The code currently hardcodes elem_size = sizeof(double) which assumes all result arrays are doubles; instead query the element size from the ChunkedResultHeader's array descriptor (the descriptor added via add_result_array_descriptor() in populate_chunked_result_header_*), compute elem_size from that descriptor and then derive array_size = total_bytes / elem_size so different element types are handled correctly; update the logic that sets elem_size and array_size (replace the hardcoded sizeof(double) usage) and ensure you handle any missing descriptor/error cases.cpp/src/grpc/server/grpc_job_management.cpp (1)
250-260: Redundant double-check of slot validity.The slot validation at lines 257-260 duplicates the check at lines 251-252. While the intent appears to be guarding against ABA races, both checks use the same conditions without an intervening operation that could cause the state to change.
If defensive re-validation is desired, consider adding a comment explaining the race being guarded against.
♻️ Simplified version (if redundancy is unintentional)
for (size_t i = 0; i < MAX_JOBS; ++i) { - if (!job_queue[i].ready.load(std::memory_order_acquire)) continue; - if (strcmp(job_queue[i].job_id, job_id.c_str()) != 0) continue; - - // Re-validate the slot: the job_id could have changed between the - // initial check and now if the slot was recycled. Load ready with - // acquire so we see all writes that published it. - if (!job_queue[i].ready.load(std::memory_order_acquire) || - strcmp(job_queue[i].job_id, job_id.c_str()) != 0) { + // Load ready with acquire to see all writes that published the slot. + if (!job_queue[i].ready.load(std::memory_order_acquire)) continue; + if (strcmp(job_queue[i].job_id, job_id.c_str()) != 0) { continue; }🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/grpc/server/grpc_job_management.cpp` around lines 250 - 260, The loop in grpc_job_management.cpp redundantly re-checks the same slot readiness and job_id twice (the two uses of job_queue[i].ready.load(std::memory_order_acquire) and strcmp(job_queue[i].job_id, job_id.c_str())), which either should be removed or documented as an ABA defense; update the code by either deleting the second identical validation (inside the comment block) to avoid duplicate checks, or if you intend to defend against ABA/race windows keep the second check but replace it with a clear comment explaining the exact race being guarded and prefer a robust mechanism (e.g., a sequence/version atomic on job_queue entries) instead of silent duplicate loads—refer to job_queue, MAX_JOBS, job_id, ready.load(std::memory_order_acquire), and strcmp when making the change.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: e26b5406-c826-4a7f-abc9-dcc595a81d46
📒 Files selected for processing (56)
GRPC_INTERFACE.mdGRPC_QUICK_START.mdGRPC_SERVER_ARCHITECTURE.mdbuild.shci/build_wheel_libcuopt.shci/utils/install_protobuf_grpc.shconda/environments/all_cuda-129_arch-aarch64.yamlconda/environments/all_cuda-129_arch-x86_64.yamlconda/environments/all_cuda-131_arch-aarch64.yamlconda/environments/all_cuda-131_arch-x86_64.yamlconda/recipes/libcuopt/recipe.yamlcpp/CMakeLists.txtcpp/include/cuopt/linear_programming/solve_remote.hppcpp/src/grpc/client/grpc_client.cppcpp/src/grpc/client/grpc_client.hppcpp/src/grpc/client/solve_remote.cppcpp/src/grpc/cuopt_remote.protocpp/src/grpc/cuopt_remote_service.protocpp/src/grpc/grpc_problem_mapper.cppcpp/src/grpc/grpc_problem_mapper.hppcpp/src/grpc/grpc_service_mapper.cppcpp/src/grpc/grpc_service_mapper.hppcpp/src/grpc/grpc_settings_mapper.cppcpp/src/grpc/grpc_settings_mapper.hppcpp/src/grpc/grpc_solution_mapper.cppcpp/src/grpc/grpc_solution_mapper.hppcpp/src/grpc/server/grpc_field_element_size.hppcpp/src/grpc/server/grpc_incumbent_proto.hppcpp/src/grpc/server/grpc_job_management.cppcpp/src/grpc/server/grpc_pipe_io.cppcpp/src/grpc/server/grpc_pipe_serialization.hppcpp/src/grpc/server/grpc_server_logger.cppcpp/src/grpc/server/grpc_server_logger.hppcpp/src/grpc/server/grpc_server_main.cppcpp/src/grpc/server/grpc_server_threads.cppcpp/src/grpc/server/grpc_server_types.hppcpp/src/grpc/server/grpc_service_impl.cppcpp/src/grpc/server/grpc_worker.cppcpp/src/grpc/server/grpc_worker_infra.cppcpp/src/mip_heuristics/solve.cucpp/src/pdlp/CMakeLists.txtcpp/src/pdlp/cpu_optimization_problem.cppcpp/src/pdlp/optimization_problem.cucpp/src/pdlp/solve.cucpp/src/pdlp/solve_remote.cucpp/tests/linear_programming/CMakeLists.txtcpp/tests/linear_programming/c_api_tests/c_api_tests.cppcpp/tests/linear_programming/grpc/CMakeLists.txtcpp/tests/linear_programming/grpc/grpc_client_test.cppcpp/tests/linear_programming/grpc/grpc_client_test_helper.hppcpp/tests/linear_programming/grpc/grpc_integration_test.cppcpp/tests/linear_programming/grpc/grpc_pipe_serialization_test.cppcpp/tests/linear_programming/grpc/grpc_test_log_capture.hppdependencies.yamlpython/cuopt/cuopt/tests/linear_programming/test_cpu_only_execution.pypython/libcuopt/CMakeLists.txt
💤 Files with no reviewable changes (4)
- cpp/src/pdlp/cpu_optimization_problem.cpp
- cpp/src/pdlp/CMakeLists.txt
- cpp/src/pdlp/optimization_problem.cu
- cpp/src/pdlp/solve_remote.cu
| - openssl | ||
| - c-ares | ||
| - libgrpc | ||
| - libprotobuf | ||
| - libabseil | ||
| - re2 |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
echo "== top-level shared host requirements =="
sed -n '80,100p' conda/recipes/libcuopt/recipe.yaml
echo
echo "== libcuopt output host/run requirements =="
sed -n '165,194p' conda/recipes/libcuopt/recipe.yaml
echo
echo "== wheel build note for the same binary =="
sed -n '20,28p' ci/build_wheel_libcuopt.shRepository: NVIDIA/cuopt
Length of output: 1745
🏁 Script executed:
#!/bin/bash
# Check what cache.build.script does
rg -A 30 "cache\.build\.script" conda/recipes/libcuopt/recipe.yaml | head -60
# Also check the complete libcuopt output section to understand build flow
echo "=== Full libcuopt output section ==="
sed -n '150,210p' conda/recipes/libcuopt/recipe.yamlRepository: NVIDIA/cuopt
Length of output: 1699
🏁 Script executed:
# Find and display the cache.build.script section
sed -n '1,150p' conda/recipes/libcuopt/recipe.yaml | grep -A 50 "cache\.build\.script"Repository: NVIDIA/cuopt
Length of output: 38
🏁 Script executed:
# Search for cache.build.script anywhere in the file
rg "cache\.build\.script" conda/recipes/libcuopt/recipe.yaml -A 40
# Also let's see the structure around lines 1-80 to understand the recipe better
echo "=== Top section of recipe ==="
head -80 conda/recipes/libcuopt/recipe.yamlRepository: NVIDIA/cuopt
Length of output: 3975
🏁 Script executed:
# Check if build.sh actually builds libcuopt/cuopt_grpc_server
echo "=== Checking build.sh for libcuopt build and cuopt_grpc_server ==="
grep -n "libcuopt\|cuopt_grpc_server" build.sh | head -20
# Also check CMakeLists.txt to see if cuopt_grpc_server depends on libuuid
echo
echo "=== Checking CMakeLists.txt for cuopt_grpc_server and libuuid ==="
find . -name "CMakeLists.txt" -exec grep -l "cuopt_grpc_server\|libuuid" {} \; 2>/dev/null
# Let's look at the actual CMakeLists in cpp directory
echo
echo "=== cpp/CMakeLists.txt search ==="
grep -n "cuopt_grpc_server\|libuuid\|uuid" cpp/CMakeLists.txt 2>/dev/null || echo "Not found in cpp/CMakeLists.txt"
# Check cpp/build directory or subdirectories
find cpp -name "CMakeLists.txt" -type f 2>/dev/null | head -5Repository: NVIDIA/cuopt
Length of output: 2611
🏁 Script executed:
# Verify the full top-level host requirements section to confirm libuuid is missing
echo "=== Complete top-level host requirements ==="
sed -n '73,105p' conda/recipes/libcuopt/recipe.yaml | grep -E "^\s+- " | sort
# Double-check if libuuid appears anywhere in top-level requirements
echo
echo "=== Checking for libuuid in top-level section ==="
sed -n '1,145p' conda/recipes/libcuopt/recipe.yaml | grep -i "libuuid"
echo "Exit code: $?"Repository: NVIDIA/cuopt
Length of output: 729
Add libuuid to the shared cache host requirements.
cache.build.script builds libcuopt (line 23-30 in build.sh), which includes the cuopt_grpc_server target. The CMakeLists.txt explicitly requires uuid at link time (find_library(UUID_LIBRARY uuid REQUIRED) at line 749 in cpp/CMakeLists.txt). Since output-specific libcuopt requirements apply only after the shared cache build completes, the conda build environment will fail at link time for cuopt_grpc_server without libuuid in the top-level host requirements.
Also applies to: 174-190
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@conda/recipes/libcuopt/recipe.yaml` around lines 94 - 99, The shared-cache
host requirements for libcuopt are missing libuuid, which breaks linking of the
cuopt_grpc_server target during cache.build.script (build.sh lines that build
libcuopt); update the recipe.yaml host requirements list(s) that include
openssl, c-ares, libgrpc, libprotobuf, libabseil, re2 to also include libuuid so
the system-provided UUID library satisfies find_library(UUID_LIBRARY uuid
REQUIRED) in cpp/CMakeLists.txt and prevents link-time failures (also apply the
same addition to the other host requirements block referenced in the file).
| package cuopt.remote; | ||
|
|
||
| // Import the existing message definitions | ||
| import "cuopt_remote.proto"; |
There was a problem hiding this comment.
Fix the Buf import path.
buf resolves imports from the module root, so import "cuopt_remote.proto"; does not find cpp/src/grpc/cuopt_remote.proto. Update this import once the protos are moved/aligned with their package path.
🧰 Tools
🪛 Buf (1.66.1)
[error] 9-9: import "cuopt_remote.proto": file does not exist
(COMPILE)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@cpp/src/grpc/cuopt_remote_service.proto` at line 9, Update the import in
cuopt_remote_service.proto so Buf can resolve it from the module root: replace
the current import "cuopt_remote.proto" with the module-root-relative path that
matches where cuopt_remote.proto will live (i.e., the package-aligned path after
you move/align protos); modify the import statement in
cuopt_remote_service.proto (the existing import "cuopt_remote.proto") to the
correct module-root-relative path.
|
|
||
| syntax = "proto3"; | ||
|
|
||
| package cuopt.remote; |
There was a problem hiding this comment.
Move the proto files under a package-matching path.
buf lint will fail here because package cuopt.remote is not under a cuopt/remote/ directory. Either move this file (and cuopt_remote_service.proto) under a module-root-relative cuopt/remote/ path or rename the package/imports consistently.
🧰 Tools
🪛 Buf (1.66.1)
[error] 6-6: Files with package "cuopt.remote" must be within a directory "cuopt/remote" relative to root but were in directory "cpp/src/grpc".
(PACKAGE_DIRECTORY_MATCH)
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@cpp/src/grpc/cuopt_remote.proto` at line 6, The proto package declaration
"package cuopt.remote" in cuopt_remote.proto (and the matching
cuopt_remote_service.proto) must live in a directory path that matches the
package; either move both files into a module-root-relative cuopt/remote/
directory so their filesystem path matches the package, or update the package
line and any imports to a package name that matches their current directory;
adjust imports in both cuopt_remote.proto and cuopt_remote_service.proto to
remain consistent after the move/rename so buf lint passes.
| // proto3 defaults numeric fields to 0; treat negative iteration_limit as "unset" | ||
| // so the server keeps the library default (typically max()). | ||
| if (pb_settings.iteration_limit() >= 0) { | ||
| const auto limit = pb_settings.iteration_limit(); | ||
| settings.iteration_limit = (limit > static_cast<int64_t>(std::numeric_limits<i_t>::max())) | ||
| ? std::numeric_limits<i_t>::max() | ||
| : static_cast<i_t>(limit); | ||
| } |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
# Check how clients set iteration_limit when using server defaults
rg -n "iteration_limit" --type cpp -C3 cpp/src/grpc/client/Repository: NVIDIA/cuopt
Length of output: 38
🏁 Script executed:
# Check repository structure and find client-related code
find . -type d -name "*client*" | head -20Repository: NVIDIA/cuopt
Length of output: 196
🏁 Script executed:
# Look for proto files to understand the iteration_limit schema
fd -e proto | head -20Repository: NVIDIA/cuopt
Length of output: 127
🏁 Script executed:
# Examine the grpc_settings_mapper.cpp file more broadly to understand context
cat -n cpp/src/grpc/grpc_settings_mapper.cpp | head -100Repository: NVIDIA/cuopt
Length of output: 5006
🏁 Script executed:
# Search for iteration_limit usage patterns across the codebase
rg -n "iteration_limit" --type cpp -C2 | head -100Repository: NVIDIA/cuopt
Length of output: 8440
🏁 Script executed:
# Check the proto schema definition for iteration_limit
cat cpp/src/grpc/cuopt_remote.proto | grep -A 10 -B 5 "iteration_limit"Repository: NVIDIA/cuopt
Length of output: 634
🏁 Script executed:
# Look at the test case mentioned in search results
sed -n '77,150p' cpp/tests/linear_programming/c_api_tests/c_api_tests.cppRepository: NVIDIA/cuopt
Length of output: 3092
🏁 Script executed:
# Check if there are any gRPC-specific tests or documentation about client behavior
fd -e "test" -e "spec" | xargs grep -l "iteration_limit" 2>/dev/null | head -10Repository: NVIDIA/cuopt
Length of output: 38
🏁 Script executed:
# Search for any existing bug reports or TODOs about this iteration_limit issue
rg -n "iteration_limit.*0" --type cpp -C2 cpp/src/grpc/Repository: NVIDIA/cuopt
Length of output: 614
🏁 Script executed:
# Check if there are integration tests that test the from_proto behavior with 0 or unset values
fd -e "test.cpp" -e "test.hpp" | xargs grep -l "from_proto\|PDLPSolverSettings" 2>/dev/null | head -5Repository: NVIDIA/cuopt
Length of output: 38
🏁 Script executed:
# Look at the complete from_proto function to see the full context
sed -n '120,180p' cpp/src/grpc/grpc_settings_mapper.cppRepository: NVIDIA/cuopt
Length of output: 3450
🏁 Script executed:
# Check if there are any other similar numeric field handling patterns to confirm consistency
sed -n '150,180p' cpp/src/grpc/grpc_settings_mapper.cppRepository: NVIDIA/cuopt
Length of output: 1745
Fix asymmetry in iteration_limit proto3 handling to prevent unintended zero-iteration behavior.
Lines 141–146 accept iteration_limit >= 0, which treats proto3's default value of 0 (when the field is omitted by the client) as a valid setting. This causes immediate solver termination (0 iterations) rather than using the library default (max()).
The proto schema explicitly documents the -1 sentinel for "unset" and warns that proto3 defaults to 0 when fields are omitted. The asymmetry is clear: to_proto() correctly converts max() → -1, but from_proto() incorrectly accepts 0 as explicitly set.
Fix: Either change the condition to if (pb_settings.iteration_limit() > 0) to treat 0 and negative values as unset, or validate against a valid range with a fallback default (matching the pattern used for presolver and pdlp_precision in lines 154–159 and 172–178).
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@cpp/src/grpc/grpc_settings_mapper.cpp` around lines 139 - 146, The from-proto
handling for iteration_limit incorrectly treats proto3's default 0 as an
explicit setting; update the logic in grpc_settings_mapper.cpp that reads
pb_settings.iteration_limit() so it only accepts positive values (e.g., change
the guard to pb_settings.iteration_limit() > 0) or otherwise validate the range
and fall back to library default (std::numeric_limits<i_t>::max()) like the
presolver/pdlp_precision pattern; ensure settings.iteration_limit is set to
max() when the proto value is 0 or negative, and keep the existing cap that maps
values above numeric_limits<i_t>::max() down to that max.
| bool read_from_pipe(int fd, void* data, size_t size, int timeout_ms) | ||
| { | ||
| uint8_t* ptr = static_cast<uint8_t*>(data); | ||
| size_t remaining = size; | ||
|
|
||
| // Poll once to enforce timeout before the first read. After data starts | ||
| // flowing, blocking read() is sufficient — if the writer dies the pipe | ||
| // closes and read() returns 0 (EOF). Avoids ~10k extra poll() syscalls | ||
| // per bulk transfer. | ||
| struct pollfd pfd = {fd, POLLIN, 0}; | ||
| int pr; | ||
| do { | ||
| pr = poll(&pfd, 1, timeout_ms); | ||
| } while (pr < 0 && errno == EINTR); | ||
| if (pr < 0) { | ||
| SERVER_LOG_ERROR("[Server] poll() failed on pipe: %s", strerror(errno)); | ||
| return false; | ||
| } | ||
| if (pr == 0) { | ||
| SERVER_LOG_ERROR("[Server] Timeout waiting for pipe data (waited %dms)", timeout_ms); | ||
| return false; | ||
| } | ||
| if (pfd.revents & (POLLERR | POLLHUP | POLLNVAL)) { | ||
| SERVER_LOG_ERROR("[Server] Pipe error/hangup detected"); | ||
| return false; | ||
| } | ||
|
|
||
| while (remaining > 0) { | ||
| ssize_t nread = ::read(fd, ptr, remaining); | ||
| if (nread > 0) { | ||
| ptr += nread; | ||
| remaining -= nread; | ||
| continue; | ||
| } | ||
| if (nread == 0) { | ||
| SERVER_LOG_ERROR("[Server] Pipe EOF (writer closed)"); | ||
| return false; | ||
| } | ||
| if (errno == EINTR) continue; | ||
| SERVER_LOG_ERROR("[Server] Pipe read error: %s", strerror(errno)); | ||
| return false; | ||
| } | ||
| return true; |
There was a problem hiding this comment.
🧩 Analysis chain
🌐 Web query:
On POSIX pipes, can poll() return POLLIN and POLLHUP together while unread data remains, and does a single initial poll() followed by blocking read() enforce a timeout for the entire exact-length read?
💡 Result:
Yes, on POSIX pipes, poll can return both POLLIN and POLLHUP together while unread data remains in the pipe. This occurs when data has been written to the pipe and the write end is closed by the last writer: POLLIN indicates data is available to read without blocking, and POLLHUP indicates the pipe has been closed by the last writer. These events are explicitly not mutually exclusive per POSIX specification. After reading all available data, a subsequent read will return 0 (EOF). No, a single initial poll with timeout followed by a blocking read does not enforce a timeout for the entire exact-length read. The poll timeout only applies to waiting for the pipe to become readable (i.e., data arrives or EOF condition). Once poll returns (successfully or timed out), the subsequent blocking read has no timeout: if data is available but less than requested, read will block indefinitely waiting for more data (if write end still open) or until EOF (if closed). To timeout an exact-length read, repeated non-blocking reads or additional polls after partial reads are needed.
Citations:
- 1: https://www.unix.com/man-page/posix/3p/poll?os=posix§ion=3p&query=poll
- 2: https://pubs.opengroup.org/onlinepubs/9799919799/functions/poll.html
- 3: https://pubs.opengroup.org/onlinepubs/9799919799.2024edition/functions/poll.html
- 4: https://man7.org/linux/man-pages/man2/ppoll.2.html
- 5: https://man7.org/linux/man-pages/man2/poll.2.html
- 6: https://stackoverflow.com/questions/43553004/read-after-poll-returns-pollin-and-pollerr
- 7: https://stackoverflow.com/questions/25147181/pollhup-vs-pollnval-or-what-is-pollhup
- 8: https://www.man7.org/linux/man-pages/man7/pipe.7.html
Poll inside the read loop with timeout enforcement; POLLIN and POLLHUP can occur together while data remains buffered.
After the initial poll returns, subsequent blocking read() calls are no longer bounded by timeout_ms, which can cause indefinite hangs on partial frames. Additionally, treating any POLLHUP as fatal immediately drops valid buffered data—on POSIX pipes, POLLIN and POLLHUP occur together when the writer closes after writing. This violates timeout guarantees and can wedge server threads. Implement polling in a loop with a shrinking deadline, only treating HUP as fatal after draining all readable bytes.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@cpp/src/grpc/server/grpc_pipe_io.cpp` around lines 33 - 75, In
read_from_pipe, replace the one-time initial poll with a looped poll+read that
enforces a shrinking deadline so the total wait never exceeds timeout_ms:
compute a deadline (now + timeout_ms), then in the while(remaining>0) loop call
poll with timeout = max(0, deadline - now) and only call read when poll reports
POLLIN (or POLLIN|POLLHUP); do not treat POLLHUP as fatal if POLLIN is also set
— continue reading until read() returns 0 (EOF) or remaining == 0; treat POLLHUP
as fatal only when poll reports POLLHUP without POLLIN or when read() returns 0
after draining buffered data; handle EINTR for poll/read and log
timeout/poll/read errors with context. Ensure all references are applied to the
read_from_pipe function and its local pfd/pr/remaining variables.
| // Chunked download session state (raw arrays from worker) | ||
| struct ChunkedDownloadState { | ||
| uint32_t problem_category = cuopt::remote::LP; | ||
| std::chrono::steady_clock::time_point created; | ||
| cuopt::remote::ChunkedResultHeader result_header; | ||
| std::map<int32_t, std::vector<uint8_t>> raw_arrays; // ResultFieldId -> raw bytes | ||
| }; | ||
|
|
||
| // Per-array allocation cap for chunked uploads (4 GiB). | ||
| static constexpr int64_t kMaxChunkedArrayBytes = 4LL * 1024 * 1024 * 1024; | ||
|
|
||
| // Maximum concurrent chunked upload + download sessions (global across all clients). | ||
| static constexpr size_t kMaxChunkedSessions = 16; | ||
|
|
||
| // Stale session timeout: sessions with no activity for this long are reaped. | ||
| static constexpr int kSessionTimeoutSeconds = 300; | ||
|
|
||
| struct ChunkedUploadState { | ||
| uint32_t problem_category = cuopt::remote::LP; | ||
| cuopt::remote::ChunkedProblemHeader header; | ||
| struct FieldMeta { | ||
| int64_t total_elements = 0; | ||
| int64_t element_size = 0; | ||
| int64_t received_bytes = 0; | ||
| }; | ||
| std::map<int32_t, FieldMeta> field_meta; | ||
| std::vector<cuopt::remote::ArrayChunk> chunks; | ||
| int64_t total_chunks = 0; | ||
| int64_t total_bytes = 0; | ||
| std::chrono::steady_clock::time_point last_activity; | ||
| }; |
There was a problem hiding this comment.
Bound chunked sessions by total bytes, not just array count.
ChunkedUploadState::chunks and ChunkedDownloadState::raw_arrays retain full payloads in RAM. With a 4 GiB per-array cap and 16 concurrent sessions, one client can pin tens of GiB before a worker consumes the data or before a download finishes. Please enforce per-upload and global byte budgets, and reject or spill once total_bytes crosses them. As per coding guidelines, "Validate input sanitization to prevent buffer overflows and resource exhaustion attacks; avoid unsafe deserialization of problem files" and "Verify correct problem size checks before expensive GPU/CPU operations; prevent resource exhaustion on oversized problems".
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@cpp/src/grpc/server/grpc_server_types.hpp` around lines 186 - 216, The
upload/download state currently holds whole payloads in
ChunkedUploadState::chunks and ChunkedDownloadState::raw_arrays allowing a
client to pin many GiB; add and enforce byte budgets: introduce per-session
limit (e.g. kMaxChunkedSessionBytes) and a global in-process limit (e.g.
kGlobalMaxChunkedBytes) and check them whenever you append to
ChunkedUploadState::chunks or increment ChunkedUploadState::total_bytes, and
whenever you insert into ChunkedDownloadState::raw_arrays or set its
result_header sizes; if adding the bytes would exceed the per-session or global
budget, reject the operation with an appropriate error/GRPC status (or spill to
disk) and do not retain the payload, and update any session bookkeeping (e.g.
increment/decrement global allocated bytes on accept/release) so that session
cleanup and ChunkedUploadState::total_bytes reflect true memory usage.
| pid_t spawn_worker(int worker_id, bool is_replacement) | ||
| { | ||
| std::lock_guard<std::mutex> lock(worker_pipes_mutex); | ||
|
|
||
| if (is_replacement) { close_worker_pipes_server(worker_id); } | ||
|
|
||
| if (!create_worker_pipes(worker_id)) { | ||
| SERVER_LOG_ERROR("[Server] Failed to create pipes for %s%d", | ||
| is_replacement ? "replacement worker " : "worker ", | ||
| worker_id); | ||
| return -1; | ||
| } | ||
|
|
||
| pid_t pid = fork(); | ||
| if (pid < 0) { | ||
| SERVER_LOG_ERROR("[Server] Failed to fork %s%d", | ||
| is_replacement ? "replacement worker " : "worker ", | ||
| worker_id); | ||
| close_all_worker_pipes(worker_pipes[worker_id]); | ||
| return -1; | ||
| } else if (pid == 0) { | ||
| // Child: close all fds belonging to other workers. | ||
| for (int j = 0; j < static_cast<int>(worker_pipes.size()); ++j) { | ||
| if (j != worker_id) { close_all_worker_pipes(worker_pipes[j]); } | ||
| } | ||
| // Close the server-side ends of this worker's pipes (child uses the other ends). | ||
| close_and_reset(worker_pipes[worker_id].to_worker_fd); | ||
| close_and_reset(worker_pipes[worker_id].from_worker_fd); | ||
| close_and_reset(worker_pipes[worker_id].incumbent_from_worker_fd); | ||
| worker_process(worker_id); | ||
| _exit(0); | ||
| } | ||
|
|
||
| close_worker_pipes_child_ends(worker_id); | ||
| return pid; |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
# Find files related to worker spawning and threading
fd -t f "grpc_worker|grpc_server" cpp/src/grpc/server/ | head -20Repository: NVIDIA/cuopt
Length of output: 346
🏁 Script executed:
# Look for spawn_single_worker usage
rg "spawn_single_worker" --type cpp --type hpp -B 2 -A 5Repository: NVIDIA/cuopt
Length of output: 83
🏁 Script executed:
# Check worker_monitor_thread definition
rg "worker_monitor_thread" --type cpp --type hpp -B 2 -A 10Repository: NVIDIA/cuopt
Length of output: 83
🏁 Script executed:
# Check worker_pids usage pattern
rg "worker_pids" --type cpp --type hpp -B 2 -A 3Repository: NVIDIA/cuopt
Length of output: 83
🏁 Script executed:
# Check pending_chunked_data cleanup/handling
rg "pending_chunked_data" --type cpp --type hpp -B 2 -A 3Repository: NVIDIA/cuopt
Length of output: 83
🏁 Script executed:
# Examine the grpc_server_main.cpp to understand thread startup
fd -t f "grpc_server_main" cpp/src/grpc/server/ -exec cat {} \;Repository: NVIDIA/cuopt
Length of output: 202
🏁 Script executed:
# Find spawn_single_worker callers - use correct file types
rg "spawn_single_worker" cpp/src/grpc/server/ -B 2 -A 5Repository: NVIDIA/cuopt
Length of output: 2209
🏁 Script executed:
# Find worker_monitor_thread
rg "worker_monitor_thread" cpp/src/grpc/server/ -B 2 -A 15Repository: NVIDIA/cuopt
Length of output: 4272
🏁 Script executed:
# Check worker_pids usage
rg "worker_pids" cpp/src/grpc/server/ -B 2 -A 3Repository: NVIDIA/cuopt
Length of output: 3840
🏁 Script executed:
# Check pending_chunked_data
rg "pending_chunked_data" cpp/src/grpc/server/ -B 2 -A 3Repository: NVIDIA/cuopt
Length of output: 3024
🏁 Script executed:
# Read grpc_server_main.cpp
cat cpp/src/grpc/server/grpc_server_main.cppRepository: NVIDIA/cuopt
Length of output: 12863
🏁 Script executed:
# Read grpc_server_threads.cpp
cat cpp/src/grpc/server/grpc_server_threads.cppRepository: NVIDIA/cuopt
Length of output: 15164
🏁 Script executed:
# Find mark_worker_jobs_failed implementation to check cleanup
rg "mark_worker_jobs_failed" cpp/src/grpc/server/ -A 30Repository: NVIDIA/cuopt
Length of output: 5919
🏁 Script executed:
# Examine spawn_workers to verify indexing issue with failed spawns
rg "void spawn_workers" cpp/src/grpc/server/ -A 15Repository: NVIDIA/cuopt
Length of output: 2427
🏁 Script executed:
# Check if worker_pids is pre-sized or if initial spawn failure leaves gaps
rg "worker_pids.*resize|worker_pids.*reserve|worker_pids.*size()" cpp/src/grpc/server/ -B 2 -A 2Repository: NVIDIA/cuopt
Length of output: 419
🏁 Script executed:
# Look at full mark_worker_jobs_failed to confirm pending_chunked_data handling
rg "mark_worker_jobs_failed" cpp/src/grpc/server/grpc_worker_infra.cpp -A 60Repository: NVIDIA/cuopt
Length of output: 2714
🏁 Script executed:
# Check if is_chunked job can leave pending_chunked_data entries orphaned
rg "is_chunked" cpp/src/grpc/server/ -B 2 -A 2Repository: NVIDIA/cuopt
Length of output: 3860
Forking from the live multithreaded server creates unsafe process state.
spawn_single_worker() is called from worker_monitor_thread() after the gRPC server and background threads (result_retrieval_thread, incumbent_retrieval_thread, session_reaper_thread, plus gRPC service threads) are already running. Line 115 in spawn_worker() calls fork() while the parent process is multithreaded, which violates POSIX fork safety: the child inherits all mutexes, allocator state, and other thread-local resources in an inconsistent state. Although the child calls _exit(0) immediately to avoid destructor issues, the fork itself risks undefined behavior if any thread other than the monitor thread held worker_pipes_mutex or was inside malloc/free during the fork.
Replace the fork-based respawning with posix_spawn (which is fork-safe and does not inherit multithreaded state), or move worker supervision into a separate single-threaded parent process.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@cpp/src/grpc/server/grpc_worker_infra.cpp` around lines 103 - 137,
spawn_worker currently calls fork() from a multithreaded process (called by
worker_monitor_thread) which is unsafe; replace the fork-based respawn with
posix_spawn to avoid inheriting thread/allocator state. Update spawn_worker to
construct posix_spawn_file_actions that (1) dup2 the worker-side pipe fds into
the expected child descriptors and close the server-side ends, (2) close all
other worker pipe fds via file actions, and (3) set up any required env/args to
invoke the same worker entry (the code path currently run by worker_process).
Use posix_spawnp/posix_spawn to launch the worker binary and return the child
pid, and remove the fork/child branch and direct calls to worker_process; keep
logic that closes child ends in the parent (close_worker_pipes_child_ends) and
preserve error logging around pipe creation and spawn failures while continuing
to honor is_replacement and the worker_pipes_mutex.
| static void store_simple_result(const std::string& job_id, | ||
| int worker_id, | ||
| ResultStatus status, | ||
| const char* error_message) | ||
| { | ||
| for (size_t i = 0; i < MAX_RESULTS; ++i) { | ||
| if (result_queue[i].ready.load(std::memory_order_acquire)) continue; | ||
| bool expected = false; | ||
| if (!result_queue[i].claimed.compare_exchange_strong( | ||
| expected, true, std::memory_order_acq_rel)) { | ||
| continue; | ||
| } | ||
| if (result_queue[i].ready.load(std::memory_order_acquire)) { | ||
| result_queue[i].claimed.store(false, std::memory_order_release); | ||
| continue; | ||
| } | ||
| copy_cstr(result_queue[i].job_id, job_id); | ||
| result_queue[i].status = status; | ||
| result_queue[i].data_size = 0; | ||
| result_queue[i].worker_index.store(worker_id, std::memory_order_relaxed); | ||
| copy_cstr(result_queue[i].error_message, error_message); | ||
| result_queue[i].error_message[sizeof(result_queue[i].error_message) - 1] = '\0'; | ||
| result_queue[i].retrieved.store(false, std::memory_order_relaxed); | ||
| result_queue[i].ready.store(true, std::memory_order_release); | ||
| result_queue[i].claimed.store(false, std::memory_order_release); | ||
| break; | ||
| } |
There was a problem hiding this comment.
Don't drop completion metadata when result_queue is full.
Both store_simple_result() and publish_result() just fall out of the loop when every slot is busy. The job slot is still released afterward, so the server loses the only completion record and the client can wait forever for a result that was already produced. This needs retry/backoff or an explicit failure path before reset_job_slot() runs.
Also applies to: 408-441
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@cpp/src/grpc/server/grpc_worker.cpp` around lines 154 - 180,
store_simple_result (and similarly publish_result) currently drops completion
metadata if all result_queue slots are busy; detect the “queue full” condition
after the loop and do not simply return — instead implement a retry loop with
short backoff (or a bounded number of attempts) that retries claiming a
result_queue slot, and if still unsuccessful before reset_job_slot() is called,
write an explicit failure record into the job slot (e.g., set status to a
special ERROR/QUEUE_FULL and copy an explanatory error_message) so the client is
notified; update store_simple_result, publish_result, and call sites around
reset_job_slot to either wait for the retry to finish or to ensure the explicit
failure record is stored before releasing the job slot.
| static DeserializedJob read_problem_from_pipe(int worker_id, const JobQueueEntry& job) | ||
| { | ||
| DeserializedJob dj; | ||
|
|
||
| int read_fd = worker_pipes[worker_id].worker_read_fd; | ||
| bool is_chunked_job = job.is_chunked.load(); | ||
|
|
||
| auto pipe_recv_t0 = std::chrono::steady_clock::now(); | ||
|
|
||
| if (is_chunked_job) { | ||
| // Chunked path: the server wrote a ChunkedProblemHeader followed by | ||
| // a set of raw typed arrays (constraint matrix, bounds, etc.). | ||
| // This avoids a single giant protobuf allocation for large problems. | ||
| cuopt::remote::ChunkedProblemHeader chunked_header; | ||
| std::map<int32_t, std::vector<uint8_t>> arrays; | ||
| if (!read_chunked_request_from_pipe(read_fd, chunked_header, arrays)) { return dj; } | ||
|
|
||
| if (config.verbose) { | ||
| int64_t total_bytes = 0; | ||
| for (const auto& [fid, data] : arrays) { | ||
| total_bytes += data.size(); | ||
| } | ||
| log_pipe_throughput("pipe_job_recv", total_bytes, pipe_recv_t0); | ||
| SERVER_LOG_INFO( | ||
| "[Worker] IPC path: CHUNKED (%zu arrays, %ld bytes)", arrays.size(), total_bytes); | ||
| } | ||
| if (chunked_header.has_lp_settings()) { | ||
| map_proto_to_pdlp_settings(chunked_header.lp_settings(), dj.lp_settings); | ||
| } | ||
| if (chunked_header.has_mip_settings()) { | ||
| map_proto_to_mip_settings(chunked_header.mip_settings(), dj.mip_settings); | ||
| } | ||
| dj.enable_incumbents = chunked_header.enable_incumbents(); | ||
| map_chunked_arrays_to_problem(chunked_header, arrays, dj.problem); | ||
| } else { | ||
| // Unary path: the entire SubmitJobRequest was serialized as a single | ||
| // protobuf blob. Simpler but copies more memory for large problems. | ||
| std::vector<uint8_t> request_data; | ||
| if (!recv_job_data_pipe(read_fd, job.data_size, request_data)) { return dj; } | ||
|
|
||
| if (config.verbose) { | ||
| log_pipe_throughput("pipe_job_recv", static_cast<int64_t>(request_data.size()), pipe_recv_t0); | ||
| } | ||
| cuopt::remote::SubmitJobRequest submit_request; | ||
| if (!submit_request.ParseFromArray(request_data.data(), | ||
| static_cast<int>(request_data.size())) || | ||
| (!submit_request.has_lp_request() && !submit_request.has_mip_request())) { | ||
| return dj; | ||
| } | ||
| if (submit_request.has_lp_request()) { | ||
| const auto& req = submit_request.lp_request(); | ||
| SERVER_LOG_INFO("[Worker] IPC path: UNARY LP (%zu bytes)", request_data.size()); | ||
| map_proto_to_problem(req.problem(), dj.problem); | ||
| map_proto_to_pdlp_settings(req.settings(), dj.lp_settings); | ||
| } else { | ||
| const auto& req = submit_request.mip_request(); | ||
| SERVER_LOG_INFO("[Worker] IPC path: UNARY MIP (%zu bytes)", request_data.size()); | ||
| map_proto_to_problem(req.problem(), dj.problem); | ||
| map_proto_to_mip_settings(req.settings(), dj.mip_settings); | ||
| dj.enable_incumbents = req.has_enable_incumbents() ? req.enable_incumbents() : true; | ||
| } | ||
| } | ||
|
|
||
| dj.success = true; | ||
| return dj; |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
fd 'grpc_*mapper*.cpp' cpp/src | xargs rg -n "map_chunked_arrays_to_problem|map_proto_to_problem|map_proto_to_pdlp_settings|map_proto_to_mip_settings|throw |cuopt_expects"Repository: NVIDIA/cuopt
Length of output: 44417
🏁 Script executed:
# Get context around the read_problem_from_pipe function and its call sites
sed -n '190,280p' cpp/src/grpc/server/grpc_worker.cpp | cat -nRepository: NVIDIA/cuopt
Length of output: 4612
🏁 Script executed:
# Also check lines 514-520 as mentioned in the review
sed -n '500,535p' cpp/src/grpc/server/grpc_worker.cpp | cat -nRepository: NVIDIA/cuopt
Length of output: 1813
🏁 Script executed:
# Search for try-catch blocks around the read_problem_from_pipe call
rg -B 10 -A 10 'read_problem_from_pipe' cpp/src/grpc/server/grpc_worker.cppRepository: NVIDIA/cuopt
Length of output: 1589
🏁 Script executed:
# Check the mapper implementations to see what exceptions they can throw
sed -n '144,250p' cpp/src/grpc/grpc_problem_mapper.cpp | head -100Repository: NVIDIA/cuopt
Length of output: 5200
🏁 Script executed:
# Look at the complete grpc_worker.cpp worker loop to see full exception handling context
sed -n '470,540p' cpp/src/grpc/server/grpc_worker.cpp | cat -nRepository: NVIDIA/cuopt
Length of output: 3133
🏁 Script executed:
# Check if there's a try-catch wrapping the entire worker loop
rg -B 5 -A 15 'while.*job_queue.*ready' cpp/src/grpc/server/grpc_worker.cppRepository: NVIDIA/cuopt
Length of output: 38
🏁 Script executed:
# Check what happens around lines 514-520 in the actual file
wc -l cpp/src/grpc/server/grpc_worker.cppRepository: NVIDIA/cuopt
Length of output: 95
🏁 Script executed:
# Read around line 514-520 to see the second location mentioned
sed -n '500,530p' cpp/src/grpc/server/grpc_worker.cpp | cat -nRepository: NVIDIA/cuopt
Length of output: 1536
🏁 Script executed:
# Get a broader view of the worker_process function to see all exception handling
grep -n "try\|catch" cpp/src/grpc/server/grpc_worker.cppRepository: NVIDIA/cuopt
Length of output: 630
🏁 Script executed:
# Look at the try-catch blocks to see what they protect
sed -n '283,335p' cpp/src/grpc/server/grpc_worker.cpp | cat -nRepository: NVIDIA/cuopt
Length of output: 3265
🏁 Script executed:
# Check the second try-catch block
sed -n '346,395p' cpp/src/grpc/server/grpc_worker.cpp | cat -nRepository: NVIDIA/cuopt
Length of output: 2911
🏁 Script executed:
# Examine the complete flow from read_problem_from_pipe call to understand control flow
sed -n '490,550p' cpp/src/grpc/server/grpc_worker.cpp | cat -nRepository: NVIDIA/cuopt
Length of output: 2696
Add exception handling around mapper calls in read_problem_from_pipe() to prevent unhandled exceptions from crashing the worker.
The mapping helper functions (map_proto_to_problem(), map_chunked_arrays_to_problem(), map_proto_to_pdlp_settings(), map_proto_to_mip_settings()) throw exceptions on invalid input (e.g., std::runtime_error for unknown variable types, std::invalid_argument for invalid settings). These exceptions are not caught in read_problem_from_pipe(), allowing a malformed request to crash the worker process before store_simple_result() and reset_job_slot() execute, leaving the job slot in a claimed state and preventing proper error reporting.
Wrap the mapper calls in a try-catch block (lines 233–239 for chunked path, lines 258–264 for unary path) and return success=false on exception, similar to how I/O errors are handled. This ensures all code paths properly signal failure and allow cleanup to proceed.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@cpp/src/grpc/server/grpc_worker.cpp` around lines 206 - 270,
read_problem_from_pipe currently calls map_proto_to_problem,
map_chunked_arrays_to_problem, map_proto_to_pdlp_settings and
map_proto_to_mip_settings without protection; wrap the mapping logic in both the
chunked and unary paths in a try-catch that catches std::exception (or (...) if
preferred), log the exception (e.g., SERVER_LOG_ERROR with the what() message),
set dj.success = false and return dj so the caller can clean up; ensure the try
covers the calls that populate dj (including enable_incumbents assignments) so
any thrown exception results in the same failure return path.
| result_slot = static_cast<int>(i); | ||
| ResultQueueEntry& result = result_queue[i]; | ||
| copy_cstr(result.job_id, job_id); | ||
| result.status = sr.success ? RESULT_SUCCESS : RESULT_ERROR; | ||
| result.data_size = sr.success ? std::max<uint64_t>(result_total_bytes, 1) : 0; | ||
| result.worker_index.store(worker_id, std::memory_order_relaxed); | ||
| if (!sr.success) { copy_cstr(result.error_message, sr.error_message); } | ||
| result.retrieved.store(false, std::memory_order_relaxed); | ||
| result.ready.store(true, std::memory_order_release); | ||
| result.claimed.store(false, std::memory_order_release); |
There was a problem hiding this comment.
Publish ready=true only after the pipe payload is committed.
The shared-memory entry is marked RESULT_SUCCESS / ready before write_result_to_pipe() finishes. A reader can observe a ready result and start consuming the pipe, then block indefinitely if the later pipe write fails or only partially succeeds. Move the publish step after a successful pipe write, or add a distinct "payload ready" state.
As per coding guidelines, "Ensure race conditions are absent in multi-threaded server implementations; verify proper synchronization of shared state."
Also applies to: 443-464
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@cpp/src/grpc/server/grpc_worker.cpp` around lines 421 - 430, The code marks
result.ready and RESULT_SUCCESS before the pipe payload is committed; fix by
deferring publication until write_result_to_pipe() completes successfully: call
write_result_to_pipe(...) first, check its return, then set
result.data_size/result.status (sr.success ? RESULT_SUCCESS : RESULT_ERROR) and
copy sr.error_message as needed, and only after a successful pipe write store
result.ready.store(true, std::memory_order_release); on write failure set
RESULT_ERROR and a proper error_message and publish ready only when the payload
is guaranteed committed (or add a separate payload_ready flag and publish that
after write_result_to_pipe()); update any related stores (result.worker_index,
result.retrieved, result.claimed) to maintain memory-ordering consistent with
publishing ready.
testing dep changes, clone of #939