Nemotron OCR SDG Pipeline#1899
Conversation
- Updated the `transformers` dependency from `<=4.55.2` to `==4.57.0` in `pyproject.toml` and `uv.lock` to ensure compatibility with the Cosmos Embed imports. - Added a new `Gemini3Pro` model class in `gemini.py` utilizing the NVIDIA Inference API. - Introduced `DescriptionOutputStage` and `DescriptionValidatorStage` for processing and validating image descriptions, respectively. - Enhanced `VLMProcessingStage` to improve GPU resource handling and added a `num_workers` parameter to `DescriptionStage` for better scalability. This commit enhances the model's capabilities and ensures that dependencies are up-to-date for optimal performance. Signed-off-by: Ao Tang <aot@nvidia.com>
Signed-off-by: Ao Tang <aot@nvidia.com>
Signed-off-by: Ao Tang <aot@nvidia.com>
Signed-off-by: Ao Tang <aot@nvidia.com>
Signed-off-by: Ao Tang <aot@nvidia.com>
Signed-off-by: Ao Tang <aot@nvidia.com>
Signed-off-by: Ao Tang <aot@nvidia.com>
Signed-off-by: Ao Tang <aot@nvidia.com>
Removes description-specific stages (description*.py, description pipeline tutorials) that belong on aot/omni_description. Adds OCR result inspection/review scripts and shared design docs. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… pipeline tutorial Signed-off-by: Ao Tang <aot@nvidia.com>
Signed-off-by: Ao Tang <aot@nvidia.com>
Signed-off-by: Ao Tang <aot@nvidia.com>
Signed-off-by: Ao Tang <aot@nvidia.com>
Signed-off-by: Ao Tang <aot@nvidia.com>
Signed-off-by: Ao Tang <aot@nvidia.com>
…Data and OCRDenseWord classes Signed-off-by: Ao Tang <aot@nvidia.com>
--metrics-dir wires Ray metrics into the running Prometheus/Grafana instance. --run-name sets SLURM_JOB_NAME so Xenna labels the run on the ray_pipeline_input_tasks metric for human-readable identification in Grafana. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… timing - OCRConversationData.to_dict(): call conversation.to_dict() explicitly instead of relying on dataclasses.asdict(), which bypasses the custom ConversationSample serialization and drops the "t" media-type field from image fragments. - RayClient: move Prometheus service-discovery registration to after Ray is started and responsive; add _wait_for_ray_service_discovery_file() so the SD file exists before Prometheus is told to watch it. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Remove internal-only I/O stages from io.py (InputFormat, ImageReaderStage, ImageFolderReaderStage, TarImageReaderStage, JsonlTarImageReaderStage, OcrJsonlReaderStage, JsonlPipelineOutputReaderStage, TarImageReader, ParquetImageReaderStage, ParquetImageReader). These classes are only used by the internal ocr_pipeline.py and will live on aot/omni_sdg_internal. Public io.py now exports: HFDatasetImageReaderStage, SkipProcessedStage, ResultWriterStage, merge_output_shards, ImageWriterStage, and the FileReader helpers (load_image_from_task, TarFileReader, etc.). Add tests/stages/test_hf_dataset_image_reader.py covering HFDatasetImageReaderStage. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
ImageWriterStage is not used by hf_ocr_pipeline.py and only makes sense with the internal JSONL+tar pipeline; moved to aot/omni_sdg_internal. TarFileReader, ParquetFileReader, _file_readers dispatcher, and the deprecated _parse_tar_slice_path wrapper are removed. In the public HF pipeline images are always regular JPEG files on disk, so load_image_from_task is simplified to a single RegularFileReader call. io.py: 870 → 631 lines. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…e_url SUPPORTED_IMAGE_EXTENSIONS was only used by the internal reader classes removed in the previous commit. FileReader.read_image_url() was never called in this branch — drop it and its now-unused base64 import. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ng in nvinference_client.py Signed-off-by: Ao Tang <aot@nvidia.com>
| image_path = self.image_dir / f"{image_id}.jpg" | ||
| if not image_path.exists(): |
There was a problem hiding this comment.
image_id used as filename without path sanitization
f"{image_id}.jpg" is used directly as a filename. If the id_column value contains a / (e.g. "en/doc_001"), Path(image_dir) / "en/doc_001.jpg" silently creates a subdirectory en/ instead of a flat file. The subdirectory won't be created by self.image_dir.mkdir(parents=True, exist_ok=True) above (that only creates image_dir itself), so the subsequent pil_image.save(image_path) will fail with FileNotFoundError.
Consider sanitizing:
safe_id = image_id.replace("/", "_").replace("\\", "_")
image_path = self.image_dir / f"{safe_id}.jpg"| pass | ||
|
|
||
|
|
||
| class NVInferenceModel(VLMModel): |
There was a problem hiding this comment.
For discussion: Well, we can let it be "nvinference", or generic "OpenAI-API-compatible" or something? Basically we're not specifically targeting nvinference, but rather any openai compatible api. Although internally we would of course focus on nvinference.
There was a problem hiding this comment.
I think we should just leave it for nvinference for now. promoting NV inference infra isn't a bad idea
| content.append({"type": "text", "text": prompt}) | ||
| return content | ||
|
|
||
| def generate( |
There was a problem hiding this comment.
Meaning, we only have static batching for now, no dynamic batching?
There was a problem hiding this comment.
thats true. Do we expect perf gain for inference API from using dynamic batching ?
|
|
||
|
|
||
| @dataclass(kw_only=True) | ||
| class OCRDenseWord: |
There was a problem hiding this comment.
I guess we should rename. It's not necessarily a word, but can also be a line (if using the line prediction mode).
| DEFAULT_MODEL_ID = "gcp/google/gemini-3-pro" | ||
|
|
||
|
|
||
| class Gemini3Pro(NVInferenceModel): |
There was a problem hiding this comment.
Actually, was thinking if we should set our just released omni model as default?
Signed-off-by: Ao Tang <aot@nvidia.com>
Updated test cases across multiple files to replace instances of OCRDenseWord with OCRDenseItem, reflecting the new class structure. Adjusted related helper functions and assertions accordingly. Additionally, modified the OCR pipeline to utilize Nemotron-Nano-Omni for bbox scoring instead of Gemini, including updates to model IDs and parameters. Signed-off-by: Ao Tang <aot@nvidia.com>
| try: | ||
| responses = self.model.generate(prompts, images if self.multimodal else None, self.inference_config) | ||
|
|
||
| for idx, response in zip(valid_indices, responses, strict=True): | ||
| self._handle_response_one(tasks, idx, response) | ||
|
|
||
| logger.info(f"{self.name}: processed batch of {len(valid_indices)} items") | ||
|
|
||
| except Exception as e: # noqa: BLE001 | ||
| logger.error(f"{self.name}: batch processing error: {e}") | ||
| for task in tasks: | ||
| task.data.error = f"{self.name}: {e}" | ||
| task.data.is_valid = False |
There was a problem hiding this comment.
strict=True exception cascades through outer handler, overwriting successfully-processed results
zip(valid_indices, responses, strict=True) is inside the outer try/except Exception. When a custom VLMModel returns a list whose length differs from len(valid_indices), zip raises ValueError — but only after all items in the shorter sequence have already been yielded and _handle_response_one has been called for each. The outer except then iterates for task in tasks and sets is_valid = False on every task, including the ones whose responses were fully handled before the error. Those already-updated task states are silently overwritten with a generic failure.
The NVInferenceModel.generate implementation currently never triggers this (its inner except Exception ensures exactly len(prompts) results), but any subclass that raises or returns a shorter list will corrupt legitimate results rather than just surfacing the contract violation.
Signed-off-by: Ao Tang <aot@nvidia.com>
|
|
||
| from nemo_curator.models.omni.base import NVInferenceModel, NVInferenceModelConfig | ||
|
|
||
| DEFAULT_MODEL_ID = "nvidia/nvidia/nemotron-3-nano-omni-30b-a3b-reasoning" |
There was a problem hiding this comment.
Likely wrong model ID — double
nvidia/ prefix
DEFAULT_MODEL_ID = "nvidia/nvidia/nemotron-3-nano-omni-30b-a3b-reasoning" has the org name duplicated. NVIDIA NIM API model slugs follow the <org>/<model-name> convention (e.g. nvidia/llama-3.1-70b-instruct, nvidia/nemotron-4-340b-reward), so the expected value would be "nvidia/nemotron-3-nano-omni-30b-a3b-reasoning". If the current string is wrong, every OCRScoringQAStage API call will fail — stream_chat_completion_text will raise, NVInferenceModel.generate will catch it silently and return "", and handle_response will mark every image invalid with "empty response from model". The same double-prefix is hardcoded independently in ocr_scoring_qa.py:169 and tutorials/synthetic/omni/ocr_pipeline.py:54,254 — it would also be cleaner to reference DEFAULT_MODEL_ID there rather than duplicating the string literal.
| return OpenAI(base_url=base_url, api_key=api_key) | ||
|
|
||
|
|
||
| def stream_chat_completion_text( # noqa: PLR0913 |
There was a problem hiding this comment.
Is there a reason to not reuse / extend our OpenAiClietn / AsyncOpenAiClient
| raise ValueError(msg) | ||
|
|
||
|
|
||
| class Model(ABC): |
There was a problem hiding this comment.
I'm really wary of new generic base classes... I believe we already have https://github.com/NVIDIA-NeMo/Curator/blob/main/nemo_curator/models/base.py
Can we extend those? What does this give?
Similarly what does ModelConfig base class give rather than just having kwargs? vLLM accepts multiple kwargs these just being three of them...
Is there a reason?
| def _maybe_set_cuda_device(self) -> None: | ||
| """Set the current CUDA device for this stage, if configured.""" | ||
| if self.cuda_devices is None or len(self.cuda_devices) == 0: | ||
| return | ||
| import torch | ||
|
|
||
| if not torch.cuda.is_available(): | ||
| msg = "CUDA is not available" | ||
| raise RuntimeError(msg) | ||
| torch.cuda.set_device(self.cuda_devices[0]) | ||
| os.environ["CUDA_VISIBLE_DEVICES"] = ",".join(f"{d}" for d in self.cuda_devices) |
There was a problem hiding this comment.
Why are we doing this instead of letting Xenna / Ray Data do it?
| {{ | ||
| "ocr_mode": "word" or "line", | ||
| "text": [ | ||
| {{ | ||
| "idx": <integer matching input idx>, |
There was a problem hiding this comment.
Greedy DOTALL regex silently eats the entire JSON response when reasoning contains
{
_JSON_OBJECT_RE = re.compile(r"\{.*\}", re.DOTALL) uses a greedy .* that extends from the first { in the string to the last }. For a reasoning model like Nemotron-Nano-Omni, the thinking/scratchpad before the final answer almost always contains JSON examples or structured text with braces (e.g. {"bbox_match": 10}). When that happens, the single greedy finditer match spans from that first { all the way to the closing } of the actual response JSON — producing a non-parseable fragment that fails json.loads. Because the greedy match consumed the entire tail of the string, finditer yields no further matches, and _parse_json_object returns None, marking the image invalid with "could not parse JSON".
Signed-off-by: Ao Tang <aot@nvidia.com>
| def __init__( # noqa: PLR0913 | ||
| self, | ||
| model_id: str = "nvidia/nvidia/nemotron-3-nano-omni-30b-a3b-reasoning", | ||
| temperature: float = 1.0, | ||
| max_tokens: int = 16384, | ||
| min_bbox_match: int = 5, | ||
| max_text_errors: int = 0, | ||
| fail_on_missing_text: bool = False, | ||
| dense_dump_prob: float = 0.05, | ||
| batch_size: int | None = None, | ||
| priority_mode: bool = False, | ||
| **kwargs: Any, # noqa: ANN401 | ||
| ) -> None: | ||
| """Initialise the combined scoring + QA stage. | ||
|
|
||
| Args: | ||
| model_id: NVIDIA Inference API model to use as verifier. | ||
| temperature: Sampling temperature. | ||
| max_tokens: Max tokens for the verifier response. Nemotron | ||
| reasoning consumes most of this budget before emitting the | ||
| final JSON; 16k leaves enough headroom for both. | ||
| min_bbox_match: Minimum ``bbox_match`` score for a valid bbox. | ||
| Nemotron's near-trinary scoring (0/5/10) makes thresholds | ||
| 1-5 essentially equivalent; 5 keeps a quality gate without | ||
| over-pruning short text. | ||
| max_text_errors: Maximum ``text_errors`` count for a valid bbox. | ||
| fail_on_missing_text: If ``True``, mark the whole image invalid | ||
| when the verifier reports missing text. Defaults to | ||
| ``False`` — missing text only disables the dense-dump QA | ||
| turn. | ||
| dense_dump_prob: Probability (0-1) of generating a single-turn | ||
| dense dump conversation instead of multi-turn QA, for images | ||
| where OCR is provably complete (no missing text). Tuned | ||
| low because the verifier tends to under-report missing | ||
| text, so "provably complete" fires more often than the | ||
| underlying coverage warrants. | ||
| batch_size: Override the default batch size of 16. | ||
| priority_mode: Use priority API queue (lower latency, higher cost). | ||
| """ | ||
| self._scoring_model_id = model_id | ||
| self.min_bbox_match = min_bbox_match | ||
| self.max_text_errors = max_text_errors | ||
| self.fail_on_missing_text = fail_on_missing_text | ||
| self.dense_dump_prob = dense_dump_prob | ||
| super().__init__( | ||
| model=NVInferenceModel( | ||
| model_id=model_id, | ||
| max_tokens=max_tokens, | ||
| temperature=temperature, | ||
| top_p=1.0, | ||
| priority_mode=priority_mode, | ||
| ), | ||
| batch_size=batch_size or self.batch_size, | ||
| **kwargs, | ||
| ) |
There was a problem hiding this comment.
Broken
**kwargs forwarding crashes construction for any extra argument
OCRScoringQAStage.__init__ collects **kwargs and passes them verbatim to super().__init__(), but ModelProcessingStage.__init__ only accepts model and batch_size — no **kwargs. Any caller who passes an extra keyword argument will get an immediate TypeError at construction time. The **kwargs: Any signature advertises extensibility that is silently broken.
Signed-off-by: Ao Tang <aot@nvidia.com>
| # GIT_LFS_SKIP_SMUDGE=1: clone source only, model weights are mounted at runtime | ||
| # TORCH_CUDA_ARCH_LIST: compile for SM 8.0/8.6/8.9/9.0 to cover A100, A10, RTX Ada, H100 | ||
| RUN uv pip install hatchling ninja editables && \ | ||
| GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/nvidia/nemotron-ocr-v2 /tmp/nemotron-ocr-src && \ |
There was a problem hiding this comment.
The
git clone fetches the HEAD of the default branch with no pinned commit or tag. Every Docker build will silently pick up whatever code is currently in the HuggingFace repo at build time, making images non-reproducible and creating a supply-chain risk. Add --depth 1 (avoids downloading history) and check out a specific tag or commit so the build is locked to a known-good version of the extension source.
| GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/nvidia/nemotron-ocr-v2 /tmp/nemotron-ocr-src && \ | |
| GIT_LFS_SKIP_SMUDGE=1 git clone --depth 1 --branch <TAG_OR_COMMIT> https://huggingface.co/nvidia/nemotron-ocr-v2 /tmp/nemotron-ocr-src && \ |
Description
Adds the Nemotron OCR SDG pipeline — a multimodal synthetic data generation pipeline that converts images into structured OCR + QA conversation data for vision-language model training.
Pipeline stages
Key components
Tests
63 new unit tests across `tests/tasks/`, `tests/models/`, and `tests/stages/synthetic/omni/` — all CPU-only, no GPU required.
Checklist