Nemotron OCR SDG Pipeline by suiyoubi · Pull Request #1899 · NVIDIA-NeMo/Curator

suiyoubi · 2026-04-30T15:49:47Z

Description

Adds the Nemotron OCR SDG pipeline — a multimodal synthetic data generation pipeline that converts images into structured OCR + QA conversation data for vision-language model training.

Pipeline stages

Stage	Model	Output
`NemotronOCRV2Stage`	NemotronOCR-v2	Dense word-level OCR with bounding boxes
`OCRScoringQAStage`	Gemini 3 Pro (NVIDIA Inference API)	Scoring, validation, and missing-region detection
`OCRConversationalizeStage`	—	11 output format variants → `ConversationSample`
`OCRDenseQAStage`	—	6 QA types (bbox↔text, point↔text, dense dump)

Key components

`nemo_curator/models/omni/` — `NVInferenceModel` base class for NVIDIA Inference API-backed VLMs; `Gemini3Pro` concrete implementation
`nemo_curator/models/client/nvinference_client.py` — thin streaming client helpers (`get_nvinference_api_key`, `create_openai_client`, `stream_chat_completion_text`)
`nemo_curator/stages/synthetic/omni/base.py` — `VLMProcessingStage` and `ModelProcessingStage` base classes with batched inference, per-prompt error isolation, and setup/teardown lifecycle
`nemo_curator/stages/synthetic/omni/io.py` — `HFDatasetImageReader`, `TarImageReader`, `ParquetReader`, `SkipProcessedStage`, `ResultWriterStage`
`nemo_curator/tasks/ocr.py` — `OCRDenseWord` and `OCRData` task data classes
`docker/Dockerfile` — installs `nemotron-ocr-v2` from source (no-build-isolation, CUDA arch list for A100/A10/RTX Ada/H100)
`tutorials/synthetic/omni/hf_ocr_pipeline.py` — end-to-end example using HuggingFace datasets

Tests

63 new unit tests across `tests/tasks/`, `tests/models/`, and `tests/stages/synthetic/omni/` — all CPU-only, no GPU required.

Checklist

I am familiar with the Contributing Guide.
New or Existing tests cover these changes.
The documentation is up to date with these changes.

Signed-off-by: Ao Tang <aot@nvidia.com>

- Updated the `transformers` dependency from `<=4.55.2` to `==4.57.0` in `pyproject.toml` and `uv.lock` to ensure compatibility with the Cosmos Embed imports. - Added a new `Gemini3Pro` model class in `gemini.py` utilizing the NVIDIA Inference API. - Introduced `DescriptionOutputStage` and `DescriptionValidatorStage` for processing and validating image descriptions, respectively. - Enhanced `VLMProcessingStage` to improve GPU resource handling and added a `num_workers` parameter to `DescriptionStage` for better scalability. This commit enhances the model's capabilities and ensures that dependencies are up-to-date for optimal performance. Signed-off-by: Ao Tang <aot@nvidia.com>

Signed-off-by: Ao Tang <aot@nvidia.com>

… aot/omni_sdg

Signed-off-by: Ao Tang <aot@nvidia.com>

Removes description-specific stages (description*.py, description pipeline tutorials) that belong on aot/omni_description. Adds OCR result inspection/review scripts and shared design docs. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

… pipeline tutorial Signed-off-by: Ao Tang <aot@nvidia.com>

Signed-off-by: Ao Tang <aot@nvidia.com>

…Data and OCRDenseWord classes Signed-off-by: Ao Tang <aot@nvidia.com>

--metrics-dir wires Ray metrics into the running Prometheus/Grafana instance. --run-name sets SLURM_JOB_NAME so Xenna labels the run on the ray_pipeline_input_tasks metric for human-readable identification in Grafana. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

… timing - OCRConversationData.to_dict(): call conversation.to_dict() explicitly instead of relying on dataclasses.asdict(), which bypasses the custom ConversationSample serialization and drops the "t" media-type field from image fragments. - RayClient: move Prometheus service-discovery registration to after Ray is started and responsive; add _wait_for_ray_service_discovery_file() so the SD file exists before Prometheus is told to watch it. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Remove internal-only I/O stages from io.py (InputFormat, ImageReaderStage, ImageFolderReaderStage, TarImageReaderStage, JsonlTarImageReaderStage, OcrJsonlReaderStage, JsonlPipelineOutputReaderStage, TarImageReader, ParquetImageReaderStage, ParquetImageReader). These classes are only used by the internal ocr_pipeline.py and will live on aot/omni_sdg_internal. Public io.py now exports: HFDatasetImageReaderStage, SkipProcessedStage, ResultWriterStage, merge_output_shards, ImageWriterStage, and the FileReader helpers (load_image_from_task, TarFileReader, etc.). Add tests/stages/test_hf_dataset_image_reader.py covering HFDatasetImageReaderStage. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

ImageWriterStage is not used by hf_ocr_pipeline.py and only makes sense with the internal JSONL+tar pipeline; moved to aot/omni_sdg_internal. TarFileReader, ParquetFileReader, _file_readers dispatcher, and the deprecated _parse_tar_slice_path wrapper are removed. In the public HF pipeline images are always regular JPEG files on disk, so load_image_from_task is simplified to a single RegularFileReader call. io.py: 870 → 631 lines. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…e_url SUPPORTED_IMAGE_EXTENSIONS was only used by the internal reader classes removed in the previous commit. FileReader.read_image_url() was never called in this branch — drop it and its now-unused base64 import. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…ng in nvinference_client.py Signed-off-by: Ao Tang <aot@nvidia.com>

Signed-off-by: Ao Tang <aot@nvidia.com>

greptile-apps · 2026-04-30T20:46:02Z

+            image_path = self.image_dir / f"{image_id}.jpg"
+            if not image_path.exists():


image_id used as filename without path sanitization

f"{image_id}.jpg" is used directly as a filename. If the id_column value contains a / (e.g. "en/doc_001"), Path(image_dir) / "en/doc_001.jpg" silently creates a subdirectory en/ instead of a flat file. The subdirectory won't be created by self.image_dir.mkdir(parents=True, exist_ok=True) above (that only creates image_dir itself), so the subsequent pil_image.save(image_path) will fail with FileNotFoundError.

Consider sanitizing:

safe_id = image_id.replace("/", "_").replace("\\", "_") image_path = self.image_dir / f"{safe_id}.jpg"

voegtlel · 2026-05-04T09:41:32Z

+        pass
+
+
+class NVInferenceModel(VLMModel):


For discussion: Well, we can let it be "nvinference", or generic "OpenAI-API-compatible" or something? Basically we're not specifically targeting nvinference, but rather any openai compatible api. Although internally we would of course focus on nvinference.

I think we should just leave it for nvinference for now. promoting NV inference infra isn't a bad idea

voegtlel · 2026-05-04T09:55:19Z

+        content.append({"type": "text", "text": prompt})
+        return content
+
+    def generate(


Meaning, we only have static batching for now, no dynamic batching?

thats true. Do we expect perf gain for inference API from using dynamic batching ?

voegtlel · 2026-05-04T09:58:52Z

+
+
+@dataclass(kw_only=True)
+class OCRDenseWord:


I guess we should rename. It's not necessarily a word, but can also be a line (if using the line prediction mode).

voegtlel · 2026-05-04T10:01:19Z

+DEFAULT_MODEL_ID = "gcp/google/gemini-3-pro"
+
+
+class Gemini3Pro(NVInferenceModel):


Actually, was thinking if we should set our just released omni model as default?

Signed-off-by: Ao Tang <aot@nvidia.com>

Updated test cases across multiple files to replace instances of OCRDenseWord with OCRDenseItem, reflecting the new class structure. Adjusted related helper functions and assertions accordingly. Additionally, modified the OCR pipeline to utilize Nemotron-Nano-Omni for bbox scoring instead of Gemini, including updates to model IDs and parameters. Signed-off-by: Ao Tang <aot@nvidia.com>

greptile-apps · 2026-05-12T14:10:13Z

+        try:
+            responses = self.model.generate(prompts, images if self.multimodal else None, self.inference_config)
+
+            for idx, response in zip(valid_indices, responses, strict=True):
+                self._handle_response_one(tasks, idx, response)
+
+            logger.info(f"{self.name}: processed batch of {len(valid_indices)} items")
+
+        except Exception as e:  # noqa: BLE001
+            logger.error(f"{self.name}: batch processing error: {e}")
+            for task in tasks:
+                task.data.error = f"{self.name}: {e}"
+                task.data.is_valid = False


strict=True exception cascades through outer handler, overwriting successfully-processed results

zip(valid_indices, responses, strict=True) is inside the outer try/except Exception. When a custom VLMModel returns a list whose length differs from len(valid_indices), zip raises ValueError — but only after all items in the shorter sequence have already been yielded and _handle_response_one has been called for each. The outer except then iterates for task in tasks and sets is_valid = False on every task, including the ones whose responses were fully handled before the error. Those already-updated task states are silently overwritten with a generic failure.

The NVInferenceModel.generate implementation currently never triggers this (its inner except Exception ensures exactly len(prompts) results), but any subclass that raises or returns a shorter list will corrupt legitimate results rather than just surfacing the contract violation.

Signed-off-by: Ao Tang <aot@nvidia.com>

greptile-apps · 2026-05-12T16:29:10Z

+
+from nemo_curator.models.omni.base import NVInferenceModel, NVInferenceModelConfig
+
+DEFAULT_MODEL_ID = "nvidia/nvidia/nemotron-3-nano-omni-30b-a3b-reasoning"


Likely wrong model ID — double nvidia/ prefix

DEFAULT_MODEL_ID = "nvidia/nvidia/nemotron-3-nano-omni-30b-a3b-reasoning" has the org name duplicated. NVIDIA NIM API model slugs follow the <org>/<model-name> convention (e.g. nvidia/llama-3.1-70b-instruct, nvidia/nemotron-4-340b-reward), so the expected value would be "nvidia/nemotron-3-nano-omni-30b-a3b-reasoning". If the current string is wrong, every OCRScoringQAStage API call will fail — stream_chat_completion_text will raise, NVInferenceModel.generate will catch it silently and return "", and handle_response will mark every image invalid with "empty response from model". The same double-prefix is hardcoded independently in ocr_scoring_qa.py:169 and tutorials/synthetic/omni/ocr_pipeline.py:54,254 — it would also be cleaner to reference DEFAULT_MODEL_ID there rather than duplicating the string literal.

praateekmahajan · 2026-05-12T18:34:14Z

+    return OpenAI(base_url=base_url, api_key=api_key)
+
+
+def stream_chat_completion_text(  # noqa: PLR0913


Is there a reason to not reuse / extend our OpenAiClietn / AsyncOpenAiClient

Curator/nemo_curator/models/client/openai_client.py

Line 86 in 02dfb3f

class AsyncOpenAIClient(AsyncLLMClient):

praateekmahajan · 2026-05-12T18:35:33Z

+            raise ValueError(msg)
+
+
+class Model(ABC):


I'm really wary of new generic base classes... I believe we already have https://github.com/NVIDIA-NeMo/Curator/blob/main/nemo_curator/models/base.py

Can we extend those? What does this give?
Similarly what does ModelConfig base class give rather than just having kwargs? vLLM accepts multiple kwargs these just being three of them...

Is there a reason?

praateekmahajan · 2026-05-12T18:36:12Z

+    def _maybe_set_cuda_device(self) -> None:
+        """Set the current CUDA device for this stage, if configured."""
+        if self.cuda_devices is None or len(self.cuda_devices) == 0:
+            return
+        import torch
+
+        if not torch.cuda.is_available():
+            msg = "CUDA is not available"
+            raise RuntimeError(msg)
+        torch.cuda.set_device(self.cuda_devices[0])
+        os.environ["CUDA_VISIBLE_DEVICES"] = ",".join(f"{d}" for d in self.cuda_devices)


Why are we doing this instead of letting Xenna / Ray Data do it?

Signed-off-by: Ao Tang <aot@nvidia.com>

greptile-apps · 2026-05-14T13:45:30Z

+{{
+  "ocr_mode": "word" or "line",
+  "text": [
+    {{
+      "idx": <integer matching input idx>,


Greedy DOTALL regex silently eats the entire JSON response when reasoning contains {

_JSON_OBJECT_RE = re.compile(r"\{.*\}", re.DOTALL) uses a greedy .* that extends from the first { in the string to the last }. For a reasoning model like Nemotron-Nano-Omni, the thinking/scratchpad before the final answer almost always contains JSON examples or structured text with braces (e.g. {"bbox_match": 10}). When that happens, the single greedy finditer match spans from that first { all the way to the closing } of the actual response JSON — producing a non-parseable fragment that fails json.loads. Because the greedy match consumed the entire tail of the string, finditer yields no further matches, and _parse_json_object returns None, marking the image invalid with "could not parse JSON".

Signed-off-by: Ao Tang <aot@nvidia.com>

greptile-apps · 2026-05-14T14:02:20Z

+    def __init__(  # noqa: PLR0913
+        self,
+        model_id: str = "nvidia/nvidia/nemotron-3-nano-omni-30b-a3b-reasoning",
+        temperature: float = 1.0,
+        max_tokens: int = 16384,
+        min_bbox_match: int = 5,
+        max_text_errors: int = 0,
+        fail_on_missing_text: bool = False,
+        dense_dump_prob: float = 0.05,
+        batch_size: int | None = None,
+        priority_mode: bool = False,
+        **kwargs: Any,  # noqa: ANN401
+    ) -> None:
+        """Initialise the combined scoring + QA stage.
+
+        Args:
+            model_id: NVIDIA Inference API model to use as verifier.
+            temperature: Sampling temperature.
+            max_tokens: Max tokens for the verifier response.  Nemotron
+                reasoning consumes most of this budget before emitting the
+                final JSON; 16k leaves enough headroom for both.
+            min_bbox_match: Minimum ``bbox_match`` score for a valid bbox.
+                Nemotron's near-trinary scoring (0/5/10) makes thresholds
+                1-5 essentially equivalent; 5 keeps a quality gate without
+                over-pruning short text.
+            max_text_errors: Maximum ``text_errors`` count for a valid bbox.
+            fail_on_missing_text: If ``True``, mark the whole image invalid
+                when the verifier reports missing text.  Defaults to
+                ``False`` — missing text only disables the dense-dump QA
+                turn.
+            dense_dump_prob: Probability (0-1) of generating a single-turn
+                dense dump conversation instead of multi-turn QA, for images
+                where OCR is provably complete (no missing text).  Tuned
+                low because the verifier tends to under-report missing
+                text, so "provably complete" fires more often than the
+                underlying coverage warrants.
+            batch_size: Override the default batch size of 16.
+            priority_mode: Use priority API queue (lower latency, higher cost).
+        """
+        self._scoring_model_id = model_id
+        self.min_bbox_match = min_bbox_match
+        self.max_text_errors = max_text_errors
+        self.fail_on_missing_text = fail_on_missing_text
+        self.dense_dump_prob = dense_dump_prob
+        super().__init__(
+            model=NVInferenceModel(
+                model_id=model_id,
+                max_tokens=max_tokens,
+                temperature=temperature,
+                top_p=1.0,
+                priority_mode=priority_mode,
+            ),
+            batch_size=batch_size or self.batch_size,
+            **kwargs,
+        )


Broken **kwargs forwarding crashes construction for any extra argument

OCRScoringQAStage.__init__ collects **kwargs and passes them verbatim to super().__init__(), but ModelProcessingStage.__init__ only accepts model and batch_size — no **kwargs. Any caller who passes an extra keyword argument will get an immediate TypeError at construction time. The **kwargs: Any signature advertises extensibility that is silently broken.

Signed-off-by: Ao Tang <aot@nvidia.com>

greptile-apps · 2026-05-14T14:58:17Z

+# GIT_LFS_SKIP_SMUDGE=1: clone source only, model weights are mounted at runtime
+# TORCH_CUDA_ARCH_LIST: compile for SM 8.0/8.6/8.9/9.0 to cover A100, A10, RTX Ada, H100
+RUN uv pip install hatchling ninja editables && \
+    GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/nvidia/nemotron-ocr-v2 /tmp/nemotron-ocr-src && \


The git clone fetches the HEAD of the default branch with no pinned commit or tag. Every Docker build will silently pick up whatever code is currently in the HuggingFace repo at build time, making images non-reproducible and creating a supply-chain risk. Add --depth 1 (avoids downloading history) and check out a specific tag or commit so the build is locked to a known-good version of the extension source.

Suggested change

GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/nvidia/nemotron-ocr-v2 /tmp/nemotron-ocr-src && \

GIT_LFS_SKIP_SMUDGE=1 git clone --depth 1 --branch <TAG_OR_COMMIT> https://huggingface.co/nvidia/nemotron-ocr-v2 /tmp/nemotron-ocr-src && \

suiyoubi and others added 30 commits March 4, 2026 10:40

omni sdg

b6bf8c1

Signed-off-by: Ao Tang <aot@nvidia.com>

Refactor model initialization and enhance pipeline execution

2b9f157

Signed-off-by: Ao Tang <aot@nvidia.com>

Merge branch 'main' of github.com:NVIDIA-NeMo/Curator into aot/omni_sdg

21f417f

add metrics support

ec46518

Signed-off-by: Ao Tang <aot@nvidia.com>

Add support for per-stage pip specifications and virtual environments

5271c7c

Signed-off-by: Ao Tang <aot@nvidia.com>

Merge branch 'aot/runtime_env' of github.com:NVIDIA-NeMo/Curator into…

e4df9af

… aot/omni_sdg

Add test

fa9f9a5

Signed-off-by: Ao Tang <aot@nvidia.com>

Merge branch 'main' of github.com:NVIDIA-NeMo/Curator into aot/omni_sdg

2d34d90

Signed-off-by: Ao Tang <aot@nvidia.com>

Add ocr nemotron v2 pipeline

df58099

Signed-off-by: Ao Tang <aot@nvidia.com>

Enhance OCR pipeline with new scoring and verification stages

0cb4be7

Signed-off-by: Ao Tang <aot@nvidia.com>

Add OCR conversationalization and output formatting for dense OCR data

f41918c

Signed-off-by: Ao Tang <aot@nvidia.com>

Add HFDatasetImageReaderStage for HuggingFace datasets and create OCR…

cc19ce9

… pipeline tutorial Signed-off-by: Ao Tang <aot@nvidia.com>

Remove per-stage runtime environment design document and example script

94ec830

Signed-off-by: Ao Tang <aot@nvidia.com>

Refactor OCR pipeline and result inspection scripts

3b9a139

Signed-off-by: Ao Tang <aot@nvidia.com>

remove

effe3aa

Signed-off-by: Ao Tang <aot@nvidia.com>

remove rtx

965a87c

Signed-off-by: Ao Tang <aot@nvidia.com>

remove generate_stream

71d5c02

Signed-off-by: Ao Tang <aot@nvidia.com>

refactor

3330d71

Signed-off-by: Ao Tang <aot@nvidia.com>

Refactor ParquetImageReader

2a634d5

Signed-off-by: Ao Tang <aot@nvidia.com>

remove

37a710f

Signed-off-by: Ao Tang <aot@nvidia.com>

Refactor JSON serialization in ResultWriterStage and update ImageTask…

344157a

…Data and OCRDenseWord classes Signed-off-by: Ao Tang <aot@nvidia.com>

Refactor RayClient metrics service discovery and improve error handli…

f5a2326

…ng in nvinference_client.py Signed-off-by: Ao Tang <aot@nvidia.com>

init file

bb74e10

Signed-off-by: Ao Tang <aot@nvidia.com>

copy-pr-bot Bot temporarily deployed to nemo-ci April 30, 2026 20:40 Inactive

greptile-apps Bot reviewed Apr 30, 2026

View reviewed changes

copy-pr-bot Bot temporarily deployed to nemo-ci April 30, 2026 20:51 Inactive

voegtlel suggested changes May 4, 2026

View reviewed changes

suiyoubi added 2 commits May 12, 2026 07:00

change from gemini verification model to nemotron omni

6586c54

Signed-off-by: Ao Tang <aot@nvidia.com>

greptile-apps Bot reviewed May 12, 2026

View reviewed changes

Add response length validation to ModelProcessingStage

72c8336

Signed-off-by: Ao Tang <aot@nvidia.com>

greptile-apps Bot reviewed May 12, 2026

View reviewed changes

praateekmahajan reviewed May 12, 2026

View reviewed changes

refactor

924c8b5

Signed-off-by: Ao Tang <aot@nvidia.com>

greptile-apps Bot reviewed May 14, 2026

View reviewed changes

remove nvinferencec_client

d9bf271

Signed-off-by: Ao Tang <aot@nvidia.com>

greptile-apps Bot reviewed May 14, 2026

View reviewed changes

secrect in test

cd1224a

Signed-off-by: Ao Tang <aot@nvidia.com>

greptile-apps Bot reviewed May 14, 2026

View reviewed changes

		image_path = self.image_dir / f"{image_id}.jpg"
		if not image_path.exists():

		DEFAULT_MODEL_ID = "gcp/google/gemini-3-pro"


		class Gemini3Pro(NVInferenceModel):


		from nemo_curator.models.omni.base import NVInferenceModel, NVInferenceModelConfig

		DEFAULT_MODEL_ID = "nvidia/nvidia/nemotron-3-nano-omni-30b-a3b-reasoning"

		return OpenAI(base_url=base_url, api_key=api_key)


		def stream_chat_completion_text( # noqa: PLR0913

	GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/nvidia/nemotron-ocr-v2 /tmp/nemotron-ocr-src && \
	GIT_LFS_SKIP_SMUDGE=1 git clone --depth 1 --branch <TAG_OR_COMMIT> https://huggingface.co/nvidia/nemotron-ocr-v2 /tmp/nemotron-ocr-src && \

Conversation

suiyoubi commented Apr 30, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Pipeline stages

Key components

Tests

Checklist

Uh oh!

greptile-apps Bot Apr 30, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot May 12, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot May 12, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot May 14, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot May 14, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot May 14, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

suiyoubi commented Apr 30, 2026 •

edited

Loading