Skip to content

Nemotron OCR SDG Pipeline#1899

Open
suiyoubi wants to merge 49 commits into
mainfrom
aot/omni_sdg
Open

Nemotron OCR SDG Pipeline#1899
suiyoubi wants to merge 49 commits into
mainfrom
aot/omni_sdg

Conversation

@suiyoubi
Copy link
Copy Markdown
Contributor

@suiyoubi suiyoubi commented Apr 30, 2026

Description

Adds the Nemotron OCR SDG pipeline — a multimodal synthetic data generation pipeline that converts images into structured OCR + QA conversation data for vision-language model training.

Pipeline stages

Stage Model Output
`NemotronOCRV2Stage` NemotronOCR-v2 Dense word-level OCR with bounding boxes
`OCRScoringQAStage` Gemini 3 Pro (NVIDIA Inference API) Scoring, validation, and missing-region detection
`OCRConversationalizeStage` 11 output format variants → `ConversationSample`
`OCRDenseQAStage` 6 QA types (bbox↔text, point↔text, dense dump)

Key components

  • `nemo_curator/models/omni/` — `NVInferenceModel` base class for NVIDIA Inference API-backed VLMs; `Gemini3Pro` concrete implementation
  • `nemo_curator/models/client/nvinference_client.py` — thin streaming client helpers (`get_nvinference_api_key`, `create_openai_client`, `stream_chat_completion_text`)
  • `nemo_curator/stages/synthetic/omni/base.py` — `VLMProcessingStage` and `ModelProcessingStage` base classes with batched inference, per-prompt error isolation, and setup/teardown lifecycle
  • `nemo_curator/stages/synthetic/omni/io.py` — `HFDatasetImageReader`, `TarImageReader`, `ParquetReader`, `SkipProcessedStage`, `ResultWriterStage`
  • `nemo_curator/tasks/ocr.py` — `OCRDenseWord` and `OCRData` task data classes
  • `docker/Dockerfile` — installs `nemotron-ocr-v2` from source (no-build-isolation, CUDA arch list for A100/A10/RTX Ada/H100)
  • `tutorials/synthetic/omni/hf_ocr_pipeline.py` — end-to-end example using HuggingFace datasets

Tests

63 new unit tests across `tests/tasks/`, `tests/models/`, and `tests/stages/synthetic/omni/` — all CPU-only, no GPU required.

Checklist

  • I am familiar with the Contributing Guide.
  • New or Existing tests cover these changes.
  • The documentation is up to date with these changes.

suiyoubi and others added 30 commits March 4, 2026 10:40
Signed-off-by: Ao Tang <aot@nvidia.com>
- Updated the `transformers` dependency from `<=4.55.2` to `==4.57.0` in `pyproject.toml` and `uv.lock` to ensure compatibility with the Cosmos Embed imports.
- Added a new `Gemini3Pro` model class in `gemini.py` utilizing the NVIDIA Inference API.
- Introduced `DescriptionOutputStage` and `DescriptionValidatorStage` for processing and validating image descriptions, respectively.
- Enhanced `VLMProcessingStage` to improve GPU resource handling and added a `num_workers` parameter to `DescriptionStage` for better scalability.

This commit enhances the model's capabilities and ensures that dependencies are up-to-date for optimal performance.

Signed-off-by: Ao Tang <aot@nvidia.com>
Signed-off-by: Ao Tang <aot@nvidia.com>
Signed-off-by: Ao Tang <aot@nvidia.com>
Signed-off-by: Ao Tang <aot@nvidia.com>
Signed-off-by: Ao Tang <aot@nvidia.com>
Signed-off-by: Ao Tang <aot@nvidia.com>
Removes description-specific stages (description*.py, description
pipeline tutorials) that belong on aot/omni_description.
Adds OCR result inspection/review scripts and shared design docs.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… pipeline tutorial

Signed-off-by: Ao Tang <aot@nvidia.com>
Signed-off-by: Ao Tang <aot@nvidia.com>
Signed-off-by: Ao Tang <aot@nvidia.com>
Signed-off-by: Ao Tang <aot@nvidia.com>
Signed-off-by: Ao Tang <aot@nvidia.com>
Signed-off-by: Ao Tang <aot@nvidia.com>
Signed-off-by: Ao Tang <aot@nvidia.com>
Signed-off-by: Ao Tang <aot@nvidia.com>
…Data and OCRDenseWord classes

Signed-off-by: Ao Tang <aot@nvidia.com>
--metrics-dir wires Ray metrics into the running Prometheus/Grafana instance.
--run-name sets SLURM_JOB_NAME so Xenna labels the run on the
ray_pipeline_input_tasks metric for human-readable identification in Grafana.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… timing

- OCRConversationData.to_dict(): call conversation.to_dict() explicitly
  instead of relying on dataclasses.asdict(), which bypasses the custom
  ConversationSample serialization and drops the "t" media-type field from
  image fragments.
- RayClient: move Prometheus service-discovery registration to after Ray
  is started and responsive; add _wait_for_ray_service_discovery_file()
  so the SD file exists before Prometheus is told to watch it.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Remove internal-only I/O stages from io.py (InputFormat, ImageReaderStage,
ImageFolderReaderStage, TarImageReaderStage, JsonlTarImageReaderStage,
OcrJsonlReaderStage, JsonlPipelineOutputReaderStage, TarImageReader,
ParquetImageReaderStage, ParquetImageReader).  These classes are only
used by the internal ocr_pipeline.py and will live on aot/omni_sdg_internal.

Public io.py now exports: HFDatasetImageReaderStage, SkipProcessedStage,
ResultWriterStage, merge_output_shards, ImageWriterStage, and the
FileReader helpers (load_image_from_task, TarFileReader, etc.).

Add tests/stages/test_hf_dataset_image_reader.py covering HFDatasetImageReaderStage.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
ImageWriterStage is not used by hf_ocr_pipeline.py and only makes sense
with the internal JSONL+tar pipeline; moved to aot/omni_sdg_internal.

TarFileReader, ParquetFileReader, _file_readers dispatcher, and the
deprecated _parse_tar_slice_path wrapper are removed.  In the public HF
pipeline images are always regular JPEG files on disk, so load_image_from_task
is simplified to a single RegularFileReader call.

io.py: 870 → 631 lines.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…e_url

SUPPORTED_IMAGE_EXTENSIONS was only used by the internal reader classes
removed in the previous commit.  FileReader.read_image_url() was never
called in this branch — drop it and its now-unused base64 import.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
…ng in nvinference_client.py

Signed-off-by: Ao Tang <aot@nvidia.com>
Signed-off-by: Ao Tang <aot@nvidia.com>
Comment on lines +203 to +204
image_path = self.image_dir / f"{image_id}.jpg"
if not image_path.exists():
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 image_id used as filename without path sanitization

f"{image_id}.jpg" is used directly as a filename. If the id_column value contains a / (e.g. "en/doc_001"), Path(image_dir) / "en/doc_001.jpg" silently creates a subdirectory en/ instead of a flat file. The subdirectory won't be created by self.image_dir.mkdir(parents=True, exist_ok=True) above (that only creates image_dir itself), so the subsequent pil_image.save(image_path) will fail with FileNotFoundError.

Consider sanitizing:

safe_id = image_id.replace("/", "_").replace("\\", "_")
image_path = self.image_dir / f"{safe_id}.jpg"

Comment thread nemo_curator/models/omni/base.py Outdated
pass


class NVInferenceModel(VLMModel):
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For discussion: Well, we can let it be "nvinference", or generic "OpenAI-API-compatible" or something? Basically we're not specifically targeting nvinference, but rather any openai compatible api. Although internally we would of course focus on nvinference.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should just leave it for nvinference for now. promoting NV inference infra isn't a bad idea

Comment thread nemo_curator/models/omni/base.py Outdated
content.append({"type": "text", "text": prompt})
return content

def generate(
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Meaning, we only have static batching for now, no dynamic batching?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thats true. Do we expect perf gain for inference API from using dynamic batching ?

Comment thread nemo_curator/tasks/ocr.py Outdated


@dataclass(kw_only=True)
class OCRDenseWord:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we should rename. It's not necessarily a word, but can also be a line (if using the line prediction mode).

Comment thread nemo_curator/models/omni/gemini.py Outdated
DEFAULT_MODEL_ID = "gcp/google/gemini-3-pro"


class Gemini3Pro(NVInferenceModel):
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, was thinking if we should set our just released omni model as default?

suiyoubi added 2 commits May 12, 2026 07:00
Signed-off-by: Ao Tang <aot@nvidia.com>
Updated test cases across multiple files to replace instances of OCRDenseWord with OCRDenseItem, reflecting the new class structure. Adjusted related helper functions and assertions accordingly. Additionally, modified the OCR pipeline to utilize Nemotron-Nano-Omni for bbox scoring instead of Gemini, including updates to model IDs and parameters.

Signed-off-by: Ao Tang <aot@nvidia.com>
Comment on lines +271 to +283
try:
responses = self.model.generate(prompts, images if self.multimodal else None, self.inference_config)

for idx, response in zip(valid_indices, responses, strict=True):
self._handle_response_one(tasks, idx, response)

logger.info(f"{self.name}: processed batch of {len(valid_indices)} items")

except Exception as e: # noqa: BLE001
logger.error(f"{self.name}: batch processing error: {e}")
for task in tasks:
task.data.error = f"{self.name}: {e}"
task.data.is_valid = False
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 strict=True exception cascades through outer handler, overwriting successfully-processed results

zip(valid_indices, responses, strict=True) is inside the outer try/except Exception. When a custom VLMModel returns a list whose length differs from len(valid_indices), zip raises ValueError — but only after all items in the shorter sequence have already been yielded and _handle_response_one has been called for each. The outer except then iterates for task in tasks and sets is_valid = False on every task, including the ones whose responses were fully handled before the error. Those already-updated task states are silently overwritten with a generic failure.

The NVInferenceModel.generate implementation currently never triggers this (its inner except Exception ensures exactly len(prompts) results), but any subclass that raises or returns a shorter list will corrupt legitimate results rather than just surfacing the contract violation.

Signed-off-by: Ao Tang <aot@nvidia.com>

from nemo_curator.models.omni.base import NVInferenceModel, NVInferenceModelConfig

DEFAULT_MODEL_ID = "nvidia/nvidia/nemotron-3-nano-omni-30b-a3b-reasoning"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Likely wrong model ID — double nvidia/ prefix

DEFAULT_MODEL_ID = "nvidia/nvidia/nemotron-3-nano-omni-30b-a3b-reasoning" has the org name duplicated. NVIDIA NIM API model slugs follow the <org>/<model-name> convention (e.g. nvidia/llama-3.1-70b-instruct, nvidia/nemotron-4-340b-reward), so the expected value would be "nvidia/nemotron-3-nano-omni-30b-a3b-reasoning". If the current string is wrong, every OCRScoringQAStage API call will fail — stream_chat_completion_text will raise, NVInferenceModel.generate will catch it silently and return "", and handle_response will mark every image invalid with "empty response from model". The same double-prefix is hardcoded independently in ocr_scoring_qa.py:169 and tutorials/synthetic/omni/ocr_pipeline.py:54,254 — it would also be cleaner to reference DEFAULT_MODEL_ID there rather than duplicating the string literal.

return OpenAI(base_url=base_url, api_key=api_key)


def stream_chat_completion_text( # noqa: PLR0913
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason to not reuse / extend our OpenAiClietn / AsyncOpenAiClient

class AsyncOpenAIClient(AsyncLLMClient):

Comment thread nemo_curator/models/omni/base.py Outdated
raise ValueError(msg)


class Model(ABC):
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm really wary of new generic base classes... I believe we already have https://github.com/NVIDIA-NeMo/Curator/blob/main/nemo_curator/models/base.py

Can we extend those? What does this give?
Similarly what does ModelConfig base class give rather than just having kwargs? vLLM accepts multiple kwargs these just being three of them...

Is there a reason?

Comment on lines +58 to +68
def _maybe_set_cuda_device(self) -> None:
"""Set the current CUDA device for this stage, if configured."""
if self.cuda_devices is None or len(self.cuda_devices) == 0:
return
import torch

if not torch.cuda.is_available():
msg = "CUDA is not available"
raise RuntimeError(msg)
torch.cuda.set_device(self.cuda_devices[0])
os.environ["CUDA_VISIBLE_DEVICES"] = ",".join(f"{d}" for d in self.cuda_devices)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we doing this instead of letting Xenna / Ray Data do it?

Signed-off-by: Ao Tang <aot@nvidia.com>
Comment on lines +62 to +66
{{
"ocr_mode": "word" or "line",
"text": [
{{
"idx": <integer matching input idx>,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Greedy DOTALL regex silently eats the entire JSON response when reasoning contains {

_JSON_OBJECT_RE = re.compile(r"\{.*\}", re.DOTALL) uses a greedy .* that extends from the first { in the string to the last }. For a reasoning model like Nemotron-Nano-Omni, the thinking/scratchpad before the final answer almost always contains JSON examples or structured text with braces (e.g. {"bbox_match": 10}). When that happens, the single greedy finditer match spans from that first { all the way to the closing } of the actual response JSON — producing a non-parseable fragment that fails json.loads. Because the greedy match consumed the entire tail of the string, finditer yields no further matches, and _parse_json_object returns None, marking the image invalid with "could not parse JSON".

Signed-off-by: Ao Tang <aot@nvidia.com>
Comment on lines +164 to +218
def __init__( # noqa: PLR0913
self,
model_id: str = "nvidia/nvidia/nemotron-3-nano-omni-30b-a3b-reasoning",
temperature: float = 1.0,
max_tokens: int = 16384,
min_bbox_match: int = 5,
max_text_errors: int = 0,
fail_on_missing_text: bool = False,
dense_dump_prob: float = 0.05,
batch_size: int | None = None,
priority_mode: bool = False,
**kwargs: Any, # noqa: ANN401
) -> None:
"""Initialise the combined scoring + QA stage.

Args:
model_id: NVIDIA Inference API model to use as verifier.
temperature: Sampling temperature.
max_tokens: Max tokens for the verifier response. Nemotron
reasoning consumes most of this budget before emitting the
final JSON; 16k leaves enough headroom for both.
min_bbox_match: Minimum ``bbox_match`` score for a valid bbox.
Nemotron's near-trinary scoring (0/5/10) makes thresholds
1-5 essentially equivalent; 5 keeps a quality gate without
over-pruning short text.
max_text_errors: Maximum ``text_errors`` count for a valid bbox.
fail_on_missing_text: If ``True``, mark the whole image invalid
when the verifier reports missing text. Defaults to
``False`` — missing text only disables the dense-dump QA
turn.
dense_dump_prob: Probability (0-1) of generating a single-turn
dense dump conversation instead of multi-turn QA, for images
where OCR is provably complete (no missing text). Tuned
low because the verifier tends to under-report missing
text, so "provably complete" fires more often than the
underlying coverage warrants.
batch_size: Override the default batch size of 16.
priority_mode: Use priority API queue (lower latency, higher cost).
"""
self._scoring_model_id = model_id
self.min_bbox_match = min_bbox_match
self.max_text_errors = max_text_errors
self.fail_on_missing_text = fail_on_missing_text
self.dense_dump_prob = dense_dump_prob
super().__init__(
model=NVInferenceModel(
model_id=model_id,
max_tokens=max_tokens,
temperature=temperature,
top_p=1.0,
priority_mode=priority_mode,
),
batch_size=batch_size or self.batch_size,
**kwargs,
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Broken **kwargs forwarding crashes construction for any extra argument

OCRScoringQAStage.__init__ collects **kwargs and passes them verbatim to super().__init__(), but ModelProcessingStage.__init__ only accepts model and batch_size — no **kwargs. Any caller who passes an extra keyword argument will get an immediate TypeError at construction time. The **kwargs: Any signature advertises extensibility that is silently broken.

Signed-off-by: Ao Tang <aot@nvidia.com>
Comment thread docker/Dockerfile
# GIT_LFS_SKIP_SMUDGE=1: clone source only, model weights are mounted at runtime
# TORCH_CUDA_ARCH_LIST: compile for SM 8.0/8.6/8.9/9.0 to cover A100, A10, RTX Ada, H100
RUN uv pip install hatchling ninja editables && \
GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/nvidia/nemotron-ocr-v2 /tmp/nemotron-ocr-src && \
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 security The git clone fetches the HEAD of the default branch with no pinned commit or tag. Every Docker build will silently pick up whatever code is currently in the HuggingFace repo at build time, making images non-reproducible and creating a supply-chain risk. Add --depth 1 (avoids downloading history) and check out a specific tag or commit so the build is locked to a known-good version of the extension source.

Suggested change
GIT_LFS_SKIP_SMUDGE=1 git clone https://huggingface.co/nvidia/nemotron-ocr-v2 /tmp/nemotron-ocr-src && \
GIT_LFS_SKIP_SMUDGE=1 git clone --depth 1 --branch <TAG_OR_COMMIT> https://huggingface.co/nvidia/nemotron-ocr-v2 /tmp/nemotron-ocr-src && \

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants