Chatterbox tts by Ssofja · Pull Request #1976 · NVIDIA-NeMo/Curator

Ssofja · 2026-05-13T13:13:15Z

Description

Add a ChatterboxTTS-based speech synthesis stage (ChatterboxTTSStage) to the NeMo Curator audio pipeline for generating multi-speaker conversation audio from text.

New stage:

ChatterboxTTSStage — Synthesises conversation-turn audio using Chatterbox TTS. Supports both the English-only model (ChatterboxTTS) and the multilingual model (ChatterboxMultilingualTTS, 23 languages). Speaker voices are automatically assigned from a reference audio dataset and stay consistent within each conversation.

Key features:

Two reference audio layouts: wavs/<dialog>/<speaker>.wav (with optional RTTM silence stripping) and MLS <spk>/<book>/<seg>.flac (auto-concatenated to target duration).
Per-conversation random exaggeration range for voice style variation.
RMS-based audio normalisation with clipping protection.
Deterministic filenames (MD5-based) enabling idempotent re-runs — existing output files are reused without re-generation.
Graceful failure handling: TTS errors produce a silence placeholder instead of crashing the pipeline.
Requires 1 GPU (Resources(gpus=1)).

New files:

nemo_curator/stages/audio/tts/__init__.py
nemo_curator/stages/audio/tts/chatterbox_tts.py
tests/stages/audio/tts/__init__.py
tests/stages/audio/tts/test_chatterbox_tts.py (55 tests)

Usage

from nemo_curator.stages.audio import ChatterboxTTSStage

# English TTS with wavs/ reference layout
stage = ChatterboxTTSStage(
    output_audio_dir="/data/tts_output",
    reference_voices_dataset="/data/reference_voices",
    cfg_weight=0.5,
    exaggeration=0.5,
    temperature=0.8,
)

# Multilingual TTS (e.g. Russian) with MLS reference layout
stage_ru = ChatterboxTTSStage(
    output_audio_dir="/data/tts_output_ru",
    reference_voices_dataset="/data/mls_russian",
    language="ru",
    exaggeration=[0.3, 0.7],  # random per conversation
    max_reference_duration=30.0,
)

# Process conversation turns
from nemo_curator.tasks import AudioTask

tasks = [
    AudioTask(
        data={"utterance": "Hello, how are you?", "speaker": "Alice", "conversation_id": "conv001"},
        task_id="t1",
        dataset_name="my_dataset",
    ),
    AudioTask(
        data={"utterance": "I'm doing well, thanks!", "speaker": "Bob", "conversation_id": "conv001"},
        task_id="t2",
        dataset_name="my_dataset",
    ),
]

results = stage.process_batch(tasks)
# Each result.data now contains "audio_filepath", "duration", and "reference_voice"

Supported languages (multilingual mode):
ar, da, de, el, en, es, fi, fr, he, hi, it, ja, ko, ms, nl, no, pl, pt, ru, sv, sw, tr, zh

Checklist

I am familiar with the Contributing Guide.
New or Existing tests cover these changes.
The documentation is up to date with these changes.

Signed-off-by: Ssofja <sofiakostandian@gmail.com>

copy-pr-bot · 2026-05-13T13:13:19Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

greptile-apps · 2026-05-13T13:17:55Z

Greptile Summary

This PR adds ChatterboxTTSStage, a new audio pipeline stage that synthesises multi-speaker conversation audio using the Chatterbox TTS and ChatterboxMultilingualTTS models. Reference voices are selected from a local audio dataset (wavs or MLS layout), optionally stripped of silence via RTTM files, and kept consistent per speaker within a conversation.

Core generation logic (chatterbox_tts.py): lazy model loading, two reference-audio layouts, per-conversation exaggeration ranges, RMS normalisation, and MD5-keyed deterministic filenames for idempotent re-runs.
Test suite (test_chatterbox_tts.py): 55 tests covering construction, reference discovery, RTTM processing, MLS concatenation, normalisation, batch processing, and lifecycle hooks — all using a fake sine-wave model so no GPU is required.

Confidence Score: 2/5

Not ready to merge — four correctness bugs in the core generation path need to be fixed before the stage can be trusted in production.

The implementation has four independent logic bugs, all in the main file. The language code is validated lowercase but stored and forwarded to the model with its original casing, which would break multilingual inference for any caller who passes an uppercase code. The reference_voice metadata field always contains a meaningless temp-directory name for RTTM-processed or MLS references rather than the actual speaker/dialog ID. RTTM-processed temp files use only os.path.basename as the key, so two dialogs containing a speaker file of the same name silently overwrite each other's temp reference, swapping voices without any warning. Finally, conversation IDs are truncated to 12 characters in the output filename, so structured IDs with a shared prefix cause the second conversation to reuse cached audio generated with a different reference voice while reporting the new (incorrect) reference ID in metadata.

nemo_curator/stages/audio/tts/chatterbox_tts.py requires the most attention; the test file should also be updated to assert the correctness (not just consistency) of reference_voice values and to cover the truncated-ID collision scenario.

Important Files Changed

Filename	Overview
nemo_curator/stages/audio/tts/chatterbox_tts.py	New ChatterboxTTSStage implementation; contains four logic bugs: language code case not normalized before storage, reference_voice metadata uses temp-dir name instead of speaker ID, RTTM temp-file basename collision overwrites reference audio, and conversation_id truncation causes cross-conversation filename collisions.
nemo_curator/stages/audio/tts/init.py	Simple package init exporting ChatterboxTTSStage; no issues.
nemo_curator/stages/audio/init.py	Adds ChatterboxTTSStage import and all entry; correct and consistent with existing style.
tests/stages/audio/tts/test_chatterbox_tts.py	55 tests covering construction, model loading, reference discovery, RTTM processing, speaker assignment, normalization, and batch processing; tests check consistency of reference_voice but not correctness of its value, so the metadata bug goes undetected.

Sequence Diagram

sequenceDiagram
    participant Caller
    participant Stage as ChatterboxTTSStage
    participant Ref as Reference Resolver
    participant TmpDir as Temp Dir
    participant Model as ChatterboxTTS/MTL
    participant FS as Output FS

    Caller->>Stage: process_batch(tasks)
    Stage->>Stage: _ensure_ready()
    Stage->>Model: from_pretrained(device)
    Stage->>Stage: _load_reference_audio_files()

    loop per AudioTask
        Stage->>Ref: _assign_reference(speaker, conv_id)
        Ref->>TmpDir: write RTTM-stripped WAV or MLS concat WAV
        Ref-->>Stage: ref_path
        Stage->>Stage: _output_filename(conv_id, speaker, text)
        alt file already exists
            Stage->>FS: sf.read(audio_path)
        else
            Stage->>Model: "generate(text, audio_prompt_path=ref_path, ...)"
            Model-->>Stage: wav tensor
            Stage->>Stage: _normalize_audio(wav)
            Stage->>FS: sf.write(audio_path, audio_data, sr)
        end
        Stage-->>Caller: AudioTask with audio_filepath, duration, reference_voice
    end

_{Reviews (1): Last reviewed commit: "added chatterbox tts stage" | Re-trigger Greptile}

greptile-apps · 2026-05-13T13:17:59Z

+        if language is not None and language.lower() not in SUPPORTED_LANGUAGES:
+            raise ValueError(
+                f"Unsupported language '{language}'. "
+                f"Supported: {', '.join(sorted(SUPPORTED_LANGUAGES))}"
+            )


The language code is validated with .lower() but stored as-is and later passed directly to the model as language_id. If a caller passes "RU" or "FR", it clears the SUPPORTED_LANGUAGES check (because "ru" is in the set), but the raw uppercase string is forwarded to ChatterboxMultilingualTTS.generate. The Chatterbox API expects lowercase ISO 639-1 codes, so inference would either fail or silently produce wrong-language output.

Suggested change

if language is not None and language.lower() not in SUPPORTED_LANGUAGES:

raise ValueError(

f"Unsupported language '{language}'. "

f"Supported: {', '.join(sorted(SUPPORTED_LANGUAGES))}"

)

if language is not None and language.lower() not in SUPPORTED_LANGUAGES:

raise ValueError(

f"Unsupported language '{language}'. "

f"Supported: {', '.join(sorted(SUPPORTED_LANGUAGES))}"

)

if language is not None:

language = language.lower()

greptile-apps · 2026-05-13T13:18:00Z

+            out_path = os.path.join(
+                self.temp_dir, os.path.basename(audio_filepath)
+            )


Both _get_reference_audio_wavs and _get_reference_audio_mls write their RTTM-processed / concatenated output into self.temp_dir using os.path.basename(audio_filepath). When different dialogs contain a speaker file with the same name (e.g. dialog001/Alice.wav and dialog002/Alice.wav), the second write silently overwrites the first temp file. Any speaker already assigned temp_dir/Alice.wav then synthesises audio with the wrong voice without any warning.

Suggested change

out_path = os.path.join(

self.temp_dir, os.path.basename(audio_filepath)

)

unique_name = hashlib.md5(audio_filepath.encode()).hexdigest()[:8] + "_" + os.path.basename(audio_filepath)

out_path = os.path.join(self.temp_dir, unique_name)

greptile-apps · 2026-05-13T13:18:01Z

+    @staticmethod
+    def _output_filename(conversation_id: str, speaker: str, text: str) -> str:
+        """Deterministic filename: ``{conv_id_short}_{speaker}_{text_hash}.wav``."""
+        conv_short = conversation_id[:12] if len(conversation_id) > 12 else conversation_id
+        text_hash = hashlib.md5(text.encode("utf-8")).hexdigest()[:10]
+        return f"{conv_short}_{speaker}_{text_hash}.wav"


Truncating the conversation ID to 12 characters means two conversations whose IDs share a 12-character prefix (common with structured IDs such as session1_conv001 / session1_conv002) generate the same filename for the same speaker and text. On a subsequent run the cached file from the first conversation is reused for the second even though a different reference voice may have been assigned, producing a silent audio/metadata mismatch.

Suggested change

@staticmethod

def _output_filename(conversation_id: str, speaker: str, text: str) -> str:

"""Deterministic filename: ``{conv_id_short}_{speaker}_{text_hash}.wav``."""

conv_short = conversation_id[:12] if len(conversation_id) > 12 else conversation_id

text_hash = hashlib.md5(text.encode("utf-8")).hexdigest()[:10]

return f"{conv_short}_{speaker}_{text_hash}.wav"

@staticmethod

def _output_filename(conversation_id: str, speaker: str, text: str) -> str:

"""Deterministic filename: ``{conv_id_hash}_{speaker}_{text_hash}.wav``."""

conv_hash = hashlib.md5(conversation_id.encode("utf-8")).hexdigest()[:12]

text_hash = hashlib.md5(text.encode("utf-8")).hexdigest()[:10]

return f"{conv_hash}_{speaker}_{text_hash}.wav"

greptile-apps · 2026-05-13T13:18:02Z

+            reference_wav = self._assign_reference(speaker, conversation_id)
+
+            filename = self._output_filename(conversation_id, speaker, text)
+            audio_path = os.path.join(self.output_audio_dir, filename)
+
+            if os.path.exists(audio_path):
+                audio_data, _ = sf.read(audio_path)
+            else:
+                audio_data = self._generate_turn_audio(
+                    text, reference_wav, conversation_id
+                )
+                sf.write(audio_path, audio_data, self.sample_rate)
+
+            duration = len(audio_data) / self.sample_rate
+
+            out_data = dict(data)
+            out_data["audio_filepath"] = audio_path
+            out_data["duration"] = duration
+            out_data["reference_voice"] = Path(reference_wav).parent.name


Path(reference_wav).parent.name returns the temp-directory name (e.g. chatterbox_ref_abc123) whenever the reference has been RTTM-processed or comes from the MLS layout, because both code paths write to self.temp_dir/<filename>. Only the raw wavs path (no RTTM) has a meaningful parent (the dialog ID). The emitted reference_voice value should be the MLS speaker ID or the dialog/speaker tag, not an ephemeral temp-dir name.

Suggested change

reference_wav = self._assign_reference(speaker, conversation_id)

filename = self._output_filename(conversation_id, speaker, text)

audio_path = os.path.join(self.output_audio_dir, filename)

if os.path.exists(audio_path):

audio_data, _ = sf.read(audio_path)

else:

audio_data = self._generate_turn_audio(

text, reference_wav, conversation_id

)

sf.write(audio_path, audio_data, self.sample_rate)

duration = len(audio_data) / self.sample_rate

out_data = dict(data)

out_data["audio_filepath"] = audio_path

out_data["duration"] = duration

out_data["reference_voice"] = Path(reference_wav).parent.name

reference_wav, ref_id = self._assign_reference(speaker, conversation_id)

filename = self._output_filename(conversation_id, speaker, text)

audio_path = os.path.join(self.output_audio_dir, filename)

if os.path.exists(audio_path):

audio_data, _ = sf.read(audio_path)

else:

audio_data = self._generate_turn_audio(

text, reference_wav, conversation_id

)

sf.write(audio_path, audio_data, self.sample_rate)

duration = len(audio_data) / self.sample_rate

out_data = dict(data)

out_data["audio_filepath"] = audio_path

out_data["duration"] = duration

out_data["reference_voice"] = ref_id

sarahyurick · 2026-05-13T18:19:48Z

+        top_p: float = 1.0,
+        normalize_audio: bool = True,
+        normalize_level: float = -20.0,
+        **kwargs,


Can this be removed?

sarahyurick · 2026-05-13T18:20:55Z

+        if self.language:
+            os.environ["TRANSFORMERS_ATTN_IMPLEMENTATION"] = "eager"
+            try:
+                import chatterbox.models.t3.llama_configs as _llama_cfgs
+                for _cfg_dict in _llama_cfgs.LLAMA_CONFIGS.values():
+                    _cfg_dict["attn_implementation"] = "eager"
+            except (ImportError, AttributeError):
+                pass
+
+            from chatterbox.mtl_tts import ChatterboxMultilingualTTS
+            self.model = ChatterboxMultilingualTTS.from_pretrained(device=self.device)
+            logger.info(f"Loaded ChatterboxMultilingualTTS (language={self.language})")
+        else:
+            from chatterbox.tts import ChatterboxTTS
+            self.model = ChatterboxTTS.from_pretrained(device=self.device)
+            logger.info("Loaded ChatterboxTTS (English)")


Can this imports be at the top of the script? Same with the environment variable?

sarahyurick · 2026-05-13T18:21:47Z

+        text_hash = hashlib.md5(text.encode("utf-8")).hexdigest()[:10]
+        return f"{conv_short}_{speaker}_{text_hash}.wav"
+
+    def _ensure_ready(self) -> None:


Let's remove this.

sarahyurick · 2026-05-13T18:22:07Z

+        if not tasks:
+            return []
+
+        self._ensure_ready()


Remove. We should never call setup in process_batch/process.

sarahyurick · 2026-05-13T18:22:22Z

+            return []
+
+        self._ensure_ready()
+        os.makedirs(self.output_audio_dir, exist_ok=True)


Should this be in setup?

Ssofja added 2 commits May 13, 2026 17:05

added chatterbox tts for english and multilanguage audio generation

e490108

Signed-off-by: Ssofja <sofiakostandian@gmail.com>

added chatterbox tts stage

253e633

Signed-off-by: Ssofja <sofiakostandian@gmail.com>

Ssofja requested a review from a team as a code owner May 13, 2026 13:13

Ssofja requested review from sarahyurick and removed request for a team May 13, 2026 13:13

greptile-apps Bot reviewed May 13, 2026

View reviewed changes

sarahyurick reviewed May 13, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Chatterbox tts#1976

Chatterbox tts#1976
Ssofja wants to merge 2 commits into
NVIDIA-NeMo:mainfrom
Ssofja:chatterbox_tts

Ssofja commented May 13, 2026

Uh oh!

copy-pr-bot Bot commented May 13, 2026

Uh oh!

greptile-apps Bot commented May 13, 2026

Uh oh!

greptile-apps Bot May 13, 2026

Uh oh!

greptile-apps Bot May 13, 2026

Uh oh!

greptile-apps Bot May 13, 2026

Uh oh!

greptile-apps Bot May 13, 2026

Uh oh!

sarahyurick May 13, 2026

Uh oh!

sarahyurick May 13, 2026

Uh oh!

sarahyurick May 13, 2026

Uh oh!

sarahyurick May 13, 2026

Uh oh!

sarahyurick May 13, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Ssofja commented May 13, 2026

Description

Usage

Checklist

Uh oh!

copy-pr-bot Bot commented May 13, 2026

Uh oh!

greptile-apps Bot commented May 13, 2026

Greptile Summary

Confidence Score: 2/5

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps Bot May 13, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot May 13, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot May 13, 2026

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot May 13, 2026

Choose a reason for hiding this comment

Uh oh!

sarahyurick May 13, 2026

Choose a reason for hiding this comment

Uh oh!

sarahyurick May 13, 2026

Choose a reason for hiding this comment

Uh oh!

sarahyurick May 13, 2026

Choose a reason for hiding this comment

Uh oh!

sarahyurick May 13, 2026

Choose a reason for hiding this comment

Uh oh!

sarahyurick May 13, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants