Skip to content

Chatterbox tts#1976

Open
Ssofja wants to merge 2 commits into
NVIDIA-NeMo:mainfrom
Ssofja:chatterbox_tts
Open

Chatterbox tts#1976
Ssofja wants to merge 2 commits into
NVIDIA-NeMo:mainfrom
Ssofja:chatterbox_tts

Conversation

@Ssofja
Copy link
Copy Markdown
Contributor

@Ssofja Ssofja commented May 13, 2026

Description

Add a ChatterboxTTS-based speech synthesis stage (ChatterboxTTSStage) to the NeMo Curator audio pipeline for generating multi-speaker conversation audio from text.

New stage:

  • ChatterboxTTSStage — Synthesises conversation-turn audio using Chatterbox TTS. Supports both the English-only model (ChatterboxTTS) and the multilingual model (ChatterboxMultilingualTTS, 23 languages). Speaker voices are automatically assigned from a reference audio dataset and stay consistent within each conversation.

Key features:

  • Two reference audio layouts: wavs/<dialog>/<speaker>.wav (with optional RTTM silence stripping) and MLS <spk>/<book>/<seg>.flac (auto-concatenated to target duration).
  • Per-conversation random exaggeration range for voice style variation.
  • RMS-based audio normalisation with clipping protection.
  • Deterministic filenames (MD5-based) enabling idempotent re-runs — existing output files are reused without re-generation.
  • Graceful failure handling: TTS errors produce a silence placeholder instead of crashing the pipeline.
  • Requires 1 GPU (Resources(gpus=1)).

New files:

  • nemo_curator/stages/audio/tts/__init__.py
  • nemo_curator/stages/audio/tts/chatterbox_tts.py
  • tests/stages/audio/tts/__init__.py
  • tests/stages/audio/tts/test_chatterbox_tts.py (55 tests)

Usage

from nemo_curator.stages.audio import ChatterboxTTSStage

# English TTS with wavs/ reference layout
stage = ChatterboxTTSStage(
    output_audio_dir="/data/tts_output",
    reference_voices_dataset="/data/reference_voices",
    cfg_weight=0.5,
    exaggeration=0.5,
    temperature=0.8,
)

# Multilingual TTS (e.g. Russian) with MLS reference layout
stage_ru = ChatterboxTTSStage(
    output_audio_dir="/data/tts_output_ru",
    reference_voices_dataset="/data/mls_russian",
    language="ru",
    exaggeration=[0.3, 0.7],  # random per conversation
    max_reference_duration=30.0,
)

# Process conversation turns
from nemo_curator.tasks import AudioTask

tasks = [
    AudioTask(
        data={"utterance": "Hello, how are you?", "speaker": "Alice", "conversation_id": "conv001"},
        task_id="t1",
        dataset_name="my_dataset",
    ),
    AudioTask(
        data={"utterance": "I'm doing well, thanks!", "speaker": "Bob", "conversation_id": "conv001"},
        task_id="t2",
        dataset_name="my_dataset",
    ),
]

results = stage.process_batch(tasks)
# Each result.data now contains "audio_filepath", "duration", and "reference_voice"

Supported languages (multilingual mode):
ar, da, de, el, en, es, fi, fr, he, hi, it, ja, ko, ms, nl, no, pl, pt, ru, sv, sw, tr, zh

Checklist

  • I am familiar with the Contributing Guide.
  • New or Existing tests cover these changes.
  • The documentation is up to date with these changes.

Ssofja added 2 commits May 13, 2026 17:05
Signed-off-by: Ssofja <sofiakostandian@gmail.com>
Signed-off-by: Ssofja <sofiakostandian@gmail.com>
@Ssofja Ssofja requested a review from a team as a code owner May 13, 2026 13:13
@Ssofja Ssofja requested review from sarahyurick and removed request for a team May 13, 2026 13:13
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 13, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@greptile-apps
Copy link
Copy Markdown
Contributor

greptile-apps Bot commented May 13, 2026

Greptile Summary

This PR adds ChatterboxTTSStage, a new audio pipeline stage that synthesises multi-speaker conversation audio using the Chatterbox TTS and ChatterboxMultilingualTTS models. Reference voices are selected from a local audio dataset (wavs or MLS layout), optionally stripped of silence via RTTM files, and kept consistent per speaker within a conversation.

  • Core generation logic (chatterbox_tts.py): lazy model loading, two reference-audio layouts, per-conversation exaggeration ranges, RMS normalisation, and MD5-keyed deterministic filenames for idempotent re-runs.
  • Test suite (test_chatterbox_tts.py): 55 tests covering construction, reference discovery, RTTM processing, MLS concatenation, normalisation, batch processing, and lifecycle hooks — all using a fake sine-wave model so no GPU is required.

Confidence Score: 2/5

Not ready to merge — four correctness bugs in the core generation path need to be fixed before the stage can be trusted in production.

The implementation has four independent logic bugs, all in the main file. The language code is validated lowercase but stored and forwarded to the model with its original casing, which would break multilingual inference for any caller who passes an uppercase code. The reference_voice metadata field always contains a meaningless temp-directory name for RTTM-processed or MLS references rather than the actual speaker/dialog ID. RTTM-processed temp files use only os.path.basename as the key, so two dialogs containing a speaker file of the same name silently overwrite each other's temp reference, swapping voices without any warning. Finally, conversation IDs are truncated to 12 characters in the output filename, so structured IDs with a shared prefix cause the second conversation to reuse cached audio generated with a different reference voice while reporting the new (incorrect) reference ID in metadata.

nemo_curator/stages/audio/tts/chatterbox_tts.py requires the most attention; the test file should also be updated to assert the correctness (not just consistency) of reference_voice values and to cover the truncated-ID collision scenario.

Important Files Changed

Filename Overview
nemo_curator/stages/audio/tts/chatterbox_tts.py New ChatterboxTTSStage implementation; contains four logic bugs: language code case not normalized before storage, reference_voice metadata uses temp-dir name instead of speaker ID, RTTM temp-file basename collision overwrites reference audio, and conversation_id truncation causes cross-conversation filename collisions.
nemo_curator/stages/audio/tts/init.py Simple package init exporting ChatterboxTTSStage; no issues.
nemo_curator/stages/audio/init.py Adds ChatterboxTTSStage import and all entry; correct and consistent with existing style.
tests/stages/audio/tts/test_chatterbox_tts.py 55 tests covering construction, model loading, reference discovery, RTTM processing, speaker assignment, normalization, and batch processing; tests check consistency of reference_voice but not correctness of its value, so the metadata bug goes undetected.

Sequence Diagram

sequenceDiagram
    participant Caller
    participant Stage as ChatterboxTTSStage
    participant Ref as Reference Resolver
    participant TmpDir as Temp Dir
    participant Model as ChatterboxTTS/MTL
    participant FS as Output FS

    Caller->>Stage: process_batch(tasks)
    Stage->>Stage: _ensure_ready()
    Stage->>Model: from_pretrained(device)
    Stage->>Stage: _load_reference_audio_files()

    loop per AudioTask
        Stage->>Ref: _assign_reference(speaker, conv_id)
        Ref->>TmpDir: write RTTM-stripped WAV or MLS concat WAV
        Ref-->>Stage: ref_path
        Stage->>Stage: _output_filename(conv_id, speaker, text)
        alt file already exists
            Stage->>FS: sf.read(audio_path)
        else
            Stage->>Model: "generate(text, audio_prompt_path=ref_path, ...)"
            Model-->>Stage: wav tensor
            Stage->>Stage: _normalize_audio(wav)
            Stage->>FS: sf.write(audio_path, audio_data, sr)
        end
        Stage-->>Caller: AudioTask with audio_filepath, duration, reference_voice
    end
Loading

Reviews (1): Last reviewed commit: "added chatterbox tts stage" | Re-trigger Greptile

Comment on lines +118 to +122
if language is not None and language.lower() not in SUPPORTED_LANGUAGES:
raise ValueError(
f"Unsupported language '{language}'. "
f"Supported: {', '.join(sorted(SUPPORTED_LANGUAGES))}"
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 The language code is validated with .lower() but stored as-is and later passed directly to the model as language_id. If a caller passes "RU" or "FR", it clears the SUPPORTED_LANGUAGES check (because "ru" is in the set), but the raw uppercase string is forwarded to ChatterboxMultilingualTTS.generate. The Chatterbox API expects lowercase ISO 639-1 codes, so inference would either fail or silently produce wrong-language output.

Suggested change
if language is not None and language.lower() not in SUPPORTED_LANGUAGES:
raise ValueError(
f"Unsupported language '{language}'. "
f"Supported: {', '.join(sorted(SUPPORTED_LANGUAGES))}"
)
if language is not None and language.lower() not in SUPPORTED_LANGUAGES:
raise ValueError(
f"Unsupported language '{language}'. "
f"Supported: {', '.join(sorted(SUPPORTED_LANGUAGES))}"
)
if language is not None:
language = language.lower()

Comment on lines +265 to +267
out_path = os.path.join(
self.temp_dir, os.path.basename(audio_filepath)
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Both _get_reference_audio_wavs and _get_reference_audio_mls write their RTTM-processed / concatenated output into self.temp_dir using os.path.basename(audio_filepath). When different dialogs contain a speaker file with the same name (e.g. dialog001/Alice.wav and dialog002/Alice.wav), the second write silently overwrites the first temp file. Any speaker already assigned temp_dir/Alice.wav then synthesises audio with the wrong voice without any warning.

Suggested change
out_path = os.path.join(
self.temp_dir, os.path.basename(audio_filepath)
)
unique_name = hashlib.md5(audio_filepath.encode()).hexdigest()[:8] + "_" + os.path.basename(audio_filepath)
out_path = os.path.join(self.temp_dir, unique_name)

Comment on lines +417 to +422
@staticmethod
def _output_filename(conversation_id: str, speaker: str, text: str) -> str:
"""Deterministic filename: ``{conv_id_short}_{speaker}_{text_hash}.wav``."""
conv_short = conversation_id[:12] if len(conversation_id) > 12 else conversation_id
text_hash = hashlib.md5(text.encode("utf-8")).hexdigest()[:10]
return f"{conv_short}_{speaker}_{text_hash}.wav"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Truncating the conversation ID to 12 characters means two conversations whose IDs share a 12-character prefix (common with structured IDs such as session1_conv001 / session1_conv002) generate the same filename for the same speaker and text. On a subsequent run the cached file from the first conversation is reused for the second even though a different reference voice may have been assigned, producing a silent audio/metadata mismatch.

Suggested change
@staticmethod
def _output_filename(conversation_id: str, speaker: str, text: str) -> str:
"""Deterministic filename: ``{conv_id_short}_{speaker}_{text_hash}.wav``."""
conv_short = conversation_id[:12] if len(conversation_id) > 12 else conversation_id
text_hash = hashlib.md5(text.encode("utf-8")).hexdigest()[:10]
return f"{conv_short}_{speaker}_{text_hash}.wav"
@staticmethod
def _output_filename(conversation_id: str, speaker: str, text: str) -> str:
"""Deterministic filename: ``{conv_id_hash}_{speaker}_{text_hash}.wav``."""
conv_hash = hashlib.md5(conversation_id.encode("utf-8")).hexdigest()[:12]
text_hash = hashlib.md5(text.encode("utf-8")).hexdigest()[:10]
return f"{conv_hash}_{speaker}_{text_hash}.wav"

Comment on lines +460 to +478
reference_wav = self._assign_reference(speaker, conversation_id)

filename = self._output_filename(conversation_id, speaker, text)
audio_path = os.path.join(self.output_audio_dir, filename)

if os.path.exists(audio_path):
audio_data, _ = sf.read(audio_path)
else:
audio_data = self._generate_turn_audio(
text, reference_wav, conversation_id
)
sf.write(audio_path, audio_data, self.sample_rate)

duration = len(audio_data) / self.sample_rate

out_data = dict(data)
out_data["audio_filepath"] = audio_path
out_data["duration"] = duration
out_data["reference_voice"] = Path(reference_wav).parent.name
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Path(reference_wav).parent.name returns the temp-directory name (e.g. chatterbox_ref_abc123) whenever the reference has been RTTM-processed or comes from the MLS layout, because both code paths write to self.temp_dir/<filename>. Only the raw wavs path (no RTTM) has a meaningful parent (the dialog ID). The emitted reference_voice value should be the MLS speaker ID or the dialog/speaker tag, not an ephemeral temp-dir name.

Suggested change
reference_wav = self._assign_reference(speaker, conversation_id)
filename = self._output_filename(conversation_id, speaker, text)
audio_path = os.path.join(self.output_audio_dir, filename)
if os.path.exists(audio_path):
audio_data, _ = sf.read(audio_path)
else:
audio_data = self._generate_turn_audio(
text, reference_wav, conversation_id
)
sf.write(audio_path, audio_data, self.sample_rate)
duration = len(audio_data) / self.sample_rate
out_data = dict(data)
out_data["audio_filepath"] = audio_path
out_data["duration"] = duration
out_data["reference_voice"] = Path(reference_wav).parent.name
reference_wav, ref_id = self._assign_reference(speaker, conversation_id)
filename = self._output_filename(conversation_id, speaker, text)
audio_path = os.path.join(self.output_audio_dir, filename)
if os.path.exists(audio_path):
audio_data, _ = sf.read(audio_path)
else:
audio_data = self._generate_turn_audio(
text, reference_wav, conversation_id
)
sf.write(audio_path, audio_data, self.sample_rate)
duration = len(audio_data) / self.sample_rate
out_data = dict(data)
out_data["audio_filepath"] = audio_path
out_data["duration"] = duration
out_data["reference_voice"] = ref_id

top_p: float = 1.0,
normalize_audio: bool = True,
normalize_level: float = -20.0,
**kwargs,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this be removed?

Comment on lines +175 to +190
if self.language:
os.environ["TRANSFORMERS_ATTN_IMPLEMENTATION"] = "eager"
try:
import chatterbox.models.t3.llama_configs as _llama_cfgs
for _cfg_dict in _llama_cfgs.LLAMA_CONFIGS.values():
_cfg_dict["attn_implementation"] = "eager"
except (ImportError, AttributeError):
pass

from chatterbox.mtl_tts import ChatterboxMultilingualTTS
self.model = ChatterboxMultilingualTTS.from_pretrained(device=self.device)
logger.info(f"Loaded ChatterboxMultilingualTTS (language={self.language})")
else:
from chatterbox.tts import ChatterboxTTS
self.model = ChatterboxTTS.from_pretrained(device=self.device)
logger.info("Loaded ChatterboxTTS (English)")
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this imports be at the top of the script? Same with the environment variable?

text_hash = hashlib.md5(text.encode("utf-8")).hexdigest()[:10]
return f"{conv_short}_{speaker}_{text_hash}.wav"

def _ensure_ready(self) -> None:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's remove this.

if not tasks:
return []

self._ensure_ready()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove. We should never call setup in process_batch/process.

return []

self._ensure_ready()
os.makedirs(self.output_audio_dir, exist_ok=True)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be in setup?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants