Skip to content

Commit ace7c47

Browse files
authored
Merge pull request #1850 from yuekaizhang/cosy3_pr
Support Cosyvoice3 TRT-LLM Inference
2 parents 04bcadc + 914454e commit ace7c47

27 files changed

Lines changed: 3000 additions & 133 deletions

cosyvoice/dataset/processor.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -181,7 +181,7 @@ def compute_fbank(data,
181181

182182

183183
def compute_whisper_fbank(data, num_frames=-1, mode='train'):
184-
""" Extract whisper fbank
184+
""" Extract whisper fbank
185185
186186
Args:
187187
data: Iterable[{key, wav, label, sample_rate}]

cosyvoice/vllm/cosyvoice2.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -99,7 +99,7 @@ def compute_logits(
9999
sampling_metadata: Optional[SamplingMetadata] = None,
100100
) -> Optional[torch.Tensor]:
101101
if VLLM_V1_ENGINE_ONLY:
102-
logits = self.logits_processor(self.lm_head, hidden_states,
102+
logits = self.logits_processor(self.lm_head, hidden_states,
103103
self.lm_head.bias)
104104
else:
105105
logits = self.logits_processor(self.lm_head, hidden_states,

example.py

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -99,7 +99,7 @@ def cosyvoice3_example():
9999
# 歴史的世界においては、過去は単に過ぎ去ったものではない、プラトンのいう如く非有が有である。 -> レキシ テキ セカイ ニ オイ テ ワ、カコ ワ タンニ スギサッ タ モノ デ ワ ナイ、プラトン ノ イウ ゴトク ヒ ユー ガ ユー デ アル。
100100
for i, j in enumerate(cosyvoice.inference_cross_lingual('You are a helpful assistant.<|endofprompt|>レキシ テキ セカイ ニ オイ テ ワ、カコ ワ タンニ スギサッ タ モノ デ ワ ナイ、プラトン ノ イウ ゴトク ヒ ユー ガ ユー デ アル。',
101101
'./asset/zero_shot_prompt.wav', stream=False)):
102-
torchaudio.save('japanese_{}.wav'.format(i), j['tts_speech'], cosyvoice.sample_rate)
102+
torchaudio.save('japanese_{}.wav'.format(i), j['tts_speech'], cosyvoice.sample_rate)
103103

104104

105105
def main():

runtime/triton_trtllm/README.DIT.md renamed to runtime/triton_trtllm/README.Cosyvoice2.DiT.md

Lines changed: 1 addition & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -8,7 +8,7 @@ This document describes how to accelerate CosyVoice with a DiT-based Token2Wav m
88

99
Launch the service directly with Docker Compose:
1010
```sh
11-
docker compose -f docker-compose.dit.yml up
11+
docker compose -f docker-compose.cosyvoice2.dit.yml up
1212
```
1313

1414
### Build the Docker Image
Lines changed: 146 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,146 @@
1+
## Accelerating CosyVoice with NVIDIA Triton Inference Server and TensorRT-LLM
2+
3+
Contributed by Yuekai Zhang (NVIDIA).
4+
5+
### Quick Start
6+
7+
Launch the service directly with Docker Compose:
8+
```sh
9+
docker compose -f docker-compose.cosyvoice2.unet.yml up
10+
```
11+
12+
### Build the Docker Image
13+
14+
To build the image from scratch:
15+
```sh
16+
docker build . -f Dockerfile.server -t soar97/triton-cosyvoice:25.06
17+
```
18+
19+
### Run a Docker Container
20+
```sh
21+
your_mount_dir=/mnt:/mnt
22+
docker run -it --name "cosyvoice-server" --gpus all --net host -v $your_mount_dir --shm-size=2g soar97/triton-cosyvoice:25.06
23+
```
24+
25+
### Understanding `run.sh`
26+
27+
The `run.sh` script orchestrates the entire workflow through numbered stages.
28+
29+
You can run a subset of stages with:
30+
```sh
31+
bash run.sh <start_stage> <stop_stage> [service_type]
32+
```
33+
- `<start_stage>`: The stage to start from (0-5).
34+
- `<stop_stage>`: The stage to stop after (0-5).
35+
36+
**Stages:**
37+
38+
- **Stage 0**: Downloads the `cosyvoice-2 0.5B` model from HuggingFace.
39+
- **Stage 1**: Converts the HuggingFace checkpoint to the TensorRT-LLM format and builds the TensorRT engines.
40+
- **Stage 2**: Creates the Triton model repository and configures the model files. The configuration is adjusted based on whether `Decoupled=True` (streaming) or `Decoupled=False` (offline) will be used.
41+
- **Stage 3**: Launches the Triton Inference Server.
42+
- **Stage 4**: Runs the single-utterance HTTP client for testing.
43+
- **Stage 5**: Runs the gRPC benchmark client.
44+
- **Stage 6**: Runs the offline inference benchmark test.
45+
46+
### Export Models and Launch Server
47+
48+
Inside the Docker container, prepare the models and start the Triton server by running stages 0-3:
49+
```sh
50+
# This command runs stages 0, 1, 2, and 3
51+
bash run.sh 0 3
52+
```
53+
> [!TIP]
54+
> Both streaming and offline (non-streaming) TTS modes are supported. For streaming TTS, set `Decoupled=True`. For offline TTS, set `Decoupled=False`. You need to rerun stage 2 if you switch between modes.
55+
56+
### Single-Utterance HTTP Client
57+
58+
Sends a single HTTP inference request. This is intended for testing the offline TTS mode (`Decoupled=False`):
59+
```sh
60+
bash run.sh 4 4
61+
```
62+
63+
### Benchmark with client-server mode
64+
65+
To benchmark the running Triton server, pass `streaming` or `offline` as the third argument:
66+
```sh
67+
bash run.sh 5 5 # [streaming|offline]
68+
69+
# You can also customize parameters such as the number of tasks and the dataset split:
70+
# python3 client_grpc.py --num-tasks 2 --huggingface-dataset yuekai/seed_tts_cosy2 --split-name test_zh --mode [streaming|offline]
71+
```
72+
> [!TIP]
73+
> It is recommended to run the benchmark multiple times to get stable results after the initial server warm-up.
74+
75+
### Benchmark with offline inference mode
76+
For offline inference mode benchmark, please check the below command:
77+
```sh
78+
# install FlashCosyVoice for token2wav batching
79+
# git clone https://github.com/yuekaizhang/FlashCosyVoice.git /workspace/FlashCosyVoice -b trt
80+
# cd /workspace/FlashCosyVoice
81+
# pip install -e .
82+
# cd -
83+
# wget https://huggingface.co/yuekai/cosyvoice2_flow_onnx/resolve/main/flow.decoder.estimator.fp32.dynamic_batch.onnx -O $model_scope_model_local_dir/flow.decoder.estimator.fp32.dynamic_batch.onnx
84+
85+
bash run.sh 6 6
86+
87+
# You can also switch to huggingface backend by setting backend=hf
88+
```
89+
90+
91+
### Benchmark Results
92+
The following results were obtained by decoding on a single L20 GPU with 26 prompt audio/target text pairs from the [yuekai/seed_tts](https://huggingface.co/datasets/yuekai/seed_tts) dataset (approximately 170 seconds of audio):
93+
94+
**Client-Server Mode: Streaming TTS (First Chunk Latency)**
95+
| Mode | Concurrency | Avg Latency (ms) | P50 Latency (ms) | RTF |
96+
|---|---|---|---|---|
97+
| Streaming, use_spk2info_cache=False | 1 | 220.43 | 218.07 | 0.1237 |
98+
| Streaming, use_spk2info_cache=False | 2 | 476.97 | 369.25 | 0.1022 |
99+
| Streaming, use_spk2info_cache=False | 4 | 1107.34 | 1243.75| 0.0922 |
100+
| Streaming, use_spk2info_cache=True | 1 | 189.88 | 184.81 | 0.1155 |
101+
| Streaming, use_spk2info_cache=True | 2 | 323.04 | 316.83 | 0.0905 |
102+
| Streaming, use_spk2info_cache=True | 4 | 977.68 | 903.68| 0.0733 |
103+
104+
> If your service only needs a fixed speaker, you can set `use_spk2info_cache=True` in `run.sh`. To add more speakers, refer to the instructions [here](https://github.com/qi-hua/async_cosyvoice?tab=readme-ov-file#9-spk2info-%E8%AF%B4%E6%98%8E).
105+
106+
**Client-Server Mode: Offline TTS (Full Sentence Latency)**
107+
| Mode | Note | Concurrency | Avg Latency (ms) | P50 Latency (ms) | RTF |
108+
|---|---|---|---|---|---|
109+
| Offline, Decoupled=False, use_spk2info_cache=False | [Commit](https://github.com/yuekaizhang/CosyVoice/commit/b44f12110224cb11c03aee4084b1597e7b9331cb) | 1 | 758.04 | 615.79 | 0.0891 |
110+
| Offline, Decoupled=False, use_spk2info_cache=False | [Commit](https://github.com/yuekaizhang/CosyVoice/commit/b44f12110224cb11c03aee4084b1597e7b9331cb) | 2 | 1025.93 | 901.68 | 0.0657 |
111+
| Offline, Decoupled=False, use_spk2info_cache=False | [Commit](https://github.com/yuekaizhang/CosyVoice/commit/b44f12110224cb11c03aee4084b1597e7b9331cb) | 4 | 1914.13 | 1783.58 | 0.0610 |
112+
113+
**Offline Inference Mode: Hugginface LLM V.S. TensorRT-LLM**
114+
| Backend | Batch Size | llm_time_seconds | total_time_seconds | RTF |
115+
|---------|------------|------------------|-----------------------|--|
116+
| HF | 1 | 39.26 | 44.31 | 0.2494 |
117+
| HF | 2 | 30.54 | 35.62 | 0.2064 |
118+
| HF | 4 | 18.63 | 23.90 | 0.1421 |
119+
| HF | 8 | 11.22 | 16.45 | 0.0947 |
120+
| HF | 16 | 8.42 | 13.78 | 0.0821 |
121+
| TRTLLM | 1 | 12.46 | 17.31 | 0.0987 |
122+
| TRTLLM | 2 | 7.64 |12.65 | 0.0739 |
123+
| TRTLLM | 4 | 4.89 | 9.38 | 0.0539 |
124+
| TRTLLM | 8 | 2.92 | 7.23 | 0.0418 |
125+
| TRTLLM | 16 | 2.01 | 6.63 | 0.0386 |
126+
### OpenAI-Compatible Server
127+
128+
To launch an OpenAI-compatible API service, run the following commands:
129+
```sh
130+
git clone https://github.com/yuekaizhang/Triton-OpenAI-Speech.git
131+
cd Triton-OpenAI-Speech
132+
pip install -r requirements.txt
133+
134+
# After the Triton service is running, start the FastAPI bridge:
135+
python3 tts_server.py --url http://localhost:8000 --ref_audios_dir ./ref_audios/ --port 10086 --default_sample_rate 24000
136+
137+
# Test the service with curl:
138+
bash test/test_cosyvoice.sh
139+
```
140+
> [!NOTE]
141+
> Currently, only the offline TTS mode is compatible with the OpenAI-compatible server.
142+
143+
### Acknowledgements
144+
145+
This work originates from the NVIDIA CISI project. For more multimodal resources, please see [mair-hub](https://github.com/nvidia-china-sae/mair-hub).
146+
Lines changed: 87 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,87 @@
1+
## Accelerating CosyVoice3 with NVIDIA Triton Inference Server and TensorRT-LLM
2+
3+
Contributed by Yuekai Zhang (NVIDIA).
4+
5+
### Quick Start
6+
7+
Launch the service directly with Docker Compose:
8+
```sh
9+
docker compose -f docker-compose.cosyvoice3.yml up
10+
```
11+
12+
### Build the Docker Image
13+
14+
To build the image from scratch:
15+
```sh
16+
docker build . -f Dockerfile.server -t soar97/triton-cosyvoice:25.06
17+
```
18+
19+
### Run a Docker Container
20+
```sh
21+
your_mount_dir=/mnt:/mnt
22+
docker run -it --name "cosyvoice-server" --gpus all --net host -v $your_mount_dir --shm-size=2g soar97/triton-cosyvoice:25.06
23+
```
24+
25+
### Understanding `run_cosyvoice3.sh`
26+
27+
The `run_cosyvoice3.sh` script orchestrates the entire workflow through numbered stages.
28+
29+
You can run a subset of stages with:
30+
```sh
31+
bash run_cosyvoice3.sh <start_stage> <stop_stage>
32+
```
33+
- `<start_stage>`: The stage to start from.
34+
- `<stop_stage>`: The stage to stop after.
35+
36+
**Stages:**
37+
38+
- **Stage -1**: Clones the `CosyVoice` repository.
39+
- **Stage 0**: Downloads the `Fun-CosyVoice3-0.5B-2512` model and its HuggingFace LLM checkpoint.
40+
- **Stage 1**: Converts the HuggingFace checkpoint for the LLM to the TensorRT-LLM format and builds the TensorRT engines.
41+
- **Stage 2**: Creates the Triton model repository, including configurations for `cosyvoice3`, `token2wav`, `vocoder`, `audio_tokenizer`, and `speaker_embedding`.
42+
- **Stage 3**: Launches the Triton Inference Server for Token2Wav module and uses `trtllm-serve` to deploy CosyVoice3 LLM.
43+
- **Stage 4**: Runs the gRPC benchmark client for performance testing.
44+
- **Stage 5**: Runs the offline TTS inference benchmark test.
45+
46+
### Export Models and Launch Server
47+
48+
Inside the Docker container, prepare the models and start the Triton server by running stages 0-3:
49+
```sh
50+
# This command runs stages 0, 1, 2, and 3
51+
bash run_cosyvoice3.sh 0 3
52+
```
53+
54+
### Benchmark with client-server mode
55+
56+
To benchmark the running Triton server, run stage 4:
57+
```sh
58+
bash run_cosyvoice3.sh 4 4
59+
60+
# You can customize parameters such as the number of tasks inside the script.
61+
```
62+
The following results were obtained by decoding on a single L20 GPU.
63+
64+
#### Streaming TTS (Concurrent Tasks = 4)
65+
66+
**First Chunk Latency**
67+
68+
| Concurrent Tasks | Average (ms) | 50th Percentile (ms) | 90th Percentile (ms) | 95th Percentile (ms) | 99th Percentile (ms) |
69+
| ---------------- | ------------ | -------------------- | -------------------- | -------------------- | -------------------- |
70+
| 4 | 750.42 | 740.31 | 941.05 | 977.55 | 1002.37 |
71+
72+
### Benchmark with offline inference mode
73+
74+
For offline inference mode benchmark, please run stage 5:
75+
```sh
76+
bash run_cosyvoice3.sh 5 5
77+
```
78+
79+
#### Offline TTS (CosyVoice3 0.5B LLM + Token2Wav with TensorRT)
80+
81+
| Backend | LLM Batch Size | llm_time (s) | token2wav_time (s) | pipeline_time (s) | RTF |
82+
|---------|------------|--------------|--------------------|--------------------|--------|
83+
| TRTLLM | 1 | 13.21 | 5.72 | 19.48 | 0.1091 |
84+
| TRTLLM | 2 | 8.46 | 6.02 | 14.91 | 0.0822 |
85+
| TRTLLM | 4 | 5.07 | 5.95 | 11.43 | 0.0630 |
86+
| TRTLLM | 8 | 2.98 | 6.11 | 9.53 | 0.0562 |
87+
| TRTLLM | 16 | 2.12 | 6.27 | 8.83 | 0.0501 |

0 commit comments

Comments
 (0)