Skip to content

Commit f6dcac6

Browse files
more tts pipeline exampel (#42484)
* more tts pipeline exampel * remove duplicate code * group example by models * nit
1 parent ce53cc0 commit f6dcac6

File tree

1 file changed

+43
-8
lines changed

1 file changed

+43
-8
lines changed

docs/source/en/tasks/text-to-speech.md

Lines changed: 43 additions & 8 deletions
Original file line numberDiff line numberDiff line change
@@ -22,16 +22,14 @@ Text-to-speech (TTS) is the task of creating natural-sounding speech from text,
2222
languages and for multiple speakers. Several text-to-speech models are currently available in 🤗 Transformers, such as [Dia](../model_doc/dia), [CSM](../model_doc/csm),
2323
[Bark](../model_doc/bark), [MMS](../model_doc/mms), [VITS](../model_doc/vits) and [SpeechT5](../model_doc/speecht5).
2424

25-
You can easily generate audio using the `"text-to-audio"` pipeline (or its alias - `"text-to-speech"`). Some models, like Dia,
26-
can also be conditioned to generate non-verbal communications such as laughing, sighing and crying, or even add music.
27-
Here's an example of how you would use the `"text-to-speech"` pipeline with Dia:
25+
You can easily generate audio using the `"text-to-audio"` pipeline (or its alias - `"text-to-speech"`).
26+
Here's an example of how you would use the `"text-to-speech"` pipeline with [CSM](https://huggingface.co/sesame/csm-1b):
2827

29-
```py
28+
```python
3029
>>> from transformers import pipeline
3130

32-
>>> pipe = pipeline("text-to-speech", model="nari-labs/Dia-1.6B-0626")
33-
>>> text = "[S1] (clears throat) Hello! How are you? [S2] I'm good, thanks! How about you?"
34-
>>> output = pipe(text)
31+
>>> pipe = pipeline("text-to-audio", model="sesame/csm-1b")
32+
>>> output = pipe("Hello from Sesame.")
3533
```
3634

3735
Here's a code snippet you can use to listen to the resulting audio in a notebook:
@@ -41,7 +39,44 @@ Here's a code snippet you can use to listen to the resulting audio in a notebook
4139
>>> Audio(output["audio"], rate=output["sampling_rate"])
4240
```
4341

44-
For more examples on what Bark and other pretrained TTS models can do, refer to our
42+
By default, CSM uses a random voice. You can do voice cloning by providing a reference audio as part of a chat template dictionary:
43+
44+
```python
45+
>>> import soundfile as sf
46+
>>> import torch
47+
>>> from datasets import Audio, load_dataset
48+
>>> from transformers import pipeline
49+
50+
>>> pipe = pipeline("text-to-audio", model="sesame/csm-1b")
51+
52+
>>> ds = load_dataset("hf-internal-testing/dailytalk-dummy", split="train")
53+
>>> ds = ds.cast_column("audio", Audio(sampling_rate=24000))
54+
>>> conversation = [
55+
... {
56+
... "role": "0",
57+
... "content": [
58+
... {"type": "text", "text": "What are you working on?"},
59+
... {"type": "audio", "path": ds[0]["audio"]["array"]},
60+
... ],
61+
... },
62+
... {"role": "0", "content": [{"type": "text", "text": "How much money can you spend?"}]},
63+
... ]
64+
>>> output = pipe(conversation)
65+
```
66+
67+
Some models, like [Dia](https://huggingface.co/nari-labs/Dia-1.6B-0626), can also be conditioned to generate non-verbal communications such as laughing, sighing and crying, or even add music. Below is such an example:
68+
69+
```python
70+
>>> from transformers import pipeline
71+
72+
>>> pipe = pipeline("text-to-speech", model="nari-labs/Dia-1.6B-0626")
73+
>>> text = "[S1] (clears throat) Hello! How are you? [S2] I'm good, thanks! How about you?"
74+
>>> output = pipe(text)
75+
```
76+
77+
Note that Dia also accepts speaker tags such as [S1] and [S2] to generate a conversation between unique voices.
78+
79+
For more examples on what CSM and other pretrained TTS models can do, refer to our
4580
[Audio course](https://huggingface.co/learn/audio-course/chapter6/pre-trained_models).
4681

4782
If you are looking to fine-tune a TTS model, the only text-to-speech models currently available in 🤗 Transformers

0 commit comments

Comments
 (0)