Gemini TTS: The Essentials

Last updated: April 22, 2026

asset_dDkLwtFghXNz2c36ixCx3QSU_A clean, professional banner for an article about Gemini TTS. The composition features a stylized visual of text transforming into expressive audio waves, subtly incorporating small, .png

Introduction

Covers Gemini 3.1 Flash TTS (current) and Gemini 2.5 TTS (previous generation)

Scenario offers two generations of Google's Gemini text-to-speech models. Gemini 3.1 Flash TTS is the current recommended version for all new projects. Gemini 2.5 TTS remains available for existing workflows.

What Gemini 3.1 Flash TTS Does

Gemini 3.1 Flash TTS converts a text script into expressive speech. You provide a script, choose a voice, and the model returns a mono MP3 at 24kHz. What sets it apart is the audio tag system: by embedding tags like [determination], [whispers], or [enthusiasm] directly in the text, you control how specific lines are delivered without extra parameters or prompts. The direction lives in the script itself.

For dialogue scenes with two characters, Multi-Speaker Config maps two speaker names to separate voices and handles turn-taking from the labeled script automatically.

Parameters

Parameter	Required	Default	Description
Text	Yes		Script to synthesize. Maximum 5,000 characters. Audio tags can appear anywhere in the text.
Voice	No	Puck	Voice preset for single-speaker output. Ignored when Multi-Speaker Config is active. See the full list of 30 voices below.
Language	No	en-US	BCP-47 language code for the target language. Auto-detected if left empty, but setting it explicitly improves accuracy. 24 locales supported.
Multi-Speaker Config	No		Maps up to 2 speaker names to voice presets. Overrides the single Voice setting. Speaker names in the text must match the labels in this config exactly.

Audio Tags

Audio tags appear in square brackets within the script and control delivery without altering the words themselves. Place them at the start of a sentence for consistent application:

[determination] We are not leaving without the artifact.
[whispers] The vault door was already open.
[enthusiasm] This is going to change everything.
[slow] Read each step carefully before you begin.
[laughs] I cannot believe that actually worked.
[soft] The people using the tools. That always brings it back to earth.

Mid-sentence placement is supported but may produce uneven transitions in complex phrasing. The model supports over 200 audio tags covering emotions, interjections, pacing, and performance notes.

Multi-Speaker Dialogue

To generate a dialogue between two characters, label each line with a speaker name and configure Multi-Speaker Config to map each name to a voice:

Scout: Someone followed us from the market. Do not look back.
Commander: [determination] How many?
Scout: Two. Maybe three. We take the alley on the left.
Commander: [low] I will handle the one on the right. You get clear.
Scout: Together or not at all.

In this example, Multi-Speaker Config would be set to: Scout mapped to Kore, Commander mapped to Fenrir.

Known limitation: Multi-Speaker Config manages turn-taking and audio tag delivery correctly, but voice differentiation between speakers is not reliably applied. Both speakers may render with the same underlying timbre regardless of the voice assignments. Use this feature primarily to structure labeled dialogue scripts, not to produce distinct character voices in a single job. For distinct voices per character, generate each speaker's lines as a separate single-speaker job and combine the audio files.

Supported Voices

30 voices are available, named after astronomical bodies. The default voice is Puck.

				All 30 voices
Achernar	Achird	Algenib	Algieba	Alnilam
Aoede	Autonoe	Callirrhoe	Charon	Despina
Enceladus	Erinome	Fenrir	Gacrux	Iapetus
Kore	Laomedeia	Leda	Orus	Pulcherrima
Puck (default)	Rasalgethi	Sadachbia	Sadaltager	Schedar
Sulafat	Umbriel	Vindemiatrix	Zephyr	Zubenelgenubi

Test several voices with your script before batching. Register and timbre vary significantly across voices. For dramatic characters, try Fenrir or Puck. For narration, try Charon or Orus. For clear, neutral delivery, try Kore or Zephyr.

Supported Languages

			LanguageBCP-47 codeLanguageBCP-47 code
Arabic (Egypt)	ar-EG	Marathi	mr-IN
Bengali (Bangladesh)	bn-BD	Dutch	nl-NL
German	de-DE	Polish	pl-PL
English (US)	en-US	Portuguese (Brazil)	pt-BR
English (UK)	en-GB	Romanian	ro-RO
Spanish (US)	es-US	Russian	ru-RU
Spanish (Spain)	es-ES	Tamil	ta-IN
French (France)	fr-FR	Telugu	te-IN
Hindi	hi-IN	Thai	th-TH
Indonesian	id-ID	Turkish	tr-TR
Italian	it-IT	Ukrainian	uk-UA
Japanese	ja-JP	Vietnamese	vi-VN
Korean	ko-KR

Always set the language explicitly. Auto-detection works for common languages but can produce inconsistent pronunciation on regional variants and mixed-language scripts.

Use Cases

Game dialogue: Generate NPC lines, quest narration, and cinematic audio at scale. Use audio tags to vary delivery across tense, calm, or comedic moments. Keep the same voice preset per character across all scenes for consistency.
Marketing and advertising: Produce ad scripts and product demos in multiple languages without re-recording. Use a fixed voice to build a consistent brand sound across all markets.
E-learning: Convert course content into narrated audio for 24 language markets. Use [slow] tags in technical sections to improve comprehension.
Audiobooks and publishing: Process long manuscripts in chunks through a Loop node. Split at natural sentence boundaries and maintain the same voice preset across all chunks for a cohesive final output.

Tips for Better Results

Set the language explicitly. Auto-detection works for standard English but can drift on regional accents and non-Latin scripts.
Place audio tags at sentence boundaries. Opening a sentence with a tag applies it consistently to the full sentence. Mid-sentence placement is supported but can produce uneven transitions.
Test voice and tag combinations on a short clip before batching. Some voice and tag pairings produce unexpected results. A 2-sentence test costs almost nothing and avoids expensive re-generation on a full script.
For distinct character voices, generate each character separately. Multi-Speaker Config handles turn-taking but does not reliably produce different timbres. Generate each speaker as a single-speaker job and combine the files.
Split long scripts at natural pauses. The 5,000-character limit is generous but finite. Use a Loop node and a Split Text node to process full manuscripts automatically.
Use the same voice preset for the same character across all sessions. The model is not stateful, but consistent voice selection produces consistent output across batches.

Known Limitations

Two-speaker maximum per job. Multi-Speaker Config supports up to 2 speakers. For scenes with more characters, generate each pair separately.
Multi-Speaker Config does not reliably differentiate voice timbres. Both speakers may render with the same underlying voice regardless of the assigned presets. Use separate single-speaker jobs for distinct character voices.
No numeric speed or pitch controls. Pacing and tone are controlled exclusively through audio tags. There are no speed multiplier or pitch shift parameters.
Voice-language pairing inconsistencies. Not all voice and language combinations produce natural-sounding output. Some combinations may default to a different voice or produce accented output. Test before committing to a voice for a localized project.
Text input only. No voice cloning, reference audio upload, or voice transfer. Output is always one of the 30 preset voices.
No SSML support. The model uses its own proprietary audio tag system. Standard SSML tags are not recognized.

Migrating from Gemini 2.5 TTS

Gemini 3.1 Flash TTS uses the same voice names, audio tag syntax, and Multi-Speaker Config format as Gemini 2.5 TTS. No changes to existing scripts or prompts are needed. To migrate, select Gemini 3.1 Flash TTS as the model in your workflow or API call and keep all other parameters as-is.

Gemini 2.5 Pro TTS offered higher audio fidelity in exchange for longer processing time. Gemini 3.1 Flash TTS matches or exceeds that output quality at faster speeds. There is no direct 3.1 Pro equivalent; 3.1 Flash is the recommended single option for all production use.