Gemini TTS: The Essentials
Last updated: April 22, 2026

Introduction
Covers Gemini 3.1 Flash TTS (current) and Gemini 2.5 TTS (previous generation)
Scenario offers two generations of Google's Gemini text-to-speech models. Gemini 3.1 Flash TTS is the current recommended version for all new projects. Gemini 2.5 TTS remains available for existing workflows.
What Gemini 3.1 Flash TTS Does
Gemini 3.1 Flash TTS converts a text script into expressive speech. You provide a script, choose a voice, and the model returns a mono MP3 at 24kHz. What sets it apart is the audio tag system: by embedding tags like [determination], [whispers], or [enthusiasm] directly in the text, you control how specific lines are delivered without extra parameters or prompts. The direction lives in the script itself.
For dialogue scenes with two characters, Multi-Speaker Config maps two speaker names to separate voices and handles turn-taking from the labeled script automatically.
Parameters
Parameter | Required | Default | Description |
Text | Yes | Script to synthesize. Maximum 5,000 characters. Audio tags can appear anywhere in the text. | |
Voice | No | Puck | Voice preset for single-speaker output. Ignored when Multi-Speaker Config is active. See the full list of 30 voices below. |
Language | No | en-US | BCP-47 language code for the target language. Auto-detected if left empty, but setting it explicitly improves accuracy. 24 locales supported. |
Multi-Speaker Config | No | Maps up to 2 speaker names to voice presets. Overrides the single Voice setting. Speaker names in the text must match the labels in this config exactly. |
Audio Tags
Audio tags appear in square brackets within the script and control delivery without altering the words themselves. Place them at the start of a sentence for consistent application:
[determination] We are not leaving without the artifact.
[whispers] The vault door was already open.
[enthusiasm] This is going to change everything.
[slow] Read each step carefully before you begin.
[laughs] I cannot believe that actually worked.
[soft] The people using the tools. That always brings it back to earth.Mid-sentence placement is supported but may produce uneven transitions in complex phrasing. The model supports over 200 audio tags covering emotions, interjections, pacing, and performance notes.
Multi-Speaker Dialogue
To generate a dialogue between two characters, label each line with a speaker name and configure Multi-Speaker Config to map each name to a voice:
Scout: Someone followed us from the market. Do not look back.
Commander: [determination] How many?
Scout: Two. Maybe three. We take the alley on the left.
Commander: [low] I will handle the one on the right. You get clear.
Scout: Together or not at all.In this example, Multi-Speaker Config would be set to: Scout mapped to Kore, Commander mapped to Fenrir.
Known limitation: Multi-Speaker Config manages turn-taking and audio tag delivery correctly, but voice differentiation between speakers is not reliably applied. Both speakers may render with the same underlying timbre regardless of the voice assignments. Use this feature primarily to structure labeled dialogue scripts, not to produce distinct character voices in a single job. For distinct voices per character, generate each speaker's lines as a separate single-speaker job and combine the audio files.
Supported Voices
30 voices are available, named after astronomical bodies. The default voice is Puck.
All 30 voices | ||||
Achernar | Achird | Algenib | Algieba | Alnilam |
Aoede | Autonoe | Callirrhoe | Charon | Despina |
Enceladus | Erinome | Fenrir | Gacrux | Iapetus |
Kore | Laomedeia | Leda | Orus | Pulcherrima |
Puck (default) | Rasalgethi | Sadachbia | Sadaltager | Schedar |
Sulafat | Umbriel | Vindemiatrix | Zephyr | Zubenelgenubi |
Test several voices with your script before batching. Register and timbre vary significantly across voices. For dramatic characters, try Fenrir or Puck. For narration, try Charon or Orus. For clear, neutral delivery, try Kore or Zephyr.
Supported Languages
LanguageBCP-47 codeLanguageBCP-47 code | |||
Arabic (Egypt) | ar-EG | Marathi | mr-IN |
Bengali (Bangladesh) | bn-BD | Dutch | nl-NL |
German | de-DE | Polish | pl-PL |
English (US) | en-US | Portuguese (Brazil) | pt-BR |
English (UK) | en-GB | Romanian | ro-RO |
Spanish (US) | es-US | Russian | ru-RU |
Spanish (Spain) | es-ES | Tamil | ta-IN |
French (France) | fr-FR | Telugu | te-IN |
Hindi | hi-IN | Thai | th-TH |
Indonesian | id-ID | Turkish | tr-TR |
Italian | it-IT | Ukrainian | uk-UA |
Japanese | ja-JP | Vietnamese | vi-VN |
Korean | ko-KR |
Always set the language explicitly. Auto-detection works for common languages but can produce inconsistent pronunciation on regional variants and mixed-language scripts.
Use Cases
Game dialogue: Generate NPC lines, quest narration, and cinematic audio at scale. Use audio tags to vary delivery across tense, calm, or comedic moments. Keep the same voice preset per character across all scenes for consistency.
Marketing and advertising: Produce ad scripts and product demos in multiple languages without re-recording. Use a fixed voice to build a consistent brand sound across all markets.
E-learning: Convert course content into narrated audio for 24 language markets. Use
[slow]tags in technical sections to improve comprehension.Audiobooks and publishing: Process long manuscripts in chunks through a Loop node. Split at natural sentence boundaries and maintain the same voice preset across all chunks for a cohesive final output.
Tips for Better Results
Set the language explicitly. Auto-detection works for standard English but can drift on regional accents and non-Latin scripts.
Place audio tags at sentence boundaries. Opening a sentence with a tag applies it consistently to the full sentence. Mid-sentence placement is supported but can produce uneven transitions.
Test voice and tag combinations on a short clip before batching. Some voice and tag pairings produce unexpected results. A 2-sentence test costs almost nothing and avoids expensive re-generation on a full script.
For distinct character voices, generate each character separately. Multi-Speaker Config handles turn-taking but does not reliably produce different timbres. Generate each speaker as a single-speaker job and combine the files.
Split long scripts at natural pauses. The 5,000-character limit is generous but finite. Use a Loop node and a Split Text node to process full manuscripts automatically.
Use the same voice preset for the same character across all sessions. The model is not stateful, but consistent voice selection produces consistent output across batches.
Known Limitations
Two-speaker maximum per job. Multi-Speaker Config supports up to 2 speakers. For scenes with more characters, generate each pair separately.
Multi-Speaker Config does not reliably differentiate voice timbres. Both speakers may render with the same underlying voice regardless of the assigned presets. Use separate single-speaker jobs for distinct character voices.
No numeric speed or pitch controls. Pacing and tone are controlled exclusively through audio tags. There are no speed multiplier or pitch shift parameters.
Voice-language pairing inconsistencies. Not all voice and language combinations produce natural-sounding output. Some combinations may default to a different voice or produce accented output. Test before committing to a voice for a localized project.
Text input only. No voice cloning, reference audio upload, or voice transfer. Output is always one of the 30 preset voices.
No SSML support. The model uses its own proprietary audio tag system. Standard SSML tags are not recognized.
Migrating from Gemini 2.5 TTS
Gemini 3.1 Flash TTS uses the same voice names, audio tag syntax, and Multi-Speaker Config format as Gemini 2.5 TTS. No changes to existing scripts or prompts are needed. To migrate, select Gemini 3.1 Flash TTS as the model in your workflow or API call and keep all other parameters as-is.
Gemini 2.5 Pro TTS offered higher audio fidelity in exchange for longer processing time. Gemini 3.1 Flash TTS matches or exceeds that output quality at faster speeds. There is no direct 3.1 Pro equivalent; 3.1 Flash is the recommended single option for all production use.