Gemini TTS: The Essentials

Last updated: April 22, 2026

asset_dDkLwtFghXNz2c36ixCx3QSU_A clean, professional banner for an article about Gemini TTS. The composition features a stylized visual of text transforming into expressive audio waves, subtly incorporating small, .png

Introduction

Covers Gemini 3.1 Flash TTS (current) and Gemini 2.5 TTS (previous generation)

Scenario offers two generations of Google's Gemini text-to-speech models. Gemini 3.1 Flash TTS is the current recommended version for all new projects. Gemini 2.5 TTS remains available for existing workflows.


What Gemini 3.1 Flash TTS Does

Gemini 3.1 Flash TTS converts a text script into expressive speech. You provide a script, choose a voice, and the model returns a mono MP3 at 24kHz. What sets it apart is the audio tag system: by embedding tags like [determination][whispers], or [enthusiasm] directly in the text, you control how specific lines are delivered without extra parameters or prompts. The direction lives in the script itself.

For dialogue scenes with two characters, Multi-Speaker Config maps two speaker names to separate voices and handles turn-taking from the labeled script automatically.


Parameters

Parameter

Required

Default

Description

Text

Yes

Script to synthesize. Maximum 5,000 characters. Audio tags can appear anywhere in the text.

Voice

No

Puck

Voice preset for single-speaker output. Ignored when Multi-Speaker Config is active. See the full list of 30 voices below.

Language

No

en-US

BCP-47 language code for the target language. Auto-detected if left empty, but setting it explicitly improves accuracy. 24 locales supported.

Multi-Speaker Config

No

Maps up to 2 speaker names to voice presets. Overrides the single Voice setting. Speaker names in the text must match the labels in this config exactly.


Audio Tags

Audio tags appear in square brackets within the script and control delivery without altering the words themselves. Place them at the start of a sentence for consistent application:

[determination] We are not leaving without the artifact.
[whispers] The vault door was already open.
[enthusiasm] This is going to change everything.
[slow] Read each step carefully before you begin.
[laughs] I cannot believe that actually worked.
[soft] The people using the tools. That always brings it back to earth.

Mid-sentence placement is supported but may produce uneven transitions in complex phrasing. The model supports over 200 audio tags covering emotions, interjections, pacing, and performance notes.


Multi-Speaker Dialogue

To generate a dialogue between two characters, label each line with a speaker name and configure Multi-Speaker Config to map each name to a voice:

Scout: Someone followed us from the market. Do not look back.
Commander: [determination] How many?
Scout: Two. Maybe three. We take the alley on the left.
Commander: [low] I will handle the one on the right. You get clear.
Scout: Together or not at all.

In this example, Multi-Speaker Config would be set to: Scout mapped to Kore, Commander mapped to Fenrir.

Known limitation: Multi-Speaker Config manages turn-taking and audio tag delivery correctly, but voice differentiation between speakers is not reliably applied. Both speakers may render with the same underlying timbre regardless of the voice assignments. Use this feature primarily to structure labeled dialogue scripts, not to produce distinct character voices in a single job. For distinct voices per character, generate each speaker's lines as a separate single-speaker job and combine the audio files.


Supported Voices

30 voices are available, named after astronomical bodies. The default voice is Puck.

All 30 voices

Achernar

Achird

Algenib

Algieba

Alnilam

Aoede

Autonoe

Callirrhoe

Charon

Despina

Enceladus

Erinome

Fenrir

Gacrux

Iapetus

Kore

Laomedeia

Leda

Orus

Pulcherrima

Puck (default)

Rasalgethi

Sadachbia

Sadaltager

Schedar

Sulafat

Umbriel

Vindemiatrix

Zephyr

Zubenelgenubi

Test several voices with your script before batching. Register and timbre vary significantly across voices. For dramatic characters, try Fenrir or Puck. For narration, try Charon or Orus. For clear, neutral delivery, try Kore or Zephyr.


Supported Languages

LanguageBCP-47 codeLanguageBCP-47 code

Arabic (Egypt)

ar-EG

Marathi

mr-IN

Bengali (Bangladesh)

bn-BD

Dutch

nl-NL

German

de-DE

Polish

pl-PL

English (US)

en-US

Portuguese (Brazil)

pt-BR

English (UK)

en-GB

Romanian

ro-RO

Spanish (US)

es-US

Russian

ru-RU

Spanish (Spain)

es-ES

Tamil

ta-IN

French (France)

fr-FR

Telugu

te-IN

Hindi

hi-IN

Thai

th-TH

Indonesian

id-ID

Turkish

tr-TR

Italian

it-IT

Ukrainian

uk-UA

Japanese

ja-JP

Vietnamese

vi-VN

Korean

ko-KR

Always set the language explicitly. Auto-detection works for common languages but can produce inconsistent pronunciation on regional variants and mixed-language scripts.


Use Cases

  • Game dialogue: Generate NPC lines, quest narration, and cinematic audio at scale. Use audio tags to vary delivery across tense, calm, or comedic moments. Keep the same voice preset per character across all scenes for consistency.

  • Marketing and advertising: Produce ad scripts and product demos in multiple languages without re-recording. Use a fixed voice to build a consistent brand sound across all markets.

  • E-learning: Convert course content into narrated audio for 24 language markets. Use [slow] tags in technical sections to improve comprehension.

  • Audiobooks and publishing: Process long manuscripts in chunks through a Loop node. Split at natural sentence boundaries and maintain the same voice preset across all chunks for a cohesive final output.


Tips for Better Results

  1. Set the language explicitly. Auto-detection works for standard English but can drift on regional accents and non-Latin scripts.

  2. Place audio tags at sentence boundaries. Opening a sentence with a tag applies it consistently to the full sentence. Mid-sentence placement is supported but can produce uneven transitions.

  3. Test voice and tag combinations on a short clip before batching. Some voice and tag pairings produce unexpected results. A 2-sentence test costs almost nothing and avoids expensive re-generation on a full script.

  4. For distinct character voices, generate each character separately. Multi-Speaker Config handles turn-taking but does not reliably produce different timbres. Generate each speaker as a single-speaker job and combine the files.

  5. Split long scripts at natural pauses. The 5,000-character limit is generous but finite. Use a Loop node and a Split Text node to process full manuscripts automatically.

  6. Use the same voice preset for the same character across all sessions. The model is not stateful, but consistent voice selection produces consistent output across batches.


Known Limitations

  • Two-speaker maximum per job. Multi-Speaker Config supports up to 2 speakers. For scenes with more characters, generate each pair separately.

  • Multi-Speaker Config does not reliably differentiate voice timbres. Both speakers may render with the same underlying voice regardless of the assigned presets. Use separate single-speaker jobs for distinct character voices.

  • No numeric speed or pitch controls. Pacing and tone are controlled exclusively through audio tags. There are no speed multiplier or pitch shift parameters.

  • Voice-language pairing inconsistencies. Not all voice and language combinations produce natural-sounding output. Some combinations may default to a different voice or produce accented output. Test before committing to a voice for a localized project.

  • Text input only. No voice cloning, reference audio upload, or voice transfer. Output is always one of the 30 preset voices.

  • No SSML support. The model uses its own proprietary audio tag system. Standard SSML tags are not recognized.


Migrating from Gemini 2.5 TTS

Gemini 3.1 Flash TTS uses the same voice names, audio tag syntax, and Multi-Speaker Config format as Gemini 2.5 TTS. No changes to existing scripts or prompts are needed. To migrate, select Gemini 3.1 Flash TTS as the model in your workflow or API call and keep all other parameters as-is.

Gemini 2.5 Pro TTS offered higher audio fidelity in exchange for longer processing time. Gemini 3.1 Flash TTS matches or exceeds that output quality at faster speeds. There is no direct 3.1 Pro equivalent; 3.1 Flash is the recommended single option for all production use.