Hi, how can we help you today?

ElevenLabs Text-to-Speech Models – The Essentials

Scenario’s text-to-speech tools let you create professional-quality voiceovers from text. Among the available models, three ElevenLabs models stand out: ElevenLabs v3, Multilingual v2 and Turbo 2.5. This guide explains each model’s unique strengths, how to choose between them, and tips for getting the best results.

Multilingual v2 supports 29 languages with rich emotional expression and natural prosody. Turbo 2.5 supports 32 languages (adds Vietnamese, Hungarian, Norwegian) with 3x faster generation for non-English languages and 25% faster English generation.


1. Overview

ElevenLabs v3 (alpha)

Eleven v3 is a research‑preview model that produces natural, life‑like speech with a high emotional range. Unlike v2 and Turbo, it is not designed for real‑time applications and is better suited to long‑form narration and expressive dialogue. The model supports more than 70 languages. Because it is in alpha, quality can vary; we recommend experimenting with longer prompts (≥250 characters) and multiple generations.


ElevenLabsMultilingual v2

Multilingual v2 is ElevenLabs’ most lifelike and emotionally rich production model. It delivers consistent voice quality and natural prosody across 29 languages, making it ideal for audiobooks, film dubbing, podcasts and other projects where emotional fidelity matters. V2 prioritizes quality over speed, resulting in higher latency and cost compared with Turbo. You can generate up to 10 000 characters per request.

Supported languages: English, Spanish, French, German, Italian, Portuguese, Russian, Japanese, Korean, Chinese (Mandarin), Arabic, Hindi, Dutch, Polish, Czech, Slovak, Ukrainian, Croatian, Romanian, Bulgarian, Greek, Finnish, Danish, Swedish, Norwegian, Hungarian, Turkish, Hebrew, Malay, Tamil.


ElevenLabsTurbo 2.5

Turbo 2.5 balances quality with low latency. It supports 32 languages, adding Vietnamese, Hungarian and Norwegian to the v2 language set. Generation speed is roughly three times faster than v2 for non‑English languages and 25 % faster for English, and the model is about 50 % cheaper per character. This makes Turbo ideal for real‑time conversational agents, interactive games and high‑volume projects. Turbo supports up to 40 000 characters per call and allows manual language enforcement via two‑letter ISO 639‑1 codes.


2. Model selection

2.1 Choosing ElevenLabs v3

  • Storytelling and character dialogue – Use v3 when you need highly expressive performances for multi‑speaker conversations, audiobooks or dramatic scenes. The model’s emotional range and contextual understanding provide realism that v2 and Turbo cannot match.

  • Non‑real‑time projects – v3 is not optimized for real‑time; its higher latency and 3 000‑character limit favour offline workflows where you can generate several takes and choose the best.

  • Multilingual content – With support for 70+ languages, v3 is a good choice when you need expressive narration in less‑common languages.


2.2 Choosing Multilingual v2

  • High‑fidelity narration – Select v2 when natural prosody and emotional nuance are paramount, such as for audiobooks, podcasts, voiceovers and educational content.

  • Stable quality – V2 maintains consistent voice personality across long passages and multiple languages.

  • Language coverage – Use v2 when your content is in one of its 29 supported languages and you require emotional richness.

  • Content length – v2’s 10 000‑character limit per call supports longer audio segments than v3.


2.3 Choosing Turbo 2.5

  • Real‑time interaction – Turbo’s low latency and cost make it suitable for chatbots, games and other interactive applications.

  • Cost efficiency – Its per‑character price is roughly half that of v2, making it economical for large volumes of speech.

  • Language flexibility – Turbo supports 32 languages and allows manual language selection via ISO codes.

  • Longest content – With a 40 000‑character limit, Turbo can generate extended scripts in a single call.


3. Key differences

Model

Latency & use

Emotional range & quality

Languages

Character limit

ElevenLabs v3 (alpha)

Not real‑time; suited to offline projects

Highest emotional range and contextual understanding

70+ languages

3 000 characters

Multilingual v2

Higher latency and cost; prioritizes quality

Lifelike speech with rich emotional expression

29 languages

10 000 characters

Turbo 2.5

Low latency and 50 % cheaper per character

Balanced quality; less emotional nuance than v2

32 languages

40 000 characters


4. Interface controls & workflow

4.1 Text input

In the Scenario interface, type or paste your script in the text field. ElevenLabs models automatically detect the language and can handle multilingual content within a single generation. Use proper punctuation and capitalization to guide rhythm and emphasis; ellipses (…) add pauses and capitalization signals emphasis.


4.2 Voice selection

All three models share the same voice library. Choose a voice that matches the desired delivery; neutral voices tend to be more stable across languages. For v3, voice selection is especially critical because the model responds strongly to voice characteristics.

ElevenLabs' voice library including Aria, Roger, Sarah, Laura, Charlie, George, Callum, River, Liam, Charlotte, Alice, Matilda, Will, Jessica, Eric, Chris, Brian, Daniel, Lily, and Bill. Each voice works with both models.

  • Aria: female, expressive, social, engaging.

  • Roger: male confident, social, persuasive.

  • Sarah: female, expressive, social, energetic.

  • Laura: female, upbeat, social, lively.

  • Charlie: male, natural, conversational, relaxed.

  • George: male, warm, narration, trustworthy.

  • Callum: male, intense, character, dramatic.

  • River: non-binary, confident, social, modern.

  • Liam: male, articulate, narration, clear.

  • Charlotte: female, seductive, character, playful.

  • Alice: female, confident, news, formal.

  • Matilda: female, friendly, narration, calm.

  • Will: male, natural, narration, steady.

  • Jessica: female, expressive, conversational, youthful.

  • Eric: male, friendly, conversational, approachable.

  • Chris: male, casual, conversational, easy going.

  • Brian: male, deep, narration, serious.

  • Daniel: male, authoritative, news, commanding.

  • Lily: female, warm, narration, gentle.

  • Bill: male, trustworthy, narration, classic.


4.3 Generation parameters (v2 & Turbo)

Multilingual v2 and Turbo 2.5 provide the following controls:

  1. Stability (0-1, default 0.5)
    Controls consistency and predictability of speech. Higher values produce more stable, consistent output. Lower values allow more variation and expressiveness.

  2. Similarity Boost (0-1, default 0.5)
    Enhances similarity to the selected voice characteristics. Higher values make the output more closely match the chosen voice profile

  3. Style Exaggeration (0-1, default 0)
    Controls emotional intensity and expressiveness. Higher values increase dramatic emphasis and emotional range. More effective with Multilingual v2

  4. Speed (0.7-1.2, default 1)
    Adjusts speech rate. Values below 1.0 slow down speech, above 1.0 speed it up. Extreme values may affect quality.

  5. Timestamps Toggle
    When enabled, returns timestamps for each word in the generated speech, useful for synchronization applications.


4.4 Advanced Features

Advanced features include Previous Text/Next Text fields for chaining long scripts and a Language Code parameter to enforce a specific language.

  1. Previous Text / Next Text
    Allows chaining multiple text segments for longer content generation while maintaining voice consistency

  2. Language Code
    Manually specify language using ISO 639-1 codes to enforce specific language pronunciation when automatic detection isn't sufficient.


5. Best practices (all ElevenLabs models)

  1. Text Formatting

    Use proper punctuation for natural pacing. Periods create pauses, commas add brief breaks, and exclamation points increase energy. Break long paragraphs into shorter sentences for better flow.

  2. Voice Consistency

    Use the same voice and similar parameter settings across related content. Enable Timestamps when you need precise synchronization with other media.

  3. Parameter Tuning

    Start with default settings and adjust based on results. Higher Stability for consistent content, higher Style Exaggeration for dramatic effect, adjusted Speed for pacing preferences.

  4. Model Selection

    Use Multilingual v2 for final production content where quality matters. Use Turbo 2.5 for prototyping, real-time applications, or when speed is the priority.

6. Best practices (ElevenLabs v3)

ElevenLabs v3 introduces unique settings and tags for fine‑grained emotional control:

  • Longer prompts – Prompts shorter than ~250 characters may yield inconsistent output; longer prompts improve stability.

  • Stability modes – v3 offers Creative, Natural and Robust modes. Creative provides expressive output but may hallucinate; Natural balances expressiveness and accuracy; Robust is highly stable but less responsive. Use Creative or Natural when employing audio tags.

  • Audio tags – Use tags such as [laughs], [whispers], [sarcastic], [curious], [excited] or sound effects like [gunshot], [applause], [clapping] to control emotion and add effects. Some tags may work better with certain voices; test combinations to find what works.

  • Punctuation and capitalization – Ellipses create pauses; capitalization adds emphasis; proper punctuation improves natural rhythm.


7. Practical prompt examples

  • Audiobook narration (v2)“Chapter One: The discovery. Sarah walked through the ancient library, her footsteps echoing in the silence….”. Use high Stability and Similarity Boost values; moderate Style Exaggeration (0.1–0.3) for subtle emotion.

  • Emotional dialogue (v2/v3)“I can’t believe you’re leaving! After everything we’ve been through together, how can you just walk away like this means nothing?”. Add tags like [crying] or [angry] in v3; increase Style Exaggeration in v2.

  • Educational content (v2)“Today we’ll explore the fascinating world of quantum physics. Don’t worry if it seems complex at first – we’ll break it down step by step.”. Choose a calm voice and set Stability high.

  • Conversational AI (Turbo)“Hi there! How can I help you today? I’m here to answer your questions and assist with whatever you need.”. Lower Stability and increase Speed for snappier responses; set Language Code to ensure the desired language.

  • Multilingual prompts“Welcome to our international conference. Bienvenue à notre conférence internationale. Bienvenidos a nuestra conferencia internacional.”. Use v2 or Turbo with automatic language detection or specify a Language Code for each segment.

6. Optimization Settings by Use Case

6.1 Audiobook/Podcast (Multilingual v2)

For audiobook and podcast production, choose expressive voices like Aria or Sarah that can convey narrative emotion effectively. Set Stability between 0.6-0.8 to ensure consistent narration throughout longer content, while using a Similarity Boost of 0.7 to maintain strong voice consistency across chapters or episodes. Apply moderate Style Exaggeration of 0.3-0.5 for natural expression that engages listeners without overwhelming the content. Keep Speed between 0.9-1.0 to create a comfortable listening pace that allows for proper comprehension and enjoyment.

6.2 Conversational AI (Turbo 2.5)

Conversational AI applications work best with natural-sounding voices like Charlie or Laura that feel approachable and friendly. Use Stability settings of 0.4-0.6 to allow slight variation that makes conversations feel more human and less robotic. Set Similarity Boost to 0.5 for balanced consistency that maintains character while allowing natural speech variation. Keep Style Exaggeration minimal at 0-0.2 to maintain a neutral, professional tone appropriate for most conversational contexts. Speed should be set to 1.0-1.1 to create a responsive feel that matches natural conversation pacing.

6.3 Dramatic Content (Multilingual v2)

Dramatic content requires expressive voices like George or Jessica that can handle emotional range and character depth. Lower Stability to 0.3-0.5 to allow for emotional variation that brings characters to life and supports dramatic storytelling. Use Similarity Boost of 0.6 to maintain character consistency while allowing for emotional expression. Increase Style Exaggeration to 0.6-0.8 for dramatic emphasis that enhances the emotional impact of the content. Reduce Speed to 0.8-0.9 for dramatic pacing that gives weight to important moments and allows emotional beats to resonate.

6.4 Educational Content (Both Models)

Educational content benefits from clear, articulate voices like Brian or Alice that prioritize comprehension and clarity. Set Stability to 0.7 for consistent delivery that helps students focus on the content rather than vocal variations. Use Similarity Boost of 0.6 for reliability that ensures consistent voice characteristics across lessons or modules. Apply moderate Style Exaggeration of 0.2-0.4 to create engaging but clear speech that maintains student interest without distracting from the educational material. Set Speed to 0.9 to optimize for comprehension, giving students time to process complex information while maintaining engagement.

9. Creative Applications

  1. Audio Content Creation

    Create audiobook narrations, podcast intros, video voiceovers, and educational content. Use Previous Text/Next Text features for longer content while maintaining voice consistency.

  2. Interactive Applications

    Build conversational AI systems, voice assistants, interactive games, and real-time communication tools. Turbo 2.5's speed makes it ideal for responsive applications.

  3. Multilingual Projects

    Develop content for global audiences using automatic language detection or manual language specification. Both models maintain voice characteristics across different languages.


10. Troubleshooting

  1. Quality Issues

    If speech sounds robotic, lower Stability and increase Style Exaggeration. If pronunciation is incorrect, try different punctuation or manual Language Code specification for Turbo 2.5.

  2. Speed vs Quality Balance

    For real-time applications needing better quality, try Turbo 2.5 with higher Stability settings. For high-quality content needing faster generation, use Multilingual v2 with optimized parameters.

  3. Voice Consistency

    Use identical parameter settings and the same voice selection across related content. The Similarity Boost setting helps maintain consistent voice characteristics.


Conclusion

Scenario’s ElevenLabs portfolio features Eleven v3, Multilingual v2 and Turbo 2.5, offering a spectrum from high expressiveness to real‑time efficiency. By understanding each model’s strengths, selecting suitable voices, tuning generation parameters and crafting well‑structured prompts, you can produce professional‑quality audio tailored to your use case.

Was this helpful?