
Scenario’s text-to-speech tools let you create professional-quality voiceovers from text. Among the available models, three ElevenLabs models stand out: ElevenLabs v3, Multilingual v2 and Turbo 2.5. This guide explains each model’s unique strengths, how to choose between them, and tips for getting the best results.
Multilingual v2 supports 29 languages with rich emotional expression and natural prosody. Turbo 2.5 supports 32 languages (adds Vietnamese, Hungarian, Norwegian) with 3x faster generation for non-English languages and 25% faster English generation.
1. Overview
ElevenLabs v3 (alpha)
Eleven v3 is a research‑preview model that produces natural, life‑like speech with a high emotional range. Unlike v2 and Turbo, it is not designed for real‑time applications and is better suited to long‑form narration and expressive dialogue. The model supports more than 70 languages. Because it is in alpha, quality can vary; we recommend experimenting with longer prompts (≥250 characters) and multiple generations.
ElevenLabsMultilingual v2
Multilingual v2 is ElevenLabs’ most lifelike and emotionally rich production model. It delivers consistent voice quality and natural prosody across 29 languages, making it ideal for audiobooks, film dubbing, podcasts and other projects where emotional fidelity matters. V2 prioritizes quality over speed, resulting in higher latency and cost compared with Turbo. You can generate up to 10 000 characters per request.
Supported languages: English, Spanish, French, German, Italian, Portuguese, Russian, Japanese, Korean, Chinese (Mandarin), Arabic, Hindi, Dutch, Polish, Czech, Slovak, Ukrainian, Croatian, Romanian, Bulgarian, Greek, Finnish, Danish, Swedish, Norwegian, Hungarian, Turkish, Hebrew, Malay, Tamil.
ElevenLabsTurbo 2.5
Turbo 2.5 balances quality with low latency. It supports 32 languages, adding Vietnamese, Hungarian and Norwegian to the v2 language set. Generation speed is roughly three times faster than v2 for non‑English languages and 25 % faster for English, and the model is about 50 % cheaper per character. This makes Turbo ideal for real‑time conversational agents, interactive games and high‑volume projects. Turbo supports up to 40 000 characters per call and allows manual language enforcement via two‑letter ISO 639‑1 codes.
2. Model selection
2.1 Choosing ElevenLabs v3
Storytelling and character dialogue – Use v3 when you need highly expressive performances for multi‑speaker conversations, audiobooks or dramatic scenes. The model’s emotional range and contextual understanding provide realism that v2 and Turbo cannot match.
Non‑real‑time projects – v3 is not optimized for real‑time; its higher latency and 3 000‑character limit favour offline workflows where you can generate several takes and choose the best.
Multilingual content – With support for 70+ languages, v3 is a good choice when you need expressive narration in less‑common languages.
2.2 Choosing Multilingual v2
High‑fidelity narration – Select v2 when natural prosody and emotional nuance are paramount, such as for audiobooks, podcasts, voiceovers and educational content.
Stable quality – V2 maintains consistent voice personality across long passages and multiple languages.
Language coverage – Use v2 when your content is in one of its 29 supported languages and you require emotional richness.
Content length – v2’s 10 000‑character limit per call supports longer audio segments than v3.
2.3 Choosing Turbo 2.5
Real‑time interaction – Turbo’s low latency and cost make it suitable for chatbots, games and other interactive applications.
Cost efficiency – Its per‑character price is roughly half that of v2, making it economical for large volumes of speech.
Language flexibility – Turbo supports 32 languages and allows manual language selection via ISO codes.
Longest content – With a 40 000‑character limit, Turbo can generate extended scripts in a single call.
3. Key differences
Model | Latency & use | Emotional range & quality | Languages | Character limit |
---|---|---|---|---|
ElevenLabs v3 (alpha) | Not real‑time; suited to offline projects | Highest emotional range and contextual understanding | 70+ languages | 3 000 characters |
Multilingual v2 | Higher latency and cost; prioritizes quality | Lifelike speech with rich emotional expression | 29 languages | 10 000 characters |
Turbo 2.5 | Low latency and 50 % cheaper per character | Balanced quality; less emotional nuance than v2 | 32 languages | 40 000 characters |
4. Interface controls & workflow
4.1 Text input
In the Scenario interface, type or paste your script in the text field. ElevenLabs models automatically detect the language and can handle multilingual content within a single generation. Use proper punctuation and capitalization to guide rhythm and emphasis; ellipses (…) add pauses and capitalization signals emphasis.
4.2 Voice selection
All three models share the same voice library. Choose a voice that matches the desired delivery; neutral voices tend to be more stable across languages. For v3, voice selection is especially critical because the model responds strongly to voice characteristics.
ElevenLabs' voice library including Aria, Roger, Sarah, Laura, Charlie, George, Callum, River, Liam, Charlotte, Alice, Matilda, Will, Jessica, Eric, Chris, Brian, Daniel, Lily, and Bill. Each voice works with both models.
Aria: female, expressive, social, engaging.
Roger: male confident, social, persuasive.
Sarah: female, expressive, social, energetic.
Laura: female, upbeat, social, lively.
Charlie: male, natural, conversational, relaxed.
George: male, warm, narration, trustworthy.
Callum: male, intense, character, dramatic.
River: non-binary, confident, social, modern.
Liam: male, articulate, narration, clear.
Charlotte: female, seductive, character, playful.
Alice: female, confident, news, formal.
Matilda: female, friendly, narration, calm.
Will: male, natural, narration, steady.
Jessica: female, expressive, conversational, youthful.
Eric: male, friendly, conversational, approachable.
Chris: male, casual, conversational, easy going.
Brian: male, deep, narration, serious.
Daniel: male, authoritative, news, commanding.
Lily: female, warm, narration, gentle.
Bill: male, trustworthy, narration, classic.
4.3 Generation parameters (v2 & Turbo)
Multilingual v2 and Turbo 2.5 provide the following controls:
Stability (0-1, default 0.5)
Controls consistency and predictability of speech. Higher values produce more stable, consistent output. Lower values allow more variation and expressiveness.Similarity Boost (0-1, default 0.5)
Enhances similarity to the selected voice characteristics. Higher values make the output more closely match the chosen voice profileStyle Exaggeration (0-1, default 0)
Controls emotional intensity and expressiveness. Higher values increase dramatic emphasis and emotional range. More effective with Multilingual v2Speed (0.7-1.2, default 1)
Adjusts speech rate. Values below 1.0 slow down speech, above 1.0 speed it up. Extreme values may affect quality.Timestamps Toggle
When enabled, returns timestamps for each word in the generated speech, useful for synchronization applications.
4.4 Advanced Features
Advanced features include Previous Text/Next Text fields for chaining long scripts and a Language Code parameter to enforce a specific language.
Previous Text / Next Text
Allows chaining multiple text segments for longer content generation while maintaining voice consistencyLanguage Code
Manually specify language using ISO 639-1 codes to enforce specific language pronunciation when automatic detection isn't sufficient.
5. Best practices (all ElevenLabs models)
Text Formatting
Use proper punctuation for natural pacing. Periods create pauses, commas add brief breaks, and exclamation points increase energy. Break long paragraphs into shorter sentences for better flow.
Voice Consistency
Use the same voice and similar parameter settings across related content. Enable Timestamps when you need precise synchronization with other media.
Parameter Tuning
Start with default settings and adjust based on results. Higher Stability for consistent content, higher Style Exaggeration for dramatic effect, adjusted Speed for pacing preferences.
Model Selection
Use Multilingual v2 for final production content where quality matters. Use Turbo 2.5 for prototyping, real-time applications, or when speed is the priority.
6. Best practices (ElevenLabs v3)
ElevenLabs v3 introduces unique settings and tags for fine‑grained emotional control:
Longer prompts – Prompts shorter than ~250 characters may yield inconsistent output; longer prompts improve stability.
Stability modes – v3 offers Creative, Natural and Robust modes. Creative provides expressive output but may hallucinate; Natural balances expressiveness and accuracy; Robust is highly stable but less responsive. Use Creative or Natural when employing audio tags.
Audio tags – Use tags such as
[laughs]
,[whispers]
,[sarcastic]
,[curious]
,[excited]
or sound effects like[gunshot]
,[applause]
,[clapping]
to control emotion and add effects. Some tags may work better with certain voices; test combinations to find what works.Punctuation and capitalization – Ellipses create pauses; capitalization adds emphasis; proper punctuation improves natural rhythm.
7. Practical prompt examples
Audiobook narration (v2) – “Chapter One: The discovery. Sarah walked through the ancient library, her footsteps echoing in the silence….”. Use high Stability and Similarity Boost values; moderate Style Exaggeration (0.1–0.3) for subtle emotion.
Emotional dialogue (v2/v3) – “I can’t believe you’re leaving! After everything we’ve been through together, how can you just walk away like this means nothing?”. Add tags like
[crying]
or[angry]
in v3; increase Style Exaggeration in v2.Educational content (v2) – “Today we’ll explore the fascinating world of quantum physics. Don’t worry if it seems complex at first – we’ll break it down step by step.”. Choose a calm voice and set Stability high.
Conversational AI (Turbo) – “Hi there! How can I help you today? I’m here to answer your questions and assist with whatever you need.”. Lower Stability and increase Speed for snappier responses; set Language Code to ensure the desired language.
Multilingual prompts – “Welcome to our international conference. Bienvenue à notre conférence internationale. Bienvenidos a nuestra conferencia internacional.”. Use v2 or Turbo with automatic language detection or specify a Language Code for each segment.
6. Optimization Settings by Use Case
6.1 Audiobook/Podcast (Multilingual v2)
6.2 Conversational AI (Turbo 2.5)
6.3 Dramatic Content (Multilingual v2)
6.4 Educational Content (Both Models)
9. Creative Applications
Audio Content Creation
Create audiobook narrations, podcast intros, video voiceovers, and educational content. Use Previous Text/Next Text features for longer content while maintaining voice consistency.
Interactive Applications
Build conversational AI systems, voice assistants, interactive games, and real-time communication tools. Turbo 2.5's speed makes it ideal for responsive applications.
Multilingual Projects
Develop content for global audiences using automatic language detection or manual language specification. Both models maintain voice characteristics across different languages.
10. Troubleshooting
Quality Issues
If speech sounds robotic, lower Stability and increase Style Exaggeration. If pronunciation is incorrect, try different punctuation or manual Language Code specification for Turbo 2.5.
Speed vs Quality Balance
For real-time applications needing better quality, try Turbo 2.5 with higher Stability settings. For high-quality content needing faster generation, use Multilingual v2 with optimized parameters.
Voice Consistency
Use identical parameter settings and the same voice selection across related content. The Similarity Boost setting helps maintain consistent voice characteristics.
Conclusion
Scenario’s ElevenLabs portfolio features Eleven v3, Multilingual v2 and Turbo 2.5, offering a spectrum from high expressiveness to real‑time efficiency. By understanding each model’s strengths, selecting suitable voices, tuning generation parameters and crafting well‑structured prompts, you can produce professional‑quality audio tailored to your use case.
Was this helpful?