Introduction to Audio Generation on Scenario

Last updated: April 20, 2026

asset_Tr2u8mbzbkJS6rGCEvA9oojJ_A sleek, futuristic banner for an article on 'Introduction to Audio Generation on Scenario'. The design should be dark, high-tech, and professional, drawing style inspiration from the.png

Scenario provides a full suite of audio generation models spanning four categories: music, text-to-speech, sound effects, and video-to-audio. This article is a quick-reference guide to help you find the right model for your use case. Each model listed below links to a dedicated article with full parameter documentation, tips, and examples.


Music Generation

Music models generate original audio tracks from text descriptions. They range from quick 30-second clips to full structured songs up to 5 minutes in length.

ModelID

Output length

Key differentiator

MiniMax Music 2.6

model_minimax-music-2-6

2:30 to 4:45 min

Full songs with vocals. Instrumental mode, auto-lyrics, and 14+ structure tags. Highest-quality output.

MiniMax Music 2.5

model_minimax-music-2-5

2:30 to 4:45 min

Full songs. Lyrics required. Same structure tag system as 2.6. Use if manual lyrics control is preferred.

ElevenLabs Music Advanced

model_elevenlabs-music-advanced

Up to 3 min

Section-by-section composition with per-section styles and narrative lines. Best for structured cinematic or game music.

ElevenLabs Music

model_elevenlabs-music

3 sec to 3 min

Single prompt, fast output. Reliable instrumental mode. Opus 192kbps max quality.

Google Lyria 3 Pro

model_lyria-3-pro

Up to ~2 min, MP3 or WAV

Full songs from text or image references. Verse/chorus structure tags. SynthID watermarked.

Google Lyria 3 Clip

model_lyria-3-clip

30 seconds, MP3

Quick 30-second music clips. Accepts image references. SynthID watermarked.

Beatoven Music Generation

model_beatoven-music-generation

5 to 150 seconds

Royalty-free instrumental music. Negative prompt, refinement, and creativity sliders.

Meta MusicGen

model_meta-musicgen

Up to 30 seconds

Melody-conditioned generation. Can continue or transform a reference audio clip.

Google Lyria 2

model_lyria-2

30 seconds, 48kHz stereo

Legacy model. Instrumental only. Only Lyria model with negativePrompt support.

Quick decision guide: For full songs with vocals, use MiniMax Music 2.6. For structured cinematic compositions, use ElevenLabs Music Advanced. For quick instrumental clips, use ElevenLabs Music or Lyria 3 Clip. For royalty-free background music with quality controls, use Beatoven.


Text-to-Speech

TTS models convert written text into spoken audio. They vary in language support, voice selection, emotional range, and latency.

ModelID

Voices / Languages

Key differentiator

Gemini 3.1 Flash TTS

model_google-gemini-3-1-flash-tts

30 voices, 24 languages

Inline audio tags for emotion control ([determination], [whispers]). Multi-speaker dialogue. Up to 5,000 characters.

MiniMax Speech 2.8 HD

model_minimax-speech-2-8-hd

17 voices, 40+ languages

Highest quality TTS. Pause syntax, interjections, 10 emotions. Broadcast-ready output.

MiniMax Speech 2.8 Turbo

model_minimax-speech-2-8-turbo

17 voices, 40+ languages

Same as HD, optimized for speed and cost. Best for real-time and high-volume pipelines.

ElevenLabs 3

model_elevenlabs-tts-v3

Custom voices, 70+ languages

Most expressive ElevenLabs model. Alpha release, non-real-time. 3,000 character limit.

ElevenLabs Multilingual 2

model_elevenlabs-multilingual-v2

Custom voices, 29 languages

Emotionally rich speech. Production-ready. 10,000 character limit.

ElevenLabs Turbo 2.5

model_elevenlabs-turbo-v2-5

Custom voices, 32 languages

Low-latency ElevenLabs. Cost-effective for real-time applications.

xAI Grok TTS

model_xai-grok-tts

Multiple voices, multilingual

Expressive delivery with speech tags like [pause] and whisper mode. Codec and language options.

Gemini 2.5 Pro TTS

model_google-gemini-2-5-pro-tts

Multiple voices

Previous Gemini TTS generation. Migrate to Gemini 3.1 Flash for new projects.

Gemini 2.5 Flash TTS

model_google-gemini-2-5-flash-tts

Multiple voices

Previous Gemini TTS generation. Migrate to Gemini 3.1 Flash for new projects.

Tada 3B TTS

model_tada-3b-text-to-speech

Reference audio cloning

Voice cloning from a short audio reference. 3B parameter model, higher quality than 1B.

Tada 1B TTS

model_tada-1b-text-to-speech

Reference audio cloning

Voice cloning from a short audio reference. Faster, lighter than 3B.

Lux TTS

model_lux-tts

Reference audio cloning

High-fidelity voice cloning at 48kHz from a reference audio clip.

Quick decision guide: For general speech with emotion control, use Gemini 3.1 Flash TTS. For broadcast-quality narration in 40+ languages, use MiniMax Speech 2.8 HD. For voice cloning from a reference clip, use Lux TTS or Tada 3B. For real-time applications, use MiniMax 2.8 Turbo or ElevenLabs Turbo 2.5.


Sound Effects

SFX models generate short audio clips from text descriptions, without requiring a video input.

ModelID

Max duration

Key differentiator

ElevenLabs Sound Effects 2

model_elevenlabs-sound-effects-v2

30 seconds

Precise sound effects with guidance control and loop toggle. Commercially licensed.

Beatoven Sound Effect

model_beatoven-sound-effect

35 seconds

Royalty-free SFX with refinement and creativity sliders. Negative prompt support.

MM Audio 2 Text-To-Audio

model_mm-audio-2-t2a

30 seconds

Academic model (MIT license). cfgStrength control. Pairs with MM Audio 2 for video-to-audio workflows.


Video-to-Audio

Video-to-audio models analyze a silent video and generate synchronized audio that matches the on-screen action. They combine visual analysis with a text prompt to produce temporally aligned Foley and ambient sound.

ModelID

Max duration

Key differentiator

MM Audio 2

model_mm-audio-2

30 seconds

More precise temporal alignment than MM Audio. Default 8s output, up to 30s.

MM Audio

model_mm-audio

30 seconds

Original MM Audio model. Default 30s output. MIT license.


Choosing the Right Model

I need to...

Start with

Generate a full song with vocals and arrangement

MiniMax Music 2.6

Compose music section-by-section for a game or film

ElevenLabs Music Advanced

Generate a quick instrumental background track

ElevenLabs Music or Beatoven Music Generation

Generate a 30-second music clip

Google Lyria 3 Clip

Narrate a script in multiple languages

MiniMax Speech 2.8 HD or Gemini 3.1 Flash TTS

Clone a specific voice from a recording

Lux TTS or Tada 3B TTS

Build a real-time voice assistant

MiniMax Speech 2.8 Turbo or ElevenLabs Turbo 2.5

Generate sound effects for a game

ElevenLabs Sound Effects 2 or Beatoven Sound Effect

Add synchronized audio to a silent video

MM Audio 2


Lip Sync Animation: Bringing Characters to Life

And finally, lip sync animation, found in the “Video generation” tools. Just open the main Video generation interface, and select from the dedicated Lip Sync models to get started.

You can choose from a wide variety of specialized models, ranging from high-fidelity avatars like Omni Human 1.5 and Kling AI Avatar 2, to specialized tools for stylized motion or fast clips. Each model offers a different look, feel, or level of control to suit your specific project needs.

All lip-sync models on Scenario are covered in a dedicated section of the Knowledge Base, available here: https://help.scenario.com/en/c/lip-sync. This section includes both high-level overviews and in-depth documentation for specific models.

AI lip sync technology allows the synchronization of a character’s or person’s lip movements with an audio track or written text converted into speech. Depending on the model, you can upload an image or video together with audio, or simply provide text to generate both the voice and the synchronized lip movement.

Available Models:

  • Omni Human 1.5: (ByteDance) Generates digital humans from a single image and audio file, creating high-quality avatars with natural gestures.

  • Creatify Aurora: A specialized model by Creatify designed for creating dynamic, lifelike character animations.

  • Sync Lipsync React-1: Tailored for reactive video generation, syncing lip movements perfectly for reaction-style content.

  • Kling AI Avatar 2 (Pro): The advanced professional version of the Kling avatar model, offering higher fidelity and enhanced control.

  • Veed Fabric Lipsync 1.0: This model provides seamless lip synchronization integrated with Veed's video fabric technology.

  • Sync Lipsync 2 (Pro): A professional-grade “zero-shot” model that aligns video with new audio while preserving the original speaking style with high precision.

  • Pixverse Lipsync: Part of a broader video generation ecosystem, offering lip sync combined with visual effects and cinematic quality.

  • Creatify Lipsync: A tool for creating looped lip sync animations, with fast generation for short, lightweight videos.

  • Sync Lipsync 2: The standard “zero-shot” model that aligns any video with new audio, featuring multi-speaker support.

  • Kling Lipsync: Focused on short, fast lip sync generation from uploaded video and audio or text, with adjustable voice speed and type.


A Complete Solution for Creative Audio Workflows

No matter what you're creating, voiceovers, music, sound effects, or animated dialogue, Scenario’s audio tools are built for speed and flexibility, providing a complete solution for creative audio workflows. From expressive speech and cinematic sound to stylized performance and lip sync animation, everything works together to give you full control from the first prompt to the final frame.