MM Audio - The Essentials

1. Overview

Scenario’s audio generation capabilities include MM Audio, a widely used AI-powered video-to-audio synthesis technology. This guide explains MM Audio's unique strengths, how to use it effectively, and tips for getting the best results.

MM Audio transforms silent videos into immersive experiences with intelligent audio synthesis. Our advanced AI technology analyzes your video content and generates perfectly matched audio, creating a professional soundtrack in minutes. It can also be used for text-to-audio and experimental image-to-audio generation.

MM Audio is a state-of-the-art AI model developed by researchers from the University of Illinois Urbana-Champaign, Sony AI, and Sony Group Corporation. It was accepted as a CVPR 2025 paper and has gained significant attention for its ability to generate high-quality, synchronized audio from video and text inputs.

2. When to Use MM Audio

Primary Use Case: MM Audio is designed specifically for generating synchronized audio for silent videos. Its ability to analyze visual cues, motion, and environmental context makes it perfect for adding realistic soundscapes to your video content.
Content Enhancement: Use MM Audio when you have video content that needs audio enhancement, whether it's adding ambient sounds, sound effects, or environmental audio that matches the visual scene.
High-Quality Results: The model excels at producing high-fidelity audio that is both semantically aligned and temporally synchronized with the video, ensuring natural-sounding results.

With MMAudio, consider the following use cases:

Silent Video Enhancement: Transform silent videos into immersive experiences with contextually appropriate audio
Sound Design: Create professional sound effects and ambient audio for film, gaming, or content creation
Environmental Audio: Generate realistic environmental sounds that match the visual setting of your videos
Creative Projects: Add atmospheric audio to artistic or experimental video content

3. Key Differences

Feature	MM Audio in Scenario	Other AI Audio Models
Primary Function	Video-to-Audio Synthesis	Primarily Text-to-Audio or Text-to-Music
Input Requirements	Video file + optional text prompt	Text prompts only
Synchronization	High-precision temporal synchronization with video frames	No video synchronization capability
Training	Multimodal joint training on audio-visual data	Typically trained on single-modality datasets (text-audio)
Use Case	Adding realistic audio to silent videos	Generating audio from text descriptions

4. Interface Controls & Workflow

4.1 Prompt Input

In the Scenario interface, enter your text prompt to describe the desired audio. The prompt field includes helpful features:

Text Description: Describe your scene or subject, and let prompt tools generate, complete, or translate it
Image Upload: You can upload an image to turn it into a prompt using the "Upload an image" feature
Prompt Tools: Use the available tools (indicated by icons) to enhance your prompt

Use clear and descriptive keywords to specify the desired sounds. For example, instead of just "Water," use "Gentle waves lapping against shore."

4.2 Video Input

Choose your video source using one of two methods:

Select from Library: Choose from your existing video library in Scenario
Drag & Drop File: Upload a new video file by dragging and dropping it into the interface, or click "import it" to browse and select a file

4.3 Negative Prompt

The negative prompt allows you to exclude unwanted elements from the generated audio. Use this field to specify sounds, objects, colors, or characters you want to avoid. For example, enter "music, speech, voices" to focus on environmental sounds only.

4.4 Additional Settings

Fine-tune your generation with these advanced parameters:

Duration (default: 8): Controls the length of the generated audio in seconds. Adjust the slider to set your desired duration. If duration is not provided, the duration of the video will be used.
Steps (default: 25): The number of generation steps. Higher values can lead to more detailed audio but will take longer to generate.
Guidance (default: 4.5): Controls how closely the model follows your prompt. Higher values will adhere more strictly to the prompt, while lower values allow for more creative interpretation.
Seed: The random seed for generation. Leave blank for random results, or enter a specific number for reproducible outputs.

4.5 Step-by-step guide

Select Video: Choose your video either from your library or upload a new file
Write Prompt: Provide a descriptive text prompt to guide the audio generation
Set Negative Prompt: Specify any sounds or elements you want to exclude
Adjust Additional Settings: Fine-tune Duration, Steps, Guidance, and Seed as needed
Generate: Start the audio synthesis process
Review and Refine: Review the generated audio and adjust your settings for optimal results

5. Best Practices

Use Descriptive Prompts: The more descriptive your prompt, the better the results. Use specific keywords and phrases to describe the sounds you want to hear. Take advantage of the prompt tools to enhance your descriptions.
Leverage Image-to-Prompt: If you're struggling to describe your desired audio, try uploading an image that represents the scene or mood you want to capture. The system can convert visual elements into audio prompts.
Strategic Negative Prompting: Use the negative prompt effectively to exclude unwanted elements. Common exclusions include "music, speech, voices" for environmental sounds, or "harsh, loud, distorted" for gentler audio.
Optimize Duration Settings: While the default 8-second duration works well, adjust based on your needs. Shorter durations (5–6 seconds) work well for quick sound effects, while longer durations (10–12 seconds) are better for ambient soundscapes.
Fine-tune Generation Parameters:
- Use higher Guidance values (5–7) for precise prompt adherence, lower values (3–4) for creative interpretation
- Set a specific Seed value when you want to reproduce successful results
Video Selection Strategy: Choose videos from your library that clearly show the action or environment you want to sonify. The visual clarity directly impacts audio quality.
Iterate and Refine: Audio generation is an iterative process. Use the same seed with different prompts to compare results, or vary the guidance and steps to find the perfect balance for your project.

Was this helpful?