Train a Voice Clone (IVC)
Last updated: May 18, 2026
Voice training on Scenario lets you generate audio in the voice of a reference speaker. Today the platform offers Instant Voice Cloning (IVC): a fast clone built from a short audio sample, suitable for experimentation and prototyping. A higher-fidelity Professional Voice Clone is on the roadmap and will appear in the picker when available.
This article covers what IVC is, what it's good for, and how to produce a sample that yields a usable clone.

What IVC is, and what it isn't
IVC produces a voice clone from a single short audio sample. It's optimized for speed and iteration, not production-grade voice work.
Use IVC for | Don't use IVC for |
|---|---|
Quick experimentation and prototyping | Final voice-over for shipped content |
Demos, internal previews, mockups | Long-form narration where consistency matters |
Testing voice direction before committing to a pro recording | Commercial use cases requiring high fidelity or robustness |
Generating placeholder voice for animatics or game prototypes | Cases where the clone must match the reference under varied emotional ranges or pacing |
For production voice work, wait for Professional Voice Clone (status: coming soon).
Picking voice training in the picker
In Train > New Model, scroll to the Voice Training section and select Instant Voice Cloning (IVC). Professional Voice Clone will be available there once it ships.
Audio sample requirements
IVC works from a single short audio sample. Quality of the sample is the primary driver of clone quality.
Requirement | Detail |
|---|---|
Duration | Under about 30 seconds. Longer samples don't improve IVC results. |
Format | Standard audio formats (WAV preferred, MP3 acceptable). |
Speaker | One speaker only. No background voices or overlapping audio. |
Background | Quiet environment. Minimal room reverb, no music, no street noise. |
Recording quality | Use a decent microphone if possible. Phone recordings work; lossy compression artifacts hurt the clone. |
Content | Natural conversational speech, varied prosody. Avoid monotone reading. |
Language | Match the language(s) you'll generate in. |
A clean sample of natural speech outperforms a sample with background music every time.
Capturing a good sample
If you control the recording:
Record in a quiet room with soft furnishings (carpet, curtains): these dampen reverb.
Position the microphone 6 to 12 inches from the speaker, slightly off-axis to reduce plosives.
Have the speaker read or say something natural (describing their day, telling a short story) rather than reading a flat scripted sentence.
Aim for clear consonants and varied intonation. Monotone delivery produces a flat clone.
Record more than you need; trim to the cleanest 20 to 30 seconds.
If you're sourcing existing audio:
Look for clean studio interviews, podcast segments without music, or audiobook excerpts.
Avoid clips with background music, multiple speakers, phone-call audio, or heavy compression.
Trim to a contiguous 20 to 30 second segment of the speaker only.
Training and using the clone
Upload your audio sample.
Confirm and start the clone. IVC training is fast (minutes, not hours).
Generate test outputs across a few prompts:
Short conversational lines
Longer paragraphs
Emotive deliveries (excited, somber, neutral)
Iterate on the sample if needed (see Common pitfalls below).

Evaluating the clone
Identity: does the generated voice sound like the reference speaker on a neutral phrase?
Stability: does it stay consistent across longer outputs, or drift mid-sentence?
Range: can it handle different emotional tones, or does it collapse to one mode?
Artifacts: any audible glitches, robotic patches, or unnatural pacing?
IVC is optimized for short, casual outputs. Expect quality to degrade on longer or more demanding lines: that's the trade-off you accept for speed.
Common pitfalls
Background noise in the sample. The clone learns the noise as part of the voice. Re-record in a quieter environment.
Music or other voices. Same problem, worse: the clone may produce hybrid output. Use clean speaker-only audio.
Heavy compression. Phone-call quality or aggressive MP3 compression bakes artifacts into the clone.
Monotone delivery in the sample. Limits the clone's expressive range. Capture varied intonation.
Sample too short. Under about 10 seconds is too thin for IVC to lock onto identity. Aim for 1 minute.
Multiple speakers in the sample. IVC can only learn one. Trim to a single-speaker segment.
When to wait for Professional Voice Clone
If you need any of these, wait for the Pro option rather than shipping with IVC:
Long-form narration (audiobooks, documentaries, full episodes)
Voice consistency across multi-hour content
Wide emotional range or character acting
Commercial deliverables where minor artifacts aren't acceptable
Phoneme-perfect pronunciation across technical or domain-specific vocabulary