Train a Voice Clone (IVC)

Last updated: June 5, 2026

asset_azGyEd1uitrwVzHjZuFK18KJ_A clean, modern banner for 'Instant Voice Cloning (IVC)'. The composition features abstract sound waves transforming into a stylized, duplicated voice icon, set against a bright, orga.png

Voice training on Scenario lets you generate audio in the voice of a reference speaker. Today the platform offers Instant Voice Cloning (IVC): a fast clone built from a short audio sample, suitable for experimentation and prototyping. A higher-fidelity Professional Voice Clone is on the roadmap and will appear in the picker when available.

This article covers what IVC is, what it's good for, and how to produce a sample that yields a usable clone.

What IVC is, and what it isn't

IVC produces a voice clone from a single short audio sample. It's optimized for speed and iteration, not production-grade voice work.

Use IVC for	Don't use IVC for
Quick experimentation and prototyping	Final voice-over for shipped content
Demos, internal previews, mockups	Long-form narration where consistency matters
Testing voice direction before committing to a pro recording	Commercial use cases requiring high fidelity or robustness
Generating placeholder voice for animatics or game prototypes	Cases where the clone must match the reference under varied emotional ranges or pacing

For production voice work, wait for Professional Voice Clone (status: coming soon).

Picking voice training in the picker

In Train > New Model, scroll to the Voice Training section and select Instant Voice Cloning (IVC). Professional Voice Clone will be available there once it ships.

Audio sample requirements

IVC works with multiple audio samples, requiring a minimum of 1 minute in total duration. Quality of the samples is the primary driver of clone quality.

Requirement	Detail
Duration	Under about 30 seconds. Longer samples don't improve IVC results.
Format	Standard audio formats (WAV preferred, MP3 acceptable).
Speaker	One speaker only. No background voices or overlapping audio.
Background	Quiet environment. Minimal room reverb, no music, no street noise.
Recording quality	Use a decent microphone if possible. Phone recordings work; lossy compression artifacts hurt the clone.
Content	Natural conversational speech, varied prosody. Avoid monotone reading.
Language	Match the language(s) you'll generate in.

A clean sample of natural speech outperforms a sample with background music every time.

Capturing a good sample

If you control the recording:

Record in a quiet room with soft furnishings (carpet, curtains): these dampen reverb.
Position the microphone 6 to 12 inches from the speaker, slightly off-axis to reduce plosives.
Have the speaker read or say something natural (describing their day, telling a short story) rather than reading a flat scripted sentence.
Aim for clear consonants and varied intonation. Monotone delivery produces a flat clone.
Record more than you need; trim to the cleanest 20 to 30 seconds.

If you're sourcing existing audio:

Look for clean studio interviews, podcast segments without music, or audiobook excerpts.
Avoid clips with background music, multiple speakers, phone-call audio, or heavy compression.
Trim to a contiguous 20 to 30 second segment of the speaker only.

Training and using the clone

Upload your audio sample.
Confirm and start the clone. IVC training is fast (minutes, not hours).
Generate test outputs across a few prompts:
- Short conversational lines
- Longer paragraphs
- Emotive deliveries (excited, somber, neutral)
Iterate on the sample if needed (see Common pitfalls below).

Evaluating the clone

Identity: does the generated voice sound like the reference speaker on a neutral phrase?
Stability: does it stay consistent across longer outputs, or drift mid-sentence?
Range: can it handle different emotional tones, or does it collapse to one mode?
Artifacts: any audible glitches, robotic patches, or unnatural pacing?

IVC is optimized for short, casual outputs. Expect quality to degrade on longer or more demanding lines: that's the trade-off you accept for speed.

Common pitfalls

Background noise in the sample. The clone learns the noise as part of the voice. Re-record in a quieter environment.
Music or other voices. Same problem, worse: the clone may produce hybrid output. Use clean speaker-only audio.
Heavy compression. Phone-call quality or aggressive MP3 compression bakes artifacts into the clone.
Monotone delivery in the sample. Limits the clone's expressive range. Capture varied intonation.
Sample too short. Under about 10 seconds is too thin for IVC to lock onto identity. Aim for 1 minute.
Multiple speakers in the sample. IVC can only learn one. Trim to a single-speaker segment.

When to wait for Professional Voice Clone

If you need any of these, wait for the Pro option rather than shipping with IVC:

Long-form narration (audiobooks, documentaries, full episodes)
Voice consistency across multi-hour content
Wide emotional range or character acting
Commercial deliverables where minor artifacts aren't acceptable
Phoneme-perfect pronunciation across technical or domain-specific vocabulary