Train a Voice Clone (IVC)

Last updated: May 18, 2026

asset_azGyEd1uitrwVzHjZuFK18KJ_A clean, modern banner for 'Instant Voice Cloning (IVC)'. The composition features abstract sound waves transforming into a stylized, duplicated voice icon, set against a bright, orga.png

Voice training on Scenario lets you generate audio in the voice of a reference speaker. Today the platform offers Instant Voice Cloning (IVC): a fast clone built from a short audio sample, suitable for experimentation and prototyping. A higher-fidelity Professional Voice Clone is on the roadmap and will appear in the picker when available.

This article covers what IVC is, what it's good for, and how to produce a sample that yields a usable clone.

image.png

What IVC is, and what it isn't

IVC produces a voice clone from a single short audio sample. It's optimized for speed and iteration, not production-grade voice work.

Use IVC for

Don't use IVC for

Quick experimentation and prototyping

Final voice-over for shipped content

Demos, internal previews, mockups

Long-form narration where consistency matters

Testing voice direction before committing to a pro recording

Commercial use cases requiring high fidelity or robustness

Generating placeholder voice for animatics or game prototypes

Cases where the clone must match the reference under varied emotional ranges or pacing

For production voice work, wait for Professional Voice Clone (status: coming soon).


Picking voice training in the picker

In Train > New Model, scroll to the Voice Training section and select Instant Voice Cloning (IVC)Professional Voice Clone will be available there once it ships.

Audio sample requirements

IVC works from a single short audio sample. Quality of the sample is the primary driver of clone quality.

Requirement

Detail

Duration

Under about 30 seconds. Longer samples don't improve IVC results.

Format

Standard audio formats (WAV preferred, MP3 acceptable).

Speaker

One speaker only. No background voices or overlapping audio.

Background

Quiet environment. Minimal room reverb, no music, no street noise.

Recording quality

Use a decent microphone if possible. Phone recordings work; lossy compression artifacts hurt the clone.

Content

Natural conversational speech, varied prosody. Avoid monotone reading.

Language

Match the language(s) you'll generate in.

A clean sample of natural speech outperforms a sample with background music every time.


Capturing a good sample

If you control the recording:

  • Record in a quiet room with soft furnishings (carpet, curtains): these dampen reverb.

  • Position the microphone 6 to 12 inches from the speaker, slightly off-axis to reduce plosives.

  • Have the speaker read or say something natural (describing their day, telling a short story) rather than reading a flat scripted sentence.

  • Aim for clear consonants and varied intonation. Monotone delivery produces a flat clone.

  • Record more than you need; trim to the cleanest 20 to 30 seconds.

If you're sourcing existing audio:

  • Look for clean studio interviews, podcast segments without music, or audiobook excerpts.

  • Avoid clips with background music, multiple speakers, phone-call audio, or heavy compression.

  • Trim to a contiguous 20 to 30 second segment of the speaker only.


Training and using the clone

  1. Upload your audio sample.

  2. Confirm and start the clone. IVC training is fast (minutes, not hours).

  3. Generate test outputs across a few prompts:

    • Short conversational lines

    • Longer paragraphs

    • Emotive deliveries (excited, somber, neutral)

  4. Iterate on the sample if needed (see Common pitfalls below).

image.png

Evaluating the clone

  • Identity: does the generated voice sound like the reference speaker on a neutral phrase?

  • Stability: does it stay consistent across longer outputs, or drift mid-sentence?

  • Range: can it handle different emotional tones, or does it collapse to one mode?

  • Artifacts: any audible glitches, robotic patches, or unnatural pacing?

IVC is optimized for short, casual outputs. Expect quality to degrade on longer or more demanding lines: that's the trade-off you accept for speed.


Common pitfalls

  • Background noise in the sample. The clone learns the noise as part of the voice. Re-record in a quieter environment.

  • Music or other voices. Same problem, worse: the clone may produce hybrid output. Use clean speaker-only audio.

  • Heavy compression. Phone-call quality or aggressive MP3 compression bakes artifacts into the clone.

  • Monotone delivery in the sample. Limits the clone's expressive range. Capture varied intonation.

  • Sample too short. Under about 10 seconds is too thin for IVC to lock onto identity. Aim for 1 minute.

  • Multiple speakers in the sample. IVC can only learn one. Trim to a single-speaker segment.


When to wait for Professional Voice Clone

If you need any of these, wait for the Pro option rather than shipping with IVC:

  • Long-form narration (audiobooks, documentaries, full episodes)

  • Voice consistency across multi-hour content

  • Wide emotional range or character acting

  • Commercial deliverables where minor artifacts aren't acceptable

  • Phoneme-perfect pronunciation across technical or domain-specific vocabulary