Lux TTS: 48 kHz Professional Voice Cloning

Last updated: April 9, 2026

Lux TTS sets a new quality bar for voice cloning on Scenario - delivering 48 kHz professional-grade speech from a reference audio and a text input, with precise control over how faithfully the model reproduces the original voice.

Introduction

Lux TTS is a voice cloning model built around one core promise: audio quality. While many TTS models optimize for speed, language coverage, or voice variety, Lux TTS focuses on fidelity - producing speech at 48 kHz that captures the true character of a voice and renders it with the kind of clarity and warmth expected in professional audio production.

The model uses flow-matching acoustic generation, a technique that progressively refines audio output through iterative inference steps, resulting in speech that is clean, natural, and free of the robotic artifacts common in lower-quality TTS systems. The reference audio is encoded within the first few seconds of the clip - meaning even a short, well-recorded sample is enough to produce highly accurate voice cloning across any text you provide.

Inside Scenario, Lux TTS integrates directly with the platform's asset system. Any audio asset already in your project - whether recorded externally, generated by ElevenLabs, or produced by another Scenario model - can serve as the reference voice.

Parameters and Settings

Prompt: Required. The text to synthesize, up to 10,000 characters. Write naturally with clear punctuation to guide pacing and delivery.
Reference Audio: Required. Any audio file or Scenario asset ID containing the voice to clone. A clean recording of 5 to 15 seconds works best.
Guidance Scale: Controls how strictly the model adheres to the reference voice. Default is 3, range is 0 to 10.
Inference Steps: Controls the number of flow-matching steps used in acoustic generation. Default is 4, maximum is 16. For final production, 8 or 12 is recommended.
Max Reference Length: Sets how many seconds of the reference audio are used for voice encoding. Default is 5 seconds, maximum is 15 seconds.
Seed: An optional number for reproducible outputs.

Recommended Workflow

Lux TTS rewards careful input preparation. For the reference audio, prioritize clean recordings with minimal ambient noise. For the generation settings, start with Guidance Scale at 4 to 5 and Inference Steps at 8 for a strong quality baseline. Once you find the right combination, lock it in with a Seed value.

Audio Quality

Lux TTS outputs at 48 kHz — the professional broadcast standard used in film, television, and high-end game audio production. This means the generated speech is ready for direct use in professional post-production workflows without resampling or quality loss. The difference is most noticeable on high-quality headphones and studio monitors.