The Gemini Omni Family: The Essentials

Last updated: July 8, 2026

asset_GaA5Xj4w1pVp3U4C8EFdphfB_A high-angle, brightly lit, clean, and modern desk setup, showcasing the seamless video generation process. On the left, three distinct, sleek digital input cards_ one labeled 'Text P.png

Gemini Omni is Google's breakthrough any-input video family on Scenario. Three sibling models cover the full generation and edit surface: create a clip from text or an image, restyle an existing clip in plain English, or drop a specific subject into a fresh scene while keeping its identity locked. All three produce 720p video with native audio (dialogue, ambience, and effects) baked in a single pass.

This article covers all three siblings with worked examples for each: Gemini Omni (text-to-video and image-to-video), Gemini Omni Flash Edit (video-to-video restyle), and Gemini Omni Flash Reference to Video (subject-consistent img-to-video with 1 to 7 references).

Which Model Should I Use?

Model	Input	Best for
Gemini Omni	Text prompt, an optional first-frame image, and up to 7 reference images	Free-form scenes, one-shot ads with baked voice-over, animating a single hero image, multi-shot cinematic sequences, keeping specific subjects consistent
Gemini Omni Flash Edit	An existing video plus an edit instruction	Restyle season, palette, weather, genre, or a specific object without changing motion. Native audio is regenerated to match
Gemini Omni Flash Reference to Video	1 to 7 reference images, plus an optional prompt	Keep a character, product, or place identical across a new scene. Multi-character scenes and material-transfer effects

All three pair naturally with Nano Banana 2 Lite upstream to produce the source image or reference. The pipeline stays fully in-house on Scenario.

Gemini Omni: Text and Image to Video

The base model turns a text prompt (with optional first-frame image) into a 720p clip of 3 to 10 seconds, widescreen or vertical, with native audio in the same pass. Dialogue, ambience, score, and sound effects arrive together, so you skip the separate voice-over and sound-design steps.

It now also accepts up to 7 reference images. Add them to keep specific subjects (a character, a product, or a place) consistent in the generated clip, alongside or instead of a first-frame image.

How to Use Gemini Omni

Open the model page and write a prompt that describes the scene, the motion, and the sound. Two prompting habits that make a difference:

Write the audio, not just the visuals. Put dialogue in quotes, name ambient sounds, and cue the score. "A confident warm-toned male voice-over says: 'Feel time.'" performs better than describing sound abstractly.
Name the moment, not the setup. "The ranger raises her hand for silence. The tracker slowly raises his rifle" beats "two rangers alert in the woods".

For complex sequences, script the prompt like a shot list with a beat every 2 seconds. Omni Flash packs a lot of narrative into 10 seconds when the prompt directs the camera and sound per beat.

Parameters

Prompt

Scene, action, mood, and audio. Include quoted dialogue for any spoken line. Optional if you provide a first-frame image.

First Frame

Optional image to animate. The clip opens from that exact frame. Best paired with an image produced by Nano Banana 2 Lite or GPT Image 2 upstream.

Duration

3 to 10 seconds, default 8. Push to 10 for beats that need setup, turn, and land. Keep to 3 to 5 for punchy social loops.

Aspect Ratio

16:9 for widescreen, 9:16 for vertical social. No 1:1, 4:5, or 21:9 native.

Reference Images

Optional, up to 7. Reference subjects you want to appear in the clip (a character, a product, or a location), used to hold identity across the shot. Combine with a first-frame image when you want both a locked opening frame and consistent subjects.

Examples: Gemini Omni

American thriller: undercover operatives in a North African market

Two operatives in linen shirts move through spice stalls, exchanging quiet English dialogue about a package. Vendors call out prices, kettles clink, distant call to prayer. Handheld cinema, warm afternoon light.

Settings: Text-to-video, 16:9, 10 seconds. Watch: Open on Scenario

Transparent smartwatch product spot

A fully transparent glass smartwatch rotates against matte black. Interior mechanisms and cyan UI pulse through the case. Warm male voice-over: "Introducing the watch you can see right through. Feel time." Subtle synth pad, mechanical whir.

Settings: Text-to-video, 16:9, 8 seconds. Watch: Open on Scenario

Fantasy dragon boss reveal

The colossal red-scaled ancient dragon rises further from a mountain fissure, wings unfurling to their full span with a leathery snap that echoes down the valley. It arches its long neck, inhales deeply with a resonant sucking sound, then unleashes a roaring wave of orange flame directly at the camera. Knights scream and scatter across the cliffside. Full symphonic orchestral score with brass and choir, dragon roar overlapping thunder, chunks of stone tumble past the frame.

Settings: Image-to-video from a Nano Lite dragon key art, 16:9, 10 seconds. Watch: Open on Scenario

Gemini Omni Flash Edit: Restyle a Video

Feed a clip and describe the change in plain English. The camera path, timing, and motion stay intact. Native audio is regenerated to match the new look.

How to Use Gemini Omni Flash Edit

Open the model page, upload the video, and write the edit instruction. You can now also attach up to 5 reference images to steer the new subject or look. There are no duration or aspect controls: the output inherits both from the source.

Four instruction shapes that work well:

Object swap. "Replace the red car with a matte black vintage motorcycle. Keep the exact drift motion, twilight lighting, and camera path."
Season, weather, or time-of-day. "Change the season to a heavy snowstorm." "Move the entire scene to late night with neon lighting."
Full style transfer. "Restyle as hand-painted Studio Ghibli animation, watercolor backgrounds, cel-shaded characters."
Film grade or medium. "Convert the look to 1970s Kodachrome: warm palette, visible grain, halation glow."

Always end the prompt with a preservation clause: "Keep the exact motion, timing, and camera path unchanged."