The Gemini Omni Family: The Essentials

Last updated: July 2, 2026

asset_GaA5Xj4w1pVp3U4C8EFdphfB_A high-angle, brightly lit, clean, and modern desk setup, showcasing the seamless video generation process. On the left, three distinct, sleek digital input cards_ one labeled 'Text P.png

Gemini Omni is Google's breakthrough any-input video family on Scenario. Three sibling models cover the full generation and edit surface: create a clip from text or an image, restyle an existing clip in plain English, or drop a specific subject into a fresh scene while keeping its identity locked. All three produce 720p video with native audio (dialogue, ambience, and effects) baked in a single pass.

This article covers all three siblings with worked examples for each: Gemini Omni (text-to-video and image-to-video), Gemini Omni Flash Edit (video-to-video restyle), and Gemini Omni Flash Reference to Video (subject-consistent img-to-video with 1 to 3 references).


Which Model Should I Use?

Model

Input

Best for

Gemini Omni

Text prompt, or text plus one first-frame image

Free-form scenes, one-shot ads with baked voice-over, animating a single hero image, multi-shot cinematic sequences

Gemini Omni Flash Edit

An existing video plus an edit instruction

Restyle season, palette, weather, genre, or a specific object without changing motion. Native audio is regenerated to match

Gemini Omni Flash Reference to Video

1 to 3 reference images, plus an optional prompt

Keep a character, product, or place identical across a new scene. Multi-character scenes and material-transfer effects

All three pair naturally with Nano Banana 2 Lite upstream to produce the source image or reference. The pipeline stays fully in-house on Scenario.


Gemini Omni: Text and Image to Video

The base model turns a text prompt (with optional first-frame image) into a 720p clip of 3 to 10 seconds, widescreen or vertical, with native audio in the same pass. Dialogue, ambience, score, and sound effects arrive together, so you skip the separate voice-over and sound-design steps.

How to Use Gemini Omni

Open the model page and write a prompt that describes the scene, the motion, and the sound. Two prompting habits that make a difference:

  1. Write the audio, not just the visuals. Put dialogue in quotes, name ambient sounds, and cue the score. "A confident warm-toned male voice-over says: 'Feel time.'" performs better than describing sound abstractly.

  2. Name the moment, not the setup. "The ranger raises her hand for silence. The tracker slowly raises his rifle" beats "two rangers alert in the woods".

For complex sequences, script the prompt like a shot list with a beat every 2 seconds. Omni Flash packs a lot of narrative into 10 seconds when the prompt directs the camera and sound per beat.

Parameters

Prompt

Scene, action, mood, and audio. Include quoted dialogue for any spoken line. Optional if you provide a first-frame image.

First Frame

Optional image to animate. The clip opens from that exact frame. Best paired with an image produced by Nano Banana 2 Lite or GPT Image 2 upstream.

Duration

3 to 10 seconds, default 8. Push to 10 for beats that need setup, turn, and land. Keep to 3 to 5 for punchy social loops.

Aspect Ratio

16:9 for widescreen, 9:16 for vertical social. No 1:1, 4:5, or 21:9 native.

Examples: Gemini Omni

American thriller: undercover operatives in a North African market

Two operatives in linen shirts move through spice stalls, exchanging quiet English dialogue about a package. Vendors call out prices, kettles clink, distant call to prayer. Handheld cinema, warm afternoon light.

Settings: Text-to-video, 16:9, 10 seconds. Watch: Open on Scenario

Transparent smartwatch product spot

A fully transparent glass smartwatch rotates against matte black. Interior mechanisms and cyan UI pulse through the case. Warm male voice-over: "Introducing the watch you can see right through. Feel time." Subtle synth pad, mechanical whir.

Settings: Text-to-video, 16:9, 8 seconds. Watch: Open on Scenario

Fantasy dragon boss reveal

The colossal red-scaled ancient dragon rises further from a mountain fissure, wings unfurling to their full span with a leathery snap that echoes down the valley. It arches its long neck, inhales deeply with a resonant sucking sound, then unleashes a roaring wave of orange flame directly at the camera. Knights scream and scatter across the cliffside. Full symphonic orchestral score with brass and choir, dragon roar overlapping thunder, chunks of stone tumble past the frame.

Settings: Image-to-video from a Nano Lite dragon key art, 16:9, 10 seconds. Watch: Open on Scenario


Gemini Omni Flash Edit: Restyle a Video

Feed a clip and describe the change in plain English. The camera path, timing, and motion stay intact. Native audio is regenerated to match the new look.

How to Use Gemini Omni Flash Edit

Open the model page, upload the video, and write the edit instruction. There are no duration or aspect controls: the output inherits both from the source.

Four instruction shapes that work well:

  1. Object swap. "Replace the red car with a matte black vintage motorcycle. Keep the exact drift motion, twilight lighting, and camera path."

  2. Season, weather, or time-of-day. "Change the season to a heavy snowstorm." "Move the entire scene to late night with neon lighting."

  3. Full style transfer. "Restyle as hand-painted Studio Ghibli animation, watercolor backgrounds, cel-shaded characters."

  4. Film grade or medium. "Convert the look to 1970s Kodachrome: warm palette, visible grain, halation glow."

Always end the prompt with a preservation clause: "Keep the exact motion, timing, and camera path unchanged."

Parameters

Prompt

The change you want, in plain English. Long, specific edit prompts outperform vague ones.

Input Video

Any Scenario video asset works, including outputs from Seedance, Veo, Kling, and other Omni Flash siblings. Longer sources cost more and take longer.

Examples: Gemini Omni Flash Edit

Astronaut restyled as pen-and-ink noir comic

Prompt: "Restyle as a pen-and-ink noir comic panel: high-contrast pure black and white, thick expressive linework, halftone shading and Ben-Day dots for mid-tones, dramatic hatching for shadows, no colors at all. Keep the exact floating motion and space station corridor geometry identical."

Before: Original astronaut in space station. After: Noir comic astronaut

Full-scene voxel restyle: pub as chunky voxel art

Prompt: "Convert the entire pub scene into a 3D voxel-art aesthetic: characters, table, tankards, hearth flames, stone walls rendered as pixelated blocks retaining colors. Keep motion, laughter, table-slapping, and camera path identical."

Before: Photoreal pubAfter: Voxel pub.

Sci-fi cockpit to Studio Ghibli watercolor

Prompt: "Restyle as a Studio Ghibli hand-painted watercolor animation: cel-shaded characters with clean expressive outlines, painterly cloud and space textures, saturated but soft palette, hand-drawn light effects and dust motes, whimsical warm interior tones. Keep the exact motion, character, and cockpit geometry identical."

Before: Sci-fi cockpit video. After: Ghibli watercolor cockpit


Gemini Omni Flash Reference to Video: Subject-Consistent

Upload 1 to 3 reference images of the subject you want in the video, optionally describe the new scene, and the model renders a 720p clip with native audio where that subject holds identity from first to last frame.

How to Use Gemini Omni Flash R2V

Three reference patterns worth knowing:

  1. Single hero reference. One clean portrait or product shot locks identity in a single new scene. Fastest option.

  2. Multiple angles of the same subject. 2 to 3 shots of the same character from different angles reduces drift when the new scene needs a different camera.

  3. Multiple distinct subjects. Character A plus character B (or subject plus material). The model places both in the same scene, or applies one to the other for material-transfer effects.

Parameters

Prompt

Optional. Describes the scene, action, and audio. Even without a prompt, the reference subjects appear in a generated context.

Reference Images

1 to 3 required. Order matters: reference them in the prompt as "the first image", "the second image", or by content.

Duration

3 to 10 seconds, default 8.

Aspect Ratio

16:9 or 9:16.

Examples: Gemini Omni Flash R2V

Multi-image material transfer: rose becomes crystal

References: 2 refs, Rose subject and Crystal material.

The rose petals transform into translucent quartz facets, rainbow light dispersion, dew droplets freeze into diamonds.

Watch: Open on Scenario

Facial locking with 3 refs: architect interview

References: 3 refs of the same character from three angles: FrontThree-quarterProfile.

Close-up interview shot, she speaks directly to camera, gentle key light, room-tone ambience.

Watch: Open on Scenario

Multi-character: ranger and tracker riding horses at dawn

References: 2 refs, Ranger and Tracker.

Wide tracking shot at dawn, hooves crunching frost, wind through pines, distant wolf howl.

Watch: Open on Scenario


Cross-Model Pipeline

The three siblings compose. A typical high-impact workflow chains them together with Nano Banana 2 Lite upstream:

  1. Nano Banana 2 Lite generates a hero image (character portrait, product hero, or scene still) from text or refs.

  2. Gemini Omni (base) animates that image into a 720p clip with native audio, using the image as the first frame.

  3. Gemini Omni Flash Edit produces stylistic or seasonal variants of that clip without regenerating the performance.

  4. Gemini Omni Flash R2V places the same character in additional scenes using the original hero image as a reference.

Result: a full multi-scene story with a locked-in cast, native audio throughout, generated end-to-end on Scenario in under ten minutes.


Tips for Better Results

  1. Script the audio in the prompt. Dialogue in quotes, ambient sounds by name, mood cues for score. Omni Flash sings when the audio is scripted and drifts when left to implication.

  2. For hero shots, feed a controlled first frame. Text-only can drift in composition. Producing the opening image in Nano Banana 2 Lite or GPT Image 2 first, then handing it to Omni, gives you a locked launch point.

  3. For edits, always end with a preservation clause. "Keep the exact motion, timing, and camera path unchanged." Otherwise the model may reinterpret the beat.

  4. For R2V, one clean reference beats three noisy ones. A single sharp portrait with the subject clearly visible locks identity better than three references with occlusion or motion blur.

  5. Describe camera moves as verbs, not presets. "The camera slowly pushes in" beats "35mm lens, shallow depth of field". Preset language often bakes into the frame as visible text or misfires as style.

  6. Multi-shot sequences work if you script them beat by beat. Break a 10-second clip into five 2-second beats in the prompt (0 to 2s, 2 to 4s, and so on) and describe what changes each time. Match-cuts and whip pans are honored when explicitly asked.

  7. Non-English dialogue may vary in accent. For non-English lines, name the language and accent explicitly in the prompt, or feed a spoken audio reference through a separate TTS model.


Known Limitations

  • 720p ceiling and 10-second maximum. Master output is 720p, 10 seconds. Upscale downstream if you need higher; concatenate multiple clips for longer beats.

  • Only 16:9 and 9:16. No 1:1, 4:5, or 21:9 native.

  • Native audio is one pass. You cannot separately request instrumental-only or dialogue-only. If you need clean stems, add audio in post.

  • Edit inherits duration and aspect from the source. No override.

  • Multi-turn editing is not supported. Each Edit run is one-shot. To iterate, re-run with a revised prompt on the original source.

  • Audio input is not accepted. None of the siblings take audio references. The launch demos that used audio as a driver are not exposed on Scenario yet.

  • Video as motion reference is not accepted by R2V. R2V only accepts image references (1 to 3), not video.


Use Cases

  • Advertising: product spots with native voice-over, brand vignettes, seasonal ad variants of an approved cut, market-specific restyles, hero social clips.

  • Games: in-game cinematics, character reveal trailers, marketing shorts with dialogue, episodic character content, restyled trailer variants (photoreal to anime, day to night).

  • Film and animation: pre-vis with sound, animatics with scratch dialogue, mood exploration on live-action plates, style testing on hero shots.

  • Marketing: repurposing existing clips for new campaigns without reshooting, spokesperson consistency across market variants, mascot in multiple contexts.

  • Education: narrated micro-lessons, historical reconstructions with ambient sound, recurring on-screen host across a lesson series.

  • Social media: vertical hero clips with dialogue and ambience baked in, one-pass content workflow.