Happy Horse by Alibaba Taotian Lab - The Essentials
Last updated: April 27, 2026

Happy Horse is a family of three video generation and editing models from Alibaba Taotian Lab, available on Scenario. Each model serves a distinct purpose: generating video from text or images, animating specific characters from reference photos, or transforming existing video into a new style or setting. This article covers all three models, their prompt conventions, and practical tips drawn from testing.
Overview
Happy Horse 1.0 is built on a unified 15-billion-parameter Transfusion architecture that handles video natively. All three models share the same core engine but expose different input surfaces: text and image prompts for generation, character reference images for animation, and source video for editing. The family ranks highly on independent benchmarks, scoring 1,357 Elo on Artificial Analysis, above comparable models in the same tier.
All three models output MP4 video at 720P or 1080P. Duration ranges from 3 to 15 seconds. Pricing is based on output duration at approximately 70 CU per second for T2V and R2V, and approximately 140 CU per second of source video for Video Editing. Always set duration and resolution explicitly; the default is 5 seconds at 1080P.
The Three Models
Model | Input | Best for | Cost (1080P) |
|---|---|---|---|
Text prompt + optional first frame image | Cinematic scenes, nature, environments, lifestyle | 70 CU per second (700 CU for 10s) | |
1 to 9 character reference images + animation prompt | Character animation, game trailers, animated storytelling | 70 CU per second (700 CU for 10s) | |
Source video + style or scene prompt | Style transfer, scene transformation, world replacement | ~140 CU per second of source (1,126 CU for 8s source) |
Model 1: Happy Horse T2V / I2V
Happy Horse T2V generates video from a text prompt. When you add an image using the image parameter, the model uses that image as the first frame and animates forward from it (I2V mode). Both modes share the same parameter set.

Parameters
Parameter | Default | Options | Description |
|---|---|---|---|
prompt | Required | Text string | Scene description. Describe what is in the scene, the atmosphere, and the visual style. Keep it under 200 words. Do not write camera directions as instructions. |
image | None | Asset ID | Optional first frame for I2V mode. The model animates the scene starting from this image. Use character renders, concept art, or architectural images as starting points. |
aspectRatio | 16:9 | 16:9, 9:16, 1:1, 4:3, 3:4 | Output video aspect ratio. Use 9:16 for social/vertical, 16:9 for cinematic, 1:1 for social square. |
resolution | 1080P | 720P, 1080P | Output resolution. 1080P is recommended for all production use. |
duration | 5 | 3 to 15 (seconds) | Output video length in seconds. Always set this explicitly. Use 10 to 15 seconds for cinematic work. |
seed | Random | Any integer | Set for reproducibility. Same seed with same prompt produces similar output. |
How to Write T2V Prompts
Happy Horse T2V reads your prompt as a scene description, not a set of camera instructions. The model infers camera movement from the visual context you describe. Writing explicit camera directions as commands (such as "the camera begins at ground level and rises") can cause the model to fail with a processing error. Instead, describe what the scene looks like and the motion will be inferred.
Prompt structure:
[Subject + action], [environment], [lighting], [atmosphere], [visual style]
Works well:
Luxury treehouse spiraling around a giant redwood, glass balconies
with tropical ferns, warm golden interior lights, morning mist
drifting through pine forest, small waterfall in background,
golden hour, cinematic, photorealistic
Causes processing errors (too long, camera as instruction):
A breathtaking multi-story luxury treehouse spiraling around an ancient
giant redwood, circular glass balconies wrapped in lush ferns and tropical
plants, warm golden interior lights glowing through floor-to-ceiling windows,
a slow cinematic drone shot begins at ground level in misty forest and rises
steadily upward along the trunk, revealing each level one by one...
The key difference: the working version describes the scene. The failing version gives the model a shooting script. Keep prompts to 2 to 4 short sentences or a compact list of descriptors. The model handles camera movement, pacing, and framing on its own.
I2V Mode: Using a First Frame
In I2V mode, the model uses your uploaded image as the opening frame and generates the remainder of the video from that starting point. The image anchors the visual style, character, and environment. The prompt then describes how the scene should evolve.
I2V is especially useful when you have high-quality concept art, a character render, or an architectural visualization that you want to bring to motion. The output preserves the visual fidelity of the source image better than a pure text description can.
I2V prompt structure:
[What the character or subject does], [how the environment responds], [atmosphere]
Example I2V prompts:
The character advances slowly toward camera, fireflies and mist
swirling around her, enchanted forest, soft magical light
The building slowly reveals as fog lifts, birds take flight
from the balconies, golden morning light fills the scene
T2V Best Practices
Set duration to 10 or 15 seconds. The default is 5 seconds. For any cinematic or production use, set duration explicitly to 10 or 15. Pricing is 70 CU per second so 10s costs 700 CU and 15s costs 1,050 CU at 1080P.
Describe the scene, not the shot. Never write "camera rises from ground level" or "slow tracking shot begins." Instead write "misty forest at dawn, sunlight breaking through redwood canopy." The model infers the shot from the scene.
Keep prompts short. Two to four compact sentences work better than a long paragraph. Overly long prompts cause processing failures.
Use I2V for visual fidelity. When you have a specific visual reference (character art, architectural image, scene render), use I2V to anchor the output to that image. T2V from text alone will generate its own interpretation.
Vary aspect ratios deliberately. Use 9:16 for social-first content, 16:9 for cinematic, 1:1 for Instagram. The model handles each aspect ratio well.
Model 2: Happy Horse R2V (Reference to Video)
Happy Horse R2V animates specific characters from reference images. You provide between 1 and 9 character images, and the model generates a video in which each referenced character performs the action you describe in the prompt. This makes R2V the right choice for game character animation, brand mascot animation, and multi-character storytelling where character consistency across frames is the priority.

Parameters
Parameter | Default | Options | Description |
|---|---|---|---|
prompt | Required | Text string | Animation instructions for each character, referenced as character1 through character9. Each character should get its own sentence describing what it does. |
referenceImages | Required | 1 to 9 asset IDs | The character reference images. Each image corresponds to a character number in order: the first image is character1, the second is character2, and so on. |
aspectRatio | 16:9 | 16:9, 9:16, 1:1, 4:3, 3:4 | Output video aspect ratio. |
resolution | 1080P | 720P, 1080P | Output resolution. |
duration | 5 | 3 to 15 (seconds) | Output video length in seconds. Set to 10 or 15 for meaningful character animation sequences. |
seed | Random | Any integer | Set for reproducibility. |
How to Write R2V Prompts
R2V prompts are more flexible than a fixed format. Each character is referenced by its position number (character1 through character9, matching the order of the referenceImages array). Beyond that, the prompt can take several different forms depending on what you are trying to achieve: individual action descriptions, shared scenes, mixed approaches, or scripted multi-character interactions. The key is that every description is motion-first and body-specific, telling the model exactly what each character is physically doing.
Format 1: Individual actions (showcase reel)
Each character gets its own sentence describing a distinct movement. Best for animation demos, character ability showcases, and game trailers where you want to see each character perform independently.
character1 walks forward confidently, swinging arms naturally with
each step, looking straight ahead in a casual strut.
character2 crouches low into a combat-ready stance, fists raised,
shifting weight from side to side as if preparing to fight.
character3 stretches both arms wide open to the sides and lets out
a powerful roar, head tilted back with mouth open.
character4 stomps forward with heavy footsteps, shoulders rolling,
chest puffed out in a slow and intimidating march.
character5 spins around quickly on one foot, arms extended outward,
then lands in a firm two-footed stance facing the camera.
character6 draws a weapon from the hip, steps back into a defensive
pose, eyes locked forward with focused intensity.
character7 leaps into the air with arms spread wide, cape or clothing
flowing, landing gracefully in a heroic pose.
character8 waves both arms overhead enthusiastically, bouncing slightly
on the heels, with an energetic and joyful movement.
character9 tips the hat with one hand, winks, and leans back slightly
with a relaxed and charming cowboy swagger.
Format 2: Shared scene (all characters together)
All characters appear in the same environment and interact within a single narrative moment. Best for group shots, party scenes, ensemble reveals, and moments where the relationship between characters matters.
character1, character2, and character3 stand together on a battlefield
at dawn, side by side. character1 raises a sword toward the horizon.
character2 crosses arms and surveys the enemy line with calm authority.
character3 kneels and plants a banner into the ground behind them.
character1 and character2 face each other across a wooden dojo.
They bow simultaneously, then character1 lunges forward with a kick
while character2 sidesteps and counters with a sweeping arm block.
Format 3: Per-character scenes (different contexts)
Each character appears in its own distinct environment or situation within the same video. Best for character introduction sequences, brand ensemble reveals, and trailers where you want to show each character in their natural world.
character1 rides a horse across a windswept highland meadow at full
gallop, cloak trailing in the wind.
character2 sits alone at a cluttered workshop desk, adjusting a small
mechanical device with focused precision, candlelight flickering.
character3 stands waist-deep in a rushing river, arms raised, calling
out to someone on the far bank.
Format 4: Mixed — shared scene with individual moments
Characters share a scene but each gets a specific role or action within it. This is the most expressive format for storytelling and cinematic use.
character1 and character2 walk through a neon-lit marketplace at night.
character1 stops at a food stall and points with excitement at the menu.
character2 keeps walking, scanning the crowd with a cautious expression,
hand resting near a concealed weapon.
Regardless of format, keep descriptions physical and specific. Avoid abstract qualities like "acts heroically" or "looks threatening." Write what the body is doing: where the arms are, what the legs are doing, what the face is doing, what direction the character is moving.
What Makes a Good R2V Reference Image
The quality of the reference image directly determines how well the model can animate the character. These guidelines produce the best results:
Full body visible. The model needs to see the complete character from head to feet. Portrait crops, face close-ups, or images where the lower body is cut off produce poor animation because the model cannot infer the character's proportions and limb structure.
Neutral or T-pose stance. Characters standing in a neutral upright position, arms slightly away from the body, give the model a clear baseline to animate from. Characters already mid-action in the reference image can produce inconsistent results.
Plain or neutral background. White, gray, or simple backgrounds help the model isolate the character. Busy backgrounds can bleed into the animation.
Clear, clean art style. High-contrast, well-defined character designs animate more reliably than blurry, heavily stylized, or painterly images. 3D character renders, clean concept art, and flat illustration styles all work well.
Minimum 400 pixels on the shorter side, maximum 10 MB per image. Images below this threshold may not provide enough detail for the model to work with.
Reference images that work well include 3D game character renders, Pixar-style CG characters, clean anime illustrations, and comic book or flat-design characters with strong silhouettes. Portrait photography, face-only close-ups, and heavily painterly references work poorly.
Multi-Character R2V
When using multiple characters, each image in the referenceImages array maps directly to its character number. The first image is character1, the second is character2, and so on. You do not need to use all 9 slots. Two to four characters is often more controllable than nine.
Each character animates somewhat independently based on its instruction. The model does not guarantee that characters will interact with each other or appear in the same frame simultaneously in the way a scripted scene would. Think of R2V as generating per-character animation clips rather than a multi-actor scene.
R2V Best Practices
Use full-body images, not portrait crops. This is the single most important factor. If the reference only shows a face or upper body, the animation will be limited and often poor.
Describe specific movements, not qualities. Write "raises right arm above head and brings it down in a chopping motion" rather than "attacks aggressively." The model animates what you describe physically, not conceptually.
Set duration to at least 10 seconds. Short clips of 3 to 5 seconds give each character very little time to complete its animation. At 10 to 15 seconds, movements have time to develop and look intentional.
Do not reuse the same reference images across jobs. Each job should use a different set of character references to produce varied output. Reusing the same images across multiple jobs makes all outputs look like they come from the same set.
Use characters from different visual styles. Mixing a realistic human, an anime character, a cartoon animal, and a sci-fi robot in the same job produces more interesting multi-character animation than using four characters from the same art style.
Test with one character before batching. Run a single-character job first to verify the reference image works and the animation matches the prompt. Then scale to multi-character jobs.
Model 3: Happy Horse Video Edit
Happy Horse Video Edit takes an existing video as input and transforms it according to a text prompt. The source video provides the motion, timing, and structure of the scene. The prompt tells the model what visual world, style, or atmosphere to apply to that motion. The result is a new video that follows the movement of the original but looks completely different.

Parameters
Parameter | Default | Options | Description |
|---|---|---|---|
video | Required | Asset ID | The source video to edit. Must be a Scenario asset ID. The source video is truncated to a maximum of 15 seconds. Longer videos are cut at 15 seconds. |
prompt | Required | Text string | The visual transformation to apply. Describe the target style, world, or atmosphere. You can reference uploaded images using @Image1 through @Image5 if referenceImages are provided. |
referenceImages | None | Up to 5 asset IDs | Optional style or character reference images. Reference them in the prompt as @Image1, @Image2, etc. to guide the visual output toward a specific look. |
resolution | 1080P | 720P, 1080P | Output resolution. The output matches the duration of the source video. |
audioSetting | auto | auto, origin | Audio handling. "origin" keeps the original audio from the source video. "auto" lets the model decide. |
How to Write Video Edit Prompts
Video Edit prompts describe the visual transformation to apply. The source video handles all motion. Your prompt does not need to describe what is happening in the scene. It only needs to describe what the output should look like visually.
Prompt structure:
[Target world or style], [key visual elements], [lighting and color palette], [atmosphere]
Works well:
Neon-soaked cyberpunk city, holographic graffiti on rain-slicked
walls, electric blue and purple color palette, flying vehicles overhead
Classic 1940s film noir, black and white, high contrast shadows,
venetian blind light patterns, detective thriller atmosphere
Lush high fantasy world, floating islands in background, ancient ruins
with glowing runes, purple and gold magical sky
The model preserves the motion from the source video and applies the described visual world to it. A character walking in the source video will still walk in the output, but through a cyberpunk alley instead of wherever they were in the original.
Video Edit Cost
Video Edit is priced by the length of the source video, not a flat rate. The cost is approximately 140 CU per second of source video. An 8-second source video costs approximately 1,126 CU. A 5-second source costs approximately 700 CU. This is roughly double the per-second cost of T2V and R2V. Choose shorter source videos when cost is a concern.
Video Edit Best Practices
Source video quality determines output quality. A well-lit, clear, high-quality source produces better edits than a dark, blurry, or compressed source. The model cannot add detail that is not present in the motion data.
Use source videos with clear, readable motion. Simple, steady motion (a character walking, a camera pan across a scene) transforms more reliably than fast cuts, heavy motion blur, or chaotic handheld footage.
Describe the target world, not the action. The action comes from the source. Your prompt only needs to define the visual style, setting, and color palette.
Use audioSetting origin to preserve source audio. If the source video has meaningful audio (music, dialogue, ambient sound), set audioSetting to "origin" to keep it. The default "auto" may replace or alter audio.
Keep source videos under 15 seconds. Sources longer than 15 seconds are truncated. Plan your source clips accordingly.
Use referenceImages for style anchoring. If you want the output to match a specific visual style or color scheme, upload a style reference image and use @Image1 in the prompt to link it. This helps the model match a particular look more precisely.
Avoid firing more than 3 Video Edit jobs simultaneously. Video Edit is computationally expensive and subject to rate limits. Fire jobs in small batches of 2 to 3 at a time.
Shared Parameters and Pricing
All costs below are for 1080P output. 720P pricing may differ. T2V and R2V charge by output duration. Video Edit charges by source video duration (output matches source length). R2V cost is the same regardless of how many character references are provided.
Duration | T2V cost (1080P) | R2V cost (1080P, any number of characters) | Video Edit cost (1080P, source duration) |
|---|---|---|---|
3 seconds | 210 CU | 210 CU | 423 CU |
5 seconds | 350 CU | 350 CU | 705 CU |
8 seconds | 560 CU | 560 CU | 1,126 CU (confirmed) |
10 seconds | 700 CU | 700 CU | 1,410 CU |
15 seconds | 1,050 CU | 1,050 CU | 2,115 CU |
Rate is exactly 70 CU per second for T2V and R2V, and 141 CU per second for Video Edit.
All three models default to 5 seconds and 1080P. Always set duration explicitly. For production output, use 10 to 15 seconds. Processing time ranges from 3 to 8 minutes depending on server load and duration.
Rate Limits
Happy Horse enforces strict rate limits across all three models. Firing more than 10 jobs in a short window triggers a rate limit error that pauses new job submission for approximately 15 to 20 minutes. Video Edit is particularly sensitive. Keep concurrent job counts to 10 or fewer per burst, and avoid mixing large T2V, R2V, and Video Edit batches in the same window.
Known Limitations
Prompt length limit on T2V. Prompts over approximately 200 words cause a "could not process" error. Keep prompts short and descriptive. Remove camera direction language.
R2V requires full-body reference images. Portrait crops and face-only images produce poor animation. Full body on a plain background is required for reliable results.
Video Edit costs scale with source length. An 8-second source costs over 1,100 CU. Plan source video length carefully before firing.
Characters do not reliably interact in R2V multi-character jobs. The model animates each character from its instruction but does not guarantee coordinated multi-character scenes. Use multi-character prompts as animation previews, not scripted scenes.
Rate limits are strict and platform-wide. A burst of 15 to 20 simultaneous jobs will trigger a 15 to 20 minute cooldown across all Happy Horse models.
Video Edit source truncated at 15 seconds. Longer sources are cut. There is no warning before the job starts.
No text prompt support in I2V mode beyond the prompt field. The image parameter in I2V mode anchors the first frame. The prompt drives what happens next. There is no separate first-frame vs. last-frame control.
No LoRA training or fine-tuning is available. Happy Horse models are third-party models on Scenario. Custom training is not supported.
Use Cases
Game trailers and character showcases (R2V): Upload full-body renders of your game characters and animate them with movement descriptions. Generate idle animations, combat stances, and emotes from a single reference image. Output to GLB or use the videos directly in trailers.
Cinematic scene generation (T2V): Generate epic establishing shots, environmental transitions, and cinematic b-roll from text descriptions. Use for storyboarding, pre-visualization, and social content.
Animated character storytelling (R2V + T2V): Combine R2V for character animation with T2V for environmental scenes to produce short animated sequences. Animate a mascot, a brand character, or a game hero with a specific set of movements and expressions.
Style transfer and world replacement (Video Edit): Take existing footage (a product shot, a fashion walk, a lifestyle clip) and transport it to a different visual world without re-shooting. Apply film noir, cyberpunk, fantasy, or watercolor styles to real or AI-generated source video.
Architecture and environment visualization (T2V + I2V): Animate architectural renders, interior design concepts, and landscape images. Use I2V to start from a high-quality visualization and bring it to life with camera motion and environmental animation such as moving water, shifting light, and drifting mist.