Happy Horse by Alibaba Taotian Lab - The Essentials

Last updated: April 29, 2026

asset_83UbXhz5Jps9cPd9rMR5ug2m_A sleek, modern desk setup from a top-down perspective, bathed in soft, bright light. On the desk, three distinct, glowing holographic projections represent the core functions of the .png

Happy Horse is a family of three video generation and editing models from Alibaba Taotian Lab, available on Scenario. Each model serves a distinct purpose: generating video from text or images, animating specific characters from reference photos, or transforming existing video into a new style or setting. This article covers all three models, their prompt conventions, and practical tips drawn from testing.

Overview

Happy Horse 1.0 is built on a unified 15-billion-parameter Transfusion architecture that handles video natively. All three models share the same core engine but expose different input surfaces: text and image prompts for generation, character reference images for animation, and source video for editing. The family ranks highly on independent benchmarks, scoring 1,357 Elo on Artificial Analysis, above comparable models in the same tier.

All three models output MP4 video at 720P or 1080P. Duration ranges from 3 to 15 seconds. Pricing is based on output duration at approximately 70 CU per second for T2V and R2V, and approximately 140 CU per second of source video for Video Editing. Always set duration and resolution explicitly; the default is 5 seconds at 1080P.

The Three Models

Model	Input	Best for	Cost (1080P)
Happy Horse T2V / I2V	Text prompt + optional first frame image	Cinematic scenes, nature, environments, lifestyle	70 CU per second (700 CU for 10s)
Happy Horse R2V	1 to 9 character reference images + animation prompt	Character animation, game trailers, animated storytelling	70 CU per second (700 CU for 10s)
Happy Horse Video Edit	Source video + style or scene prompt	Style transfer, scene transformation, world replacement	~140 CU per second of source (1,126 CU for 8s source)

Model 1: Happy Horse T2V / I2V

Happy Horse T2V generates video from a text prompt. When you add an image using the image parameter, the model uses that image as the first frame and animates forward from it (I2V mode). Both modes share the same parameter set.

Parameters

Parameter	Default	Options	Description
prompt	Required	Text string	Scene description. Describe what is in the scene, the atmosphere, and the visual style. Keep it under 200 words. Do not write camera directions as instructions.
image	None	Asset ID	Optional first frame for I2V mode. The model animates the scene starting from this image. Use character renders, concept art, or architectural images as starting points.
aspectRatio	16:9	16:9, 9:16, 1:1, 4:3, 3:4	Output video aspect ratio. Use 9:16 for social/vertical, 16:9 for cinematic, 1:1 for social square.
resolution	1080P	720P, 1080P	Output resolution. 1080P is recommended for all production use.
duration	5	3 to 15 (seconds)	Output video length in seconds. Always set this explicitly. Use 10 to 15 seconds for cinematic work.
seed	Random	Any integer	Set for reproducibility. Same seed with same prompt produces similar output.

How to Write T2V Prompts

Happy Horse T2V reads your prompt as a scene description, not a set of camera instructions. The model infers camera movement from the visual context you describe. Writing explicit camera directions as commands (such as "the camera begins at ground level and rises") can cause the model to fail with a processing error. Instead, describe what the scene looks like and the motion will be inferred.

Prompt structure:

[Subject + action], [environment], [lighting], [atmosphere], [visual style]

Works well:

Luxury treehouse spiraling around a giant redwood, glass balconies
with tropical ferns, warm golden interior lights, morning mist
drifting through pine forest, small waterfall in background,
golden hour, cinematic, photorealistic

Causes processing errors (too long, camera as instruction):

A breathtaking multi-story luxury treehouse spiraling around an ancient
giant redwood, circular glass balconies wrapped in lush ferns and tropical
plants, warm golden interior lights glowing through floor-to-ceiling windows,
a slow cinematic drone shot begins at ground level in misty forest and rises
steadily upward along the trunk, revealing each level one by one...

The key difference: the working version describes the scene. The failing version gives the model a shooting script. Keep prompts to 2 to 4 short sentences or a compact list of descriptors. The model handles camera movement, pacing, and framing on its own.

I2V Mode: Using a First Frame

In I2V mode, the model uses your uploaded image as the opening frame and generates the remainder of the video from that starting point. The image anchors the visual style, character, and environment. The prompt then describes how the scene should evolve.

I2V is especially useful when you have high-quality concept art, a character render, or an architectural visualization that you want to bring to motion. The output preserves the visual fidelity of the source image better than a pure text description can.

I2V prompt structure:

[What the character or subject does], [how the environment responds], [atmosphere]

Example I2V prompts:

A fox, owl, rabbit, and bear gather around a campfire,
cooking stew together. The owl reads from a book while the bear tastes soup.

Best Practices

Set duration to 10 or 15 seconds. The default is 5 seconds. For any cinematic or production use, set duration explicitly to 10 or 15. Pricing is 70 CU per second so 10s costs 700 CU and 15s costs 1,050 CU at 1080P.
Describe the scene, not the shot. Never write "camera rises from ground level" or "slow tracking shot begins." Instead write "misty forest at dawn, sunlight breaking through redwood canopy." The model infers the shot from the scene.
Keep prompts short. Two to four compact sentences work better than a long paragraph. Overly long prompts cause processing failures.
Use I2V for visual fidelity. When you have a specific visual reference (character art, architectural image, scene render), use I2V to anchor the output to that image. T2V from text alone will generate its own interpretation.
Vary aspect ratios deliberately. Use 9:16 for social-first content, 16:9 for cinematic, 1:1 for Instagram. The model handles each aspect ratio well.

Model 2: Happy Horse R2V (Reference to Video)

Happy Horse R2V animates specific characters from reference images. You provide between 1 and 9 character images, and the model generates a video in which each referenced character performs the action you describe in the prompt. This makes R2V the right choice for game character animation, brand mascot animation, and multi-character storytelling where character consistency across frames is the priority.

Parameters

Parameter	Default	Options	Description
prompt	Required	Text string	Animation instructions for each character, referenced as character1 through character9. Each character should get its own sentence describing what it does.
referenceImages	Required	1 to 9 asset IDs	The character reference images. Each image corresponds to a character number in order: the first image is character1, the second is character2, and so on.
aspectRatio	16:9	16:9, 9:16, 1:1, 4:3, 3:4	Output video aspect ratio.
resolution	1080P	720P, 1080P	Output resolution.
duration	5	3 to 15 (seconds)	Output video length in seconds. Set to 10 or 15 for meaningful character animation sequences.
seed	Random	Any integer	Set for reproducibility.

How to Write R2V Prompts

R2V prompts are more flexible than a fixed format. Each character is referenced by its position number (character1 through character9, matching the order of the referenceImages array). Beyond that, the prompt can take several different forms depending on what you are trying to achieve: individual action descriptions, shared scenes, mixed approaches, or scripted multi-character interactions. The key is that every description is motion-first and body-specific, telling the model exactly what each character is physically doing.

Format 1: Individual actions (showcase reel)

Each character gets its own sentence describing a distinct movement. Best for animation demos, character ability showcases, and game trailers where you want to see each character perform independently.

character1 walks slowly down a sidewalk, glancing over one shoulder with a calm 
and composed expression, coat moving gently with each step.

character2 sits down on a low surface, crosses the legs casually, rests both hands 
on the knees, and looks forward with a relaxed smile.

character3 stands with hands in pockets, shifts weight to one side, nods the head 
slightly, and glances around with a laid-back and confident posture.

character4 leans back against a surface, tilts the head slightly to one side, then 
straightens up and brushes the hair back with one hand in a natural, effortless 
motion.

Format 2: Shared scene (all characters together)

All characters appear in the same environment and interact within a single narrative moment. Best for group shots, party scenes, ensemble reveals, and moments where the relationship between characters matters.

character1 and character2 enter mossy forest ruins at sunrise as character3 flies 
beside them. An ornate treasure chest with a blue crystal lock opens beside a 
glowing stone portal arch, blue light spilling across wet stone, leaves swirling, 
slow mobile RPG trailer push-in, magical ambience.

Multi-Character R2V

When using multiple characters, each image in the reference images maps directly to its character number. The first image is character1, the second is character2, and so on. You do not need to use all 9 slots. Two to four characters is often more controllable than nine.

Each character animates somewhat independently based on its instruction. The model does not guarantee that characters will interact with each other or appear in the same frame simultaneously in the way a scripted scene would, but a good prompt that describes each action helps.

Prompt example:

character1 and character2 face off in a dramatic, high-stakes battle for survival, 
with character1 wielding a long range weapon as character2 charges and lashes out 
with sharp limbs, sparks flying and debris scattering as their movements clash, 
intense energy and determination in the air, distant thunder and the heavy sound of 
metal striking carapace echo around them, cinematic lighting highlighting sweat and 
tension on character1 while menacing growls and screeches from character2 fill the 
soundscape, fast camera moves to heighten the epic, life-or-death atmosphere.

What Makes a Good R2V Reference Image

The quality of the reference image directly determines how well the model can animate the character. These guidelines produce the best results:

Full body visible. The model needs to see the complete character from head to feet. Portrait crops, face close-ups, or images where the lower body is cut off produce poor animation because the model cannot infer the character's proportions and limb structure.
Neutral or T-pose stance. Characters standing in a neutral upright position, arms slightly away from the body, give the model a clear baseline to animate from. Characters already mid-action in the reference image can produce inconsistent results.
Plain or neutral background. White, gray, or simple backgrounds help the model isolate the character. Busy backgrounds can bleed into the animation.
Clear, clean art style. High-contrast, well-defined character designs animate more reliably than blurry, heavily stylized, or painterly images. 3D character renders, clean concept art, and flat illustration styles all work well.
Minimum 400 pixels on the shorter side, maximum 10 MB per image. Images below this threshold may not provide enough detail for the model to work with.

Reference images that work well include 3D game character renders, Pixar-style CG characters, clean anime illustrations, and comic book or flat-design characters with strong silhouettes. Portrait photography, face-only close-ups, and heavily painterly references work poorly.

R2V Best Practices

Use full-body images, not portrait crops. This is the single most important factor. If the reference only shows a face or upper body, the animation will be limited and often poor.
Describe specific movements, not qualities. Write "raises right arm above head and brings it down in a chopping motion" rather than "attacks aggressively." The model animates what you describe physically, not conceptually.
Set duration to at least 10 seconds. Short clips of 3 to 5 seconds give each character very little time to complete its animation. At 10 to 15 seconds, movements have time to develop and look intentional.
Do not reuse the same reference images across jobs. Each job should use a different set of character references to produce varied output. Reusing the same images across multiple jobs makes all outputs look like they come from the same set.
Use characters from different visual styles. Mixing a realistic human, an anime character, a cartoon animal, and a sci-fi robot in the same job produces more interesting multi-character animation than using four characters from the same art style.
Test with one character before batching. Run a single-character job first to verify the reference image works and the animation matches the prompt. Then scale to multi-character jobs.

Model 3: Happy Horse Video Edit

Happy Horse Video Edit takes an existing video as input and transforms it according to a text prompt. The source video provides the motion, timing, and structure of the scene. The prompt tells the model what visual world, style, or atmosphere to apply to that motion. The result is a new video that follows the movement of the original but looks completely different.

Parameters

Parameter	Default	Options	Description
video	Required	Asset ID	The source video to edit. Must be a Scenario asset ID. The source video is truncated to a maximum of 15 seconds. Longer videos are cut at 15 seconds.
prompt	Required	Text string	The visual transformation to apply. Describe the target style, world, or atmosphere. You can reference uploaded images using @Image1 through @Image5 if referenceImages are provided.
referenceImages	None	Up to 5 asset IDs	Optional style or character reference images. Reference them in the prompt as @Image1, @Image2, etc. to guide the visual output toward a specific look.
resolution	1080P	720P, 1080P	Output resolution. The output matches the duration of the source video.
audioSetting	auto	auto, origin	Audio handling. "origin" keeps the original audio from the source video. "auto" lets the model decide.

How to Write Video Edit Prompts

Video Edit prompts describe the visual transformation to apply. The source video handles all motion. Your prompt does not need to describe what is happening in the scene. It only needs to describe what the output should look like visually.

Prompt structure:

[Target world or style], [key visual elements], [lighting and color palette], [atmosphere]

Works well:

Remove the black background and add an industrial background with machines working 
to manufacture parts of the same robot.

The model preserves the motion from the source video and applies the described visual world to it. A character walking in the source video will still walk in the output, but through a cyberpunk alley instead of wherever they were in the original.

Video Edit Cost

Video Edit is priced by the length of the source video, not a flat rate. The cost is approximately 140 CU per second of source video. An 8-second source video costs approximately 1,126 CU. A 5-second source costs approximately 700 CU. This is roughly double the per-second cost of T2V and R2V. Choose shorter source videos when cost is a concern.

Video Edit Best Practices

Source video quality determines output quality. A well-lit, clear, high-quality source produces better edits than a dark, blurry, or compressed source. The model cannot add detail that is not present in the motion data.
Use source videos with clear, readable motion. Simple, steady motion (a character walking, a camera pan across a scene) transforms more reliably than fast cuts, heavy motion blur, or chaotic handheld footage.
Describe the target world, not the action. The action comes from the source. Your prompt only needs to define the visual style, setting, and color palette.
Use audioSetting origin to preserve source audio. If the source video has meaningful audio (music, dialogue, ambient sound), set audioSetting to "origin" to keep it. The default "auto" may replace or alter audio.
Keep source videos under 15 seconds. Sources longer than 15 seconds are truncated. Plan your source clips accordingly.
Use referenceImages for style anchoring. If you want the output to match a specific visual style or color scheme, upload a style reference image and use @Image1 in the prompt to link it. This helps the model match a particular look more precisely.
Avoid firing more than 3 Video Edit jobs simultaneously. Video Edit is computationally expensive and subject to rate limits. Fire jobs in small batches of 2 to 3 at a time.

Shared Parameters and Pricing

All costs below are for 1080P output. 720P pricing may differ. T2V and R2V charge by output duration. Video Edit charges by source video duration (output matches source length). R2V cost is the same regardless of how many character references are provided.

Duration	T2V cost (1080P)	R2V cost (1080P, any number of characters)	Video Edit cost (1080P, source duration)
3 seconds	210 CU	210 CU	423 CU
5 seconds	350 CU	350 CU	705 CU
8 seconds	560 CU	560 CU	1,126 CU (confirmed)
10 seconds	700 CU	700 CU	1,410 CU
15 seconds	1,050 CU	1,050 CU	2,115 CU

Rate is exactly 70 CU per second for T2V and R2V, and 141 CU per second for Video Edit.

All three models default to 5 seconds and 1080P. Always set duration explicitly. For production output, use 10 to 15 seconds. Processing time ranges from 3 to 8 minutes depending on server load and duration.

Rate Limits

Happy Horse enforces strict rate limits across all three models. Firing more than 10 jobs in a short window triggers a rate limit error that pauses new job submission for approximately 15 to 20 minutes. Video Edit is particularly sensitive. Keep concurrent job counts to 10 or fewer per burst, and avoid mixing large T2V, R2V, and Video Edit batches in the same window.

Known Limitations

Prompt length limit on T2V. Prompts over approximately 200 words cause a "could not process" error. Keep prompts short and descriptive. Remove camera direction language.
R2V requires full-body reference images. Portrait crops and face-only images produce poor animation. Full body on a plain background is required for reliable results.
Video Edit costs scale with source length. An 8-second source costs over 1,100 CU. Plan source video length carefully before firing.
Characters do not reliably interact in R2V multi-character jobs. The model animates each character from its instruction but does not guarantee coordinated multi-character scenes. Use multi-character prompts as animation previews, not scripted scenes.
Rate limits are strict and platform-wide. A burst of 15 to 20 simultaneous jobs will trigger a 15 to 20 minute cooldown across all Happy Horse models.
Video Edit source truncated at 15 seconds. Longer sources are cut. There is no warning before the job starts.
No text prompt support in I2V mode beyond the prompt field. The image parameter in I2V mode anchors the first frame. The prompt drives what happens next. There is no separate first-frame vs. last-frame control.
No LoRA training or fine-tuning is available. Happy Horse models are third-party models on Scenario. Custom training is not supported.

Use Cases

Game trailers and character showcases (R2V): Upload full-body renders of your game characters and animate them with movement descriptions. Generate idle animations, combat stances, and emotes from a single reference image. Output to GLB or use the videos directly in trailers.
Cinematic scene generation (T2V): Generate epic establishing shots, environmental transitions, and cinematic b-roll from text descriptions. Use for storyboarding, pre-visualization, and social content.
Animated character storytelling (R2V + T2V): Combine R2V for character animation with T2V for environmental scenes to produce short animated sequences. Animate a mascot, a brand character, or a game hero with a specific set of movements and expressions.
Style transfer and world replacement (Video Edit): Take existing footage (a product shot, a fashion walk, a lifestyle clip) and transport it to a different visual world without re-shooting. Apply film noir, cyberpunk, fantasy, or watercolor styles to real or AI-generated source video.
Architecture and environment visualization (T2V + I2V): Animate architectural renders, interior design concepts, and landscape images. Use I2V to start from a high-quality visualization and bring it to life with camera motion and environmental animation such as moving water, shifting light, and drifting mist.