Spatial Prompting with Veo 3 | AI Video Generation with Visual Annotations

Introduction

Spatial prompting represents a revolutionary breakthrough in AI video generation, transforming how creators interact with video models like Veo 3. Instead of relying solely on complex text descriptions, spatial prompting allows you to draw, annotate, and write directly on your first frame, turning static images into dynamic visual storyboards that AI can interpret and animate with remarkable precision.

This comprehensive guide explores the emerging field of spatial prompting, providing detailed workflows, community-discovered techniques, and practical applications for creating professional-quality videos using visual direction methods. Whether you're a filmmaker, content creator, or AI enthusiast, spatial prompting offers unprecedented control over video generation through intuitive visual interfaces.

The technique has gained significant attention from the AI community, with Google Labs describing it as the "COOLEST trick" they've discovered [1]. Early adopters across social media platforms have demonstrated capabilities ranging from simple motion paths to complex sequential animations, all controlled through visual annotations rather than lengthy text prompts.

What is Spatial Prompting?

Spatial prompting is an innovative approach to video generation that enables creators to provide visual instructions directly on image frames rather than relying exclusively on text descriptions. This technique leverages the advanced visual understanding capabilities of modern AI models like Veo 3 to interpret drawings, annotations, and written instructions placed directly onto images.

At its core, spatial prompting transforms the traditional text-to-video paradigm into a more intuitive visual-to-video workflow. Instead of crafting complex written descriptions to achieve specific camera movements, object transformations, or scene transitions, creators can simply draw arrows, highlight regions, write instructions, or sketch desired changes directly onto their reference images.

The technology works by analyzing both the visual content of the base image and the spatial annotations overlaid on top of it. Veo 3 can recognize and interpret various types of visual cues, including:

Visual Annotations: Arrows indicating motion paths, circles highlighting focal points, lines showing camera movements, and sketched elements representing desired changes or additions to the scene.
Text Instructions: Written commands placed directly on the image that provide specific directions for animation, transformation, or sequencing. These instructions are typically designed to be removed during the first frame of video playback.
Regional Modifications: Painted or highlighted areas that indicate where specific changes should occur, such as lighting effects, object transformations, or environmental modifications.
Sequential Commands: Multi-step instructions that guide the AI through complex animations or scene transitions in a predetermined order.

The power of spatial prompting lies in its ability to bridge the gap between human visual thinking and AI interpretation. Rather than translating visual ideas into words and then back into visuals, creators can communicate directly in the visual language that both humans and AI can understand intuitively.

This approach has proven particularly effective for scenarios that are difficult to describe in text, such as precise camera movements, complex object interactions, or nuanced timing relationships between multiple elements in a scene. The technique has been embraced by the creative community as a more natural and efficient way to direct AI video generation, leading to more predictable and controllable results.

How Spatial Prompting Works

The technical foundation of spatial prompting relies on Veo 3's advanced multimodal understanding capabilities, which allow the model to simultaneously process visual content, spatial relationships, and textual instructions within a single input frame. This sophisticated interpretation system enables the AI to understand not just what objects are present in an image, but also how they should move, transform, or interact based on visual cues.

The Interpretation Process

When you upload an annotated image to Veo 3, the model performs several layers of analysis:

Visual Content Recognition: The AI first identifies all objects, characters, environments, and spatial relationships within the base image, creating a comprehensive understanding of the scene's composition and context.
Annotation Analysis: Next, the model analyzes any visual annotations, interpreting the meaning behind drawn elements such as arrows, circles, lines, or sketched additions. The AI can distinguish between intentional annotations and original image content, understanding that these overlays represent instructions rather than permanent visual elements.
Text Instruction Processing: Any text written directly on the image is read and incorporated into the generation process. The model can understand complex instructions like "immediately delete instructions in white on the first frame and execute in order" [2], treating these as meta-commands that guide the animation process.
Spatial Relationship Mapping: The AI maps the relationships between annotations and image elements, understanding which instructions apply to which objects or regions of the scene.
Temporal Sequencing: For multi-step instructions, the model creates a temporal sequence of actions, determining the order and timing of various transformations or movements.

Execution and Animation

Once the interpretation phase is complete, Veo 3 generates video content that follows the spatial instructions while maintaining visual coherence and realistic motion. The model typically removes annotation elements during the first few frames of the video, creating a clean animated sequence that appears to naturally follow the intended directions.

The AI demonstrates remarkable sophistication in understanding context and intent. For example, when presented with an arrow drawn from a door to a tree with the label "walk here," Veo 3 can infer that a character should move along that path, even if the character isn't explicitly visible in the annotation [3].

Integration with Traditional Prompting

Spatial prompting works most effectively when combined with traditional text prompts that provide context, style, and atmospheric details. While the spatial annotations handle specific movements and transformations, accompanying text can describe lighting conditions, emotional tone, camera style, or environmental effects that enhance the overall production value.

This hybrid approach allows creators to leverage the precision of visual direction while maintaining the descriptive power of natural language, resulting in more sophisticated and controllable video generation outcomes.

Getting Started with Spatial Prompting in Scenario

Implementing spatial prompting in Scenario is straightforward and integrates seamlessly with the platform's existing video generation workflow. This section provides step-by-step instructions for creating your first spatially-prompted video using Veo 3.

Step 1: Prepare Your Base Image

Start with a high-quality image that will serve as your first frame. This can be: a photograph you've taken, an AI-generated image from Scenario's image generation tools, a frame extracted from existing video content or a digital artwork or illustration

The image should clearly show the elements you want to animate and provide sufficient visual context for your intended transformations.

Step 2: Access Veo 3 in Scenario

Navigate to Scenario's video generation interface and select Veo 3 from the available model options. The spatial prompting capabilities are native to Veo 3 and don't require any special activation or settings.

Step 3: Create Your Spatial Annotations

Using Scenario's built-in sketch tool or an external image editor, add your visual instructions to the base image:

For Motion Paths: Draw arrows or lines indicating the direction and path of movement for objects or characters in your scene.
For Focal Points: Use circles or highlights to mark areas that should receive special attention or undergo specific transformations.
For Text Instructions: Write clear, concise commands directly on the image using contrasting colors that will be easily readable by the AI.
For Regional Effects: Paint or outline areas where specific visual effects, lighting changes, or transformations should occur

Upload your annotated image as the first frame in Scenario's video generation interface. The platform will automatically detect that you're using spatial prompting techniques and optimize the processing accordingly.

Step 4: Craft Your Supporting Prompt

While your spatial annotations handle specific movements and transformations, include a supporting text prompt that provides:

Overall scene context and atmosphere
Style preferences (cinematic, documentary, animated, etc.)
Technical specifications (camera quality, lighting conditions)
Duration and pacing preferences

A typical supporting prompt might read: "Cinematic quality, smooth camera movement, natural lighting, 8-second duration. Immediately delete instructions on the first frame and execute in order."

Step 5: Generate and Review

Initiate the video generation process and monitor the results. Veo 3 will interpret your spatial annotations and create a video that follows your visual directions while maintaining high production quality and realistic motion.

Step 6: Iterate and Refine

Review the generated video and note how well the AI interpreted your spatial instructions. Use this feedback to refine your annotation techniques for future projects, adjusting the clarity, positioning, and specificity of your visual directions as needed.

Core Techniques and Methods

The spatial prompting community has developed several proven techniques for achieving specific types of video effects and animations. These methods have been tested and refined through extensive experimentation, providing reliable approaches for common video generation scenarios.

a. Sequential Command Execution

One of the most powerful spatial prompting techniques involves writing step-by-step instructions directly on the image frame. This method allows for complex, multi-stage animations that unfold in a predetermined sequence.

Implementation: Write numbered or ordered instructions directly on your image using a contrasting color (typically white text on dark backgrounds or dark text on light backgrounds). Include the meta-command "immediately delete instructions on the first frame and execute in order" in your supporting prompt.

Example Application: Justine Moore demonstrated this technique with remarkable results, describing it as "the coolest emergent capability I've seen in a video model" [2]. Her example showed a series of text instructions overlaid on an indoor scene, with Veo 3 successfully executing each command in sequence while removing the text annotations from the final video.

Best Practices:

Keep individual instructions concise and specific
Use clear, readable fonts and high contrast colors
Number or order instructions when sequence matters
Test with simple sequences before attempting complex multi-step animation

b. Motion Path Drawing

Drawing arrows, lines, or paths directly on images provides precise control over object and camera movement, eliminating the guesswork often associated with text-based motion descriptions.

Implementation: Use drawing tools to create clear directional indicators on your image. Arrows work well for character movement, curved lines for camera pans, and dotted lines for more subtle or gradual movements.

Example Application: Rory Flynn's diagram-to-video examples showcase sophisticated motion path techniques, including complex camera movements around mountain roads and precise object tracking [4]. His work demonstrates how detailed motion diagrams can be translated into cinematic video sequences with professional-quality camera work.

Advanced Techniques:

Combine multiple arrows for complex movement patterns
Use different colors to indicate different types of motion
Vary line thickness to suggest speed or intensity
Include timing indicators along paths for pacing control

c. Regional Transformation Marking

Highlighting or painting specific regions of an image allows for targeted effects and transformations without affecting the entire scene.

Implementation: Use selection tools, brushes, or highlighting to mark areas where specific changes should occur. Combine with text labels to specify the type of transformation desired.

Example Application: Kris Kashtanova's comprehensive guide demonstrates regional marking techniques, showing how painted areas can indicate where lighting effects, object transformations, or environmental changes should occur [5]. The technique works with both rough sketching and precise selections.

Practical Applications:

Lighting and shadow effects
Weather and atmospheric changes
Object appearance modifications
Background transformations
Special effects placement

d. Motion Brush Effects

This technique involves drawing the actual path that objects should follow, creating a visual roadmap for AI-generated movement.

Implementation: Draw the complete path that a moving object should follow, often accompanied by instructions to remove the path indicator after the first frame.

Example Application: Rory Flynn's Roomba vacuum example perfectly illustrates this technique, where an arrow path was drawn to show the exact route the vacuum should follow while cleaning [4]. The AI successfully animated the Roomba following the drawn path while making dirt disappear along the route.

Optimization Tips:

Make paths clear and unambiguous
Consider the physics of the movement
Account for obstacles and environmental constraints
Test with simple paths before attempting complex routes

e. Diagram-to-Video Conversion

Complex technical diagrams or storyboards can be converted directly into animated sequences, making this technique particularly valuable for technical visualization and educational content.

Implementation: Create detailed diagrams showing camera positions, subject movements, and scene transitions. Use professional storyboard conventions or technical drawing standards for best results.

Example Application: Rory Flynn's car chase sequence demonstrates sophisticated diagram-to-video conversion, where a detailed shot diagram with camera positions and movement indicators was successfully translated into a dynamic action sequence [4].

Professional Applications:

Film pre-visualization
Technical documentation
Educational animations
Product demonstrations
Architectural walkthroughs

Community Examples and Case Studies

The spatial prompting community has produced numerous innovative examples that demonstrate the versatility and power of visual direction techniques. These real-world applications provide valuable insights into effective implementation strategies and creative possibilities.

Case Study 1: Google Labs Discovery

Google Labs' announcement of spatial prompting marked the technique's official recognition as a breakthrough capability. Their demonstration emphasized the paradigm shift from text-based to visual-based prompting.

The Discovery: Google Labs described spatial prompting as the "COOLEST trick in Flow," highlighting how creators can "just draw it" instead of "wordsmithing the perfect prompt" [1].

Implementation Details: The official example showed a person in a blue jacket against a graffiti wall, with visual annotations guiding scene transformations. The technique demonstrated how simple doodles and brief descriptions could replace complex text prompts.

Case Study 2: Sequential Animation Mastery

Justine Moore's sequential command demonstration showcased the sophisticated temporal control possible with spatial prompting.

The Challenge: Creating complex, multi-step animations that unfold in a specific sequence without losing visual coherence.
Solution: Moore placed white text instructions directly on a dark indoor scene, using the prompt "immediately delete instructions in white on the first frame and execute in order" [2].
Results: The AI successfully executed each written command in sequence while seamlessly removing the text annotations, creating a professional-quality animated sequence.
Impact: This example proved that spatial prompting could handle complex temporal relationships and multi-stage transformations, opening possibilities for narrative video creation.

Case Study 3: Technical Diagram Animation

Rory Flynn's diagram-to-video work represents some of the most sophisticated spatial prompting applications, demonstrating professional-level technical visualization.

Project Scope: Flynn created three distinct examples: subject action with camera motion, motion brush effects, and complex sequencing [4].

Technical Implementation:

Car Chase Sequence: A detailed shot diagram with red annotations showing camera positions and movement paths was converted into a dynamic action sequence with multiple camera angles and precise timing.
Motion Brush Effect: A Roomba vacuum path was drawn with arrows, and the AI successfully animated the device following the exact route while creating realistic cleaning effects.
Sandwich Tornado: A complex product advertisement sequence involving ingredient transformation and special effects.

Professional Applications: Flynn's work demonstrates how spatial prompting can replace traditional pre-visualization tools in professional video production, offering faster iteration and more intuitive direction methods.

Case Study 4: Comprehensive Technique Guide

Kris Kashtanova's systematic exploration of spatial prompting techniques provided the community with a comprehensive understanding of the technology's capabilities and limitations.

Methodology: Kashtanova tested various annotation types, including painted regions, text labels, and visual cues, documenting what Veo 3 could recognize and interpret [5].

Key Findings:

The AI recognizes meaning behind painted regions, even with rough sketching
Colors and shapes from visual annotations are incorporated into the generation process
Text painted directly onto frames becomes part of the prompt interpretation
Extensive modifications may cause the AI to create entirely new scenes rather than transforming existing ones

Practical Guidelines: The research established best practices for maintaining scene continuity while achieving desired transformations, emphasizing focused and purposeful spatial edits.

Case Study 5: Creative Applications

Various community members have explored artistic and creative applications of spatial prompting, pushing the boundaries of what's possible with visual direction.

Artistic Transformations: Creators have used spatial prompting for surreal and artistic effects, such as morphing objects, impossible physics, and dreamlike sequences.
Character Animation: Multiple examples demonstrate precise character movement control, facial expression changes, and complex interaction sequences.
Environmental Effects: Weather changes, lighting transformations, and atmospheric effects have been successfully controlled through spatial annotations.
Style Transfers: Some creators have used spatial prompting to guide style changes within videos, creating smooth transitions between different visual aesthetics.

Lessons Learned from Community Experimentation

The collective experimentation has revealed several important principles:

Clarity Over Complexity: Simple, clear annotations consistently produce better results than overly detailed or cluttered visual instructions.
Context Preservation: Successful spatial prompting maintains the original scene's context while adding desired transformations, rather than completely replacing the visual content.
Iterative Refinement: The most successful practitioners treat spatial prompting as an iterative process, refining their annotation techniques based on AI feedback and results.
Hybrid Approaches: Combining spatial annotations with well-crafted text prompts produces superior results compared to using either technique in isolation.

Best Practices and Pro Tips

Effective spatial prompting requires understanding both the technical capabilities of Veo 3 and the practical considerations that lead to successful video generation. These best practices have been developed through extensive community experimentation and represent proven approaches for achieving consistent, high-quality results.

Annotation Design Principles

Use High Contrast Colors: Ensure your annotations are clearly visible against the background image. White text on dark backgrounds and dark text on light backgrounds provide optimal readability for AI interpretation.
Keep Instructions Concise: Brief, specific instructions work better than lengthy descriptions. The AI processes spatial annotations more effectively when they contain focused, actionable directions.
Maintain Visual Hierarchy: When using multiple annotations, establish a clear visual hierarchy through size, color, or positioning to help the AI understand the relative importance of different instructions.
Consider Annotation Placement: Position text and visual cues in areas that won't interfere with important scene elements. The AI needs to distinguish between annotations and original content.

Prompt Structure Optimization

Include Deletion Commands: Always include instructions for removing annotations, such as "immediately delete instructions on the first frame and execute in order" or "remove annotations after first frame."

Combine Spatial and Textual Elements: Use spatial annotations for specific movements and transformations while relying on text prompts for atmosphere, style, and technical specifications.

Specify Execution Order: For multi-step sequences, clearly indicate the order of operations through numbering, sequencing words, or temporal indicators.

Provide Context: Include enough contextual information in your text prompt to help the AI understand the overall scene and intended outcome.

Technical Considerations

Image Quality Matters: Start with high-resolution, well-lit images that provide clear visual information. Poor source images can limit the effectiveness of spatial prompting.
Account for Model Limitations: Understand that extensive modifications may cause the AI to create entirely new scenes rather than transforming existing ones. Keep modifications focused and purposeful.
Monitor Processing Time: Complex spatial annotations may require longer processing times. Plan accordingly for projects with tight deadlines.

Common Pitfalls to Avoid

Over-Annotation: Avoid cluttering images with too many annotations. The AI can become confused when presented with conflicting or overwhelming visual instructions.
Ambiguous Directions: Ensure annotations are clear and unambiguous. Vague or unclear instructions often lead to unexpected results.
Ignoring Scene Physics: Consider the physical plausibility of your requested movements and transformations. The AI works best when spatial instructions align with realistic motion patterns.
Neglecting Continuity: Maintain visual and narrative continuity throughout your spatial instructions to avoid jarring transitions or inconsistent results.

Advanced Optimization Techniques

Layer Annotations Strategically: When using multiple types of annotations, layer them in order of importance, with the most critical instructions being most prominent.
Use Color Coding: Develop a consistent color coding system for different types of instructions (red for camera movement, blue for object transformation, etc.).
Incorporate Timing Indicators: Add temporal cues to your annotations to help the AI understand pacing and timing relationships
Test Annotation Visibility: Ensure all annotations are clearly visible at the resolution and quality settings you plan to use for generation.

Troubleshooting Common Issues

If spatial prompting doesn’t behave as expected, it’s usually due to annotation visibility, prompt clarity, or input quality.

If the AI ignores your annotations, make sure they’re clearly visible—use high-contrast colors, large and simple shapes, and avoid blending them into busy parts of the image. Always include a prompt instruction like “Remove annotations after frame 1” to help the model understand they’re temporary.
If annotations remain visible in the output, double-check that your deletion command is present and clearly worded. Stick to legible fonts and avoid placing annotations over key scene elements.
If motion doesn't follow your drawn paths, simplify the instructions. Arrows and curved lines work better than complex sketches. Reinforce motion intent with a short line in the text prompt, and break long movements into simpler segments if needed.
If your scene changes too much or feels overhauled, reduce the scope of your edits. Focus on one transformation at a time and use painted or highlighted regions to localize changes instead of asking the model to reimagine the entire frame.
If sequential instructions play out of order, label each clearly using numbers (e.g. 1, 2, 3) and spread them out across the frame. Terms like “then” or “finally” help guide the model’s sense of timing.
If pacing feels off, include descriptive timing cues in your prompt, such as “slow zoom,” “hold for 3 seconds,” or “quick transition.” These help shape how the sequence unfolds.
If results vary too much between runs, standardize your annotation style and reuse effective combinations of prompt, model, and image. For quality concerns, start with high-resolution images, use minimal but purposeful annotations, and switch to Quality mode when finalizing your video.

Future Applications and Possibilities

Spatial prompting represents just the beginning of visual-based AI direction, with numerous emerging applications and potential developments on the horizon. As the technology matures and becomes more widely adopted, we can expect to see significant expansions in both capability and accessibility.

Emerging Professional Applications

Film and Television Pre-visualization: Spatial prompting is already showing promise as a replacement for traditional storyboarding and pre-visualization tools. Directors and cinematographers can quickly test camera movements, scene transitions, and visual effects without the time and expense of traditional pre-production methods.

Advertising and Marketing: The ability to rapidly prototype video concepts through spatial prompting offers significant advantages for advertising agencies and marketing teams. Complex product demonstrations, brand storytelling, and campaign concepts can be visualized and refined before committing to full production.

Educational Content Creation: Spatial prompting's intuitive visual interface makes it particularly valuable for educational applications, where complex concepts can be animated and explained through visual demonstration rather than lengthy text descriptions.

Technical Documentation: Engineering, architecture, and scientific fields can leverage spatial prompting for creating clear, animated explanations of complex processes, mechanisms, and spatial relationships.

Technological Developments

Enhanced Recognition Capabilities: Future versions of spatial prompting technology will likely offer improved recognition of subtle visual cues, more sophisticated understanding of artistic styles, and better interpretation of complex spatial relationships.

Real-time Collaboration: Development of collaborative spatial prompting tools could enable teams to work together on video concepts in real-time, with multiple contributors adding annotations and refinements to shared projects.

Integration with 3D Workflows: As spatial prompting technology evolves, integration with 3D modeling and animation software could create seamless workflows between traditional 3D production and AI-generated video content.

Mobile and Accessibility Applications: Simplified spatial prompting interfaces designed for mobile devices and accessibility needs could democratize video creation, making professional-quality video generation available to broader audiences.

Creative and Artistic Evolution

Interactive Storytelling: Spatial prompting could enable new forms of interactive narrative where viewers can influence story direction through visual input, creating personalized video experiences.

Artistic Collaboration: The technology opens possibilities for new forms of human-AI artistic collaboration, where artists provide visual direction and AI handles technical execution, leading to novel creative expressions.

Educational Democratization: As spatial prompting becomes more accessible, it could democratize video education, allowing teachers and trainers without technical video skills to create engaging visual content.

Cultural Documentation: The intuitive nature of spatial prompting makes it valuable for cultural preservation projects, where community members can easily direct the creation of videos documenting traditions, stories, and practices.

Industry Integration

Gaming and Interactive Media: Spatial prompting techniques could be integrated into game development workflows, allowing designers to quickly prototype cutscenes, character animations, and environmental effects.

Virtual and Augmented Reality: As VR and AR technologies mature, spatial prompting could provide intuitive methods for directing immersive experiences and spatial narratives.

Social Media Evolution: Platform integration of spatial prompting tools could transform social media content creation, enabling users to create sophisticated video content with minimal technical expertise.

Enterprise Applications: Business applications could include training video creation, process documentation, and internal communication tools that leverage spatial prompting for clear, engaging content.

The future of spatial prompting lies not just in technological advancement, but in its potential to fundamentally change how humans communicate visual ideas to machines, creating more intuitive, accessible, and powerful tools for creative expression and practical communication.

Was this helpful?