Ask Knox

Image prompts and video prompts look similar on the surface. Both describe what you want. Both use adjectives. Both benefit from specificity.

But video prompts have a dimension that image prompts do not: time. A frame is frozen. A video is a directed sequence of frames where the camera has a position, an intention, and a movement. Where the subject has a trajectory. Where duration, rhythm, and motion pacing determine whether the output is cinematic or static.

Most operators who come from image generation write video prompts the same way they write image prompts — describe the scene, add quality adjectives, submit. The output looks flat. Not because the platform is bad, but because the prompt has no temporal structure.

The diagram above breaks the prompt into five visual components; some practitioners split style out as its own component. This track folds those five into four working layers: Scene Setup absorbs Style/Mood (the world and how it's lit), and Duration lives in the Technical Specification layer. The four layers below are the framework you'll be tested on — map the diagram's five chips onto them as you read.

The Four Layers of a Video Prompt

Every competent video prompt has four layers. Each layer adds a dimension of control. Together they close the gap between what you imagined and what the model generates.

Layer 1: Scene

The scene layer describes the world the camera is looking at. This is the component shared with image prompts — environment, lighting, time of day, color palette, weather, texture.

But in video prompts, the scene is not static. It has a mood that the camera will move through. Describe it with that movement in mind:

"Rain-soaked rooftop at dusk, neon signs reflecting in puddles, fog rolling in"
"Sunlit forest clearing, golden hour, dust particles visible in light beams"
"Brutalist concrete interior, single overhead fluorescent, clinical and cold"

Precision here prevents the model from inventing a generic version of your intent. The more specific the scene description, the less the model fills gaps with defaults.

Layer 2: Subject Motion

Subject motion describes what is happening in the frame — what moves, how it moves, and at what speed. This is the action layer. Without it, the camera might zoom in on a static subject, which reads as image generation with slight motion artifacts rather than intentional video.

"A lone figure walks slowly toward camera"
"Leaves fall in slow motion from a dead tree"
"A fighter steps into the frame from the left and raises both hands"

Tempo matters. "Slowly," "rapidly," "gradually" are not decoration — they define the pacing of the generated clip. A slow subject movement paired with a slow camera movement creates a meditative quality. Rapid subject action paired with a tracking shot creates kinetic energy.

Layer 3: Camera Movement

Camera movement is where most operators leave the most quality on the table. The camera is a character in every video. It has a perspective and an intention. The model needs to know both.

The core vocabulary:

Establishing shot — wide, often static or slow pan, used to set location and context before moving closer. "Wide establishing shot of downtown Chicago at night, slow pan left."

Dolly in / Dolly out — the camera physically moves toward or away from the subject. Dolly in creates intimacy and builds tension. Dolly out creates revelation and release. "Slow dolly in toward the subject's face as they speak."

Pan left / Pan right — horizontal rotation of the camera on a fixed axis. Reveals what's beside the subject without changing position. "Slow pan left revealing the full skyline."

Tilt up / Tilt down — vertical rotation of the camera on a fixed axis. Looking up creates scale and awe. Looking down creates surveillance or vulnerability. "Camera tilts up slowly to reveal the full height of the structure."

Tracking shot — the camera moves alongside the subject, maintaining relative distance. Creates following motion. "Tracking shot following the runner through the crowd."

Aerial / Crane — overhead or descending perspective. Creates scale, reveals geography, conveys godlike observation. "Aerial view descending slowly over the city grid."

Static shot — the camera does not move. The world moves around it. This is not the absence of a camera decision — it is a deliberate one. "Static wide shot, nothing moves except the flag in the wind."

Layer 4: Technical Specification

The technical layer closes the prompt with output requirements: duration in seconds, aspect ratio, and any format-specific instructions.

Duration ranges vary by platform. Veo maxes at 8 seconds. Runway goes to 10. Sora reaches 25 (Pro tier). Specify duration explicitly — the model will otherwise pick a default that may not match your downstream use case.

Aspect ratio matters for distribution:

16:9 — YouTube, landscape web
9:16 — Instagram Reels, TikTok, YouTube Shorts
1:1 — Instagram feed, versatile square

"8 seconds, 16:9" at the end of your prompt takes three seconds to add and prevents a generation at the wrong aspect ratio that costs you another cycle.

Complete Prompt Examples

Weak prompt:

"Futuristic city at night with neon lights, cinematic"

Strong prompt:

"Slow dolly in toward a lone figure standing at the edge of a rain-soaked rooftop, neon signs blurring in the background, fog rolling in from the harbor. The figure turns slightly toward camera as the dolly closes distance. Cinematic, 8 seconds, 16:9."

The second prompt specifies camera movement (slow dolly in), subject motion (turns slightly toward camera), scene (rain-soaked rooftop, neon, fog), and technical spec (8s, 16:9). The first prompt leaves all four to model defaults.

How Video Prompts Differ from Image Prompts

The clearest way to internalize the difference: an image prompt describes a painting. A video prompt describes a scene in a film.

In a film scene, the director has made decisions about where the camera starts, where it ends, what the actor does during the shot, and how long the shot runs before the cut. These are not implied — they are specified in the shot list.

Your video prompt is your shot list. Write it as a director, not as a graphic designer.

Platform-Specific Adjustments

Different platforms respond to camera language differently based on training data and fine-tuning:

Veo responds well to detailed environment descriptions paired with simple camera movements. Long descriptive chains for the scene, clean camera verb.
Runway responds to concise camera directives and performs strongly on I2V where you supply a reference frame, reducing the scene description burden.
HeyGen requires no camera language — it is a fixed avatar shot. The prompting is all script and tone, not cinematography.
Kling responds to motion-forward descriptions — focus the prompt on what is moving and how. Camera language is less developed here.

Lesson Drill

Take one video topic you plan to generate. Write the prompt in four layers:

Scene: environment, lighting, time of day
Subject motion: what moves and how
Camera movement: shot type and direction
Technical: duration and aspect ratio

Compare the output to a version of the same topic where you omit camera language. Document the quality delta. That gap is what camera language is worth.

Video Prompting — Scene, Motion, Cinematography