Ask Knox

Most AI video platforms generate content that looks like AI video. The current generation of Veo generates content that looks like it was shot on a camera.

That distinction matters for production pipelines. B-roll that reads as AI-generated breaks the credibility of adjacent real footage. Environmental establishing shots that have the flat, slightly-off quality of first-generation AI video pull the viewer out of the content. Veo closed that gap. In specific use cases — cityscapes, nature, physical world environments — the output reaches the threshold where professionals can't reliably distinguish it from stock footage.

The other thing Veo has that most of its competitors don't: a complete, production-grade API via Vertex AI.

What Veo Does Well

Motion coherence is Veo's primary differentiator. Physics-based motion — water flowing, crowds moving, vehicles driving, fabric in wind — behaves correctly across the 8-second clip. Competing platforms often produce clips where motion looks correct for 2-3 seconds and then drifts into artifacts. Veo holds coherent motion through the full generation.

Environmental realism is the second strength. Nature footage, urban environments, lighting conditions — all reach production quality. The training data composition shows: real-world physical environments are where Veo returns highest quality.

Resolution supports 1080p natively with 4K upscaling available via Vertex AI. Consumer AI video tools often top out at 720p or lower. Production integration at 1080p mixes cleanly with real camera footage, and 4K upscaling is available when higher fidelity output is needed.

API depth covers the full async job lifecycle — submit, poll, retrieve, download. There is no manual step required. A pipeline can generate, validate, store, and deliver Veo output without human intervention.

The Async Job Lifecycle

Video generation is not a synchronous API call. You do not submit a request and receive a video in the response. You submit a request, receive a job ID, and then poll that job ID until the generation completes.

This matters architecturally. A naive implementation that blocks on the generation call will hang your pipeline for up to two minutes per clip. A production implementation handles this correctly:

import asyncio
import httpx

# Verify the current model ID at cloud.google.com/vertex-ai — Veo versions move fast
MODEL = (
    "https://us-central1-aiplatform.googleapis.com/v1/projects/{project}"
    "/locations/us-central1/publishers/google/models/veo-3.1-generate-preview"
)

async def generate_video(prompt: str, client: httpx.AsyncClient) -> str:
    # `client` must already carry the bearer token, e.g.:
    #   client = httpx.AsyncClient(headers={"Authorization": f"Bearer {auth_token}"})
    # where auth_token comes from the credentials block in the next section.
    # Submit the long-running generation job
    response = await client.post(
        f"{MODEL}:predictLongRunning",
        json={
            "instances": [{"prompt": prompt}],
            "parameters": {"durationSeconds": 8, "aspectRatio": "16:9"},
        },
    )
    operation_name = response.json()["name"]

    # Poll until complete — note: polling is a POST to :fetchPredictOperation,
    # not a GET on the operation name
    for attempt in range(30):  # max 150 seconds
        await asyncio.sleep(5)
        status = await client.post(
            f"{MODEL}:fetchPredictOperation",
            json={"operationName": operation_name},
        )
        body = status.json()
        if body.get("done"):
            return body["response"]["videos"][0]["gcsUri"]

    raise TimeoutError(f"Veo job {operation_name} did not complete in time")

Three rules for the polling loop:

Poll every 5 seconds minimum. More frequent polling wastes API quota.
Set a timeout ceiling. Jobs that exceed 3 minutes are either failed or hung — don't block indefinitely.
Never block the main thread. Run generation jobs as async tasks that resolve when the video is ready.

The Veo MCP Pattern

Knox's ecosystem uses a Veo MCP server (~/dev/veo-mcp/) that wraps the Vertex AI API and exposes generation, status polling, and retrieval as MCP tools. This pattern is worth replicating:

Separation of concerns: the MCP server handles auth, polling, retry, and storage. The calling agent specifies only the prompt and desired output.
Cost tracking per call: every generation is logged with duration, cost, and job ID for budget visibility.
Webhook or callback support: rather than blocking on poll, the MCP server can notify callers when the job resolves.

This pattern lets any agent in the ecosystem request Veo generation without implementing the async lifecycle itself. One MCP server, many callers.

Access and Authentication

Veo access comes through two paths:

Google AI Studio — the consumer-facing interface for prototyping. Useful for testing prompts and validating output quality. Not production-grade for pipeline integration.

Vertex AI — the enterprise API endpoint for production integration. Requires a Google Cloud project, enabled Vertex AI APIs, and a service account with the appropriate IAM roles. Authentication uses Google Cloud OAuth 2.0 service account credentials.

from google.auth import default
from google.auth.transport.requests import Request

credentials, project = default(scopes=["https://www.googleapis.com/auth/cloud-platform"])
credentials.refresh(Request())
auth_token = credentials.token

The billing model is per-second of generated video, not per call. An 8-second generation at $0.35/second costs $2.80. At scale — 10 clips per day, 30 days — that's $840/month. Budget discipline matters here.

Best Prompts for Veo

Veo responds best to prompts that front-load the environment and lighting description, followed by a simple, clean camera direction. Complex compound sentences with multiple subjects degrade coherence.

High-performing pattern:

"[Environment, lighting, time of day]. [Single subject or motion element]. [Camera movement]. Cinematic, [duration]s, [aspect ratio]."

Example:

"Dense morning fog over a mountain lake, golden hour light beginning to break through the treeline. A single wooden dock extends into the mist. Slow pan left, gradually revealing the full length of the lake. Cinematic, 8 seconds, 16:9."

What Veo handles well:

Physical world environments with weather and lighting variation
Slow, deliberate camera movements (dolly, pan, tilt)
Wide establishing shots with environmental texture

What Veo handles less well:

Precise character action (specific facial expressions, hand gestures)
Rapid cuts or multi-scene narratives in a single 8-second clip
Text in the frame (legibility is unreliable)

Production Integration Checklist

Before you wire Veo into a production pipeline:

Vertex AI API enabled in your Google Cloud project
Service account with Vertex AI User role
Budget alert configured in Cloud Console — set at 120% of expected monthly spend
Async job handler with polling, timeout, and retry logic
Quality validation after each generation: resolution check, duration check, file size sanity
Storage pipeline — Veo returns a temporary URI. Download and store to your own R2/S3 immediately; the URI expires.
Cost logging — record duration generated and cost per job_id to a database

Lesson Drill

Set up a Vertex AI project and generate your first Veo clip via the API (not the UI). The test:

Submit a generation job and capture the job_id
Implement a polling loop that checks every 5 seconds
Download the video when the job completes
Log: job_id, prompt, duration, cost, generation time in seconds

The goal is not a great video — it's a working async pipeline. The prompt can be simple. The architecture matters.