ASK KNOX
beta
LESSON 91

Multimodal Inputs — Images, Audio, Video, and Documents

Gemini's native multimodality is its most distinctive capability. Send images, audio, video, and PDFs through one API. This lesson covers exactly how each input type works, the code patterns, and the real-world use cases that justify the architecture.

10 min read·Building with Gemini

Every other AI model on the market processes text and treats other modalities as add-ons. Gemini treats all modalities as first-class inputs processed through the same unified architecture. Understanding how this works in practice — not just in theory — is what enables you to build the data pipelines that were previously impossible or required substantial specialized infrastructure.

This lesson covers the mechanics of every input type: how to send it, what Gemini can do with it, and what the real-world use cases look like.

Image Inputs

Images can be sent to Gemini in two ways: base64-encoded inline data or a URL reference for publicly accessible images.

Method 1: Base64 inline (recommended for private images)

import google.generativeai as genai
import base64

with open("invoice.jpg", "rb") as f:
    image_data = base64.standard_b64encode(f.read()).decode("utf-8")

model = genai.GenerativeModel("gemini-2.0-flash")
response = model.generate_content([
    {
        "inline_data": {
            "mime_type": "image/jpeg",
            "data": image_data
        }
    },
    "Extract the vendor name, invoice number, date, and total amount. Return as JSON."
])

Method 2: URL reference (for public images)

response = model.generate_content([
    {
        "image_url": {
            "url": "https://example.com/chart.png"
        }
    },
    "Describe the trend shown in this chart and identify the peak value."
])

Practical use cases for image inputs: invoice and receipt extraction, chart and diagram interpretation, product defect detection, screenshot analysis for UI testing, document digitization from scanned images, and visual quality control workflows.

The critical point: Gemini reasons over the actual image, not a description of it. Ask it to read a handwritten note, compare two product photos, or identify the anomalous bar in a bar chart — it processes visual information directly.

Audio Inputs

Audio input enables automatic transcription, speaker diarization, sentiment analysis, language identification, and reasoning over spoken content.

import google.generativeai as genai

# Upload audio file via File API (for files > a few MB)
audio_file = genai.upload_file("meeting_recording.mp3", mime_type="audio/mpeg")

model = genai.GenerativeModel("gemini-2.0-flash")
response = model.generate_content([
    audio_file,
    """
    Transcribe this meeting recording. Then:
    1. Identify all action items with their assigned owners
    2. List any decisions made
    3. Summarize the key discussion points in three bullet points
    """
])

Supported audio formats: MP3, WAV, FLAC, AAC, OGG, OPUS, WEBM. The File API handles files up to 2GB. For shorter audio clips under a few megabytes, you can use base64 inline encoding instead of the File API.

Practical use cases: meeting transcription and summarization, customer call analysis, podcast content extraction, voice memo processing, language identification and translation pipelines, and any workflow where spoken content needs to become structured data.

Video Inputs

Video is the most computationally intensive input type, and Gemini handles it through the File API, which processes the video and makes it available for reasoning.

import time

# Upload video
video_file = genai.upload_file("product_demo.mp4", mime_type="video/mp4")

# Wait for processing to complete
while video_file.state.name == "PROCESSING":
    time.sleep(5)
    video_file = genai.get_file(video_file.name)

if video_file.state.name == "FAILED":
    raise ValueError("Video processing failed")

# Now query the processed video
response = model.generate_content([
    video_file,
    "Summarize what happens in this product demo. Identify each feature shown and the timestamp when it appears."
])

Gemini samples video frames and processes audio simultaneously, enabling temporal reasoning — it can answer questions about what happens at specific timestamps, identify changes over time, and correlate visual events with audio commentary.

File API details: uploaded files persist for 48 hours by default. If you need to process the same video multiple times across sessions, upload once and store the file URI for reuse within the 48-hour window.

Practical use cases: video content summarization, surveillance clip analysis, training video indexing, tutorial extraction (identify steps shown at each timestamp), marketing video analysis, and sports footage breakdown.

Document and PDF Inputs

PDF and document processing works through either inline data or the File API, with Gemini reasoning over both the text content and visual elements (charts, tables, diagrams) simultaneously.

# Process a PDF — Gemini handles both text and visual content
with open("financial_report.pdf", "rb") as f:
    pdf_data = base64.standard_b64encode(f.read()).decode("utf-8")

response = model.generate_content([
    {
        "inline_data": {
            "mime_type": "application/pdf",
            "data": pdf_data
        }
    },
    "Extract all financial figures from this report. For each chart, describe the trend and identify the key data points."
])

This is a meaningful distinction from text extraction pipelines: Gemini processes the PDF as a multimodal document, not just as text. It can reason about the visual layout, interpret embedded charts, and understand the relationship between text and diagrams without requiring a separate OCR or chart parsing step.

Combining Modalities

The real power of native multimodality is combining input types in a single call. Send an audio recording of a meeting along with a PDF of the presentation being discussed, and ask Gemini to cross-reference the speaker's comments against the slide content. Send a product image alongside a returns policy document and ask whether the return is eligible.

response = model.generate_content([
    audio_file,        # meeting recording
    pdf_file,          # slide deck from the meeting
    "Identify which slides were discussed during the meeting and what was said about each one."
])

This cross-modal reasoning — something no text-only model can do — is the capability that enables pipelines that were previously impossible without significant specialized infrastructure.

Limits and Gotchas

  • File API files expire after 48 hours. Store URIs but re-upload when needed for long-lived pipelines.
  • Video processing takes time. The polling pattern (check state.name == "PROCESSING") is not optional — you must wait before querying.
  • Token counting includes multimodal content. A 1-minute video costs approximately 1500 tokens. Plan your context window budget accordingly.
  • Not all models support all modalities. Flash and Pro support all input types. Nano (on-device) supports text and images only.

Lesson 91 Drill

Pick one real task from your current workflows that involves non-text input — an image, audio clip, video, or document:

  1. Identify which input method applies: base64 inline or File API.
  2. Write the Python code to send that input to Gemini 2.0 Flash with a specific extraction or analysis prompt.
  3. Run it. Evaluate the output quality.
  4. Document what surprised you — both limitations and unexpected capabilities.

Bottom Line

Gemini's native multimodality is not a checkbox feature — it is a fundamentally different capability architecture that enables task categories that text-only models cannot handle. Images via base64 or URL, audio via File API or inline, video via File API with processing wait, PDFs combining text and visual reasoning. Each modality follows the same API pattern with minor variations. Master the mechanics here and you have the foundation for building real multimodal data pipelines.