Ask Knox

Getting Gemini working in a Jupyter notebook takes 10 minutes. Getting it working reliably in production — handling rate limits gracefully, tracking costs, streaming responses efficiently, integrating with your existing tool ecosystem — takes understanding the operational layer that most tutorials skip.

This lesson covers that layer: the retry patterns, streaming integration, cost tracking, monitoring setup, and the MCP server pattern that makes Gemini capabilities available to Claude Code and other agent systems.

Rate Limits and Quota Management

Gemini API rate limits operate on two axes: requests per minute (RPM) and tokens per minute (TPM). Both matter for production systems.

Rate limit errors return HTTP 429. The correct response is exponential backoff with jitter — not immediate retry, not fixed-interval retry.

import time
import random
from google import genai
from google.genai import errors

client = genai.Client()

def generate_with_retry(
    model_name: str,
    prompt: str,
    max_retries: int = 3
) -> str:
    """Generate content with exponential backoff on rate limit errors."""
    for attempt in range(max_retries):
        try:
            response = client.models.generate_content(
                model=model_name,
                contents=prompt
            )
            return response.text
        except errors.ClientError as e:
            if "429" not in str(e) and "RESOURCE_EXHAUSTED" not in str(e):
                raise
            if attempt == max_retries - 1:
                raise
            # Exponential backoff with jitter
            wait_time = (2 ** attempt) + random.uniform(0, 1)
            print(f"Rate limited. Retrying in {wait_time:.1f}s (attempt {attempt + 1}/{max_retries})")
            time.sleep(wait_time)
    raise RuntimeError("Max retries exceeded")

For high-throughput pipelines, implement a semaphore to stay under rate limits proactively rather than hitting them and backing off reactively:

import asyncio
from google import genai

client = genai.Client()

class GeminiRateLimiter:
    def __init__(self, requests_per_minute: int = 1800):  # 90% of limit
        self.semaphore = asyncio.Semaphore(requests_per_minute // 60)
        self.interval = 60 / requests_per_minute

    async def generate(self, model_name: str, prompt: str) -> str:
        async with self.semaphore:
            response = await client.aio.models.generate_content(
                model=model_name,
                contents=prompt
            )
            await asyncio.sleep(self.interval)
            return response.text

Streaming Responses

For user-facing applications, streaming is the difference between a response that appears to take 5 seconds and one that starts showing content in under 1 second. The API supports streaming natively.

# Non-streaming — wait for the entire response
response = client.models.generate_content(
    model="gemini-2.5-flash",
    contents="Write a detailed analysis of context windows."
)
print(response.text)

# Streaming — yield chunks as they arrive
for chunk in client.models.generate_content_stream(
    model="gemini-2.5-flash",
    contents="Write a detailed analysis of context windows."
):
    if chunk.text:
        print(chunk.text, end="", flush=True)
print()  # newline at end

# Async streaming for production web applications
async def stream_response(prompt: str):
    async for chunk in client.aio.models.generate_content_stream(
        model="gemini-2.5-flash",
        contents=prompt
    ):
        if chunk.text:
            yield chunk.text

Streaming does not change the total token cost or the total response time — it changes the time to first token, which significantly improves perceived responsiveness for end users.

Cost Tracking and Token Monitoring

Every Gemini API response includes token usage in response.usage_metadata. Track it from day one.

from google import genai

client = genai.Client()

response = client.models.generate_content(
    model="gemini-2.5-flash",
    contents=prompt
)

# Token usage is always available in the response
usage = response.usage_metadata
input_tokens = usage.prompt_token_count
output_tokens = usage.candidates_token_count
cached_tokens = usage.cached_content_token_count  # 0 if no caching

# Calculate cost (Flash pricing as of early 2026 — verify current rates at ai.google.dev/gemini-api/docs/pricing)
INPUT_COST_PER_M = 0.30    # $0.30 per million input tokens
OUTPUT_COST_PER_M = 2.50   # $2.50 per million output tokens
CACHED_COST_PER_M = 0.075  # $0.075 per million cached tokens

billable_input = input_tokens - cached_tokens
cost = (
    (billable_input / 1_000_000) * INPUT_COST_PER_M +
    (cached_tokens / 1_000_000) * CACHED_COST_PER_M +
    (output_tokens / 1_000_000) * OUTPUT_COST_PER_M
)

print(f"Input tokens: {input_tokens} (cached: {cached_tokens})")
print(f"Output tokens: {output_tokens}")
print(f"Estimated cost: ${cost:.6f}")

Build cost tracking into every production call and aggregate it daily. Unexpected cost spikes are often the first signal of a prompt engineering regression or an unintended context size increase.

Monitoring Setup

A minimal production monitoring setup for Gemini:

import time
from dataclasses import dataclass
from typing import Optional
from google import genai

client = genai.Client()

@dataclass
class GeminiCallMetrics:
    model: str
    input_tokens: int
    output_tokens: int
    cached_tokens: int
    latency_ms: float
    cost_usd: float
    error: Optional[str] = None

class MonitoredGeminiClient:
    def __init__(self, model_name: str, alert_on_cost_usd: float = 0.10):
        self.model_name = model_name
        self.alert_threshold = alert_on_cost_usd
        self.total_cost = 0.0

    def generate(self, prompt: str) -> tuple[str, GeminiCallMetrics]:
        start = time.time()
        try:
            response = client.models.generate_content(
                model=self.model_name,
                contents=prompt
            )
            latency = (time.time() - start) * 1000
            usage = response.usage_metadata
            cost = self._calculate_cost(usage)
            self.total_cost += cost

            if cost > self.alert_threshold:
                print(f"COST ALERT: Single call cost ${cost:.4f} exceeds threshold")

            metrics = GeminiCallMetrics(
                model=self.model_name,
                input_tokens=usage.prompt_token_count,
                output_tokens=usage.candidates_token_count,
                cached_tokens=usage.cached_content_token_count,
                latency_ms=latency,
                cost_usd=cost
            )
            return response.text, metrics

        except Exception as e:
            latency = (time.time() - start) * 1000
            metrics = GeminiCallMetrics(
                model=self.model_name,
                input_tokens=0, output_tokens=0, cached_tokens=0,
                latency_ms=latency, cost_usd=0.0,
                error=str(e)
            )
            raise

The MCP Server Pattern

Model Context Protocol (MCP) is the standard for exposing AI tool capabilities to agent systems. Wrapping Gemini in an MCP server makes its capabilities — text generation, image analysis, multimodal processing — available as tools to Claude Code, Agent Gateway, and any other MCP-compatible agent.

The mcp-image MCP server used in production systems (like the Knox ecosystem) is exactly this pattern: Gemini's Imagen API wrapped as an MCP tool, callable by Claude Code without requiring Claude to handle the API integration directly.

# Minimal MCP server wrapping Gemini
from mcp.server import Server
import mcp.types as types
from google import genai

server = Server("gemini-mcp")
client = genai.Client()

@server.list_tools()
async def list_tools() -> list[types.Tool]:
    return [
        types.Tool(
            name="gemini_generate",
            description="Generate text using Gemini 2.5 Flash",
            inputSchema={
                "type": "object",
                "properties": {
                    "prompt": {"type": "string", "description": "The prompt to send to Gemini"},
                    "model": {"type": "string", "default": "gemini-2.5-flash"}
                },
                "required": ["prompt"]
            }
        )
    ]

@server.call_tool()
async def call_tool(name: str, arguments: dict) -> list[types.TextContent]:
    if name == "gemini_generate":
        response = client.models.generate_content(
            model=arguments.get("model", "gemini-2.5-flash"),
            contents=arguments["prompt"]
        )
        return [types.TextContent(type="text", text=response.text)]

This pattern enables routing: Claude Code handles reasoning and orchestration, Gemini handles multimodal processing and image generation, each doing what it does best — coordinated through MCP without direct integration between the two systems.

Lesson Drill

Build a minimal production-grade Gemini client:

Implement retry logic with exponential backoff for 429 errors.
Add cost tracking that prints the estimated cost after each call.
Enable streaming and verify that output appears progressively.
Run 10 calls against your client and calculate the total session cost.

Bottom Line

Production Gemini integration requires the operational layer that tutorials skip: retry logic, streaming, cost tracking, and monitoring. None of these are optional if you want a system that behaves predictably at scale. The MCP server pattern extends this further — it makes Gemini's capabilities available as composable tools to any agent system, enabling provider-agnostic orchestration that adapts as the AI landscape evolves.