Gemini in Production — MCP Servers and API Integration
Getting Gemini working in a notebook is different from running it reliably in production. Rate limits, retry logic, streaming, cost tracking, and the MCP server pattern — this lesson covers the operational layer that makes Gemini pipelines stable at scale.
Getting Gemini working in a Jupyter notebook takes 10 minutes. Getting it working reliably in production — handling rate limits gracefully, tracking costs, streaming responses efficiently, integrating with your existing tool ecosystem — takes understanding the operational layer that most tutorials skip.
This lesson covers that layer: the retry patterns, streaming integration, cost tracking, monitoring setup, and the MCP server pattern that makes Gemini capabilities available to Claude Code and other agent systems.
Rate Limits and Quota Management
Gemini API rate limits operate on two axes: requests per minute (RPM) and tokens per minute (TPM). Both matter for production systems.
Rate limit errors return HTTP 429. The correct response is exponential backoff with jitter — not immediate retry, not fixed-interval retry.
import time
import random
import google.generativeai as genai
from google.api_core.exceptions import ResourceExhausted
def generate_with_retry(
model: genai.GenerativeModel,
prompt: str,
max_retries: int = 3
) -> str:
"""Generate content with exponential backoff on rate limit errors."""
for attempt in range(max_retries):
try:
response = model.generate_content(prompt)
return response.text
except ResourceExhausted as e:
if attempt == max_retries - 1:
raise
# Exponential backoff with jitter
wait_time = (2 ** attempt) + random.uniform(0, 1)
print(f"Rate limited. Retrying in {wait_time:.1f}s (attempt {attempt + 1}/{max_retries})")
time.sleep(wait_time)
raise RuntimeError("Max retries exceeded")
For high-throughput pipelines, implement a token bucket or semaphore to stay under rate limits proactively rather than hitting them and backing off reactively:
import asyncio
class GeminiRateLimiter:
def __init__(self, requests_per_minute: int = 1800): # 90% of limit
self.semaphore = asyncio.Semaphore(requests_per_minute // 60)
self.interval = 60 / requests_per_minute
async def generate(self, model, prompt: str) -> str:
async with self.semaphore:
response = await model.generate_content_async(prompt)
await asyncio.sleep(self.interval)
return response.text
Streaming Responses
For user-facing applications, streaming is the difference between a response that appears to take 5 seconds and one that starts showing content in under 1 second. The API supports streaming natively.
# Non-streaming — wait for the entire response
response = model.generate_content("Write a detailed analysis of context windows.")
print(response.text)
# Streaming — yield chunks as they arrive
for chunk in model.generate_content("Write a detailed analysis of context windows.", stream=True):
if chunk.text:
print(chunk.text, end="", flush=True)
print() # newline at end
# Async streaming for production web applications
async def stream_response(prompt: str):
async for chunk in await model.generate_content_async(prompt, stream=True):
if chunk.text:
yield chunk.text
Streaming does not change the total token cost or the total response time — it changes the time to first token, which significantly improves perceived responsiveness for end users.
Cost Tracking and Token Monitoring
Every Gemini API response includes token usage in response.usage_metadata. Track it from day one.
response = model.generate_content(prompt)
# Token usage is always available in the response
usage = response.usage_metadata
input_tokens = usage.prompt_token_count
output_tokens = usage.candidates_token_count
cached_tokens = usage.cached_content_token_count # 0 if no caching
# Calculate cost (Flash pricing as of early 2026)
INPUT_COST_PER_M = 0.075 # $0.075 per million input tokens
OUTPUT_COST_PER_M = 0.30 # $0.30 per million output tokens
CACHED_COST_PER_M = 0.01875 # $0.01875 per million cached tokens
billable_input = input_tokens - cached_tokens
cost = (
(billable_input / 1_000_000) * INPUT_COST_PER_M +
(cached_tokens / 1_000_000) * CACHED_COST_PER_M +
(output_tokens / 1_000_000) * OUTPUT_COST_PER_M
)
print(f"Input tokens: {input_tokens} (cached: {cached_tokens})")
print(f"Output tokens: {output_tokens}")
print(f"Estimated cost: ${cost:.6f}")
Build cost tracking into every production call and aggregate it daily. Unexpected cost spikes are often the first signal of a prompt engineering regression or an unintended context size increase.
Monitoring Setup
A minimal production monitoring setup for Gemini:
import time
from dataclasses import dataclass
from typing import Optional
@dataclass
class GeminiCallMetrics:
model: str
input_tokens: int
output_tokens: int
cached_tokens: int
latency_ms: float
cost_usd: float
error: Optional[str] = None
class MonitoredGeminiClient:
def __init__(self, model_name: str, alert_on_cost_usd: float = 0.10):
self.model = genai.GenerativeModel(model_name)
self.model_name = model_name
self.alert_threshold = alert_on_cost_usd
self.total_cost = 0.0
def generate(self, prompt: str) -> tuple[str, GeminiCallMetrics]:
start = time.time()
try:
response = self.model.generate_content(prompt)
latency = (time.time() - start) * 1000
usage = response.usage_metadata
cost = self._calculate_cost(usage)
self.total_cost += cost
if cost > self.alert_threshold:
print(f"COST ALERT: Single call cost ${cost:.4f} exceeds threshold")
metrics = GeminiCallMetrics(
model=self.model_name,
input_tokens=usage.prompt_token_count,
output_tokens=usage.candidates_token_count,
cached_tokens=usage.cached_content_token_count,
latency_ms=latency,
cost_usd=cost
)
return response.text, metrics
except Exception as e:
latency = (time.time() - start) * 1000
metrics = GeminiCallMetrics(
model=self.model_name,
input_tokens=0, output_tokens=0, cached_tokens=0,
latency_ms=latency, cost_usd=0.0,
error=str(e)
)
raise
The MCP Server Pattern
Model Context Protocol (MCP) is the standard for exposing AI tool capabilities to agent systems. Wrapping Gemini in an MCP server makes its capabilities — text generation, image analysis, multimodal processing — available as tools to Claude Code, OpenClaw instances, and any other MCP-compatible agent.
The mcp-image MCP server used in production systems (like the Knox ecosystem) is exactly this pattern: Gemini's Imagen API wrapped as an MCP tool, callable by Claude Code without requiring Claude to handle the API integration directly.
# Minimal MCP server wrapping Gemini
from mcp.server import Server
from mcp.server.models import InitializationOptions
import mcp.types as types
import google.generativeai as genai
server = Server("gemini-mcp")
@server.list_tools()
async def list_tools() -> list[types.Tool]:
return [
types.Tool(
name="gemini_generate",
description="Generate text using Gemini 2.0 Flash",
inputSchema={
"type": "object",
"properties": {
"prompt": {"type": "string", "description": "The prompt to send to Gemini"},
"model": {"type": "string", "default": "gemini-2.0-flash"}
},
"required": ["prompt"]
}
)
]
@server.call_tool()
async def call_tool(name: str, arguments: dict) -> list[types.TextContent]:
if name == "gemini_generate":
model = genai.GenerativeModel(arguments.get("model", "gemini-2.0-flash"))
response = model.generate_content(arguments["prompt"])
return [types.TextContent(type="text", text=response.text)]
This pattern enables routing: Claude Code handles reasoning and orchestration, Gemini handles multimodal processing and image generation, each doing what it does best — coordinated through MCP without direct integration between the two systems.
Lesson 94 Drill
Build a minimal production-grade Gemini client:
- Implement retry logic with exponential backoff for 429 errors.
- Add cost tracking that prints the estimated cost after each call.
- Enable streaming and verify that output appears progressively.
- Run 10 calls against your client and calculate the total session cost.
Bottom Line
Production Gemini integration requires the operational layer that tutorials skip: retry logic, streaming, cost tracking, and monitoring. None of these are optional if you want a system that behaves predictably at scale. The MCP server pattern extends this further — it makes Gemini's capabilities available as composable tools to any agent system, enabling provider-agnostic orchestration that adapts as the AI landscape evolves.