Building a Production OpenAI Integration
A working prototype is not a production system. Retry logic, rate limit handling, cost tracking, streaming, and a proper error taxonomy are what separate a demo from a system you can stake a product on.
A working OpenAI integration is not a production OpenAI integration. The prototype calls the API. The production system calls it reliably, handles every failure mode, tracks cost, and does not bring down your application when OpenAI has an incident.
This lesson builds the production-grade wrapper — the patterns that every serious OpenAI integration needs before going live.
The Error Taxonomy
Before writing retry logic, you need to understand which errors are retryable and which are fatal.
429 Rate Limit — Retryable. You have exceeded your tokens-per-minute (TPM) or requests-per-minute (RPM) limit. The response includes a Retry-After header indicating how long to wait. Always use exponential backoff with jitter — not fixed-interval retry.
500 / 503 Server Errors — Retryable. Transient OpenAI infrastructure issues. Retry up to 3 times with backoff before giving up. Alert if this exceeds 1% error rate over a rolling window.
400 Context Length — Partially Retryable. Your messages array is too long for the model's context window. Truncate the message history and retry. Not a transient error — do not retry without modifying the request.
400 Invalid Request — Fatal. Your request is malformed — wrong parameter types, invalid schema, unsupported combination. Do not retry. Fix the code.
401 Authentication — Fatal. Invalid or expired API key. Alert immediately. Do not retry.
403 Permission — Fatal. Your account does not have access to the requested model or feature. Do not retry.
Content Policy / Refusal — Non-Error. The model declined to answer due to content policy. This is not a retryable error — it is a valid model response. Handle it in application logic, not retry logic.
Retry Logic with Exponential Backoff
import time
import random
from openai import OpenAI, RateLimitError, APIStatusError
client = OpenAI()
def call_openai_with_retry(messages: list, model: str = "gpt-4o", max_attempts: int = 3) -> str:
for attempt in range(max_attempts):
try:
response = client.chat.completions.create(
model=model,
messages=messages,
max_tokens=1000
)
return response.choices[0].message.content
except RateLimitError as e:
if attempt == max_attempts - 1:
raise
# Exponential backoff with jitter
wait = (2 ** attempt) + random.uniform(0, 1)
time.sleep(wait)
except APIStatusError as e:
if e.status_code in (500, 503):
if attempt == max_attempts - 1:
raise
wait = (2 ** attempt) + random.uniform(0, 1)
time.sleep(wait)
else:
# 4xx errors are fatal — do not retry
raise
Cost Tracking
Every API call should log its token usage and estimated cost. This is the only way to catch cost spikes before they become surprises on your monthly bill.
# Pricing as of early 2026 (verify current prices)
PRICING = {
"gpt-4o": {"input": 2.50, "output": 10.00},
"gpt-4o-mini": {"input": 0.15, "output": 0.60},
"o1": {"input": 15.00, "output": 60.00},
"o3-mini": {"input": 1.10, "output": 4.40},
}
def calculate_cost(model: str, usage) -> float:
rates = PRICING.get(model, PRICING["gpt-4o"])
input_cost = (usage.prompt_tokens / 1_000_000) * rates["input"]
output_cost = (usage.completion_tokens / 1_000_000) * rates["output"]
return input_cost + output_cost
def call_openai_tracked(messages: list, model: str = "gpt-4o") -> dict:
response = client.chat.completions.create(model=model, messages=messages)
cost = calculate_cost(model, response.usage)
# Log to your observability stack
logger.info("openai_call", extra={
"model": model,
"prompt_tokens": response.usage.prompt_tokens,
"completion_tokens": response.usage.completion_tokens,
"cost_usd": cost,
"finish_reason": response.choices[0].finish_reason
})
return {
"content": response.choices[0].message.content,
"cost": cost,
"usage": response.usage
}
Rate Limit Management
OpenAI rate limits have two dimensions: requests per minute (RPM) and tokens per minute (TPM). Hitting either triggers a 429.
Strategies for managing at scale:
Request queuing. Instead of calling OpenAI synchronously from user requests, push calls to a queue (Redis, SQS, BullMQ) and process with a worker pool. The pool can respect rate limits without blocking user-facing operations.
Token estimation before sending. Use tiktoken to count tokens before sending a request. If the request would push you over your TPM budget, delay or queue it.
import tiktoken
def count_tokens(messages: list, model: str = "gpt-4o") -> int:
enc = tiktoken.encoding_for_model(model)
total = 0
for message in messages:
total += len(enc.encode(message["content"])) + 4 # role overhead
return total + 2 # reply priming
Tier upgrades. OpenAI increases rate limits based on spend history. As usage grows, request higher tier access through the platform dashboard.
Streaming for Real-Time UX
For any user-facing interface, implement streaming:
async def stream_response(messages: list):
async with client.chat.completions.stream(
model="gpt-4o",
messages=messages,
max_tokens=800
) as stream:
async for text in stream.text_stream:
yield text
In a FastAPI backend:
from fastapi.responses import StreamingResponse
@app.post("/chat")
async def chat(request: ChatRequest):
return StreamingResponse(
stream_response(request.messages),
media_type="text/event-stream"
)
On the client (JavaScript), consume via EventSource or fetch with ReadableStream.
Streaming does not change cost — you are still billed for the same tokens. It changes perception. Users see output immediately. The experience of waiting for a complete response versus watching tokens appear is night-and-day different.
Context Length Management
When conversation history grows long enough to approach the context limit, you need a truncation strategy. Two options:
Sliding window: Keep the system message plus the last N messages. Discard everything else.
def trim_history(messages: list, max_tokens: int = 90000, model: str = "gpt-4o") -> list:
system = [m for m in messages if m["role"] == "system"]
conversation = [m for m in messages if m["role"] != "system"]
while count_tokens(system + conversation, model) > max_tokens:
if len(conversation) <= 2:
break
conversation = conversation[2:] # drop oldest user+assistant pair
return system + conversation
Summarization: Periodically summarize old conversation history into a single system message update. More expensive but preserves more semantic context.
The Production Wrapper Pattern
Assemble everything into a single wrapper class:
class OpenAIClient:
def __init__(self, model: str = "gpt-4o"):
self.client = OpenAI()
self.model = model
def complete(self, messages: list, max_tokens: int = 1000) -> dict:
trimmed = trim_history(messages)
return call_openai_tracked(trimmed, self.model)
def complete_with_retry(self, messages: list) -> str:
return call_openai_with_retry(messages, self.model)
def stream(self, messages: list):
return stream_response(messages)
One class. Retry logic, cost tracking, context management, and streaming — all hidden behind a clean interface. Business logic never touches the OpenAI SDK directly.
Observability Checklist
Before shipping an OpenAI integration to production, verify:
- Every API call logs: model, tokens, cost, latency, finish_reason
- Retry attempts are logged with error type and attempt number
- Rate limit hits trigger an alert, not just a retry
- Monthly spend alert is configured in the OpenAI dashboard
- Context length exceeded errors have a retry path with truncation
- Content policy refusals are handled gracefully, not treated as errors
Bottom Line
Production OpenAI integrations are not about the API call — they are about what happens around the API call. Retry logic with exponential backoff handles transient failures. Cost tracking surfaces spend before it surprises you. Context management prevents calls from failing as conversations grow. Streaming makes user experience feel instant.
Build these into a wrapper from day one. They are not optional for production — they are the difference between a demo and a system.
This completes the Building with ChatGPT track. You have covered the full OpenAI ecosystem: model routing, API fundamentals, GPT model lineup, function calling, structured outputs, Custom GPTs, the Assistants API, and production patterns. You have the architecture foundation to build any OpenAI-powered application.