Building a Production OpenAI Integration

A working OpenAI integration is not a production OpenAI integration. The prototype calls the API. The production system calls it reliably, handles every failure mode, tracks cost, and does not bring down your application when OpenAI has an incident.

This lesson builds the production-grade wrapper — the patterns that every serious OpenAI integration needs before going live.

The Error Taxonomy

Before writing retry logic, you need to understand which errors are retryable and which are fatal.

429 Rate Limit — Retryable. You have exceeded your tokens-per-minute (TPM) or requests-per-minute (RPM) limit. The response includes a Retry-After header indicating how long to wait. Always use exponential backoff with jitter — not fixed-interval retry.

500 / 503 Server Errors — Retryable. Transient OpenAI infrastructure issues. Retry up to 3 times with backoff before giving up. Alert if this exceeds 1% error rate over a rolling window.

400 Context Length — Partially Retryable. Your messages array is too long for the model's context window. Truncate the message history and retry. Not a transient error — do not retry without modifying the request.

400 Invalid Request — Fatal. Your request is malformed — wrong parameter types, invalid schema, unsupported combination. Do not retry. Fix the code.

401 Authentication — Fatal. Invalid or expired API key. Alert immediately. Do not retry.

403 Permission — Fatal. Your account does not have access to the requested model or feature. Do not retry.

Content Policy / Refusal — Non-Error. The model declined to answer due to content policy. This is not a retryable error — it is a valid model response. Handle it in application logic, not retry logic.

Retry Logic with Exponential Backoff

import time
import random
from openai import OpenAI, RateLimitError, APIStatusError

client = OpenAI()

def call_openai_with_retry(messages: list, model: str = "gpt-4o", max_attempts: int = 3) -> str:
    for attempt in range(max_attempts):
        try:
            response = client.chat.completions.create(
                model=model,
                messages=messages,
                max_tokens=1000
            )
            return response.choices[0].message.content

        except RateLimitError as e:
            if attempt == max_attempts - 1:
                raise
            # Exponential backoff with jitter
            wait = (2 ** attempt) + random.uniform(0, 1)
            time.sleep(wait)

        except APIStatusError as e:
            if e.status_code in (500, 503):
                if attempt == max_attempts - 1:
                    raise
                wait = (2 ** attempt) + random.uniform(0, 1)
                time.sleep(wait)
            else:
                # 4xx errors are fatal — do not retry
                raise

Cost Tracking

Every API call should log its token usage and estimated cost. This is the only way to catch cost spikes before they become surprises on your monthly bill.

# Pricing as of early 2026 (verify current prices)
PRICING = {
    "gpt-4o": {"input": 2.50, "output": 10.00},
    "gpt-4o-mini": {"input": 0.15, "output": 0.60},
    "o1": {"input": 15.00, "output": 60.00},
    "o3-mini": {"input": 1.10, "output": 4.40},
}

def calculate_cost(model: str, usage) -> float:
    rates = PRICING.get(model, PRICING["gpt-4o"])
    input_cost = (usage.prompt_tokens / 1_000_000) * rates["input"]
    output_cost = (usage.completion_tokens / 1_000_000) * rates["output"]
    return input_cost + output_cost

def call_openai_tracked(messages: list, model: str = "gpt-4o") -> dict:
    response = client.chat.completions.create(model=model, messages=messages)
    cost = calculate_cost(model, response.usage)

    # Log to your observability stack
    logger.info("openai_call", extra={
        "model": model,
        "prompt_tokens": response.usage.prompt_tokens,
        "completion_tokens": response.usage.completion_tokens,
        "cost_usd": cost,
        "finish_reason": response.choices[0].finish_reason
    })

    return {
        "content": response.choices[0].message.content,
        "cost": cost,
        "usage": response.usage
    }

Rate Limit Management

OpenAI rate limits have two dimensions: requests per minute (RPM) and tokens per minute (TPM). Hitting either triggers a 429.

Strategies for managing at scale:

Request queuing. Instead of calling OpenAI synchronously from user requests, push calls to a queue (Redis, SQS, BullMQ) and process with a worker pool. The pool can respect rate limits without blocking user-facing operations.

Token estimation before sending. Use tiktoken to count tokens before sending a request. If the request would push you over your TPM budget, delay or queue it.

import tiktoken

def count_tokens(messages: list, model: str = "gpt-4o") -> int:
    enc = tiktoken.encoding_for_model(model)
    total = 0
    for message in messages:
        total += len(enc.encode(message["content"])) + 4  # role overhead
    return total + 2  # reply priming

Tier upgrades. OpenAI increases rate limits based on spend history. As usage grows, request higher tier access through the platform dashboard.

Streaming for Real-Time UX

For any user-facing interface, implement streaming:

from openai import AsyncOpenAI

async_client = AsyncOpenAI()  # async client — the sync OpenAI() client cannot be awaited

async def stream_response(messages: list):
    stream = await async_client.chat.completions.create(
        model="gpt-4o",
        messages=messages,
        max_tokens=800,
        stream=True
    )
    async for chunk in stream:
        yield chunk.choices[0].delta.content or ""

In a FastAPI backend:

from fastapi.responses import StreamingResponse

@app.post("/chat")
async def chat(request: ChatRequest):
    return StreamingResponse(
        stream_response(request.messages),
        media_type="text/event-stream"
    )

On the client (JavaScript), consume via EventSource or fetch with ReadableStream.

Streaming does not change cost — you are still billed for the same tokens. It changes perception. Users see output immediately. The experience of waiting for a complete response versus watching tokens appear is night-and-day different.

Context Length Management

When conversation history grows long enough to approach the context limit, you need a truncation strategy. Two options:

Sliding window: Keep the system message plus the last N messages. Discard everything else.

def trim_history(messages: list, max_tokens: int = 90000, model: str = "gpt-4o") -> list:
    system = [m for m in messages if m["role"] == "system"]
    conversation = [m for m in messages if m["role"] != "system"]

    while count_tokens(system + conversation, model) > max_tokens:
        if len(conversation) <= 2:
            break
        conversation = conversation[2:]  # drop oldest user+assistant pair

    return system + conversation

Summarization: Periodically summarize old conversation history into a single system message update. More expensive but preserves more semantic context.

The Production Wrapper Pattern

Assemble everything into a single wrapper class:

class OpenAIClient:
    def __init__(self, model: str = "gpt-4o"):
        self.client = OpenAI()
        self.model = model

    def complete(self, messages: list, max_tokens: int = 1000) -> dict:
        trimmed = trim_history(messages)
        return call_openai_tracked(trimmed, self.model)

    def complete_with_retry(self, messages: list) -> str:
        return call_openai_with_retry(messages, self.model)

    def stream(self, messages: list):
        return stream_response(messages)

One class. Retry logic, cost tracking, context management, and streaming — all hidden behind a clean interface. Business logic never touches the OpenAI SDK directly.

Observability Checklist

Before shipping an OpenAI integration to production, verify:

Every API call logs: model, tokens, cost, latency, finish_reason
Retry attempts are logged with error type and attempt number
Rate limit hits trigger an alert, not just a retry
Monthly spend alert is configured in the OpenAI dashboard
Context length exceeded errors have a retry path with truncation
Content policy refusals are handled gracefully, not treated as errors

Bottom Line

Production OpenAI integrations are not about the API call — they are about what happens around the API call. Retry logic with exponential backoff handles transient failures. Cost tracking surfaces spend before it surprises you. Context management prevents calls from failing as conversations grow. Streaming makes user experience feel instant.

Build these into a wrapper from day one. They are not optional for production — they are the difference between a demo and a system.

You have now covered the full OpenAI ecosystem: model routing, API fundamentals, GPT model lineup, function calling, structured outputs, Custom GPTs, the Assistants API, and the production patterns above. You have the architecture foundation to build any OpenAI-powered application.

One discipline remains, and it is the one that keeps a shipped system honest: proving quality. A production wrapper makes calls reliable, but it cannot tell you whether a prompt tweak or a model upgrade quietly degraded your output. The final lesson builds the eval harness — a golden set, a scoring rubric, and a regression gate wired into CI — so every model-touching change gets a quantified quality signal before it reaches users.