Ask Knox

The chat completions API is stateless. You manage history. You manage context. You handle truncation when the context fills up. For simple use cases, this is fine. For document Q&A, multi-turn task execution, and file-aware assistants, the bookkeeping becomes the application.

The Assistants API abstracts all of that. OpenAI stores the conversation, manages the context window, handles file indexing, and provides built-in tools for RAG and code execution.

The Four Core Concepts

Assistant — a configured AI entity with a name, instructions (system prompt), model selection, and tool definitions. Create an Assistant once and reuse it across many conversations. Equivalent to a custom GPT configuration but fully API-controlled.

Thread — a persistent conversation container. Messages are stored in the Thread on OpenAI's servers. You do not need to manage history arrays — you just add messages and run the Thread. OpenAI handles context window management automatically, truncating old messages intelligently when the thread grows long.

Message — a single turn added to a Thread. Messages have roles (user, assistant) and can include file attachments.

Run — the execution of an Assistant against a Thread. Creating a Run triggers inference. You poll the Run's status until it reaches a terminal state (completed, failed, expired, requires_action).

Setup: Creating an Assistant

from openai import OpenAI

client = OpenAI()

assistant = client.beta.assistants.create(
    name="Document Analyst",
    instructions=(
        "You analyze uploaded documents and answer questions about their content. "
        "Always cite the specific section of the document that supports your answer. "
        "If the document does not contain the answer, say so explicitly."
    ),
    model="gpt-4o",
    tools=[{"type": "file_search"}]
)

assistant_id = assistant.id  # Store this — reuse the assistant, don't recreate

Create the Assistant once. Store the ID. Every user conversation creates a new Thread but reuses the same Assistant.

File Search: Document Q&A

Upload files to a vector store, attach it to the Assistant, and the model retrieves relevant passages automatically:

# Create a vector store
# Note: vector stores graduated out of the beta namespace — they live at
# client.vector_stores (the assistants/threads namespaces remain under client.beta)
vector_store = client.vector_stores.create(name="Company Docs")

# Upload files to the vector store
with open("q4_report.pdf", "rb") as f:
    client.vector_stores.file_batches.upload_and_poll(
        vector_store_id=vector_store.id,
        files=[f]
    )

# Attach vector store to the assistant
client.beta.assistants.update(
    assistant.id,
    tool_resources={"file_search": {"vector_store_ids": [vector_store.id]}}
)

Once attached, every Run against this Assistant has automatic access to the vector store. When a user asks a question, the model retrieves the relevant passages and grounds its answer in the document content.

Running a Conversation

# Create a thread for this user session
thread = client.beta.threads.create()

# Add the user's message
client.beta.threads.messages.create(
    thread_id=thread.id,
    role="user",
    content="What were the key risk factors mentioned in the Q4 report?"
)

# Run the assistant against the thread
run = client.beta.threads.runs.create_and_poll(
    thread_id=thread.id,
    assistant_id=assistant.id
)

# Get the response when complete
if run.status == "completed":
    messages = client.beta.threads.messages.list(thread_id=thread.id)
    latest = messages.data[0]  # most recent message
    print(latest.content[0].text.value)

create_and_poll is a blocking SDK helper that waits for the Run to complete. For production use with a web server, run the poll loop asynchronously rather than blocking the request thread.

Handling requires_action (Function Calling)

If you include function tools on your Assistant and the model decides to call one, the Run pauses at requires_action:

if run.status == "requires_action":
    tool_outputs = []
    for tool_call in run.required_action.submit_tool_outputs.tool_calls:
        result = dispatch_function(
            tool_call.function.name,
            json.loads(tool_call.function.arguments)
        )
        tool_outputs.append({
            "tool_call_id": tool_call.id,
            "output": json.dumps(result)
        })

    # Resume the run with tool results
    run = client.beta.threads.runs.submit_tool_outputs_and_poll(
        thread_id=thread.id,
        run_id=run.id,
        tool_outputs=tool_outputs
    )

Code Interpreter

The code_interpreter tool gives the model a Python sandbox. It can write and execute code, produce charts, manipulate files, and perform calculations — all without you provisioning any compute.

assistant = client.beta.assistants.create(
    name="Data Analyst",
    instructions="Analyze the uploaded CSV and answer questions about the data. Generate charts when helpful.",
    model="gpt-4o",
    tools=[{"type": "code_interpreter"}]
)

Attach a CSV file to the Thread message and ask the model to analyze it. It will write pandas code, execute it, and return both the code and the results. Charts are returned as file references in the response.

Assistants vs Chat Completions

Dimension	Chat Completions	Assistants API
State management	You manage history	OpenAI manages threads
File RAG	You build it	Built-in file_search
Code execution	You provision	Built-in code_interpreter
Cost	Per token only	Per token + tool costs
Control	Full	Limited

Use Assistants when the built-in tools match your needs and you want to avoid the engineering overhead. Use chat completions when you need full control over every aspect of the interaction.

Cost Model

Assistants API costs include:

Model tokens (same rates as chat completions)
file_search vector storage: $0.10/GB/day
code_interpreter: $0.03/session when used
Retrieval API calls: included in file_search cost

For most document Q&A use cases, the storage cost is negligible. The dominant cost remains model tokens.

The Forward Path: Responses API

OpenAI's Responses API is the current replacement for the Assistants API. It brings the same capabilities — file search, code interpreter, tool calling, multi-turn state — under a simpler model: send input items, get output items back. There is no separate Thread or Run lifecycle to manage.

Key architectural differences:

Dimension	Assistants API (deprecated)	Responses API (current)
State	Server-side Threads	Application-managed conversations
Tools	Attached to Assistant object	Per-request tool definitions
Execution	Async Runs + polling	Single call, agentic loop
File RAG	Vector Store + file_search tool	file_search tool (same capability)
Code execution	code_interpreter tool	code_interpreter tool
Lifecycle	Create → Thread → Run → Poll	Single `responses.create()` call

The Responses API also adds capabilities not available in Assistants: deep research, computer use, and remote MCP server calls within a single request.

Bottom Line

The Assistants API trades control for built-in state management, file RAG, and code execution — but it is deprecated with a hard sunset of August 26, 2026. For new integrations, use the Responses API, which provides the same capabilities with a simpler request model and additional agentic features. If you are maintaining an existing Assistants integration, plan your migration before the sunset date.

The next lesson covers production patterns — the retry logic, rate limiting, cost tracking, and error handling that separates a working prototype from a system that runs reliably at scale.

The Assistants API — Threads, Files, and Persistent Conversations