Temperature, Tokens, and Model Parameters

Model parameters are the layer underneath the prompts. Most operators either ignore them entirely or reach for them first when output quality fails — both mistakes.

The right model is: fix the prompt first, treat parameters as precision instruments for the cases where prompt changes alone cannot get you where you need to go. Temperature, max tokens, and top-p are not magic dials. They have specific, predictable effects, and tuning them without understanding those effects produces unpredictable results.

Temperature

Temperature controls the randomness of the model's token selection. At each position in the output, the model has a probability distribution over all possible next tokens. Temperature scales that distribution.

Temperature = 0.0 collapses the distribution to the highest-probability token at each step — greedy, most-likely token selection. The output is near-deterministic: highly repeatable, but not guaranteed byte-for-byte identical, because serving-side factors (batching, hardware, model updates) can still introduce small variation. This is the correct setting for tasks where correctness matters more than variety: code generation, data extraction, structured output, classification.

Temperature = 0.7–1.0 preserves a moderate spread of probabilities, enabling the model to sometimes select tokens that are plausible but not the most probable. This produces more varied, natural-sounding prose. It is the working range for most content generation, analysis, and general-purpose tasks.

Temperature at 1.0 (the maximum on Anthropic models that still accept the parameter) significantly increases randomness. Low-probability tokens appear more frequently. This can produce surprising, creative output — and it can produce incoherent output. Use sparingly and only for brainstorming or ideation, never for production outputs where quality consistency matters.

The practical question: what does your task require — consistency or variety? Code needs consistency. Marketing copy benefits from some variety. The answer tells you where on the spectrum to set temperature.

Max Tokens

Max tokens is the hard limit on output length. The model stops generating after it hits this ceiling, regardless of whether the task is complete. Setting it wrong produces two failure modes:

Too low: The output is truncated mid-sentence, mid-section, or mid-JSON object. This is one of the more frustrating and common failures in AI pipelines — the output looks fine until you notice it was silently cut off.

Too high with a verbose model: The model pads to fill available space. Some models, given a high token limit, will add unnecessary caveats, summaries of what they just said, and filler prose. Setting a tight but adequate limit produces cleaner, denser output.

Estimation guide for setting max tokens:

One paragraph: 150–250 tokens
Short article section: 400–600 tokens
Full article: 1,500–3,000 tokens
Complete structured JSON (moderate complexity): 500–2,000 tokens depending on schema

Run a few test completions at generous token limits, measure actual output length, then set max tokens at 120–130% of your measured typical length.

Top-p (Nucleus Sampling)

Top-p controls which tokens are eligible for selection at each step. A top-p of 0.9 means: consider only tokens that collectively account for 90% of the probability mass. Tokens in the bottom 10% of probability are excluded from selection.

Reducing top-p makes outputs more conservative — fewer candidate tokens, more predictable selections. Increasing top-p expands the candidate pool.

The interaction with temperature: both parameters affect output variance, but through different mechanisms. Temperature scales the whole probability distribution; top-p truncates the tail.

The practical rule: Do not tune both simultaneously. Start with temperature. If temperature adjustments get you close but not quite where you need to be, then consider top-p. Adjusting both at once makes it impossible to attribute cause to effect.

Most production deployments leave top-p at its default (1.0 on Anthropic models that accept it) and never touch it. Temperature is the primary lever — on models that expose it at all; remember that Anthropic's newest Opus-tier models accept neither knob.

When to Tune vs. When to Fix the Prompt

This is the most important judgment call in parameter work.

Fix the prompt when:

Output is wrong because the model is missing information (add context)
Output is in the wrong format (tighten format specification)
Output is hallucinated (ground it in injected facts)
Output is drifting from the task (tighten the task definition)

These are prompt failures, not parameter failures. No temperature setting will fix a prompt that lacks context. No max_tokens adjustment will fix a format break.

Tune parameters when:

Output format is correct and content is right, but the model is too predictable/repetitive — increase temperature slightly
Output is being cut off and content is complete — increase max tokens
Output is confidently wrong in the same direction repeatedly — reduce temperature (you may be in a high-probability bad-answer attractor)

Model Selection as a Parameter

Temperature, top-p, and max tokens are intra-model parameters. Model selection is the meta-parameter — which model handles the task at all.

Model selection by task type:

Complex reasoning, code, nuanced analysis: Claude Sonnet, Claude Opus, the current GPT-class flagship (model names in this table drift — verify the current flagship for each provider)
Speed-critical synthesis, summarization: Gemini Flash, Claude Haiku
Real-time search, live data: Grok, Perplexity
Cost-sensitive, high-volume classification: Gemini Flash, Claude Haiku

Running a complex reasoning task through a fast cheap model because it is cheaper produces expensive failures — bad output that cascades downstream. Running a simple classification task through a frontier reasoning model burns money without improving quality.

Right model for right task is the highest-leverage parameter decision you make — not temperature.

Lesson Drill

For any AI task you run regularly:

What temperature are you running? Is that actually right for the task type?
What is your max_tokens setting? Have you ever seen truncation? Is your buffer adequate?
Are you running the right model for the task complexity and cost profile?

Change one parameter, document the effect. Then return to the prompt and see if a prompt improvement outperforms the parameter change. In most cases, it will.

Bottom Line

Temperature sets variance — low for deterministic tasks, moderate for production, up to 1.0 for ideation only. Max tokens is a hard truncation limit — set it with buffer. Top-p is a tail-pruning mechanism — leave it at its default (1.0 on models that accept it) unless temperature adjustments alone are insufficient. On Anthropic's newest Opus-tier models, sampling parameters are gone entirely — prompting, effort, and adaptive thinking are the levers. Model selection is the most impactful parameter and the most neglected. Fix the prompt first. Tune parameters when the prompt is right and the variance or length profile is still wrong. The next lesson covers the infrastructure that makes all of this sustainable at scale: the prompt library.