Temperature, Tokens, and Model Parameters
Temperature, max tokens, and top-p are precision instruments — not default knobs to ignore. Knowing when to tune them versus when to fix the prompt separates operators who understand the stack from those who don't.
Model parameters are the layer underneath the prompts. Most operators either ignore them entirely or reach for them first when output quality fails — both mistakes.
The right model is: fix the prompt first, treat parameters as precision instruments for the cases where prompt changes alone cannot get you where you need to go. Temperature, max tokens, and top-p are not magic dials. They have specific, predictable effects, and tuning them without understanding those effects produces unpredictable results.
Temperature
Temperature controls the randomness of the model's token selection. At each position in the output, the model has a probability distribution over all possible next tokens. Temperature scales that distribution.
Temperature = 0.0 collapses the distribution to the highest-probability token at each step. The output is maximally deterministic — the same input will reliably produce the same output. This is the correct setting for tasks where correctness matters more than variety: code generation, data extraction, structured output, classification.
Temperature = 0.7–1.0 preserves a moderate spread of probabilities, enabling the model to sometimes select tokens that are plausible but not the most probable. This produces more varied, natural-sounding prose. It is the working range for most content generation, analysis, and general-purpose tasks.
Temperature above 1.0 significantly increases randomness. Low-probability tokens appear more frequently. This can produce surprising, creative output — and it can produce incoherent output. Use sparingly and only for brainstorming or ideation, never for production outputs where quality consistency matters.
The practical question: what does your task require — consistency or variety? Code needs consistency. Marketing copy benefits from some variety. The answer tells you where on the spectrum to set temperature.
Max Tokens
Max tokens is the hard limit on output length. The model stops generating after it hits this ceiling, regardless of whether the task is complete. Setting it wrong produces two failure modes:
Too low: The output is truncated mid-sentence, mid-section, or mid-JSON object. This is one of the more frustrating and common failures in AI pipelines — the output looks fine until you notice it was silently cut off.
Too high with a verbose model: The model pads to fill available space. Some models, given a high token limit, will add unnecessary caveats, summaries of what they just said, and filler prose. Setting a tight but adequate limit produces cleaner, denser output.
Estimation guide for setting max tokens:
- One paragraph: 150–250 tokens
- Short article section: 400–600 tokens
- Full article: 1,500–3,000 tokens
- Complete structured JSON (moderate complexity): 500–2,000 tokens depending on schema
Run a few test completions at generous token limits, measure actual output length, then set max tokens at 120–130% of your measured typical length.
Top-p (Nucleus Sampling)
Top-p controls which tokens are eligible for selection at each step. A top-p of 0.9 means: consider only tokens that collectively account for 90% of the probability mass. Tokens in the bottom 10% of probability are excluded from selection.
Reducing top-p makes outputs more conservative — fewer candidate tokens, more predictable selections. Increasing top-p expands the candidate pool.
The interaction with temperature: both parameters affect output variance, but through different mechanisms. Temperature scales the whole probability distribution; top-p truncates the tail.
The practical rule: Do not tune both simultaneously. Start with temperature. If temperature adjustments get you close but not quite where you need to be, then consider top-p. Adjusting both at once makes it impossible to attribute cause to effect.
Most production deployments run top-p at 0.9 (the default for most APIs) and never touch it. Temperature is the primary lever.
When to Tune vs. When to Fix the Prompt
This is the most important judgment call in parameter work.
Fix the prompt when:
- Output is wrong because the model is missing information (add context)
- Output is in the wrong format (tighten format specification)
- Output is hallucinated (ground it in injected facts)
- Output is drifting from the task (tighten the task definition)
These are prompt failures, not parameter failures. No temperature setting will fix a prompt that lacks context. No max_tokens adjustment will fix a format break.
Tune parameters when:
- Output format is correct and content is right, but the model is too predictable/repetitive — increase temperature slightly
- Output is being cut off and content is complete — increase max tokens
- Output is confidently wrong in the same direction repeatedly — reduce temperature (you may be in a high-probability bad-answer attractor)
Model Selection as a Parameter
Temperature, top-p, and max tokens are intra-model parameters. Model selection is the meta-parameter — which model handles the task at all.
Model selection by task type:
- Complex reasoning, code, nuanced analysis: Claude Sonnet, Claude Opus, GPT-4o
- Speed-critical synthesis, summarization: Gemini Flash, Claude Haiku
- Real-time search, live data: Grok, Perplexity
- Cost-sensitive, high-volume classification: Gemini Flash, Claude Haiku
Running a complex reasoning task through a fast cheap model because it is cheaper produces expensive failures — bad output that cascades downstream. Running a simple classification task through a frontier reasoning model burns money without improving quality.
Right model for right task is the highest-leverage parameter decision you make — not temperature.
Lesson 54 Drill
For any AI task you run regularly:
- What temperature are you running? Is that actually right for the task type?
- What is your max_tokens setting? Have you ever seen truncation? Is your buffer adequate?
- Are you running the right model for the task complexity and cost profile?
Change one parameter, document the effect. Then return to the prompt and see if a prompt improvement outperforms the parameter change. In most cases, it will.
Bottom Line
Temperature sets variance — low for deterministic tasks, moderate for production, high for ideation only. Max tokens is a hard truncation limit — set it with buffer. Top-p is a tail-pruning mechanism — leave it at 0.9 unless temperature adjustments alone are insufficient. Model selection is the most impactful parameter and the most neglected. Fix the prompt first. Tune parameters when the prompt is right and the variance or length profile is still wrong. The next lesson covers the infrastructure that makes all of this sustainable at scale: the prompt library.