Ask Knox

Model selection is one of the highest-leverage decisions in an AI pipeline, and most developers get it wrong in one of two directions: they default to the most powerful model available because it feels safer, or they never leave the default model they first got working.

Both approaches leave money on the table — either in wasted compute cost or degraded output quality. Intelligent routing is the discipline of matching task complexity to model capability, and it is a skill, not a setting.

Gemini 2.5 Flash — The Production Workhorse

Flash is the default model for the vast majority of production Gemini workloads. It is optimized for speed and cost, and within that optimization it remains genuinely capable — not a degraded experience, but a different capability profile.

Flash is the correct choice when:

You are processing high volumes of requests (thousands per hour or more)
Response latency is user-facing and must be minimized
The task is classification, summarization, extraction from structured inputs, or short-form generation
Cost is a primary constraint alongside quality

Flash model IDs:

gemini-2.5-flash — current stable production model
gemini-2.5-flash-lite — smallest, fastest, lowest cost for ultra-high-volume
gemini-3-flash-preview — cutting-edge preview Flash

The mistake with Flash is underestimating it. Developers often step up to Pro unnecessarily because Flash's first output was imperfect — before iterating on the prompt. A well-constructed prompt to Flash frequently matches or exceeds a mediocre prompt to Pro. Exhaust prompt engineering before escalating models.

Gemini 2.5 Pro — The Reasoning Workhorse

Pro exists at the inflection point where Flash's capabilities become genuinely insufficient for the task. It is not a luxury option — it is the appropriate tool when reasoning depth, code quality, or analytical nuance matters more than throughput.

Pro is the correct choice when:

The task requires multi-step logical reasoning across complex inputs
You are generating substantial code that must be correct on first output
You are doing nuanced document analysis where subtle distinctions matter
You need to extract structured data from ambiguous or inconsistently formatted inputs
The task involves understanding relationships across long contexts

Pro model IDs:

gemini-2.5-pro — current stable Pro
gemini-3.1-pro-preview — cutting-edge frontier Pro

Pro costs approximately 4x more per input token than Flash — and the gap widens on output tokens (see the pricing table below). That cost is justified when Pro's reasoning depth prevents multiple retry cycles or downstream data quality failures. It is not justified when Flash with a better prompt would have worked.

Gemini 3.1 Pro Preview — The Frontier Reserve

The Gemini 3.x Preview tier occupies the top of the capability spectrum. It is reserved for tasks where the highest reasoning quality available is the only relevant variable — academic research, complex scientific analysis, frontier benchmark tasks, and any scenario where the cost of a wrong output is substantially higher than the cost of preview-tier pricing. Note: the Ultra model line has been retired.

Gemini 3.1 Pro Preview is the correct choice when:

Pro output is measurably insufficient for the task and you have verified this
The task is at the frontier of AI capability (complex mathematical reasoning, multi-domain synthesis at expert level)
Cost is not the primary constraint

Most production pipelines never need to call the preview tier. If you find yourself defaulting to it across your workload, it is a signal to audit your routing logic — you are likely paying frontier-tier prices for tasks that Pro or Flash would handle adequately.

Thinking Mode — Extended Reasoning

Thinking mode adds extended reasoning to Flash and Pro models — the equivalent of chain-of-thought processing, but executed within the model rather than externalized in the prompt. The model works through problems step by step before producing its final answer.

Thinking mode characteristics:

Significantly slower than standard mode (reasoning takes time)
Substantially better on multi-step logical problems
Available on 2.5 Pro, 2.5 Flash, and all Gemini 3.x models — and on by default for 2.5 Pro and Flash
Control: ThinkingConfig(thinking_budget=...) caps reasoning tokens (0 disables thinking on Flash); include_thoughts=True returns thought summaries for inspection

Use thinking mode in development to evaluate whether extended reasoning improves output quality for a specific task class. If it does, and the latency is acceptable, it can replace standard calls at the same tier for certain reasoning tasks.

The Routing Decision Framework

A practical routing decision process for any new task:

Start with Flash. Write the best prompt you can. Evaluate output quality.
Iterate on the prompt. If Flash output is insufficient, improve the prompt before escalating the model. Most quality failures are prompt failures.
Escalate to Pro. If Flash with optimized prompting still produces insufficient output, switch to Pro and re-evaluate.
Reserve Gemini 3.1 Pro Preview. If Pro is insufficient and the task genuinely requires frontier capability, escalate to the preview tier. If this happens frequently, re-examine your task decomposition — you may be asking one model to do a job that should be split into multiple smaller tasks.

The routing decision is also dynamic. Tasks that required Pro six months ago may run adequately on Flash today as the models improve. Revisit routing decisions quarterly. The cost savings from routing correctly compound over time.

Token Pricing Reference

Model	Input (per 1M tokens)	Output (per 1M tokens)
gemini-2.5-flash	$0.30	$2.50
gemini-2.5-flash (cached input)	$0.075	$2.50
gemini-2.5-pro	$1.25 (≤200K) / $2.50 (>200K)	$10.00 (≤200K) / $15.00 (>200K)
gemini-3.1-pro-preview	Contact Google	Contact Google

Rates drift — verify against the live pricing page (ai.google.dev/gemini-api/docs/pricing) before building cost models on these numbers.

The cached input pricing for Flash is particularly significant for production workloads with repeated large context (system prompts, document bases). Cache aggressively and the effective cost drops to a fraction of the standard rate.

Lesson Drill

Take three tasks from your current AI workflows:

Classify each task: Flash, Pro, or 3.1 Pro Preview. Write down your reasoning.
For any task you currently route to Pro, ask: have you actually tested it on Flash with an optimized prompt? If not, do it.
Calculate the monthly cost difference between your current routing and an optimized routing where Flash handles everything it is capable of. That delta is your routing inefficiency tax.

Bottom Line

Flash is not a budget Pro. Pro is not a weak 3.1 Pro Preview. Each model occupies a specific capability niche, and routing correctly is a discipline that pays dividends at scale. Default to Flash, step up to Pro when reasoning depth requires it, reserve Gemini 3.1 Pro Preview for frontier-difficulty tasks. Revisit routing decisions as models improve. The operators with the best cost-quality ratios are the ones who treat model routing as a continuous optimization problem, not a one-time configuration choice.

Gemini Model Routing — Flash, Pro, and Preview