Most organisations begin with the wrong question: which AI model is the best?
The right question is: best for what?
That sounds like a semantic distinction. It is a fundamental difference in approach. Start with “which model is the best” and you end up running a frontier model on every task — and a bill that climbs fast for work a much cheaper, faster model would have handled without effort. Start with “best for which task” and you build a model strategy that scales.
Model quality is the first pillar of AI as an accelerator. Not because the model decides everything — the other two pillars matter just as much — but because a badly fitted model undermines the value of every investment above it.
The misunderstanding about frontier models
There is a quiet assumption in enterprise AI right now: when in doubt, pick the biggest model. GPT-5, Claude Opus, Gemini Ultra. Maximum capability, minimum second-guessing. It feels safe.
But the assumption is economically wrong — and architecturally wrong too.
Frontier models are designed for the hardest work: complex reasoning, creative synthesis, multi-step planning across long contexts, nuanced decisions that require weighing multiple perspectives. They are slow, expensive per token, and variable in behaviour on simple, deterministic tasks. That variability — a frontier model being a little creative on a task that demands exact extraction — leads to more retries, higher costs and longer chains in production systems.
The reality: 80% of typical enterprise workloads do not require frontier-scale reasoning. Summarising, classifying, extracting, rerouting, formatting — these are tasks that smaller, specialised models execute with equal or better accuracy, at a fraction of the cost and with lower latency.
A concrete example: a small model trained specifically on customer-service classification beats GPT-5 on support-ticket categorisation — and runs a hundred times faster. Not because it is smarter. Because it fits the task.
Benchmark ≠ business performance
Here is a second, expensive misunderstanding.
Organisations pick models based on benchmark scores: MMLU, HumanEval, LMSYS Chatbot Arena. Those benchmarks are designed for scientific comparison under controlled conditions — engineered prompts, optimised context, carefully curated test questions. They measure what is possible at maximum effort. They do not measure what is reliable in production under realistic conditions.
Production works differently. Prompts are in natural language, without handcrafted examples. Queries are ambiguous. Latency counts. Error rates count. The model that excels on a benchmark can behave significantly differently in production when the prompt structure deviates from the benchmark format.
What you actually measure in production: latency, cost per request, hallucination rate, tool-call success rate, task completion rate, escalation rate. Those are the metrics that determine business value. None of them appear on a leaderboard.
The conclusion is blunt: evaluate models on your use cases, with your data, under your production conditions. A high Chatbot Arena score is a starting point for exploration, not a selection criterion.
Model routing: the professional approach
If “one model for everything” is the wrong approach, what is the right one?
The approach that has become standard in 2026 among organisations using AI seriously is called model routing: dynamically assigning each query or task to the model best suited for it — based on complexity, required accuracy, latency requirements and cost.
The principle is simple: not every request deserves the same reasoning weight. Summarising an email calls for a different model than analysing contract risk. Classifying intent calls for a different model than generating a strategic proposal. Build routing into your architecture and you get the best of every world: speed and low cost for simple tasks, capacity and depth for complex ones.
In practice it looks like this: you design a decision tree — or, more sophisticated, a lightweight classifier — that evaluates incoming tasks on complexity and type, and forwards them to the right model. Simple classification and extraction go to a fast, cheap model. Multi-step reasoning goes to a frontier model. Domain-specific work goes to a fine-tuned model trained on your data.
The result: cost reduction without quality loss, lower end-to-end latency, and a system that scales without costs rising in proportion.
The four dimensions of model quality
Quality is not one thing. When evaluating a model for a specific task, four dimensions matter:
Accuracy on task type. How well does the model perform on the specific combination of input and output your use case requires? Reasoning, extraction, generation and classification each have their own performance characteristics.
Cost and latency. Not an afterthought but an architectural constraint. In agentic systems where multiple LLM calls chain together, latency accumulates quickly. Two seconds per step becomes twenty seconds across a ten-step chain. That is not acceptable in production.
Controllability. Can you fine-tune the model on your domain? Does it run on-premises when privacy or compliance demand it? How stable is its behaviour across version updates?
Vendor independence. Is the choice reversible? Can you switch when a better alternative appears? Proprietary fine-tuning pipelines and closed APIs make switching harder than it needs to be.
What this means for your organisation
Model quality as a strategic choice does not begin with a model — it begins with a task analysis. Which AI tasks do you want to run? Which require deep reasoning? Which require speed and volume? Which require domain-specific accuracy that justifies fine-tuning?
Then: which model performs best on each class of task, measured under your production conditions? Not on a benchmark. Not in a demo. In an evaluation that reflects your reality.
Then: how do you build in routing, so that every task reaches the right model automatically, without a human in between?
This is not a one-off decision. The landscape moves fast — new models, better pricing, higher quality on specific domains. The organisations that build a robust model strategy are also the ones that can evolve with it without rebuilding their architecture.
The model is the starting point. The harness — the layer that connects the model to your processes — decides whether that start produces results.
What follows
In the next post: Pillar 2 — Harness Strength. The orchestration layer decides whether your AI agent does work or merely simulates it.