Most people pick an AI model the way they pick anything else: they compare the price tag. Model A is one price per million tokens, Model B is half that, so Model B is the cheap one. A study from researchers at Stanford, Berkeley, and Microsoft says that instinct is wrong often enough to be dangerous: in 32% of head-to-head comparisons, the model with the lower listed price actually ran up the higher total bill, sometimes by as much as 28 times.
Their headline example: Gemini 3 Flash is listed at 80% cheaper than GPT-5.4. Across the tasks they tested, it cost 38% more.
The researchers call this the price reversal phenomenon, and it matters for any business choosing a model by its sticker price, which is to say almost everyone.
Why the sticker price lies
A price per token looks like a unit price, the way fuel is priced per litre. The problem is that you do not control how many "litres" a task burns, and modern reasoning models burn wildly different amounts for the same job. Two things drive the gap:
- Thinking tokens. Reasoning models generate a hidden internal monologue before they answer. You pay for those tokens even though you never see them. On the very same query, the study found one model using 900% more thinking tokens than another. A model with a low headline price that thinks nine times as hard is not actually cheap.
- Agent turns. When a model runs as an agent, it goes back and forth with tools and its environment. One model took 10 times as many interaction turns as another to finish the same task. Every turn is more tokens, and the cheap-looking model can quietly take the long way around.
So the real cost is the listed price multiplied by how much the model actually does, and that second number swings far more than the first.
It is worse than that: you cannot reliably predict it
You might hope to measure each model once and settle the question. The study closes that door too. Run the same query through the same model more than once and the thinking-token count varies by as much as 9.7 times. The cost is not a fixed figure, it is a distribution with a wide spread, what the authors call an "irreducible noise floor." There is no clever formula that turns a list price into the bill you will actually get.
What this means for a small business
The takeaway is not "reasoning models are a trap." They are often worth every token. The takeaway is narrower and more useful:
The published price per token tells you almost nothing about what a model will cost you on your work.
The cost ranking depends on your tasks, your prompts, and how your agents are built. The cheapest model for a math benchmark can be the most expensive for your support inbox. The only honest comparison is to run your own workload through each candidate and watch the actual bill.
That sounds like a lot of work, and done with a stack of separate provider accounts and spreadsheets, it is. This is exactly where the tooling matters.
What to actually do
- Never procure on list price alone. Treat the per-token price as one input, not the decision.
- Test on your real tasks. Run the jobs you actually have, support replies, document summaries, the agent you are about to deploy, not a generic benchmark.
- Measure the full bill, not the average. Because cost is a distribution, look at the spread and the worst case, not just a tidy mean.
- Re-check when anything changes. A new prompt, a new agent design, or a model update can flip the ranking. Yesterday's cheapest is not guaranteed to be today's.
- Prefer tools that show actual per-request cost. If you cannot see what a task cost, you cannot do any of the above.
This is the cost-side companion to picking the right model in the first place (see right-sizing models for agentic workflows) and to keeping team-wide usage in check (see controlling your team's AI spending).
Where Crewdle fits, and where it does not
Let us be precise about this, because the study rewards precision. Crewdle does not make any model cheaper. The price of a token is set by the provider, and no platform changes that.
What Crewdle gives you is the thing the study says you actually need: the visibility to measure real cost on your own tasks and use cases.
- Run the same job on different models. In Crewdle Connect you choose the model behind each agent, so you can point the same task at Claude, GPT, or Gemini and compare what they actually cost you, not what their price pages claim.
- Pay for what is done, not what is listed. Crewdle is pay-as-you-go: you are billed on actual usage, including the thinking tokens and the extra turns, so the real cost is the cost you see.
- See it per agent and per task. Crewdle Admin shows where the spend actually goes, which is the per-request monitoring the researchers call for, on one screen instead of across a dozen invoices.
In other words, Crewdle does not win the price-per-token race for you. It lets you stop guessing and find out which model is genuinely cheapest for your use case, which, as the study shows, is the only number that matters.
The takeaway
A lower price tag is not a lower bill. Independent researchers measured 8 frontier models across 12 tasks and found the cheaper-listed model lost on cost a third of the time, occasionally by 28 times, with no reliable way to predict it from the price. For a small business the lesson is freeing, not scary: stop agonizing over price pages and measure what your own work actually costs. Then pick the tools that let you see it.
Start for free and measure the real cost of your AI on the tasks that matter to you.