Let us tailor this article for you
Answer three quick questions and we'll adapt this article to your needs.
Agentic workflows are transforming how enterprises build with AI. Instead of a single prompt-response exchange, modern systems orchestrate dozens — sometimes hundreds — of LLM calls per user session: planning, tool selection, data extraction, summarization, validation, and reflection.
This is powerful. It is also extraordinarily expensive when done wrong.
The default approach — routing every call through a frontier model like Claude Opus 4.6 or GPT-5 — can burn through budgets at a staggering rate. Not because frontier models are overpriced, but because most of those calls do not need frontier-level intelligence. A JSON parser does not need a PhD. A field validator does not need a philosopher.
This article presents the Model Right-Sizing Framework: a structured methodology for matching model capability to task complexity at every step of an agentic workflow. The economic impact is not marginal — it is the difference between a viable product and one that cannot scale.
The Multiplication Problem
In a traditional single-call architecture, model cost is simple: one request in, one response out. If you pay 5 USD per million input tokens and 25 USD per million output tokens (Claude Opus 4.6 pricing), a typical 2,000-token exchange costs roughly 0.06 USD.
Agentic workflows change the calculus entirely. A single user session might involve:
- 3–5 planning steps (decomposing the task into sub-goals)
- 5–15 tool calls (searching, retrieving, computing)
- 3–10 extraction/parsing steps (structuring raw data into usable formats)
- 2–5 validation passes (checking outputs against constraints)
- 1–3 reflection loops (evaluating whether the answer is sufficient)
A moderate agent workflow of 20 LLM calls per session, at the same Opus-level pricing, costs roughly 1.20 USD per session. At 10,000 daily sessions, that is 12,000 USD per day — or 4.4 million USD per year.
The question is not whether agentic AI works. It does. The question is whether every one of those 20 calls needs a 5/25 USD-per-MTok model.
The answer, almost universally, is no.
The Model Tier Landscape (March 2026)
To understand right-sizing, we need to understand what is available. The current model landscape falls into four distinct tiers:
Tier 1: Frontier Reasoning
These models excel at complex multi-step reasoning, nuanced judgment, and novel problem-solving.
| Model | Input (USD) | Output (USD) | Strength |
|---|---|---|---|
| Claude Opus 4.6 | 5.00/MTok | 25.00/MTok | Deep reasoning, complex analysis |
| GPT-5.4 Pro | 30.00/MTok | 180.00/MTok | Maximum capability |
Tier 2: Balanced Performance
Strong general-purpose models that handle most production tasks well.
| Model | Input (USD) | Output (USD) | Strength |
|---|---|---|---|
| Claude Sonnet 4.6 | 3.00/MTok | 15.00/MTok | Coding, analysis, tool use |
| GPT-5.4 | 2.50/MTok | 15.00/MTok | Multi-step reasoning, planning |
| Gemini 3.1 Pro | 2.00/MTok | 12.00/MTok | Long-context reasoning, 1M window |
| Magistral Medium | 2.00/MTok | 5.00/MTok | Domain-specific reasoning, multilingual |
| Qwen 3 Max | 1.20/MTok | 6.00/MTok | Strong reasoning, open-weight ecosystem |
Tier 3: Cost-Optimized
Fast, affordable models ideal for well-defined, narrower tasks.
| Model | Input (USD) | Output (USD) | Strength |
|---|---|---|---|
| Claude Haiku 4.5 | 1.00/MTok | 5.00/MTok | Classification, extraction, routing |
| Mistral Large 3 | 0.50/MTok | 1.50/MTok | General purpose, European provider |
| DeepSeek R1 | 0.45/MTok | 2.15/MTok | Reasoning at budget pricing |
| Qwen 3.5 Plus | 0.40/MTok | 2.40/MTok | Balanced MoE, open-weight |
| Magistral Small | 0.50/MTok | 1.50/MTok | Lightweight reasoning, transparent |
| Gemini 3 Flash | 0.50/MTok | 3.00/MTok | Fast inference, multimodal |
| GPT-5 mini | 0.25/MTok | 2.00/MTok | Structured output, validation |
| Gemini 3.1 Flash-Lite | 0.25/MTok | 1.50/MTok | Lightweight tasks, multimodal |
| Llama 4 Maverick | 0.15/MTok | 0.60/MTok | Open-source, self-hosted option |
Tier 4: Ultra-Efficient
Minimal-cost models for the simplest operations.
| Model | Input (USD) | Output (USD) | Strength |
|---|---|---|---|
| Qwen 3.5 Flash | 0.10/MTok | 0.40/MTok | Fast MoE, open-weight |
| Mistral Small 3.2 | 0.06/MTok | 0.18/MTok | Fast, self-hosted option |
| GPT-5 nano | 0.05/MTok | 0.40/MTok | Simple classification, formatting |
| Llama 4 Scout | 0.08/MTok | 0.30/MTok | Parsing, entity extraction |
The price spread from Tier 1 to Tier 4 is enormous. Claude Opus 4.6 output tokens cost 83x more than Llama 4 Scout output tokens. Even within the same provider family, Claude Opus 4.6 costs 5x more than Haiku 4.5.
The Right-Sizing Framework
The framework consists of four steps: Decompose, Classify, Assign, and Measure (DCAM).
Step 1: Decompose the Workflow
Break your agentic workflow into its atomic LLM calls. Each call should be categorized by its function:
- Routing — Deciding which tool or sub-agent to invoke
- Extraction — Pulling structured data from unstructured input
- Transformation — Converting data between formats (JSON, SQL, summaries)
- Validation — Checking outputs against schemas or business rules
- Reasoning — Drawing conclusions, making judgments, synthesizing information
- Generation — Producing user-facing natural language output
- Reflection — Evaluating quality and deciding whether to retry
Step 2: Classify Task Complexity
Each task type maps to a complexity level:
| Complexity | Description | Examples |
|---|---|---|
| Low | Deterministic or near-deterministic; the output format is fixed and the reasoning is minimal | JSON parsing, field validation, intent classification, keyword extraction |
| Medium | Requires understanding context and applying learned patterns, but the task is well-bounded | Summarization, entity extraction from documents, SQL generation, code completion |
| High | Requires multi-step reasoning, judgment under ambiguity, or creative synthesis | Research analysis, strategic planning, complex code architecture, nuanced customer responses |
Step 3: Assign Model Tiers
The assignment rule is straightforward:
| Task Complexity | Recommended Tier | Rationale |
|---|---|---|
| Low | Tier 4 (Ultra-Efficient) | These tasks need pattern matching, not reasoning. Overspending here is pure waste. |
| Medium | Tier 3 (Cost-Optimized) | Haiku-class models handle summarization, extraction, and structured generation reliably. |
| High | Tier 2 (Balanced) | Sonnet-class models cover the vast majority of complex production tasks. |
| Critical | Tier 1 (Frontier) | Reserve frontier models for tasks where quality failures have outsized consequences. |
The key insight — supported by NVIDIA's research on small language models for agentic AI — is that most agent calls are Low or Medium complexity. In a typical 20-call agent session, only 2–4 calls genuinely require Tier 1 or Tier 2 reasoning. The rest are routing, parsing, validation, and extraction.
Step 4: Measure and Iterate
Deploy with your initial tier assignments, then measure:
- Quality metrics per step — Is the lower-tier model achieving acceptable accuracy for each task?
- Cost per session — What is the blended cost across all tiers?
- Latency per step — Smaller models are often faster, which improves user experience
- Failure rate — Are any steps failing disproportionately with the assigned model?
Adjust tier assignments based on data. Some tasks you assumed were Medium may work fine at Low. Some you classified as Low may need Medium. The framework is iterative, not prescriptive.
The Economics: A Worked Example
Consider a document analysis agent that processes uploaded contracts. The workflow has 18 LLM calls per document:
Before Right-Sizing (All Opus 4.6)
| Step | Calls | Avg Tokens | Cost/Call | Total |
|---|---|---|---|---|
| Document chunking | 1 | 2,000 in / 500 out | 0.0225 | 0.0225 |
| Section classification | 6 | 1,000 in / 100 out | 0.0075 | 0.0450 |
| Clause extraction | 4 | 1,500 in / 300 out | 0.0150 | 0.0600 |
| Risk assessment | 3 | 2,000 in / 800 out | 0.0300 | 0.0900 |
| Summary generation | 2 | 3,000 in / 1,000 out | 0.0400 | 0.0800 |
| Quality validation | 2 | 1,500 in / 200 out | 0.0125 | 0.0250 |
| Total per document | 18 | 0.3225 USD |
At 1,000 documents per day: 322.50 USD/day — 117,712 USD/year
After Right-Sizing (Tiered)
| Step | Tier | Model | Calls | Cost/Call | Total |
|---|---|---|---|---|---|
| Document chunking | 4 | GPT-5 nano | 1 | 0.0003 | 0.0003 |
| Section classification | 4 | GPT-5 nano | 6 | 0.0001 | 0.0006 |
| Clause extraction | 3 | Haiku 4.5 | 4 | 0.0030 | 0.0120 |
| Risk assessment | 1 | Opus 4.6 | 3 | 0.0300 | 0.0900 |
| Summary generation | 2 | Sonnet 4.6 | 2 | 0.0240 | 0.0480 |
| Quality validation | 3 | Haiku 4.5 | 2 | 0.0025 | 0.0050 |
| Total per document | 18 | 0.1559 USD |
At 1,000 documents per day: 155.90 USD/day — 56,904 USD/year
Annual savings: 60,808 USD (52%)
And this is a conservative example. When Tier 4 models handle the majority of calls — as they do in many real workflows — savings of 60–90% are achievable. Research from model routing studies shows that intelligent routing of 90% of requests to budget models with only 10% going to frontier models can reduce costs by 86% without meaningful quality degradation.
The Carbon Connection
Cost savings from right-sizing are not purely financial. In our companion article, How to Measure the Carbon Footprint of Your LLM API Spend, we established a framework for converting token expenditure into gCO₂e estimates using datacenter PUE, grid emission factors, and energy-per-token ranges.
The connection is direct: cheaper tokens from smaller models also consume less energy per inference. NVIDIA's research demonstrates that running a 1–3B parameter model is 10–30x cheaper in compute (FLOPs, energy, GPU-hours) than running a 70–175B parameter model. This means that right-sizing your model selection does not just reduce your API bill — it proportionally reduces the carbon footprint of your AI operations.
For an organization processing 1,000 documents daily:
- All-Opus configuration: ~117K USD/year in API costs, with proportionally higher energy consumption from frontier-class GPU utilization
- Right-sized configuration: ~57K USD/year, with the majority of inference running on smaller, more energy-efficient models
The environmental and financial incentives are perfectly aligned. Every dollar saved through right-sizing represents real energy not consumed and real emissions not produced.
Implementation Patterns
Pattern 1: Static Routing
The simplest approach. Hard-code which model handles which step at development time.
Pros: No routing overhead, deterministic behavior, easy to debug Cons: Cannot adapt to edge cases, requires manual tuning
Best for: Well-understood workflows with stable task profiles.
Pattern 2: Confidence-Based Escalation
Start every call at the cheapest viable tier. If the model's output confidence falls below a threshold, escalate to the next tier.
Pros: Automatically optimizes cost-quality tradeoff, handles edge cases Cons: Requires reliable confidence signals, adds latency for escalated calls
Best for: High-volume systems where even small per-call savings compound.
Pattern 3: Router Model
Use a dedicated lightweight model (Tier 4) as a classifier that examines each incoming task and routes it to the appropriate tier.
Pros: Adaptive, learns from patterns, centralizes routing logic Cons: Adds one extra LLM call per task, router errors cascade
Best for: Complex multi-agent systems with diverse task types.
Pattern 4: Offline Analysis and Batch Optimization
Log all agent calls in production, then periodically analyze which calls could be downgraded to cheaper tiers without quality loss. Apply the Batch API (50% discount on most providers) for non-time-sensitive operations.
Pros: Data-driven, no runtime complexity, captures long-tail optimizations Cons: Requires logging infrastructure, changes are delayed
Best for: Mature systems optimizing for maximum cost reduction.
Common Mistakes
Mistake 1: Defaulting to Frontier for Everything
This is the most expensive mistake and the most common one. Engineering teams often prototype with the most capable model and never revisit the decision. What works in a prototype becomes a 4M USD/year production cost.
Mistake 2: Optimizing Only for Token Price
Token price is one variable. Latency, throughput, and quality-per-dollar matter just as much. A model that costs 2x more but completes tasks in half the time may be the right choice for latency-sensitive steps.
Mistake 3: Ignoring the Batch API
For any step that is not user-facing and not time-sensitive — background validation, batch extraction, quality audits — the Batch API cuts costs by 50%. Combined with right-sizing, this creates compounding savings.
Mistake 4: Treating All Agent Steps Equally
A routing decision and a risk assessment are fundamentally different tasks. The framework exists precisely to make this distinction explicit rather than implicit.
The Framework Checklist
Use this checklist when designing or auditing any agentic workflow:
- Map every LLM call in your agent pipeline (most teams undercount by 30–50%)
- Classify each call as Low, Medium, High, or Critical complexity
- Assign initial tier using the DCAM matrix
- Deploy and measure quality, cost, and latency per step
- Iterate monthly — model capabilities improve; pricing drops; your tier assignments should evolve
- Calculate your carbon impact using cost-normalized emission factors (see our carbon accounting framework)
- Report both metrics — CFOs care about dollars, boards increasingly care about emissions
Conclusion
The model right-sizing framework is not about choosing worse models. It is about choosing appropriate models. NVIDIA's research confirms what production experience already teaches: small language models are not just cheaper alternatives — they are often better suited for the narrow, well-defined tasks that make up the majority of agentic workflows.
The enterprises that will win in the agentic era are not the ones spending the most on AI. They are the ones spending the most intelligently — matching model capability to task complexity at every step, measuring the results, and iterating relentlessly.
The math is clear. A 20-step agentic workflow where 15 steps use Tier 3–4 models and 5 steps use Tier 1–2 models will cost a fraction of one that runs everything through Opus. And that fraction? It scales linearly with every additional user session, every additional agent, every additional workflow.
Right-size your models. Your budget — and the planet — will thank you.