Let us tailor this article for you

Answer three quick questions and we'll adapt this article to your needs.

or
|

Agentic workflows are transforming how enterprises build with AI. Instead of a single prompt-response exchange, modern systems orchestrate dozens — sometimes hundreds — of LLM calls per user session: planning, tool selection, data extraction, summarization, validation, and reflection.

This is powerful. It is also extraordinarily expensive when done wrong.

The default approach — routing every call through a frontier model like Claude Opus 4.6 or GPT-5 — can burn through budgets at a staggering rate. Not because frontier models are overpriced, but because most of those calls do not need frontier-level intelligence. A JSON parser does not need a PhD. A field validator does not need a philosopher.

This article presents the Model Right-Sizing Framework: a structured methodology for matching model capability to task complexity at every step of an agentic workflow. The economic impact is not marginal — it is the difference between a viable product and one that cannot scale.


The Multiplication Problem

In a traditional single-call architecture, model cost is simple: one request in, one response out. If you pay 5 USD per million input tokens and 25 USD per million output tokens (Claude Opus 4.6 pricing), a typical 2,000-token exchange costs roughly 0.06 USD.

Agentic workflows change the calculus entirely. A single user session might involve:

  • 3–5 planning steps (decomposing the task into sub-goals)
  • 5–15 tool calls (searching, retrieving, computing)
  • 3–10 extraction/parsing steps (structuring raw data into usable formats)
  • 2–5 validation passes (checking outputs against constraints)
  • 1–3 reflection loops (evaluating whether the answer is sufficient)

A moderate agent workflow of 20 LLM calls per session, at the same Opus-level pricing, costs roughly 1.20 USD per session. At 10,000 daily sessions, that is 12,000 USD per day — or 4.4 million USD per year.

The question is not whether agentic AI works. It does. The question is whether every one of those 20 calls needs a 5/25 USD-per-MTok model.

The answer, almost universally, is no.


The Model Tier Landscape (March 2026)

To understand right-sizing, we need to understand what is available. The current model landscape falls into four distinct tiers:

Tier 1: Frontier Reasoning

These models excel at complex multi-step reasoning, nuanced judgment, and novel problem-solving.

Model Input (USD) Output (USD) Strength
Claude Opus 4.6 5.00/MTok 25.00/MTok Deep reasoning, complex analysis
GPT-5.4 Pro 30.00/MTok 180.00/MTok Maximum capability

Tier 2: Balanced Performance

Strong general-purpose models that handle most production tasks well.

Model Input (USD) Output (USD) Strength
Claude Sonnet 4.6 3.00/MTok 15.00/MTok Coding, analysis, tool use
GPT-5.4 2.50/MTok 15.00/MTok Multi-step reasoning, planning
Gemini 3.1 Pro 2.00/MTok 12.00/MTok Long-context reasoning, 1M window
Magistral Medium 2.00/MTok 5.00/MTok Domain-specific reasoning, multilingual
Qwen 3 Max 1.20/MTok 6.00/MTok Strong reasoning, open-weight ecosystem

Tier 3: Cost-Optimized

Fast, affordable models ideal for well-defined, narrower tasks.

Model Input (USD) Output (USD) Strength
Claude Haiku 4.5 1.00/MTok 5.00/MTok Classification, extraction, routing
Mistral Large 3 0.50/MTok 1.50/MTok General purpose, European provider
DeepSeek R1 0.45/MTok 2.15/MTok Reasoning at budget pricing
Qwen 3.5 Plus 0.40/MTok 2.40/MTok Balanced MoE, open-weight
Magistral Small 0.50/MTok 1.50/MTok Lightweight reasoning, transparent
Gemini 3 Flash 0.50/MTok 3.00/MTok Fast inference, multimodal
GPT-5 mini 0.25/MTok 2.00/MTok Structured output, validation
Gemini 3.1 Flash-Lite 0.25/MTok 1.50/MTok Lightweight tasks, multimodal
Llama 4 Maverick 0.15/MTok 0.60/MTok Open-source, self-hosted option

Tier 4: Ultra-Efficient

Minimal-cost models for the simplest operations.

Model Input (USD) Output (USD) Strength
Qwen 3.5 Flash 0.10/MTok 0.40/MTok Fast MoE, open-weight
Mistral Small 3.2 0.06/MTok 0.18/MTok Fast, self-hosted option
GPT-5 nano 0.05/MTok 0.40/MTok Simple classification, formatting
Llama 4 Scout 0.08/MTok 0.30/MTok Parsing, entity extraction

The price spread from Tier 1 to Tier 4 is enormous. Claude Opus 4.6 output tokens cost 83x more than Llama 4 Scout output tokens. Even within the same provider family, Claude Opus 4.6 costs 5x more than Haiku 4.5.


The Right-Sizing Framework

The framework consists of four steps: Decompose, Classify, Assign, and Measure (DCAM).

Step 1: Decompose the Workflow

Break your agentic workflow into its atomic LLM calls. Each call should be categorized by its function:

  • Routing — Deciding which tool or sub-agent to invoke
  • Extraction — Pulling structured data from unstructured input
  • Transformation — Converting data between formats (JSON, SQL, summaries)
  • Validation — Checking outputs against schemas or business rules
  • Reasoning — Drawing conclusions, making judgments, synthesizing information
  • Generation — Producing user-facing natural language output
  • Reflection — Evaluating quality and deciding whether to retry

Step 2: Classify Task Complexity

Each task type maps to a complexity level:

Complexity Description Examples
Low Deterministic or near-deterministic; the output format is fixed and the reasoning is minimal JSON parsing, field validation, intent classification, keyword extraction
Medium Requires understanding context and applying learned patterns, but the task is well-bounded Summarization, entity extraction from documents, SQL generation, code completion
High Requires multi-step reasoning, judgment under ambiguity, or creative synthesis Research analysis, strategic planning, complex code architecture, nuanced customer responses

Step 3: Assign Model Tiers

The assignment rule is straightforward:

Task Complexity Recommended Tier Rationale
Low Tier 4 (Ultra-Efficient) These tasks need pattern matching, not reasoning. Overspending here is pure waste.
Medium Tier 3 (Cost-Optimized) Haiku-class models handle summarization, extraction, and structured generation reliably.
High Tier 2 (Balanced) Sonnet-class models cover the vast majority of complex production tasks.
Critical Tier 1 (Frontier) Reserve frontier models for tasks where quality failures have outsized consequences.

The key insight — supported by NVIDIA's research on small language models for agentic AI — is that most agent calls are Low or Medium complexity. In a typical 20-call agent session, only 2–4 calls genuinely require Tier 1 or Tier 2 reasoning. The rest are routing, parsing, validation, and extraction.

Step 4: Measure and Iterate

Deploy with your initial tier assignments, then measure:

  • Quality metrics per step — Is the lower-tier model achieving acceptable accuracy for each task?
  • Cost per session — What is the blended cost across all tiers?
  • Latency per step — Smaller models are often faster, which improves user experience
  • Failure rate — Are any steps failing disproportionately with the assigned model?

Adjust tier assignments based on data. Some tasks you assumed were Medium may work fine at Low. Some you classified as Low may need Medium. The framework is iterative, not prescriptive.


The Economics: A Worked Example

Consider a document analysis agent that processes uploaded contracts. The workflow has 18 LLM calls per document:

Before Right-Sizing (All Opus 4.6)

Step Calls Avg Tokens Cost/Call Total
Document chunking 1 2,000 in / 500 out 0.0225 0.0225
Section classification 6 1,000 in / 100 out 0.0075 0.0450
Clause extraction 4 1,500 in / 300 out 0.0150 0.0600
Risk assessment 3 2,000 in / 800 out 0.0300 0.0900
Summary generation 2 3,000 in / 1,000 out 0.0400 0.0800
Quality validation 2 1,500 in / 200 out 0.0125 0.0250
Total per document 18 0.3225 USD

At 1,000 documents per day: 322.50 USD/day — 117,712 USD/year

After Right-Sizing (Tiered)

Step Tier Model Calls Cost/Call Total
Document chunking 4 GPT-5 nano 1 0.0003 0.0003
Section classification 4 GPT-5 nano 6 0.0001 0.0006
Clause extraction 3 Haiku 4.5 4 0.0030 0.0120
Risk assessment 1 Opus 4.6 3 0.0300 0.0900
Summary generation 2 Sonnet 4.6 2 0.0240 0.0480
Quality validation 3 Haiku 4.5 2 0.0025 0.0050
Total per document 18 0.1559 USD

At 1,000 documents per day: 155.90 USD/day — 56,904 USD/year

Annual savings: 60,808 USD (52%)

And this is a conservative example. When Tier 4 models handle the majority of calls — as they do in many real workflows — savings of 60–90% are achievable. Research from model routing studies shows that intelligent routing of 90% of requests to budget models with only 10% going to frontier models can reduce costs by 86% without meaningful quality degradation.


The Carbon Connection

Cost savings from right-sizing are not purely financial. In our companion article, How to Measure the Carbon Footprint of Your LLM API Spend, we established a framework for converting token expenditure into gCO₂e estimates using datacenter PUE, grid emission factors, and energy-per-token ranges.

The connection is direct: cheaper tokens from smaller models also consume less energy per inference. NVIDIA's research demonstrates that running a 1–3B parameter model is 10–30x cheaper in compute (FLOPs, energy, GPU-hours) than running a 70–175B parameter model. This means that right-sizing your model selection does not just reduce your API bill — it proportionally reduces the carbon footprint of your AI operations.

For an organization processing 1,000 documents daily:

  • All-Opus configuration: ~117K USD/year in API costs, with proportionally higher energy consumption from frontier-class GPU utilization
  • Right-sized configuration: ~57K USD/year, with the majority of inference running on smaller, more energy-efficient models

The environmental and financial incentives are perfectly aligned. Every dollar saved through right-sizing represents real energy not consumed and real emissions not produced.


Implementation Patterns

Pattern 1: Static Routing

The simplest approach. Hard-code which model handles which step at development time.

Pros: No routing overhead, deterministic behavior, easy to debug Cons: Cannot adapt to edge cases, requires manual tuning

Best for: Well-understood workflows with stable task profiles.

Pattern 2: Confidence-Based Escalation

Start every call at the cheapest viable tier. If the model's output confidence falls below a threshold, escalate to the next tier.

Pros: Automatically optimizes cost-quality tradeoff, handles edge cases Cons: Requires reliable confidence signals, adds latency for escalated calls

Best for: High-volume systems where even small per-call savings compound.

Pattern 3: Router Model

Use a dedicated lightweight model (Tier 4) as a classifier that examines each incoming task and routes it to the appropriate tier.

Pros: Adaptive, learns from patterns, centralizes routing logic Cons: Adds one extra LLM call per task, router errors cascade

Best for: Complex multi-agent systems with diverse task types.

Pattern 4: Offline Analysis and Batch Optimization

Log all agent calls in production, then periodically analyze which calls could be downgraded to cheaper tiers without quality loss. Apply the Batch API (50% discount on most providers) for non-time-sensitive operations.

Pros: Data-driven, no runtime complexity, captures long-tail optimizations Cons: Requires logging infrastructure, changes are delayed

Best for: Mature systems optimizing for maximum cost reduction.


Common Mistakes

Mistake 1: Defaulting to Frontier for Everything

This is the most expensive mistake and the most common one. Engineering teams often prototype with the most capable model and never revisit the decision. What works in a prototype becomes a 4M USD/year production cost.

Mistake 2: Optimizing Only for Token Price

Token price is one variable. Latency, throughput, and quality-per-dollar matter just as much. A model that costs 2x more but completes tasks in half the time may be the right choice for latency-sensitive steps.

Mistake 3: Ignoring the Batch API

For any step that is not user-facing and not time-sensitive — background validation, batch extraction, quality audits — the Batch API cuts costs by 50%. Combined with right-sizing, this creates compounding savings.

Mistake 4: Treating All Agent Steps Equally

A routing decision and a risk assessment are fundamentally different tasks. The framework exists precisely to make this distinction explicit rather than implicit.


The Framework Checklist

Use this checklist when designing or auditing any agentic workflow:

  1. Map every LLM call in your agent pipeline (most teams undercount by 30–50%)
  2. Classify each call as Low, Medium, High, or Critical complexity
  3. Assign initial tier using the DCAM matrix
  4. Deploy and measure quality, cost, and latency per step
  5. Iterate monthly — model capabilities improve; pricing drops; your tier assignments should evolve
  6. Calculate your carbon impact using cost-normalized emission factors (see our carbon accounting framework)
  7. Report both metrics — CFOs care about dollars, boards increasingly care about emissions

Conclusion

The model right-sizing framework is not about choosing worse models. It is about choosing appropriate models. NVIDIA's research confirms what production experience already teaches: small language models are not just cheaper alternatives — they are often better suited for the narrow, well-defined tasks that make up the majority of agentic workflows.

The enterprises that will win in the agentic era are not the ones spending the most on AI. They are the ones spending the most intelligently — matching model capability to task complexity at every step, measuring the results, and iterating relentlessly.

The math is clear. A 20-step agentic workflow where 15 steps use Tier 3–4 models and 5 steps use Tier 1–2 models will cost a fraction of one that runs everything through Opus. And that fraction? It scales linearly with every additional user session, every additional agent, every additional workflow.

Right-size your models. Your budget — and the planet — will thank you.