What is model right-sizing in agentic workflows?

Model right-sizing means matching the LLM capability level to the complexity of each individual step in an agentic workflow, rather than using a single frontier model for every call. Most agent steps — routing, parsing, validation — need only small, cost-efficient models, while complex reasoning steps benefit from larger models.

How much money can right-sizing save on LLM costs?

Depending on the workflow, right-sizing can reduce LLM API costs by 45% to 90%. A conservative approach — keeping frontier models for complex reasoning but using budget models for extraction, classification, and validation — typically saves 45-60%. Aggressive optimization with intelligent model routing can achieve 86% cost reduction.

Does using smaller models reduce quality in agentic AI systems?

Not when done correctly. NVIDIA research shows that small language models (1-9B parameters) can match or exceed larger models on narrow, well-defined tasks like classification, parsing, and structured output generation. Quality only suffers when a small model is assigned a task that genuinely requires complex reasoning — which the DCAM framework helps prevent.

What is the DCAM framework for model selection?

DCAM stands for Decompose, Classify, Assign, and Measure. First, decompose your agentic workflow into atomic LLM calls. Second, classify each call by complexity (Low, Medium, High, Critical). Third, assign an appropriate model tier to each complexity level. Fourth, measure quality, cost, and latency in production and iterate on your assignments.

How does model right-sizing affect carbon emissions from AI?

Smaller models consume 10-30x less energy per inference than frontier models. When you right-size your model selection, every dollar saved on API costs represents proportionally less energy consumed and fewer carbon emissions produced. The financial and environmental incentives are perfectly aligned.

The Model Right-Sizing Framework: Why Choosing the Wrong LLM Is Your Most Expensive AI Decision

Agentic workflows are transforming how enterprises build with AI. Instead of a single prompt-response exchange, modern systems orchestrate dozens — sometimes hundreds — of LLM calls per user session: planning, tool selection, data extraction, summarization, validation, and reflection.

This is powerful. It is also extraordinarily expensive when done wrong.

The default approach — routing every call through a frontier model like Claude Opus 4.6 or GPT-5 — can burn through budgets at a staggering rate. Not because frontier models are overpriced, but because most of those calls do not need frontier-level intelligence. A JSON parser does not need a PhD. A field validator does not need a philosopher.

This article presents the Model Right-Sizing Framework: a structured methodology for matching model capability to task complexity at every step of an agentic workflow. The economic impact is not marginal — it is the difference between a viable product and one that cannot scale.

The Multiplication Problem

In a traditional single-call architecture, model cost is simple: one request in, one response out. If you pay 5 USD per million input tokens and 25 USD per million output tokens (Claude Opus 4.6 pricing), a typical 2,000-token exchange costs roughly 0.06 USD.

Agentic workflows change the calculus entirely. A single user session might involve:

3–5 planning steps (decomposing the task into sub-goals)
5–15 tool calls (searching, retrieving, computing)
3–10 extraction/parsing steps (structuring raw data into usable formats)
2–5 validation passes (checking outputs against constraints)
1–3 reflection loops (evaluating whether the answer is sufficient)

A moderate agent workflow of 20 LLM calls per session, at the same Opus-level pricing, costs roughly 1.20 USD per session. At 10,000 daily sessions, that is 12,000 USD per day — or 4.4 million USD per year.

The question is not whether agentic AI works. It does. The question is whether every one of those 20 calls needs a 5/25 USD-per-MTok model.

The answer, almost universally, is no.

The Model Tier Landscape (March 2026)

To understand right-sizing, we need to understand what is available. The current model landscape falls into four distinct tiers:

Tier 1: Frontier Reasoning

These models excel at complex multi-step reasoning, nuanced judgment, and novel problem-solving.

Model	Input (USD)	Output (USD)	Strength
Claude Opus 4.6	5.00/MTok	25.00/MTok	Deep reasoning, complex analysis
GPT-5.4 Pro	30.00/MTok	180.00/MTok	Maximum capability

Tier 2: Balanced Performance

Strong general-purpose models that handle most production tasks well.

Model	Input (USD)	Output (USD)	Strength
Claude Sonnet 4.6	3.00/MTok	15.00/MTok	Coding, analysis, tool use
GPT-5.4	2.50/MTok	15.00/MTok	Multi-step reasoning, planning
Gemini 3.1 Pro	2.00/MTok	12.00/MTok	Long-context reasoning, 1M window
Magistral Medium	2.00/MTok	5.00/MTok	Domain-specific reasoning, multilingual
Qwen 3 Max	1.20/MTok	6.00/MTok	Strong reasoning, open-weight ecosystem

Tier 3: Cost-Optimized

Fast, affordable models ideal for well-defined, narrower tasks.

Model	Input (USD)	Output (USD)	Strength
Claude Haiku 4.5	1.00/MTok	5.00/MTok	Classification, extraction, routing
Mistral Large 3	0.50/MTok	1.50/MTok	General purpose, European provider
DeepSeek R1	0.45/MTok	2.15/MTok	Reasoning at budget pricing
Qwen 3.5 Plus	0.40/MTok	2.40/MTok	Balanced MoE, open-weight
Magistral Small	0.50/MTok	1.50/MTok	Lightweight reasoning, transparent
Gemini 3 Flash	0.50/MTok	3.00/MTok	Fast inference, multimodal
GPT-5 mini	0.25/MTok	2.00/MTok	Structured output, validation
Gemini 3.1 Flash-Lite	0.25/MTok	1.50/MTok	Lightweight tasks, multimodal
Llama 4 Maverick	0.15/MTok	0.60/MTok	Open-source, self-hosted option

Tier 4: Ultra-Efficient

Minimal-cost models for the simplest operations.

Model	Input (USD)	Output (USD)	Strength
Qwen 3.5 Flash	0.10/MTok	0.40/MTok	Fast MoE, open-weight
Mistral Small 3.2	0.06/MTok	0.18/MTok	Fast, self-hosted option
GPT-5 nano	0.05/MTok	0.40/MTok	Simple classification, formatting
Llama 4 Scout	0.08/MTok	0.30/MTok	Parsing, entity extraction

The price spread from Tier 1 to Tier 4 is enormous. Claude Opus 4.6 output tokens cost 83x more than Llama 4 Scout output tokens. Even within the same provider family, Claude Opus 4.6 costs 5x more than Haiku 4.5.

The Right-Sizing Framework

The framework consists of four steps: Decompose, Classify, Assign, and Measure (DCAM).

Step 1: Decompose the Workflow

Break your agentic workflow into its atomic LLM calls. Each call should be categorized by its function:

Routing — Deciding which tool or sub-agent to invoke
Extraction — Pulling structured data from unstructured input
Transformation — Converting data between formats (JSON, SQL, summaries)
Validation — Checking outputs against schemas or business rules
Reasoning — Drawing conclusions, making judgments, synthesizing information
Generation — Producing user-facing natural language output
Reflection — Evaluating quality and deciding whether to retry

Step 2: Classify Task Complexity

Each task type maps to a complexity level:

Complexity	Description	Examples
Low	Deterministic or near-deterministic; the output format is fixed and the reasoning is minimal	JSON parsing, field validation, intent classification, keyword extraction
Medium	Requires understanding context and applying learned patterns, but the task is well-bounded	Summarization, entity extraction from documents, SQL generation, code completion
High	Requires multi-step reasoning, judgment under ambiguity, or creative synthesis	Research analysis, strategic planning, complex code architecture, nuanced customer responses

Step 3: Assign Model Tiers

The assignment rule is straightforward:

Task Complexity	Recommended Tier	Rationale
Low	Tier 4 (Ultra-Efficient)	These tasks need pattern matching, not reasoning. Overspending here is pure waste.
Medium	Tier 3 (Cost-Optimized)	Haiku-class models handle summarization, extraction, and structured generation reliably.
High	Tier 2 (Balanced)	Sonnet-class models cover the vast majority of complex production tasks.
Critical	Tier 1 (Frontier)	Reserve frontier models for tasks where quality failures have outsized consequences.

The key insight — supported by NVIDIA's research on small language models for agentic AI — is that most agent calls are Low or Medium complexity. In a typical 20-call agent session, only 2–4 calls genuinely require Tier 1 or Tier 2 reasoning. The rest are routing, parsing, validation, and extraction.

Step 4: Measure and Iterate

Deploy with your initial tier assignments, then measure:

Quality metrics per step — Is the lower-tier model achieving acceptable accuracy for each task?
Cost per session — What is the blended cost across all tiers?
Latency per step — Smaller models are often faster, which improves user experience
Failure rate — Are any steps failing disproportionately with the assigned model?

Adjust tier assignments based on data. Some tasks you assumed were Medium may work fine at Low. Some you classified as Low may need Medium. The framework is iterative, not prescriptive.

The Economics: A Worked Example

Consider a document analysis agent that processes uploaded contracts. The workflow has 18 LLM calls per document:

Before Right-Sizing (All Opus 4.6)

Step	Calls	Avg Tokens	Cost/Call	Total
Document chunking	1	2,000 in / 500 out	0.0225	0.0225
Section classification	6	1,000 in / 100 out	0.0075	0.0450
Clause extraction	4	1,500 in / 300 out	0.0150	0.0600
Risk assessment	3	2,000 in / 800 out	0.0300	0.0900
Summary generation	2	3,000 in / 1,000 out	0.0400	0.0800
Quality validation	2	1,500 in / 200 out	0.0125	0.0250
Total per document	18			0.3225 USD

At 1,000 documents per day: 322.50 USD/day — 117,712 USD/year

After Right-Sizing (Tiered)

Step	Tier	Model	Calls	Cost/Call	Total
Document chunking	4	GPT-5 nano	1	0.0003	0.0003
Section classification	4	GPT-5 nano	6	0.0001	0.0006
Clause extraction	3	Haiku 4.5	4	0.0030	0.0120
Risk assessment	1	Opus 4.6	3	0.0300	0.0900
Summary generation	2	Sonnet 4.6	2	0.0240	0.0480
Quality validation	3	Haiku 4.5	2	0.0025	0.0050
Total per document			18		0.1559 USD

At 1,000 documents per day: 155.90 USD/day — 56,904 USD/year

Annual savings: 60,808 USD (52%)

And this is a conservative example. When Tier 4 models handle the majority of calls — as they do in many real workflows — savings of 60–90% are achievable. Research from model routing studies shows that intelligent routing of 90% of requests to budget models with only 10% going to frontier models can reduce costs by 86% without meaningful quality degradation.

The Carbon Connection

Cost savings from right-sizing are not purely financial. In our companion article, How to Measure the Carbon Footprint of Your LLM API Spend, we established a framework for converting token expenditure into gCO₂e estimates using datacenter PUE, grid emission factors, and energy-per-token ranges.

The connection is direct: cheaper tokens from smaller models also consume less energy per inference. NVIDIA's research demonstrates that running a 1–3B parameter model is 10–30x cheaper in compute (FLOPs, energy, GPU-hours) than running a 70–175B parameter model. This means that right-sizing your model selection does not just reduce your API bill — it proportionally reduces the carbon footprint of your AI operations.

For an organization processing 1,000 documents daily:

All-Opus configuration: ~117K USD/year in API costs, with proportionally higher energy consumption from frontier-class GPU utilization
Right-sized configuration: ~57K USD/year, with the majority of inference running on smaller, more energy-efficient models

The environmental and financial incentives are perfectly aligned. Every dollar saved through right-sizing represents real energy not consumed and real emissions not produced.

Implementation Patterns

Pattern 1: Static Routing

The simplest approach. Hard-code which model handles which step at development time.

Pros: No routing overhead, deterministic behavior, easy to debug Cons: Cannot adapt to edge cases, requires manual tuning

Best for: Well-understood workflows with stable task profiles.

Pattern 2: Confidence-Based Escalation

Start every call at the cheapest viable tier. If the model's output confidence falls below a threshold, escalate to the next tier.

Pros: Automatically optimizes cost-quality tradeoff, handles edge cases Cons: Requires reliable confidence signals, adds latency for escalated calls

Best for: High-volume systems where even small per-call savings compound.

Pattern 3: Router Model

Use a dedicated lightweight model (Tier 4) as a classifier that examines each incoming task and routes it to the appropriate tier.

Pros: Adaptive, learns from patterns, centralizes routing logic Cons: Adds one extra LLM call per task, router errors cascade

Best for: Complex multi-agent systems with diverse task types.

Pattern 4: Offline Analysis and Batch Optimization

Log all agent calls in production, then periodically analyze which calls could be downgraded to cheaper tiers without quality loss. Apply the Batch API (50% discount on most providers) for non-time-sensitive operations.

Pros: Data-driven, no runtime complexity, captures long-tail optimizations Cons: Requires logging infrastructure, changes are delayed

Best for: Mature systems optimizing for maximum cost reduction.

Common Mistakes

Mistake 1: Defaulting to Frontier for Everything

This is the most expensive mistake and the most common one. Engineering teams often prototype with the most capable model and never revisit the decision. What works in a prototype becomes a 4M USD/year production cost.

Mistake 2: Optimizing Only for Token Price

Token price is one variable. Latency, throughput, and quality-per-dollar matter just as much. A model that costs 2x more but completes tasks in half the time may be the right choice for latency-sensitive steps.

Mistake 3: Ignoring the Batch API

For any step that is not user-facing and not time-sensitive — background validation, batch extraction, quality audits — the Batch API cuts costs by 50%. Combined with right-sizing, this creates compounding savings.

Mistake 4: Treating All Agent Steps Equally

A routing decision and a risk assessment are fundamentally different tasks. The framework exists precisely to make this distinction explicit rather than implicit.

The Framework Checklist

Use this checklist when designing or auditing any agentic workflow:

Map every LLM call in your agent pipeline (most teams undercount by 30–50%)
Classify each call as Low, Medium, High, or Critical complexity
Assign initial tier using the DCAM matrix
Deploy and measure quality, cost, and latency per step
Iterate monthly — model capabilities improve; pricing drops; your tier assignments should evolve
Calculate your carbon impact using cost-normalized emission factors (see our carbon accounting framework)
Report both metrics — CFOs care about dollars, boards increasingly care about emissions

Conclusion

The model right-sizing framework is not about choosing worse models. It is about choosing appropriate models. NVIDIA's research confirms what production experience already teaches: small language models are not just cheaper alternatives — they are often better suited for the narrow, well-defined tasks that make up the majority of agentic workflows.

The enterprises that will win in the agentic era are not the ones spending the most on AI. They are the ones spending the most intelligently — matching model capability to task complexity at every step, measuring the results, and iterating relentlessly.

The math is clear. A 20-step agentic workflow where 15 steps use Tier 3–4 models and 5 steps use Tier 1–2 models will cost a fraction of one that runs everything through Opus. And that fraction? It scales linearly with every additional user session, every additional agent, every additional workflow.

Right-size your models. Your budget — and the planet — will thank you.