Why can’t I just use API cost as a proxy for carbon emissions?

API pricing includes margin, product strategy, and overhead — it does not directly encode energy consumption. Two services with the same spend can have >10× different emissions depending on utilization, batching, datacenter location, and grid carbon intensity.

What is the minimum I need to estimate my LLM carbon footprint?

At minimum (Boundary A), you need four inputs: tokens per dollar (τ) from your billing logs, a grid region and its carbon intensity (CI), a PUE assumption (1.15 is a reasonable default), and a J/token range from published benchmarks. Plug these into the formula to get a low/mid/high gCO₂e/USD range.

Why does this framework use ranges instead of single values?

API users have limited visibility into provider batching, hardware, and exact datacenter region. Energy per token (J/token) can vary from ~9 to ~72 depending on utilization. Publishing a range (low/mid/high) with full parameter disclosure is more scientifically defensible than a false-precision single number.

Should I include training emissions in my carbon accounting?

Training amortization (Boundary C) is optional because total training emissions and lifetime token counts are rarely disclosed for proprietary models. Include it if your provider publishes this data; otherwise, note it as a scenario-based estimate with explicit assumptions.

How to Measure the Carbon Footprint of Your LLM API Spend

Organizations increasingly use large language models (LLMs) through token-metered APIs. Finance teams can measure dollars precisely, but sustainability teams face a harder question: how much CO₂ does each dollar of AI spend actually produce?

Naïve approaches assume dollars map directly to energy, but API pricing includes margin, product strategy, and overhead. Physical emissions depend on utilization (batching), datacenter efficiency, and the electricity mix at the time and location of computation.

This paper proposes a cost-normalized accounting framework that estimates gCO₂e per USD of LLM inference spend using location-based grid emission factors, explicit datacenter efficiency assumptions (PUE), and empirically grounded ranges for energy intensity per token (J/token).

Scope, Boundaries, and Reporting Conventions

Operational Boundary Options

We define three nested boundaries:

Boundary A: Operational electricity only. Includes GPU/CPU/memory/network electricity consumption attributable to inference, multiplied by PUE and grid intensity.
Boundary B: Boundary A + embodied hardware (allocated). Adds manufacturing/transport/end-of-life emissions allocated to the user’s share of compute.
Boundary C: Boundary B + training amortization (allocated). Adds a per-token allocation of upstream training emissions.

Boundary choice is a reporting decision and must be disclosed. In many contexts, Boundary A is the only one feasible without provider disclosures; B and C can be estimated parametrically.

Location-Based vs Market-Based Electricity Factors

We recommend location-based factors for “no-greenwashing” physical emissions estimates. Market-based instruments (RECs/PPAs) may be disclosed separately, but they do not change the electricity physically consumed at the time of inference.

Related Work and Data Sources

Three evidence streams support inference carbon estimation:

Grid emission factors (regional electricity carbon intensity), e.g., U.S. EPA eGRID subregion rates.
Datacenter efficiency (PUE), typically published by hyperscalers (e.g., AWS global PUE).
Inference energy intensity (J/token or kWh/million tokens), measured or inferred from throughput and power under different batch sizes and serving regimes.

For embodied hardware allocation, practical methodologies exist in the cloud carbon accounting ecosystem and provider/third-party frameworks for fleet-scale hardware embodied emissions.

The Framework: From Tokens and Dollars to CO₂e

Core Variables

\(T_{in}\), \(T_{out}\) = input and output tokens for a workload
\(P_{in}\), \(P_{out}\) = price per token for input and output (USD/token)
\(\tau\) = tokens per dollar for an observed workload (tokens/USD)
\(j\) = energy intensity per token (J/token)
\(\text{PUE}\) = power usage effectiveness (dimensionless)
\(CI\) = location-based grid carbon intensity (kgCO₂e/kWh)

Conversion constant:

\[1 \text{ kWh} = 3.6 \times 10^6 \text{ J}\]

Effective Price Per Token

LLM APIs often bill input and output differently. Define the effective price per token for a workload:

\[p_{\text{eff}} = \frac{T_{in} \cdot P_{in} + T_{out} \cdot P_{out}}{T_{in} + T_{out}}\]

Then tokens per dollar:

\[\tau = \frac{1}{p_{\text{eff}}}\]

Operational Emissions Per Token (Boundary A)

Operational emissions per token are:

\[e_{\text{tok}} = \frac{j}{3.6 \times 10^6} \cdot \text{PUE} \cdot CI \quad [\text{kgCO}_2\text{e/token}]\]

Operational Emissions Per Dollar (Boundary A)

Cost-normalized operational emissions are:

\[e_{\text{USD}} = \tau \cdot \frac{j}{3.6 \times 10^6} \cdot \text{PUE} \cdot CI \quad [\text{kgCO}_2\text{e/USD}]\]

Or in grams per dollar:

\[\text{gCO}_2\text{e/USD} = 1000 \cdot e_{\text{USD}} \quad [\text{gCO}_2\text{e/USD}]\]

This is the central result: gCO₂e/USD depends linearly on tokens per dollar, joules per token, PUE, and grid intensity — not on dollars alone.

This means two services (or two deployments) with the same spend can have drastically different emissions: pricing does not encode utilization, and utilization drives J/token. Dollars normalize across models only if (i) your effective price per token is stable and (ii) the backend energy per token is comparable — both often false in practice.

Parameterization for U.S. Regions, PUE, and J/token

Location-Based Grid Intensity for Likely U.S. Regions

Using EPA eGRID 2023 subregion CO₂ emission rates, we can derive representative kgCO₂e/kWh:

ERCT (Texas/ERCOT): ~0.333 kgCO₂e/kWh
SRVC (Virginia/Carolinas region): ~0.269 kgCO₂e/kWh

Exact subregion choice should match your best estimate of serving region; absent that, treat region as a sensitivity axis.

PUE (Datacenter Overhead)

A reasonable first assumption is a hyperscaler-class PUE. AWS reports a global PUE of 1.15 for 2024. If the provider or region differs, treat PUE as a range (e.g., 1.10–1.30) and disclose it.

Energy Intensity Per Token (J/token)

Energy per token varies strongly with batching/utilization. Research estimates yield approximately ~9–72 J/token across batch sizes and power assumptions, and ~2.7–20 kWh per million tokens.

Given limited user visibility into provider batching, we recommend using a range for \(j\) rather than a point estimate.

Sensitivity Analysis: Effects of Batch Size, Grid Region, and Pricing

Purpose

Because API users cannot observe internal serving utilization (batching, queueing, hardware selection), a single J/token is rarely defensible. Instead, we quantify how gCO₂e/USD varies under plausible serving regimes and U.S. grid regions.

Scenario Design

We vary:

Utilization regime via \(j\) using batching-linked ranges:
- Efficient / well-batched: \(j \approx\) 9–19 J/token
- Moderate: \(j \approx\) 14–42 J/token
- Poorly batched: \(j \approx\) 36–72 J/token
Grid region using eGRID subregions:
- SRVC (≈0.269 kg/kWh)
- ERCT (≈0.333 kg/kWh)
PUE fixed at 1.15 as a baseline hyperscaler assumption.
Tokens per dollar \(\tau\): computed from official price schedules and your workload’s input/output mix.

Reporting Outputs

We recommend publishing three factors:

Low: (efficient \(j\), lower \(CI\))
Mid: (moderate \(j\), mid \(CI\))
High: (poor \(j\), higher \(CI\))

with full parameter disclosure.

Interpretation

This sensitivity structure explains why two services (or two deployments) with the same spend can have drastically different emissions: pricing does not encode utilization, and utilization drives \(j\). gCO₂e/USD can vary by >10× across plausible batching/utilization regimes even when holding grid region and PUE constant.

Extending Beyond Operational Electricity: Embodied Hardware and Training

Embodied Hardware Allocation (Boundary B)

Embodied emissions represent manufacturing/transport/end-of-life impacts. In cloud settings, a common approach is to allocate embodied emissions proportionally to usage share:

\[e_{\text{emb,tok}} = \frac{E_{\text{emb,total}}}{H_{\text{life}} \cdot u \cdot r_{\text{tok/hr}}}\]

Where:

\(E_{\text{emb,total}}\) = embodied emissions for the relevant server/GPU system (kgCO₂e)
\(H_{\text{life}}\) = lifetime hours (e.g., 3 years \(\approx\) 26,280 h)
\(u\) = average utilization fraction attributable to serving
\(r_{\text{tok/hr}}\) = tokens produced per hour at that utilization

Provider methodologies increasingly attempt to measure embodied carbon at scale, but API customers usually lack asset-level data. We recommend Boundary B as optional unless your cloud provider exposes embodied allocations.

Training Amortization (Boundary C)

Training emissions can be allocated across a model’s lifetime inference output:

\[e_{\text{train,tok}} = \frac{E_{\text{train,total}}}{T_{\text{lifetime}}}\]

Where \(E_{\text{train,total}}\) = total training emissions for the model (kgCO₂e) and \(T_{\text{lifetime}}\) = total tokens served over the model’s commercial lifetime.

For frontier proprietary models, these figures are rarely disclosed, so Boundary C is typically scenario-based. Emerging efforts toward standardized environmental disclosures for AI highlight the need for comparable lifecycle reporting.

Practical Implementation

Minimum Viable Method (Boundary A)

Measure \(\tau\) (tokens/USD) from billing logs for your workload (preferred). If not available, compute \(\tau\) from official prices and your observed input/output mix.
Select grid region(s) and derive location-based \(CI\). If region uncertain, use a low/high region pair (e.g., SRVC vs ERCT).
Select PUE (baseline 1.15; range if desired).
Select J/token range from empirical sources. Use at least a low/mid/high \(j\).
Compute:

\[\frac{\text{gCO}_2\text{e}}{\text{USD}} = 1000 \cdot \tau \cdot \left(\frac{j}{3.6 \times 10^6}\right) \cdot \text{PUE} \cdot CI\]

Multi-Model Pipelines

For pipelines invoking multiple models (e.g., GPT-4.1, GPT-5.2, Claude Sonnet 4.5, Gemini 3 Pro), compute per-call emissions and sum:

\[\text{CO}_2\text{e}_{\text{pipeline}} = \sum_{m \in \mathcal{M}} \left( (T_{\text{in},m} + T_{\text{out},m}) \cdot e_{\text{tok},m} \right)\]

If you only know dollars by model, compute:

\[\text{CO}_2\text{e}_{\text{pipeline}} = \sum_{m \in \mathcal{M}} \left( S_m \cdot e_{\text{USD},m} \right)\]

Where \(e_{\text{USD},m}\) uses that model’s effective \(\tau_m\) (from prices or observed logs).

Limitations and Recommended Disclosures

Key Uncertainties

Serving utilization / batching: dominant driver of J/token.
Exact region and time-varying grid: location-based factors can vary seasonally and hourly; eGRID is an annual average.
Model-specific hardware choices: different providers may use different accelerators and serving stacks, affecting J/token.
Training and embodied allocations: often not observable for proprietary frontier models.

Disclosure Checklist

When reporting gCO₂e/USD, disclose:

Boundary (A/B/C)
CI source and subregion
PUE assumption and source
\(j\) source and range
\(\tau\) derivation (observed billing vs price schedule)

Conclusion

This framework provides a practical, auditable method to compute emissions per dollar of LLM token spend using location-based grid factors. The key insight is that gCO₂e/USD is not a constant: it scales with (i) tokens per dollar (pricing and workload mix), (ii) joules per token (utilization/batching), (iii) PUE, and (iv) regional grid intensity.

Because utilization dominates uncertainty, the most scientifically defensible approach for API customers is to publish a range (low/mid/high) rather than a single value. Optional modules extend the framework to embodied hardware and training amortization when disclosures or credible estimates exist.

Standardized environmental reporting for AI would materially improve accuracy by reducing uncertainty in \(j\), region, and lifecycle allocations.