Let us tailor this article for you
Answer three quick questions and we'll adapt this article to your needs.
Organizations increasingly use large language models (LLMs) through token-metered APIs. Finance teams can measure dollars precisely, but sustainability teams face a harder question: how much CO₂ does each dollar of AI spend actually produce?
Naïve approaches assume dollars map directly to energy, but API pricing includes margin, product strategy, and overhead. Physical emissions depend on utilization (batching), datacenter efficiency, and the electricity mix at the time and location of computation.
This paper proposes a cost-normalized accounting framework that estimates gCO₂e per USD of LLM inference spend using location-based grid emission factors, explicit datacenter efficiency assumptions (PUE), and empirically grounded ranges for energy intensity per token (J/token).
Scope, Boundaries, and Reporting Conventions
Operational Boundary Options
We define three nested boundaries:
- Boundary A: Operational electricity only. Includes GPU/CPU/memory/network electricity consumption attributable to inference, multiplied by PUE and grid intensity.
- Boundary B: Boundary A + embodied hardware (allocated). Adds manufacturing/transport/end-of-life emissions allocated to the user’s share of compute.
- Boundary C: Boundary B + training amortization (allocated). Adds a per-token allocation of upstream training emissions.
Boundary choice is a reporting decision and must be disclosed. In many contexts, Boundary A is the only one feasible without provider disclosures; B and C can be estimated parametrically.
Location-Based vs Market-Based Electricity Factors
We recommend location-based factors for “no-greenwashing” physical emissions estimates. Market-based instruments (RECs/PPAs) may be disclosed separately, but they do not change the electricity physically consumed at the time of inference.
Related Work and Data Sources
Three evidence streams support inference carbon estimation:
- Grid emission factors (regional electricity carbon intensity), e.g., U.S. EPA eGRID subregion rates.
- Datacenter efficiency (PUE), typically published by hyperscalers (e.g., AWS global PUE).
- Inference energy intensity (J/token or kWh/million tokens), measured or inferred from throughput and power under different batch sizes and serving regimes.
For embodied hardware allocation, practical methodologies exist in the cloud carbon accounting ecosystem and provider/third-party frameworks for fleet-scale hardware embodied emissions.
The Framework: From Tokens and Dollars to CO₂e
Core Variables
- $T_{in}$, $T_{out}$ = input and output tokens for a workload
- $P_{in}$, $P_{out}$ = price per token for input and output (USD/token)
- $\tau$ = tokens per dollar for an observed workload (tokens/USD)
- $j$ = energy intensity per token (J/token)
- $\text{PUE}$ = power usage effectiveness (dimensionless)
- $CI$ = location-based grid carbon intensity (kgCO₂e/kWh)
Conversion constant:
$$1 \text{ kWh} = 3.6 \times 10^6 \text{ J}$$
Effective Price Per Token
LLM APIs often bill input and output differently. Define the effective price per token for a workload:
$$p_{\text{eff}} = \frac{T_{in} \cdot P_{in} + T_{out} \cdot P_{out}}{T_{in} + T_{out}}$$
Then tokens per dollar:
$$\tau = \frac{1}{p_{\text{eff}}}$$
Operational Emissions Per Token (Boundary A)
Operational emissions per token are:
$$e_{\text{tok}} = \frac{j}{3.6 \times 10^6} \cdot \text{PUE} \cdot CI \quad [\text{kgCO}_2\text{e/token}]$$
Operational Emissions Per Dollar (Boundary A)
Cost-normalized operational emissions are:
$$e_{\text{USD}} = \tau \cdot \frac{j}{3.6 \times 10^6} \cdot \text{PUE} \cdot CI \quad [\text{kgCO}_2\text{e/USD}]$$
Or in grams per dollar:
$$\text{gCO}_2\text{e/USD} = 1000 \cdot e_{\text{USD}} \quad [\text{gCO}_2\text{e/USD}]$$
This is the central result: gCO₂e/USD depends linearly on tokens per dollar, joules per token, PUE, and grid intensity — not on dollars alone.
This means two services (or two deployments) with the same spend can have drastically different emissions: pricing does not encode utilization, and utilization drives J/token. Dollars normalize across models only if (i) your effective price per token is stable and (ii) the backend energy per token is comparable — both often false in practice.
Parameterization for U.S. Regions, PUE, and J/token
Location-Based Grid Intensity for Likely U.S. Regions
Using EPA eGRID 2023 subregion CO₂ emission rates, we can derive representative kgCO₂e/kWh:
- ERCT (Texas/ERCOT): ~0.333 kgCO₂e/kWh
- SRVC (Virginia/Carolinas region): ~0.269 kgCO₂e/kWh
Exact subregion choice should match your best estimate of serving region; absent that, treat region as a sensitivity axis.
PUE (Datacenter Overhead)
A reasonable first assumption is a hyperscaler-class PUE. AWS reports a global PUE of 1.15 for 2024. If the provider or region differs, treat PUE as a range (e.g., 1.10–1.30) and disclose it.
Energy Intensity Per Token (J/token)
Energy per token varies strongly with batching/utilization. Research estimates yield approximately ~9–72 J/token across batch sizes and power assumptions, and ~2.7–20 kWh per million tokens.
Given limited user visibility into provider batching, we recommend using a range for $j$ rather than a point estimate.
Sensitivity Analysis: Effects of Batch Size, Grid Region, and Pricing
Purpose
Because API users cannot observe internal serving utilization (batching, queueing, hardware selection), a single J/token is rarely defensible. Instead, we quantify how gCO₂e/USD varies under plausible serving regimes and U.S. grid regions.
Scenario Design
We vary:
- Utilization regime via $j$ using batching-linked ranges:
- Efficient / well-batched: $j \approx$ 9–19 J/token
- Moderate: $j \approx$ 14–42 J/token
- Poorly batched: $j \approx$ 36–72 J/token
- Grid region using eGRID subregions:
- SRVC (≈0.269 kg/kWh)
- ERCT (≈0.333 kg/kWh)
- PUE fixed at 1.15 as a baseline hyperscaler assumption.
- Tokens per dollar $\tau$: computed from official price schedules and your workload’s input/output mix.
Reporting Outputs
We recommend publishing three factors:
- Low: (efficient $j$, lower $CI$)
- Mid: (moderate $j$, mid $CI$)
- High: (poor $j$, higher $CI$)
with full parameter disclosure.
Interpretation
This sensitivity structure explains why two services (or two deployments) with the same spend can have drastically different emissions: pricing does not encode utilization, and utilization drives $j$. gCO₂e/USD can vary by >10× across plausible batching/utilization regimes even when holding grid region and PUE constant.
Extending Beyond Operational Electricity: Embodied Hardware and Training
Embodied Hardware Allocation (Boundary B)
Embodied emissions represent manufacturing/transport/end-of-life impacts. In cloud settings, a common approach is to allocate embodied emissions proportionally to usage share:
$$e_{\text{emb,tok}} = \frac{E_{\text{emb,total}}}{H_{\text{life}} \cdot u \cdot r_{\text{tok/hr}}}$$
Where:
- $E_{\text{emb,total}}$ = embodied emissions for the relevant server/GPU system (kgCO₂e)
- $H_{\text{life}}$ = lifetime hours (e.g., 3 years $\approx$ 26,280 h)
- $u$ = average utilization fraction attributable to serving
- $r_{\text{tok/hr}}$ = tokens produced per hour at that utilization
Provider methodologies increasingly attempt to measure embodied carbon at scale, but API customers usually lack asset-level data. We recommend Boundary B as optional unless your cloud provider exposes embodied allocations.
Training Amortization (Boundary C)
Training emissions can be allocated across a model’s lifetime inference output:
$$e_{\text{train,tok}} = \frac{E_{\text{train,total}}}{T_{\text{lifetime}}}$$
Where $E_{\text{train,total}}$ = total training emissions for the model (kgCO₂e) and $T_{\text{lifetime}}$ = total tokens served over the model’s commercial lifetime.
For frontier proprietary models, these figures are rarely disclosed, so Boundary C is typically scenario-based. Emerging efforts toward standardized environmental disclosures for AI highlight the need for comparable lifecycle reporting.
Practical Implementation
Minimum Viable Method (Boundary A)
- Measure $\tau$ (tokens/USD) from billing logs for your workload (preferred). If not available, compute $\tau$ from official prices and your observed input/output mix.
- Select grid region(s) and derive location-based $CI$. If region uncertain, use a low/high region pair (e.g., SRVC vs ERCT).
- Select PUE (baseline 1.15; range if desired).
- Select J/token range from empirical sources. Use at least a low/mid/high $j$.
- Compute:
$$\frac{\text{gCO}_2\text{e}}{\text{USD}} = 1000 \cdot \tau \cdot \left(\frac{j}{3.6 \times 10^6}\right) \cdot \text{PUE} \cdot CI$$
Multi-Model Pipelines
For pipelines invoking multiple models (e.g., GPT-4.1, GPT-5.2, Claude Sonnet 4.5, Gemini 3 Pro), compute per-call emissions and sum:
$$\text{CO}_2\text{e}_{\text{pipeline}} = \sum_{m \in \mathcal{M}} \left( (T_{\text{in},m} + T_{\text{out},m}) \cdot e_{\text{tok},m} \right)$$
If you only know dollars by model, compute:
$$\text{CO}_2\text{e}_{\text{pipeline}} = \sum_{m \in \mathcal{M}} \left( S_m \cdot e_{\text{USD},m} \right)$$
Where $e_{\text{USD},m}$ uses that model’s effective $\tau_m$ (from prices or observed logs).
Limitations and Recommended Disclosures
Key Uncertainties
- Serving utilization / batching: dominant driver of J/token.
- Exact region and time-varying grid: location-based factors can vary seasonally and hourly; eGRID is an annual average.
- Model-specific hardware choices: different providers may use different accelerators and serving stacks, affecting J/token.
- Training and embodied allocations: often not observable for proprietary frontier models.
Disclosure Checklist
When reporting gCO₂e/USD, disclose:
- Boundary (A/B/C)
- CI source and subregion
- PUE assumption and source
- $j$ source and range
- $\tau$ derivation (observed billing vs price schedule)
Conclusion
This framework provides a practical, auditable method to compute emissions per dollar of LLM token spend using location-based grid factors. The key insight is that gCO₂e/USD is not a constant: it scales with (i) tokens per dollar (pricing and workload mix), (ii) joules per token (utilization/batching), (iii) PUE, and (iv) regional grid intensity.
Because utilization dominates uncertainty, the most scientifically defensible approach for API customers is to publish a range (low/mid/high) rather than a single value. Optional modules extend the framework to embodied hardware and training amortization when disclosures or credible estimates exist.
Standardized environmental reporting for AI would materially improve accuracy by reducing uncertainty in $j$, region, and lifecycle allocations.