Inefficient AI models are killing us

Last week ago I talked on whether AI data centers are stealing all your water. The short answer was no, not yet, but the trajectory is a growing problem. Data centers and AI overuse has become a hot button issue, and the question I kept getting back, in different shapes from different readers is, “Fine, but what are the labs actually doing about this?”

So I went looking.

Every frontier lab has a public posture on compute efficiency. They all have a small model. They all have a blog post about quantization or distillation or “responsible scaling.” Every single one of them is also pouring tens of billions of dollars into the largest data center buildout in industrial history.

The question I want to answer is whether the efficiency work is moving the needle, or whether it’s all for show. So I built a timeline. It’s the cleanest way I’ve found to see which labs led, which labs followed, and whether the curve is anywhere close to moving.

Jake’s compute-efficient LLM timeline

2015 - 2022

March 2015

Google. Hinton, Vinyals, and Dean publish Distilling the Knowledge in a Neural Network. The foundational distillation paper. Train a big “teacher,” squeeze its predictions into a much smaller “student,” interact with the student. Most “mini” and “nano” model on the market today traces its lineage back to this paper.

June 2017

Google. Vaswani et al. publishes Attention Is All You Need, introducing the Transformer. Not an efficiency paper, but every efficient model since has either tried to compress, sparsify, or replace this architecture. The foundation for the entire conversation (and the modern AI industry).

October 2019

Hugging Face. Sanh et al. release DistilBERT, 40% smaller than BERT and 60% faster while retaining 97% of language understanding. First publicly-loved proof that distillation could turn a research-scale model into something a startup could actually run.

January 2020

OpenAI. Kaplan et al. publish Scaling Laws for Neural Language Models, formalizing the bigger-is-better curve. This is the paper that has justified every $10B training run since, and the one the efficiency-research counter-movement of the next six years was responding to.

June 2020

OpenAI. GPT-3 releases. 175B parameters. Subsequent third-party estimates put the training-run electricity cost at roughly 1,287 MWh. The starting gun for every “how do we do this with less?” research thread in the field.

January 2021

Google. Fedus, Zoph, and Shazeer publish Switch Transformers, scaling sparse Mixture-of-Experts to 1.6T parameters with up to 7x the pre-training speed of a dense baseline. The Western labs largely ignore it in production for the next three years (which they would all come to regret).

March 2022

DeepMind. Hoffmann et al. publish the Chinchilla paper and prove most flagship models are undertrained. The implication, which the field has spent four years half-absorbing, was that a properly-trained smaller model beat a bloated bigger one on a fixed compute budget. This is the first rigorous case that “make it bigger” is wasting both money and energy.

2023

February 2023

Meta. LLaMA releases (7B, 13B, 33B, 65B) and promptly leaks. The open-weights ecosystem that powers most of today’s efficient inference (llama.cpp, vLLM, MLX, Ollama) traces back to this single release.

May 2023

Microsoft and Helion. Helion Energy announces the world’s first fusion power purchase agreement, with Microsoft to offtake up to 50 MW of capacity by 2028. This is the earliest sign that the hyperscaler answer to AI’s energy problem was going to be “build more power,” not “use less.”

June 2023

UC Berkeley Sky Computing Lab. vLLM and PagedAttention releases. KV-cache memory managed like virtual memory in an operating system, 2-4x throughput gain at the same latency over the prior state of the art. Several production LLM-serving stacks now run on vLLM (or a fork of it).

September 2023

Mistral. Mistral 7B releases on September 27. Outperforms Llama 2 13B at half the size and demolishes the assumption that small open models had to be hobbyist toys.

December 2023

Microsoft. Phi-2 (2.7B) releases on December 12. Argues that data curation beats parameter count and produces a small model that outperforms 7B peers on reasoning.

Google. Gemini Nano on Pixel 8 Pro. First smartphone engineered for an on-device foundation model. Summarize-in-Recorder and Smart Reply run entirely on-device, no data center round-trip.

Mistral. Mixtral 8x7B ships on December 11. First major open Mixture-of-Experts model. Active parameters per token roughly 13B, total 47B, inference cost about half a comparable dense model.

2024

April 2024

Microsoft. Phi-3 family releases on April 23 (3.8B mini, 7B small, 14B medium). Phi-3-mini runs on a laptop and outperforms GPT-3.5 on a stack of reasoning benchmarks.

April 2024

Meta. Llama 3 (8B, 70B) releases on April 18. The 8B variant becomes the default on-device open-weights model for the year, displacing Mistral 7B in most local-inference stacks.

July 2024

Apple. Apple publishes the foundation model technical report, documenting a ~3B-parameter on-device model with 2-bit quantization-aware training and KV-cache sharing optimized for the Apple Neural Engine. The model outperforms comparable small models in their reported evals.

Meta. Llama 3.1 (8B, 70B, 405B) releases. The 405B is the first open model to credibly contend with closed frontier capability, and the field uses its weights as fodder for a year of distillation and fine-tuning work.

OpenAI. GPT-4o-mini releases. 15¢/M input and 60¢/M output tokens, roughly 60% cheaper than GPT-3.5-turbo and an order of magnitude cheaper than GPT-4. The first serious small-model pricing move from OpenAI, and the start of the “mini and nano” ladder that now spans the API.

August 2024

Anthropic. Prompt caching launches on the Claude API. Cached input tokens billed at 10% of the standard rate, which for long-context workflows cuts both cost and the compute behind it by up to 90%.

September 2024

Microsoft and Constellation. Constellation announces the Three Mile Island Unit 1 restart, a 20-year power purchase agreement and an 835-MW reactor brought back online in 2028 to feed Azure AI workloads. Other companies announce nuclear deals in the months that follow.

October 2024

Apple. Apple Intelligence ships in iOS 18.1. First mass-market on-device foundation model deployed in production on phones already in hundreds of millions of pockets. Every query handled on-device is one that didn’t make a data center round-trip, positioning Apple with the most environmentally responsible product strategy in the industry.

December 2024

DeepSeek. DeepSeek V3 releases. 671B total parameters, 37B active per token MoE. Hits GPT-4-class benchmarks for an API price an order of magnitude below OpenAI’s.

Google. Gemini 2.0 Flash releases, outperforming Gemini 1.5 Pro on key benchmarks at twice the speed. The Flash variant pushes the frontier price floor down by another meaningful chunk.

Microsoft. Phi-4 (14B) releases. Strong showing on math and reasoning benchmarks against models several times its size. The Phi line is now the longest-running efficiency bet of any major lab.

2025

January 2025

DeepSeek. DeepSeek R1 releases, an open reasoning model trained at a small fraction of o1’s reported cost. The release triggered a temporary $600B drop in US AI-related market cap and a permanent shift in how the field talks about training efficiency.

OpenAI, SoftBank, Oracle, MGX. The Stargate Project is announced, a $500B commitment to US AI infrastructure over four years. The largest single capex announcement in the industry’s history, made the day after DeepSeek R1 dropped, and timed for the political stage (rather than the engineering one).

February 2025

Epoch AI. How much energy does ChatGPT use?, publishes. Independent reconstruction landed at roughly 0.3 Wh per typical GPT-4o query, an order of magnitude below the widely-cited 2023 estimate of ~3 Wh. The 10x drop came from better hardware (H100 over A100), more accurate token counts (~269 average output tokens, not 2,000), and updated parameter assumptions. The narrative begins to shift from “ChatGPT burns more energy than Google search” to “the labs still don’t publish methodology, and reasoning queries can be orders of magnitude more expensive than chat ones.”

March 2025

Google. Gemma 3 releases (1B, 4B, 12B, 27B). Open-weights line built for laptop and single-GPU deployment with up to 128K context. The 27B variant beat models 3x its size on common reasoning benches.

April 2025

Meta. Llama 4 releases (Scout 17B-active/109B-total/16 experts, Maverick 17B-active/400B-total/128 experts, Behemoth still in training). This is Meta’s first MoE flagship, shipped three months after DeepSeek R1, with active-parameter counts.

June 2025

OpenAI. Sam Altman’s The Gentle Singularity blog post discloses an average ChatGPT query at “about 0.34 watt-hours” and “about 0.000085 gallons of water.” This is the first official OpenAI datapoint on the question, and a number that conveniently sits inside the range Epoch AI had already published four months earlier. No methodology released, no model breakdown, no audit.

Late 2025

Anthropic. Claude Haiku 4.5 releases and hits a meaningful fraction of Sonnet’s coding benchmark scores at a small fraction of the price-per-token (and the inference compute that goes with it).

2026 (So far)

April 2026

Anthropic. Claude Opus 4.7 releases alongside the Claude Code Skills framework and the sub-agent delegation pattern. This routes orchestration to Opus and grunt work to Haiku by default, which cuts both the bill and the compute for the median multi-step task. Marks the first time efficiency-by-design has shown up as a default product behavior.

Moonshot AI. Kimi K2.6 releases. 1T-parameter MoE with 32B active per token, priced ~88% below Opus 4.7. The second non-American lab in 18 months to undercut US frontier pricing by an order of magnitude.

DeepSeek. DeepSeek V4 releases, pushing the MoE curve further. API costs roughly an order of magnitude below US frontier prices, and the gap on benchmark capability is closer to zero than expected.

OpenAI. GPT-5.5 takes the #1 spot on the Artificial Analysis Intelligence Index with a reported 86% hallucination rate on certain factual benchmark categories. The clearest signal yet that OpenAI is optimizing for raw capability, not for efficiency or reliability, from the company most willing to spend an order of magnitude more compute for a single-digit benchmark bump.

May 2026

Anthropic, Cursor, OpenAI. The agentic delegation pattern (small model for routing, big model for synthesis) settles into Claude Code, Codex CLI, and Cursor as the default execution shape.

Eleven years of papers, releases, and counter-moves. You can see which labs led on architecture (Google on the Transformer, distillation, and MoE; DeepMind on Chinchilla; DeepSeek on production MoE plus reasoning), which led on product packaging (OpenAI’s mini ladder, Anthropic’s prompt caching and delegation, Apple’s on-device bet), which led on infrastructure (UC Berkeley’s vLLM), and which led on capex (Microsoft on nuclear, OpenAI on Stargate). The efficiency work is present and it’s accelerating, but the capex work is also real, and it’s accelerating faster.

So is any of this enough?

No.

The unit economics are getting better, but the environmental impact is getting worse. Every lab can show you a chart of cost-per-million-tokens going down and to the right and none of them can show you a chart of their total annual energy consumption going down (because the line is straight up).

The IEA projects global data center electricity demand to roughly double by 2030 to around 945 TWh, driven primarily by AI. Lawrence Berkeley Lab’s 2024 United States Data Center Energy Usage Report puts US data centers at 6.7-12% of national electricity consumption by 2028, up from 4.4% in 2023. Google’s emissions are up 48% since 2019 and the company execs themselves are explicitly blaming AI. Microsoft’s overall emissions are up nearly 30% since 2020, with Scope 3 (the data-center construction and supply chain piece) doing almost all of the rising. AWS doesn’t break theirs out clearly, which is its own kind of answer.

The labs will tell you they’re working on efficiency, and they are, no lie there. They’ll also tell you efficiency frees them to do more, and it does. But Jevons paradox is doing what Jevons paradox always does: cheaper inference means more inference, not less total power draw. The frontier models are getting bigger, the thinking budgets per query are getting longer, and the number of agents per user is climbing every quarter.

A second pattern worth naming: every architectural efficiency win of the last three years has either originated outside the US frontier labs or arrived in their products only after a non-US competitor forced the move. Meta moved to MoE for Llama 4 three months after DeepSeek’s R1. Anthropic’s prompt caching is the most generous American exception, and even it landed years after the technique was in the literature. The American hyperscalers aren’t leading efficiency, they’re just reacting to it on a delay while spending faster.

The labs have decided on your behalf that productivity gains and scientific upside justify the energy costs.

Why this should scare the industry

There’s a tactical case for caring about this that has nothing to do with the environment (though, that should be enough).

Public opinion on AI is in a worse place than the industry pretends. The most recent Pew Research survey has 51% of US adults more concerned than excited about increased use of AI in daily life, against just 11% who are more excited than concerned. That concerned-vs-excited gap has widened from 37% concerned in 2021 to a steady ~50% across 2023, 2024, and 2025. Environmental impact is one of the top concerns named alongside job displacement and disinformation. Those water and energy stories I was talking about are running on MSNBC and Fox at the same time, which is weird.

People rejected genetically modified food. People rejected fur. People rejected single-use plastic at retail scale. People rejected fast fashion brands, slowly and partially but visibly enough to reshape that industry. The pattern in every case is the same: a 5 to 15% consumer defection layered on top of a regulatory wave that the companies didn’t see coming because their internal polling told them they were fine.

If a meaningful slice of ChatGPT subscribers cancel because of energy guilt, that is a material revenue hit on a consumer side that now accounts for roughly two-thirds of OpenAI’s top line, per recent reporting on the company’s 2025 revenue mix. If a state attorney general launches an “AI emissions disclosure” suit and wins (and I’d bet these are coming), the discovery phase alone forces public methodology on energy reporting. If an EU regulator decides the AI Act’s energy disclosure requirements have actual teeth, then every frontier lab has to publish numbers they’ve spent years hiding.

The labs are betting that capability dazzles people enough to outrun the backlash and they’ve been right on this for three years. Hell, they might be right for three more. But the cultural ground is shifting underneath them, and the efficiency posture they’ve adopted (small model variants buried inside the API page, vague PR numbers, nuclear power deals announced for five years out) is calibrated to an old 2023 conversation. The 2026 conversation is louder, more specific, and more informed.

If you run an AI lab and you’re reading this:

Publish real, audited energy numbers per model, per query class.
Stop the marketing-range game on watt-hours.
Cap the headline model’s compute growth at the rate of demonstrated capability improvement, not at the rate of capex availability.
Ship the small model first, the big model second.
Make efficiency an actual product constraint instead of a PR line.

This is what needs to be done, but the current incentive structures inside a $90B/year AI company won’t reward it. I’d hate for the most important technology of my career to get killed by a backlash that the industry could have priced in years earlier.

Originally published on the Handy AI newsletter →