You're tokenmaxxing yourself broke

When you send ChatGPT or Claude a message and it sends you one back, your words are metered in “tokens.” The term “tokenmaxxing” was coined recently to refer to the practice of blasting through as many tokens as possible like a teenager with their first credit card.

I’ve already gone after the supply side of this. I wrote about models that torch compute to sound smart, and about what the buildout costs the towns it lands in. This one’s about the demand side. About you. Because the single biggest lever on what these systems cost, in dollars and in watts, isn’t the data center. It’s the prompt you just sent (and the seventeen you didn’t need to).

So let’s talk about how to use fewer tokens without using less AI. There’s real money and real environmental cost on the table, and the techniques to claw both back are (mostly) free.

The part where I make you feel bad

Here’s the uncomfortable physics. Every token a model reads or writes is arithmetic running on a chip in a building that drinks power and water to stay cool.

The per-query numbers sound tiny until you multiply them. OpenAI pegs an average ChatGPT query at around 0.34 watt-hours. Google says the median Gemini text prompt lands near 0.24. Independent measurement put a short GPT-4o call at about 0.42 watt-hours, roughly 40% more than a Google search. On its own, nothing. But at the scale of hundreds of daily, robust reasoning requests, the carbon emissions from our AI use begins to look similar to our daily commutes. (And with Anthropic’s release of Fable 5 yesterday, tokens-per-query for reasoning models just keep going up and up).

The good news is that token efficiency has improved about 120x since the GPT-3 era, so the floor keeps dropping. The bad news is that Jevon’s paradox is skyrocketing AI-demand and diminishing these efficiency gains nearly as soon as they take hold.

Every token you don’t send is the cleanest token there is. Here’s how to send fewer.

🐯 Before we get started: right now a data center is being planned for construction right next to the Nashville Zoo. While water and energy concerns around data centers are overblown, their sound pollution and growing CO2 offsets directly affect our world and animals, making construction near a zoo a nonstarter.

I urge you to sign the petition to prevent the Nashville Zoo build, and keep an eye out for nonsensical construction plans in your local community (data center or otherwise).

Practical ways to use less tokens

1. Give your agent tools that don’t cost any tokens

When you ask an agent to do a repetitive task, it reasons through the whole thing every single time. Rename 200 files, reconcile a spreadsheet, pull the same five fields off every PDF in a folder; a naive setup re-derives the approach on every item, burning thinking tokens to rediscover a process it already figured out on item one. You’ll save some tokens from this kind of work with prompt caching, but there’s an even better way to reduce usage down to nearly zero:

Have your agent write a script the first time, then run the script forever after.

The model reasons once, captures the logic in code, and the code executes for free. Anthropic measured exactly this with their code execution work. A workflow that loaded everything into the model’s context ate 150,000 tokens. The same workflow, restructured so the agent wrote and ran code instead of reasoning over raw data, used 2,000. That’s a 98.7% reduction for the identical result.

This is the whole idea behind Skills: folders of scripts and instructions an agent loads on demand instead of being re-taught the same procedure across every conversation. It’s also why the durable pattern in 2026 is semi-autonomous, not fully agentic. A cron job fires, deterministic scripts do the heavy lifting, and the model only steps in to judge or summarize. One write-up clocked the difference at 50x to 500x on cost: a 24/7 assistant left running on a frontier model can cost $4 to $12 a day, while the same job as a scheduled small-model call with hard caps comes in under two cents.

The product manager in me loves this because it’s just good engineering. You don’t pay a senior engineer to copy-paste the same fix 200 times. You have them write the fix once and automate it. Treat your agent the same way. When you catch yourself prompting the same shape of task twice, stop and say “write me a script that does this, then run it.” Done.

2. Quit asking for dissertations when you need a sentence

Reasoning models are extraordinary at hard problems and ridiculous at easy ones. Handed something trivial, a large reasoning model will still pile on thinking tokens, second-guessing itself with “wait” and “hmm” and backtracking through reflection it never needed. Researchers call it the overthinking trap, and the savings from cutting it are not small. Suppressing those self-doubt tokens shortens reasoning chains by 27% to 51% with no loss in answer quality. Batch prompting similar questions together cut reasoning tokens 76% on average across thirteen benchmarks. Letting the model skip reflection when it’s already confident saves another 18% to 42%.

Match the reasoning to the task.

Most modern models let you set a reasoning effort or thinking budget. Classification, extraction, formatting, summarizing, routine code edits, none of these need extended chain-of-thought. Turn it down. Save the deep reasoning for the genuinely hard stuff, where it earns its keep. Stop bringing Opus to reformat a CSV.

3. The cheap wins everyone skips

Beyond those two big ones, there’s a pile of low-effort token savings most people never turn on.

Prompt caching. Mentioned previously, prompt caching lets the model store words or documents you’ve sent previously and reuse it. Providers typically charge 50% to 90% less for cached tokens. ProjectDiscovery cut their LLM bill 59% doing nothing but this.
Mind your output. Output tokens cost 3 to 8 times more than input tokens at every major provider. A rambling answer with heavy reasoning is the expensive kind of token.
Compress the conversation. Long chat threads replay the entire history on every turn. Summarizing older exchanges instead of resending them verbatim trims 20% to 40% off a chatbot’s token use. Most agent frameworks will do this for you if you ask.
Route by complexity. A model router inspects each request and sends the easy ones to a cheap model and the hard ones to the expensive one. You get frontier quality only where it matters and pay small-model prices everywhere else. Most harnesses (Cursor, Claude Code) have a version of routing built in, but it can’t hurt to double up.

Insourcing: keep some tasks for your own brain

I have a rule I call insourcing: the practice of keeping a deliberate set of tasks you refuse to hand to a model, specifically so the part of your brain that does them doesn’t go soft. The first thing on my list is personal communication. I don’t let an agent write the text to my grandmother, the note to a friend going through it, or the message to someone I actually care about. Not because the model would do it badly, but because the doing is the point.

Beyond this obvious line, I ensure that a not-insignificant portion of my daily work is kept in-house. There is an ever-present itch for me to reach for Claude to build out slides or format a spreadsheet; Claude Cowork has progressed to a point where it can handle these tasks with relative ease. Resist! Make sure you’re putting in the work on these when you can. The initial act of “creation” in work is important for flexing your brain muscle.

An MIT Media Lab team ran a study last year they called “Your Brain on ChatGPT”, wiring up 54 people writing essays with AI, with a search engine, or with nothing but their own head. The AI group showed the weakest brain connectivity and the thinnest memory of what they’d just written. Even after they put the AI away, their brain activity stayed sluggish. The researchers called it “cognitive debt.” You outsource the thinking and your brain doesn’t eagerly grab the wheel back when you ask it to.

This is the not-so-quiet cost of tokenmaxxing your whole life. There is physical muscle atrophy that happens with every email, every decision, and every paragraph generated by your agents.

Insourcing is the cheapest and healthiest token-reduction strategy there is. Keep a few things that stay slow and human, and you’ll spend fewer tokens and keep a sharper head doing it.

The on-device future

Every technique above is aimed at reducing data center usage by the beefy cloud models. But what if these models ran locally, on your phone or laptop, and skipped the trip to the data center altogether?

Apple just bet its whole software stack on that question. The centerpiece of WWDC this week was AFM 3 Core Advanced, a 20-billion-parameter sparse on-device model that activates only 1 to 4 billion parameters per prompt. That pruning trick is what lets it live on an iPhone 17 Pro instead of in a data center. The Foundation Models framework hands the model to any app on the device: no API key, no network call, no token meter. It now accepts images, so receipts, screenshots, and photos get parsed on the Neural Engine without a byte leaving your hand. Apple is even open-sourcing the framework this summer. Apple is seemingly try to make up for their historically horrible Siri launches with some premier local LLMs as a default OS capability, though the hardware floor is real: 12GB of RAM and recent silicon means older iPhones stay in the cloud era a while longer.

Google is racing down the same curve in the open-weights lane. Gemma 4 12B handles text, image, audio, and video, beats last generation’s Gemma 3 27B, and fits on a 16GB laptop. One Ollama command and it runs fully offline with no strings on commercial use. Its edge sibling, Gemma 4 E2B, runs on your iPhone today through an app like Locally AI.

No wifi or signal required. No data leaving your device. No API bill. No query metered against a reservoir somewhere in Virginia.

This is the endgame for both problems at once. On-device models kill the per-token cost, the environmental draw, and the privacy tax. Notice that even Apple’s architecture concedes the split: the heavyweight, Gemini-assisted model behind Siri AI routes through Private Cloud Compute, while summarization, extraction, and classification stay on the chip. The frontier clouds keep the genuinely hard, novel reasoning that earns the watts. The thousand small calls that make up most of what people actually do with AI are moving into your pocket.

But the cost curve points home.

Originally published on the Handy AI newsletter →