Tokenmaxxing Is Over: Spend on AI Like a Cost

For a few giddy months, the instruction from the corner office was simple: use AI for everything, and use more of it. Some leaders branded it openly — “tokenmaxxing” — framing 2026 as the year teams would push generative AI as far as it would go, no questions asked. Usage was a virtue. Internal leaderboards ranked who prompted the most. Then the invoices landed, and the religion met the ledger.

According to reporting from TechCrunch in June 2026, the reckoning has been concrete. One large company reportedly blew through its entire annual AI budget in a matter of months. Some firms have since cut AI-assistant licenses for parts of their organisations. And at least one quietly killed the internal usage leaderboard it had built to celebrate consumption. The mood has shifted from evangelism to arithmetic. This is a guide to spending on AI like an operator — treating it as a cost line you manage, not a faith you profess.

The hype and the bill

The tokenmaxxing mandate made a kind of intuitive sense at the time. If AI makes people faster, then more AI should make them faster still. Leaders worried more about being left behind than about overspending, so they removed friction: unlimited seats, generous limits, and gamified dashboards that rewarded raw volume. The implicit message was that any token not spent was an opportunity missed.

The problem is that AI usage scales with cost in a way most software never did. Traditional SaaS is priced per seat, and a seat costs the same whether you log in once a month or all day. Generative AI is metered by consumption, and consumption turned out to be far more elastic — and far more expensive — than budgets assumed. The result, per TechCrunch’s June 2026 reporting, is annual allocations exhausted in a fraction of the year, followed by the unglamorous corrections: pulling licenses from teams that did not demonstrably need them, and retiring the leaderboards that had encouraged the very behaviour now bleeding the budget.

None of this is happening in a vacuum. As dentro.de/ai noted in June 2026, the vendors themselves are moving the industry toward usage-based billing — GitHub Copilot shifted to a metered model from 1 June 2026 — precisely because inference costs make all-you-can-eat pricing unsustainable for the providers too. When the people selling the tokens stop offering a flat rate, it is a strong signal that buyers should stop budgeting as if one existed.

Why usage ran away

Three forces turned enthusiastic adoption into runaway spend, and most organisations saw none of them coming.

The first is technical. The frontier moved from single-shot chat to agentic and reasoning workflows, and those multiply tokens dramatically. A reasoning model that “thinks” before answering can consume many times the tokens of a direct response. An agent that plans, calls tools, reads results, and re-plans can fan out into dozens of model calls for a single user request. What feels like one task to an employee can be a long, expensive chain under the hood — and that chain runs every time, whether the question was trivial or hard.

The second is visibility, or the lack of it. Most companies bought AI before they could see what it cost per workflow. Spend showed up as one fat bill from a provider, with no breakdown of which team, feature, or process drove it. You cannot manage what you cannot attribute, and for months teams simply could not tell whether the money went to high-value automation or to someone asking a frontier model to reformat a spreadsheet.

The third is incentives. Leaderboards and usage targets rewarded more, not better. When the metric is tokens consumed or prompts sent, people optimise for activity rather than outcomes. Nobody was incentivised to find the cheapest path to a result; they were incentivised to look busy with AI. Combine multiplying token chains, zero cost attribution, and incentives pointed the wrong way, and overspend was not a risk — it was the design.

Spending on AI like an operator

The fix is not to ban AI or freeze budgets in a panic. It is to apply the same discipline you would to cloud compute or ad spend: match the resource to the job, eliminate waste, and measure what actually matters.

Start with model routing. Not every task needs the most capable, most expensive model. Classification, extraction, summarisation, and formatting can run on smaller, cheaper models — often at a fraction of the cost — while you reserve frontier reasoning models for genuinely hard problems. A routing layer that sends cheap tasks to cheap models, and escalates only when needed, is the single highest-leverage change most teams can make.

Then attack the call volume itself:

Cache aggressively. Many requests are near-duplicates. Caching responses for repeated or similar inputs avoids paying twice for the same answer.
Batch where latency allows. Non-urgent jobs — overnight enrichment, bulk classification — can be batched, and many providers price batch processing lower than real-time calls.
Gate expensive calls. Put a cheap check before an expensive one. Use a small model or a simple rule to decide whether the costly reasoning step is warranted at all, rather than defaulting to it.

Finally, change the metric. Stop measuring seats and tokens; measure cost-per-outcome. What does it cost to resolve a support ticket, qualify a lead, or ship a code review with AI in the loop? When you can attribute spend to outcomes, you can make real decisions — doubling down where the unit economics work, and cutting where they do not. That is also the only honest way to know whether AI is paying for itself, which is the question the leaderboards were never built to answer.

The India read

For Indian teams, this correction is less a setback than an opening. Cost discipline has always been a structural strength here, and the global swing from tokenmaxxing to unit economics rewards exactly the kind of frugal engineering that lean startups and services firms have practised for years. While well-funded Western competitors learn to count tokens for the first time, teams that never had budget to burn are already wired for efficiency.

Open-weight and self-hosted models sharpen that edge. For high-volume, predictable workloads — classification, extraction, internal tooling — running an open-weight model on your own infrastructure can convert an unpredictable per-token bill into a fixed, capacity-based cost you control. It demands engineering investment, but for steady, large-scale tasks the economics often favour ownership over metered API calls. The pragmatic pattern is hybrid: open-weight models for the bulk, frontier APIs for the genuinely hard edge cases.

Whatever the mix, build guardrails before you scale, not after. The companies caught out this year lacked circuit-breakers: hard budget caps per team and per workflow, automatic throttling when spend crosses a threshold, alerts that fire in hours rather than at month-end, and per-feature attribution from day one. With vendors like GitHub Copilot moving to metered billing — as dentro.de/ai reported in June 2026 — assume every token has a price tag and architect accordingly. The teams that win the next phase of AI will not be the ones that used it most. They will be the ones that knew, to the rupee, what each use was worth.

The Token Reckoning: How to Spend on AI Like It’s a Cost, Not a Religion

The hype and the bill

Why usage ran away

Spending on AI like an operator

The India read

Aarav Malhotra

The Signal — one email, every Tuesday.