AI Cost Governance: A Playbook for Variable AI Bills
Two months ago, an AI coding assistant was a fixed cost: one seat, one flat monthly price, use it as hard as you liked. That deal is dead. On June 1, GitHub Copilot moved every plan to usage-based billing — the seat price held, but your requests now drain a monthly bucket of AI Credits one cent at a time. Anthropic, OpenAI, and a permanent 75% price cut from DeepSeek are all pushing the same direction. Here's the part most engineering leaders are missing: this isn't a price increase. It's a category change. Your AI spend just stopped being a subscription and became a utility bill — and almost nobody has the meter, the budget, or the alerts that a utility bill demands.
I've covered the Copilot repricing and the per-call token discipline on their own. This piece is the layer above both: how a team runs usage-based AI as a governed line item, instead of discovering the number at the end of the month.
Flat pricing was a loss leader, and the trial just ended
Per-seat AI pricing was never a real price. It was a land-grab — vendors absorbing the cost of power users to win the market while the market was still up for grabs. Now that AI assistance is table stakes (a JetBrains survey found 46% of senior engineers use Claude Code daily), the subsidy has no strategic purpose left, so it's ending. The mechanics are explicit: on Copilot, one AI credit equals one cent, and credits burn on input, output, and cached tokens at rates that vary by model. Tellingly, autocomplete and next-edit suggestions stay unmetered — but agentic sessions, chat, and automated code review, the capabilities that actually got good this year, are exactly what now costs money. The developer verdict was blunt: "you will get less, but pay the same price."
That's the wrong way to read it. You don't get less — you get charged for what you use, which is fine if you're measuring it and a disaster if you're not.
The number nobody on your team can answer
Try this in your next standup: ask what a single agentic coding session costs. Not the monthly seat — one run that reads the repo, plans, edits ten files, and runs the tests. Most teams can't answer within an order of magnitude. That gap is the whole problem. A variable cost you can't forecast isn't a budget; it's a surprise with a date attached.
And the surprise is rarely a dramatic spike. It's quiet. I run automated jobs that call LLM APIs on a schedule, and not long ago one of them drained a balance to zero between two runs — no spike, no warning, just a job that worked yesterday and returned a billing error today. The only reason I caught it fast was a logged error, because nothing was wired to alert on the balance itself. That is the characteristic failure mode of usage-based AI: not a frightening invoice, but a silent zero that takes a service down, or a slow drift upward that nobody owns until finance asks about it.
Govern it like infrastructure, because that's what it is
Cloud taught us this lesson a decade ago. The instinct with elastic, usage-priced resources is to optimize the unit cost first. Wrong order. You instrument, you budget, you cap, then you optimize. Here's the playbook I'd run for any team that just got moved onto usage-based AI.
1. Instrument before you optimize. You cannot manage a variable cost you can't see. Copilot now exposes per-credit usage; every API provider exposes token usage. Put it on a dashboard someone actually looks at weekly. Optimizing before you measure is just guessing with extra steps.
2. Set auto-reload and a hard alert — both, always. Auto-reload keeps the lights on. The alert tells you why burn jumped. Auto-reload without an alert is a runaway bill you discover from your card statement; an alert without auto-reload is an outage. You need the pair, not either one.
3. Cap the blast radius. Per-developer and per-project limits stop one looping agent or one over-eager automation from eating the month. Treat an uncapped API key the way you'd treat an uncapped cloud account — as a liability, not a convenience.
4. Right-size the model per task. This is the single biggest line-item waste I see: teams default everything to the most expensive frontier model "to be safe." Match the tier to the job.
| Task | Tier | Why |
|---|---|---|
| Autocomplete, next-edit | Cheapest / unmetered | High volume, low stakes |
| File-level chat, simple Q&A | Mid (Sonnet-class) | Good enough at a fraction of frontier cost |
| Multi-file agentic refactor | Frontier, high effort | Getting it right once beats three cheap wrong tries |
| Bulk / nightly batch work | Batch API, off-peak | 50% discounts exist — use them for anything non-interactive |
5. Keep a provider exit. DeepSeek's permanent cut and five new models in a single month mean you have leverage — but only if your code isn't welded to one vendor's SDK. An abstraction layer that lets you swap the model behind a call is now a cost control, not just architectural hygiene. The day a competitor undercuts your provider by 75%, you want that to be a config change, not a quarter of refactoring.
6. Move batch work off your interactive budget. Anything asynchronous — content generation, nightly summaries, scheduled analysis — should not draw from the same per-token bucket as your developers' live sessions. Run it on a batch API, off-peak, or even on a flat-rate subscription's own automation. Mixing batch and interactive spend in one meter is how a background job silently starves the tools your team actually works in.
What to actually do this week
- Pull last month's token usage for every AI tool you pay for. If you can't, that's finding number one — fix the visibility before anything else.
- Turn on auto-reload with a sane cap, and wire a balance or usage alert to a channel a human reads daily.
- Set per-seat or per-project usage limits so no single actor can drain the shared pool.
- Pick one default-model downgrade — one workflow currently on a frontier model that a mid-tier model would handle fine — and ship it.
- Audit your async jobs and move at least one off interactive credits onto a cheaper path.
The vendors didn't raise prices this year. They stopped hiding them. Usage-based AI quietly rewards the teams that treat every token as a line item and taxes the ones that don't — and the gap between those two teams compounds every single month. Start metering now, while the bill is still small enough to be a lesson instead of an incident.
Working on something like this? I take on a few fractional-CTO and AI engagements at a time.
Get my AI playbooks — straight to your inbox
Practical notes on shipping production AI, scaling teams, and the calls a CTO actually has to make. A few times a month. No spam, no fluff.