Stop Paying for Tokens You Don't Need: A Practical Guide to LLM Cost Control
Most teams I talk to are hemorrhaging money on LLM inference without realizing it. Not because the models are expensive — they're getting cheaper every quarter — but because the usage patterns are wasteful by default. Fat system prompts, redundant context, wrong model for the job. Fix those three things and the gains are large.
Here's how I think about it in production.
Prompt Surgery First
Your system prompt is the most expensive line of code you own. Every token in it is paid for on every single call. I've seen teams ship 2,000-token system prompts that contain half a product spec, a style guide, and three paragraphs of "be helpful and professional" filler. Strip it.
The discipline is brutal and worth it:
- Remove any instruction that isn't load-bearing — if removing it doesn't change output quality in eval, it doesn't belong there.
- Move static background knowledge (FAQs, product docs) into retrieval. Don't stuff it into the prompt hoping the model remembers.
- Use structured formats (JSON, XML tags) to compress intent.
<task>classify</task>is cheaper and more reliable than a paragraph explaining the task.
A 40% reduction in prompt tokens is not unusual when you approach this with an eval harness rather than intuition.
Model Routing: Use the Cheapest Model That Can Do the Job
This is the highest-leverage architectural decision you can make. Not every call needs GPT-4o or Claude 3.5 Sonnet. A lot of production workloads decompose cleanly into:
| Task type | Right model tier |
|---|---|
| Classification, intent detection | Small fine-tuned model or GPT-4o-mini |
| Structured extraction from short text | GPT-4o-mini, Haiku |
| Multi-hop reasoning, complex generation | Frontier model |
| Summarization of long docs | Mid-tier with large context window |
At Etera AI, the multi-agent travel platform I lead engineering on, different agents in the same pipeline deliberately run on different model tiers. A routing agent that classifies user intent has no business calling a frontier model. It's overkill and it adds latency.
Implement a router layer — even a simple one based on query complexity heuristics — before you reach for full model cascades.
pythondef route_to_model(query: str, context_tokens: int) -> str: # Simple heuristic router — replace with a trained classifier at scale if context_tokens > 8000 or requires_deep_reasoning(query): return "gpt-4o" return "gpt-4o-mini"
This is a starting point. In production you'd train a lightweight classifier on your own query logs to make this decision more accurately.
Cache Aggressively, But Cache the Right Things
Semantic caching is underused. The idea: before hitting the LLM, embed the incoming query and check it against a vector store of recent queries. If the cosine similarity (computed on L2-normalized embeddings, so the dot product equals cosine similarity) is above a threshold, return the cached response.
pythonimport numpy as np def cosine_similarity(a: np.ndarray, b: np.ndarray) -> float: # Assumes both vectors are already L2-normalized return float(np.dot(a, b)) def check_semantic_cache(query_embedding, cache, threshold=0.97): for cached_embedding, response in cache: if cosine_similarity(query_embedding, cached_embedding) >= threshold: return response return None
For high-traffic, repetitive workloads — customer support bots, FAQ assistants, internal search — cache hit rates can be high enough to meaningfully shift your monthly invoice. The threshold matters: too low and you return wrong answers, too high and you never hit the cache. Tune it with evals.
Exact prompt caching (offered natively by Anthropic and OpenAI for repeated prefix tokens) is also worth enabling if you're not using it. For long, stable system prompts, this cuts the effective cost of those prefix tokens significantly.
Context Window Discipline
LLMs charge you for every token in the context window, input and output. Teams building chat applications make a classic mistake: they send the entire conversation history on every turn.
You don't need the full history. You need relevant history. Strategies:
- Rolling window: Keep last N turns. Simple, often good enough.
- Summarization: Periodically compress older turns into a running summary. More complex to implement but preserves semantic content over long sessions.
- RAG over history: For very long sessions, embed past turns and retrieve only the relevant ones. Overkill for most apps, right answer for a few.
Pick the simplest approach that passes your evals. Don't over-engineer it.
Output Length Control
Models are trained to be verbose. They'll use 300 tokens to say what 80 tokens could cover. This is a prompt engineering problem.
Explicit length constraints work:
Respond in 3 bullet points maximum. Each bullet: one sentence.
For structured outputs, use JSON mode or function calling — constrained decoding is faster and token-efficient compared to free-form prose that you then parse.
Also audit your max_tokens parameter. If you're setting it to 2048 as a default and your average response is 200 tokens, you're not overspending per call — but it signals you haven't thought about what your application actually needs.
The Latency Side of the Equation
Cost and latency are coupled but not identical. A few latency-specific levers:
- Streaming: Start rendering output to users immediately. Perceived latency drops dramatically even when actual latency doesn't.
- Parallel calls: If your agent workflow has independent subtasks, fire them concurrently. Sequential chains are a latency trap.
- Smaller models are faster: A GPT-4o-mini call is not just cheaper — it's significantly lower latency. For user-facing features where response time matters, this is a real UX argument, not just a cost argument.
What I Actually Prioritize
When I audit a team's LLM stack, I look in this order:
- Prompt bloat — immediate wins, no architectural change needed
- Model routing — usually requires a small routing layer, pays back fast
- Caching — depends on traffic patterns, high upside for repetitive workloads
- Context window management — important for chat, critical at scale
- Output constraints — polish, but worth doing
None of this requires exotic infrastructure. It requires discipline, an eval harness, and the willingness to measure before and after every change. Ship, measure, iterate.
Working on something like this? I take on a few fractional-CTO and AI engagements at a time.