What actually cuts Opus token costs

I’ve been running Opus 4.6 on the API and tracking what cuts cost without degrading output. The pricing is $5/MTok input, $25/MTok output, $0.50/MTok cache reads. Output is 5x more expensive than input, so controlling response length is the highest-ROI change. Three lines in CLAUDE.md handle most of it:

- Only make changes directly requested. No bonus refactors.
- Choose an approach and commit. No hedging with "alternatively".
- Be definitive: no "might", "could", "perhaps" in technical statements.

Opus over-engineers, over-thinks, and hedges. Anthropic documents this in their own migration guide. Those three lines fix it.

Context reprocessing

Claude has no hidden memory between turns. Every API call sends the full conversation (system prompt, history, new input) and processes it all again. One developer tracked a 100+ message session: 98.5% of tokens went to re-reading history, 1.5% to generating output.

MCP servers make this worse. Each connected server loads all its tool definitions into context on every single message. One server can add 18,000 tokens per turn. With too many enabled, your 200K window shrinks to about 70K usable. I keep under 10 MCPs and disconnect anything I’m not actively using.

Model routing

Sonnet 4.6 vs Opus 4.6 on SWE-bench: 79.6% vs 80.8%. A 1.2 point gap. Users preferred Sonnet 4.6 over Opus 4.5 59% of the time. I default to Sonnet and switch to Opus per-task with /model. The hidden /model opus plan uses Opus for planning and drops to Sonnet for execution. That alone cuts about 60%.

Session practices

/clear between unrelated tasks. This is the single biggest session-life optimization and most people don’t use it. New sessions every 15-20 messages. /compact at 60% context (not 95%), with explicit instructions: /compact preserve all file paths, tool decisions, and architectural choices. Use @file instead of pasting. Batch questions into single prompts.

Anthropic adjusts how fast your session window drains based on demand (since March 2026). Peak hours are 8 AM to 2 PM Eastern on weekdays.

The advisor tool

Launched April 9, 2026, still in beta. A cheap executor (Sonnet or Haiku) consults Opus mid-generation within a single API call. Opus sees the full transcript, produces a 400-700 token plan, the executor continues. Easy tasks never touch Opus.

Numbers: Sonnet + Opus advisor gets 74.8% SWE-bench (up 2.7pp from Sonnet alone) at 11.9% less cost than Opus solo.

response = client.messages.create(
    model="claude-sonnet-4-6-20260612",
    max_tokens=8096,
    headers={"anthropic-beta": "advisor-tool-2026-03-01"},
    tools=[{
        "type": "advisor_20260301",
        "model": "claude-opus-4-6-20260612",
        "max_uses": 3,
    }],
    messages=[{"role": "user", "content": prompt}],
)

The whole exchange happens inside one /v1/messages request. No extra round trips. Set max_uses to cap how many times the executor escalates. In Claude Code, /model opus plan is the closest equivalent.

Compaction

Auto-compaction is the most complained-about Claude Code feature. Claude follows rules perfectly before compaction, then violates them consistently after. It forgets skills, procedures, sometimes which repo it’s in. Extended thinking makes it worse (the API rejects modified <thinking> blocks, making sessions unrecoverable).

CLAUDE.md is compaction-proof. Move critical rules there. Set CLAUDE_AUTOCOMPACT_PCT_OVERRIDE=50 to trigger earlier. You can also add a custom compaction prompt in ~/.claude/settings.json with explicit instructions on what to preserve.

Tools

squeez compresses bash output up to 95% via Claude Code hooks. Zero config. Caveman forces terse output and cuts about 65% of output tokens (a March 2026 paper found brevity constraints improved accuracy by 26pp, so there’s no quality tradeoff). TOON is a compact serialization format, about 40% fewer tokens than JSON with slightly better retrieval accuracy.

What to skip

Over-aggressive prompt compression on agentic sessions. Suppressing verbosity during multi-step coding causes Claude to lose track. Developers report shorter prompts trigger more back-and-forth, increasing total cost. Reasoning tokens often prevent expensive re-reads later.

Removing markdown headers from CLAUDE.md saves a handful of tokens and degrades instruction compliance. Not worth it. Same for generic prompt-optimizer libraries: unpredictable quality tradeoff, no single optimizer generalizes. CompressGPT-style tools that compress prompts into abbreviations cost more tokens than they save unless you reuse the compressed prompt many times (prompt caching does this without overhead).