12 Ways to Cut Token Consumption in Claude Code

Hiba Fathima

Jun 05, 2026

TL;DR: Claude Code token efficiency techniques

Tip	What it does
Know your hidden baseline	20,000-30,000 tokens load before you type anything
Feed agents clean web data, not HTML soup	94% fewer input tokens per web page with Firecrawl
Strip CLAUDE.md to under 500 tokens	91.9% context reduction, no quality regression
Use .claudeignore AND permissions.deny	Advisory vs. enforced: use both
Move rules to path-scoped .claude/rules/	41% overhead reduction from always-loaded rules
Filter tool output before Claude reads it	80-99% compression on build and test logs
Scope prompts precisely and store your best constraints	Specific instructions + reusable snippets = consistent output
Separate modes: plan first, build second, verify	14% fewer tokens with structured sessions
Control MCP server overhead	10,000-20,000 tokens per server, per session
Match the model to the task	Up to 75% cost reduction from deliberate model routing
Use /compact, /clear, and /rewind	Control context size before it controls you
Use skills for progressive disclosure	30-100 tokens at startup vs. full load on demand

A Reddit user recently posted their May usage stats: 1,156,308,524 input tokens in a single month. The thread went viral. Most of the replies weren't shock — they were recognition.

Claude Code token efficiency has become a hot topic of discussion — see also how it stacks up against Codex as an AI coding tool, and the broader case for why AI agents prefer the CLI over IDE-based alternatives, including the token math behind that preference.

It is not just individual developers. Microsoft gave thousands of its own engineers access to Claude Code in December, and it proved perhaps a little too popular. By May, the company was canceling most of those licenses — with the decision described internally as "also a financial one." Developers are voting with their workflows too:

The symptom shows up before the bill does. Long sessions start to feel sluggish — Claude revisits files it already resolved, drags failed approaches back into its reasoning, and processes stale logs that have nothing to do with the current task. Output quality drops. It is not the model getting worse. It is the context getting noisier.

Another user shared that a "hi" prompt with nothing prior consumed 31,000 tokens before any code was discussed. Most people assume the model is just being difficult. It is a reasonable conclusion. It is also wrong, and that is actually the good news — a context problem is fixable.

What is a token?

A token is the smallest unit a language model processes. For Claude, one token is roughly 3.5 English characters — so a word like "function" is about two tokens, and a full line of code might be 10-20. Tokens are not words and they are not characters; they sit somewhere in between.

Every message you send and every response Claude gives is measured in tokens. Every file Claude reads gets tokenized and added to the running total. That 31,000-token "hi" response was not 31,000 words — it was roughly 108,000 characters of system prompt, config files, and tool schemas that had to travel with your two-letter message.

This is why so many users (including me) are losing their minds.

12 ways to cut token consumption in Claude Code

1. Know your session's hidden baseline

Your session starts with 20,000-30,000 tokens in the context window before you type a single character.

This is not a bug and it is not unique to your setup. It is the fixed overhead of how Claude Code initializes: system prompt, CLAUDE.md (global and project-level), memory files, MCP server tool schemas, and skill names plus descriptions. All of it loads at session start. All of it is present in every message you send.

The GitHub issue #52979 confirming that a "hi" prompt consumed roughly 31,000 tokens is not an outlier. It is the baseline. Every optimization below is reducing from that floor.

Three commands worth knowing:

/context: a live breakdown of everything in the context window, including token counts per element and cumulative usage
/usage: new in Opus 4.8 — shows exactly which component is consuming your tokens, announced by the creator of Claude Code, Boris Cherny, before it shipped:
/memory: exactly which CLAUDE.md and memory files loaded at session start

This is the mental model that makes every other technique make sense. You are not starting from zero. You are starting from 25,000. Everything you do either holds that floor steady or pushes it up.

Past roughly 300,000-400,000 tokens, attention gets distributed so widely that response quality degrades noticeably. Thariq Shihipar, an engineer at Anthropic, calls this "context rot" — the context is not full, it is poisoned. The model is not broken. The context is.

Where the skeptics have a point: Not all of that baseline is waste. The system prompt and skill descriptions are doing real work. The goal is not to eliminate context. It is to make sure what is there is load-bearing.

2. Feed your agent clean web data, not HTML soup

When a Claude Code agent looks something up on the web, the default is a raw HTML fetch. What Claude receives is a full page.

A typical web page returns roughly 38,000 tokens of raw HTML: navigation menus, scripts, cookie banners, ads, footers, and the roughly 2,800 tokens of actual content somewhere inside. Claude processes all 38,000 tokens. You pay for all 38,000 tokens. We benchmark this precisely in Claude web fetch vs Firecrawl for anyone who wants the full side-by-side.

On a research-heavy agent workflow where Claude fetches ten pages per session, that is 380,000 tokens of noise per session. At Sonnet 4.6 pricing ($3/M input), that is $1.14 per session, every session, in content the model cannot reason about.

Firecrawl strips pages to clean markdown before they reach Claude: 94% fewer input tokens per scrape, by default. The median page goes from 38,381 tokens to 2,788 tokens. Claude reads only the signal.

The numbers:

Model	Saved per scrape	Saved per 10k scrapes
Claude Opus 4.7 ($5/M)	$0.18	$1,799
Claude Sonnet 4.6 ($3/M)	$0.11	$1,079
GPT-5.4 ($2.5/M)	$0.09	$890

Hobby plan ($19/mo): breakeven at 176 scrapes (3.5% of plan credits). Full plan utilization returns $540 saved for $19 spent, a 28x ROI. Standard plan: 109x-182x. Growth: 135x-225x.

This is independently validated. The MindStudio benchmark notes:

Firecrawl converts any URL into clean structured data, which translates to up to 80% token reduction compared to feeding raw HTML directly.

Applies to both research modes. Research-heavy agents hit the web in two ways: finding pages (search) and extracting from them (scrape). Firecrawl handles both. The agent never fumbles with cookie banners and nav menus in its context window.

Firecrawl is not infrastructure cost. It is a discount on your LLM bill that happens to also fix all the annoying parts of web scraping. Set it up via the official Firecrawl plugin for Claude Code.

3. Strip CLAUDE.md to under 500 tokens

CLAUDE.md is the most expensive single file you control. It loads on every session and stays for every message, forever.

A benchmark from the token-optimizer repo compared a 3,847-token CLAUDE.md (the kind generated when you let Claude document everything it encounters during onboarding) with a 312-token version stripped to only what Claude cannot infer from the code itself. Result: 91.9% context reduction with no quality regression. On a $500/month API spend, that is roughly $460 saved per month from editing one file.

The rule: if Claude can infer it from reading the codebase, cut it. If a senior developer could figure it out in twenty minutes of reading, cut it.

What stays:

Non-obvious build and test commands
Architecture decisions that go against framework defaults
Project-specific constraints a new engineer would not guess
Things that would genuinely surprise an experienced developer new to the repo

Anthropic's official guidance is under 200 lines. Some teams run at 60.

Three things most CLAUDE.md authors don't know:

HTML comments () are stripped before injection. They cost zero tokens. Use them for teammate notes and rationale that Claude doesn't need.
@path/to/file imports let you split CLAUDE.md across multiple files, but all imported files still load at session start. Splitting is organizational, not a token-saving move.
CLAUDE.local.md (gitignored) is your personal preferences file. Good for local sandbox URLs and notes you don't want in the shared config.

The skeptic's view: HN developer gavinray pushed back on aggressive CLAUDE.md trimming: "If the Very Smart People working on CC haven't integrated a feature... it's probably because it doesn't improve performance." That is a fair prior. The counter: the trimming works because it removes things Claude already knows from training, not things it genuinely needs. Cutting Next.js routing conventions from a CLAUDE.md is not removing valuable context. It is removing redundant context.

4. Use .claudeignore AND permissions.deny (they are not the same)

One is a suggestion. The other is a hard block. You need both.

.claudeignore signals to Claude that certain files are not relevant. It is advisory. Multiple GitHub issues (#36163, #51105) document that Claude can still read ignored files if it decides they are necessary for the task.

permissions.deny in .claude/settings.json is enforced at the permission layer. It blocks the Read tool entirely for matched paths. Claude cannot read them regardless of what it decides.

{
  "permissions": {
    "deny": ["Read(node_modules/**)", "Read(dist/**)", "Read(*.lock)"]
  }
}

In practice: use .claudeignore for the signal layer (Claude won't proactively include these files) and permissions.deny for the hard block (Claude can't read them even if it tries). Measured results show an 85.5% context reduction from .claudeignore discipline alone.

Minimum .claudeignore for any project:

node_modules/
dist/
build/
.next/
__pycache__/
*.lock
coverage/
*.generated.*
*.min.js
*.min.css

Check it into version control. Every team member gets the same context discipline automatically.

5. Move rules to .claude/rules/ with path scoping

Rules without path frontmatter load at session start like a second CLAUDE.md. Rules with paths: frontmatter load only when Claude reads a matching file.

This distinction is the most underused optimization in the Claude Code ecosystem. If you have a .claude/rules/ directory full of rule files with no paths: frontmatter, every one of them loads at session start whether you're working on frontend, database, or tests. They're indistinguishable from a bloated CLAUDE.md in terms of token cost.

Add paths: frontmatter and they become invisible until needed:

---
paths:
  - "src/api/**/*.ts"
---
# API Layer Rules
All endpoints must validate input with Zod schemas.
Response errors must use the shared ApiError class.
Never return raw Prisma errors to the client.

This rule costs nothing during frontend work. It enters context only when Claude first touches a file in src/api/. API layer rules don't load during frontend work. Frontend rules don't load during database migrations.

One documented case (Zenn, 2025) reduced always-loaded rule overhead from 1,358 lines to 807 lines: a 41% overhead reduction by converting procedure-heavy rule files into Skills (on-demand only) and scoping domain rules to their respective directories.

6. Filter tool output before Claude reads it

Raw logs are context poison. A 10,000-line test run gives Claude noise, not signal.

When Claude runs a bash command, the entire output gets injected into the context window as a tool result. That result then travels with every subsequent message for the rest of the session. A failing build log that prints 10,000 lines of compilation output is 10,000 lines of noise Claude has to reason around.

The fix is filtering before Claude sees it. A PostToolUse hook on bash commands can compress a 10,000-line build log to a 200-line error summary before it enters context. The manual version:

npm test 2>&1 | grep -A5 -E "FAIL|ERROR|Expected|Received" | head -100

Three tools that automate this at the hook level:

RTK (github.com/rtk-ai/rtk): a Rust CLI proxy that compresses git, npm, and build log output 80-99% before it hits the context window
Headroom (github.com/chopratejas/headroom): a localhost proxy that applies roughly 34% context compression
Context Mode (github.com/mksglu/context-mode): a Claude Code skill that intercepts verbose shell output and passes only meaningful parts to context, while also maintaining a running session log. When Claude resets mid-session due to context limits, Context Mode restores the log automatically so work resumes where it left off. Sessions that used to die at the 30-minute mark run for hours.

sillysaurusx on HN thread #47581701 shares:

RTK compresses shell output 60-90% before it hits the context window. Stacks on top of Headroom.

The key insight: Claude does not need to see a successful test pass. It needs to see what failed and why. Everything else is overhead that compounds for the rest of the session.

7. Scope prompts precisely and store your best constraints

Vague prompts are token multipliers. Open-ended questions invite Claude to generate possibility space. All of it is billed per token.

"What do you think about the auth flow?" is an invitation. The model will generate options, tradeoffs, philosophy, alternatives, adjacent considerations, and things you will never use. It is not misbehaving. It is doing exactly what continuation probability rewards: keep going until something lands.

The fix is treating prompts like instructions on industrial equipment: specific verb, specific scope, specific constraint.

Before:

what do you think about the auth flow?

After:

Identify the security issue in the token validation logic.
Return the vulnerable line and explain why it's exploitable.
Under 150 words.

Negative constraints work. Fencing the scope with explicit exclusions eliminates both scope creep and unnecessary output:

do not redesign the architecture
do not explain basics
do not add dependencies
do not touch code outside the function I specified
do not rewrite working tests

Answer budgets work. "Answer in five bullets maximum." "Return the patch only, no explanation." "Under 200 tokens." Quality often improves under constraint because the model has to select rather than generate everything plausible.

Once you find constraints that work, store them. If you retype the same instructions every session you are paying the tax twice: once in keystrokes, once in output variance. Different wording across sessions produces surprisingly different model behavior. Build a snippets file and load the relevant subset at session start:

preserve all existing comments
minimal diff only, do not touch working code
TypeScript strict mode, no any
no new dependencies
mobile-first for any UI changes
explain your reasoning before writing code

When output quality drops, you can audit the inputs rather than just re-prompting and hoping. Also consider /commands for multi-step sequences — named aliases Claude executes deterministically are cheaper than natural language prompts, which are probabilistic and can drift across sessions.

8. Separate modes: plan first, build second, verify before closing

Mixed-mode sessions reliably cost the most tokens per hour of useful output.

Planning while implementing. Debugging while also redesigning. Asking "while we're here, should we rethink the whole structure?" during what should have been a fifteen-minute fix. Every architectural tangent becomes working context Claude reasons against. Entropy in the context produces entropy in the response, and that relationship is consistent.

Use plan mode (double Shift+Tab) to separate reading from writing. Claude maps dependencies, reads the relevant files, and produces a plan before touching anything. Misunderstandings caught at this stage cost almost nothing to fix. Misunderstandings discovered after Claude finishes writing code require a new session with fresh context loading. The MindStudio benchmark measured 14% fewer tokens and 9% lower cost from a structured five-phase approach versus unstructured sessions at the same model.

Three distinct session types beyond planning:

Implementation session: starts from a specific section of a plan, not the entire document.

Debugging session: starts from a structured incident format rather than a narrative description:

expected: webhook processes in under 500ms
actual: times out after 30s on payloads over 10KB
reproduction: any request with body > 10KB to /api/webhooks
recent changes: added payload validation in middleware, PR #47
logs: [paste relevant lines only]
suspected scope: likely the base64 encoding step in validatePayload()

Not narrative. Not conversational. Clean surface. This replaces paragraph-form bug descriptions, which are expensive because they require Claude to extract what it needs from prose.

Verify before closing. Bugs caught during the current session, while files are still in context, are fixed cheaply. Bugs found after Claude finishes are fixed in a new session with fresh context loading and re-reading of all the relevant files.

9. Control MCP server overhead

Every connected MCP server loads its full tool schema at session start, whether you use it or not.

Measured overhead: 10,000-20,000 tokens per server per session. A setup with multiple servers connected silently adds 50,000-70,000 tokens of schema overhead before you type anything. That number rides in every message for the entire session.

Two fixes:

ENABLE_TOOL_SEARCH defers schema loading until Claude actually needs a specific tool. Schemas load on demand rather than at startup.
Disconnect servers you are not actively using. This is the simplest fix and the one most people skip because connecting MCP servers feels like setup cost that should be amortized across all sessions.

Critical for API users: connecting or disconnecting an MCP server mid-session wipes your entire prompt cache. Do this at session start, not mid-conversation.

The practical rule: keep only the servers you need for the current session. If you're evaluating which MCP servers for developers are worth the overhead, the answer is almost always: fewer than you currently have connected. Every active tool is a context tax that compounds across every message. The best Claude Code plugins guide covers which ones actually justify the overhead.

10. Match the model to the task

Running Opus on everything is the most expensive default habit in Claude Code.

Model routing can cut costs by up to 75% — I've seen it consistently across my own day-to-day Claude sessions just from being deliberate about which model handles which task. The mechanism is straightforward: Opus costs roughly 5x more per token than Haiku. Most tasks in a typical session don't require Opus-level reasoning.

A practical decision ladder:

Sonnet: default for most tasks, code generation, documentation, refactoring
Opus: deep architectural decisions, tricky multi-file bugs, anything that failed multiple times already
Haiku: subagents, log inspection, file reading, boilerplate generation, anything fast and repetitive

CLAUDE_CODE_SUBAGENT_MODEL=haiku

Set this environment variable so exploration subagents run on Haiku while the main session stays on Sonnet. Also consider MAX_THINKING_TOKENS=0 for simple tasks where extended reasoning is overkill.

Opus 4.8 also ships with Fast mode: the same model running at 2.5x speed, now three times cheaper than fast mode on previous versions. Activate it with /fast for routine tasks that don't need deep reasoning but still need frontier-quality output. Switch back with /standard when the task calls for it.

Where the skeptics are right: HN commenter scosman made the important counter-argument: "Inference time scaling, generating more tokens when getting to an answer, helps produce better answers. Optimizing for minimal length when the model was RL'd on task performance seems detrimental." This is correct for complex tasks. Model routing is about matching horsepower to the job, not turning off the engine. Don't route complex architectural decisions to Haiku to save a few dollars. That trade is not worth it.

The goal is using Haiku for things Haiku can do well, so Sonnet and Opus have budget for what only they can do.

11. Use /compact and /clear strategically

Three commands for reclaiming context — and most people only ever use one.

/rewind is the most underused command in Claude Code. Double-tap Escape (or type /rewind) to open a checkpoint interface, then roll back to the moment just after Claude read your files — before the failed attempt began. Everything after that checkpoint is deleted from context permanently. Then re-instruct with what you learned: "Don't use approach A, the foo module doesn't expose that interface. Go directly to B using the pattern in utils/fetch.ts." Claude has the file knowledge it built up without the noise of the failed path. This produces better results than typing "that didn't work, try X instead" — which leaves the full failed attempt in context forever.

/compact mid-session compresses the conversation history with Claude's understanding of what happened. The critical detail: run it proactively around 250,000-300,000 tokens, not when you've already hit the limit. When context is already rotted, Claude gives you the worst summary at the moment you need the best one. Direct it: "Compact, preserving the current working approach to the auth middleware and the list of files we've already fixed."

The difference between /compact and /handoff is intent. Compact keeps you in the same thread with a compressed summary — you are recovering focus within one conversation. Handoff is for deliberately moving on: to a new session, a fresh worktree, or a completely different agent. After roughly 120,000 tokens, attention relationships strain and response quality degrades noticeably — that is the threshold to run /handoff rather than compact.

/clear wipes the slate entirely. The workflow that makes this not painful:

At the end of a productive stretch, write a session-handoff file: current goal, changed files, key decisions, failing tests, and the specific next step.
Run /clear.
Restart with: "Read session-handoff.md and continue."

The handoff file is cheaper than carrying four hours of conversation history into every subsequent message.

Matt Pocock (@mattpocockuk) built a skill for this that has been widely shared in the community:

I wrote a skill called /handoff. Whenever a session is nearing a compaction limit... it generates and commits a markdown file explaining everything it did... It's an excellent daily report type system. -- (gist)

/btw is less known but useful when you have a sidebar thought you would otherwise inject into the main thread. It opens a parallel inference channel against the current session knowledge without adding to conversation history. The thought gets processed; the main context stays clean.

12. Use skills for progressive disclosure

At session start, Claude reads each skill's name and description: roughly 30-100 tokens each. The full SKILL.md body only loads when Claude determines the skill is relevant.

This is progressive disclosure applied to capabilities. You can have dozens of Claude Code skills installed with no impact on sessions where they don't apply. For a walkthrough of writing your own SKILL.md file with a working Firecrawl example, see our Claude Code skill tutorial.

Two things that make skills work:

Precise descriptions matter. "Use when filling PDF forms and extracting table data" triggers correctly. "Helps with documents" doesn't. Claude pattern-matches your task description against the skill description at startup. Vague descriptions produce no activation or wrong activation.

Skills as CLAUDE.md overflow. Everything you stripped from CLAUDE.md to hit the 500-token target can live in skills with precise trigger descriptions. It stays accessible without loading on every session.

The compounding benefit: as you move content from CLAUDE.md into path-scoped rules and precise skills, the session baseline drops. Everything added to CLAUDE.md is paid every session forever. Skills with correct descriptions are paid only when needed.

One skill worth calling out directly for token efficiency: Caveman (github.com/JuliusBrussee/caveman). It strips narration, filler, and pleasantries from Claude's responses while keeping every technical fact and code block intact. Benchmarks across 10 real prompts show an average 65% output token reduction (range 22-87%). In caveman mode, "The reason your component is re-rendering is likely because you're creating a new object reference on each render cycle. I'd recommend using useMemo." (69 tokens) becomes "New object ref each render. Wrap in useMemo." (19 tokens) — same fix, 75% fewer words. It also ships a /caveman-compress sub-skill that rewrites your CLAUDE.md into compressed form, cutting ~46% of input tokens on every future session.

Bonus: Know when not to use Claude

The most credible token-efficiency advice is also the hardest to take seriously.

Sometimes the fastest path is opening the file and fixing it yourself.

Localized fixes with no ambiguity. Changes where you already know exactly what to write. Problems where the verification overhead exceeds the work itself. Some tasks are faster without an assistant in the loop, and recognizing that boundary is part of using Claude Code well.

Claude Code works best when you treat it like a powerful engineer with limited working memory. Don't dump everything on it. Don't make it read garbage. Don't let old context pile up. Give it the right files, the right errors, the right data, and a clean session.

Entropy in the context produces entropy in the response. That relationship is consistent. That is where the savings actually come from.

Frequently Asked Questions

Why does Claude Code use so many tokens?

Claude Code re-sends the entire conversation history on every message. Every file Claude reads gets added to that history permanently. A PR review that pulls 20 files into context means those files are reprocessed on every subsequent message for the rest of the session. Add the 20,000-30,000 token base overhead that loads before you type anything, and the numbers compound fast.

How do I see how many tokens I'm using in Claude Code?

Run /context at any point in a session to see a live breakdown of everything in the context window, including token counts per element and total usage. Run /memory to see which CLAUDE.md and memory files loaded at session start.

How much can CLAUDE.md optimization actually save?

One benchmark compared a 3,847-token CLAUDE.md with a 312-token version stripped to only what Claude cannot infer from the code. Result: 91.9% context reduction with no quality regression. Since CLAUDE.md loads on every session and persists the entire time, cutting it is the single highest-leverage optimization available.

What is .claudeignore and does it actually work?

A .claudeignore file signals to Claude that certain files are not relevant. Unlike .gitignore, it is advisory rather than enforced. Claude can still read ignored files if it decides they are necessary. For hard enforcement, use permissions.deny in .claude/settings.json, which blocks the Read tool at the permission layer.

Does using fewer tokens hurt output quality?

Not when done correctly. Trimming irrelevant context typically improves quality by reducing noise the model has to reason through. Context discipline applied correctly is a quality upgrade, not a cost sacrifice. The pattern that hurts quality is disabling reasoning on complex tasks or routing complex work to an underpowered model.

What is the best model to use in Claude Code to save tokens?

Use Sonnet as your default. Reserve Opus for deep architectural decisions and tricky multi-file bugs. Use Haiku for subagent tasks like log inspection, file reading, and boilerplate generation. Set CLAUDE_CODE_SUBAGENT_MODEL=haiku to route exploration agents to Haiku automatically while keeping your main thread on Sonnet.

Why do MCP servers add so many tokens?

Every connected MCP server loads its full tool schema into every request at session start, whether you use it or not. On a setup with several MCP servers, this can silently add 50,000-70,000 tokens of overhead per session. Setting ENABLE_TOOL_SEARCH defers schema loading until Claude actually needs a specific tool.

How does Firecrawl reduce Claude Code token usage?

When Claude Code agents look things up on the web, a raw HTML fetch returns roughly 38,000 tokens per page. Firecrawl strips pages to clean markdown at around 2,800 tokens, a 94% reduction. This saves approximately $0.11 per page on Claude Sonnet 4.6, which compounds significantly on any research-heavy agent workflow.

Can I count tokens before sending a message to Claude?

Yes. The Anthropic API has a dedicated token counting endpoint (POST /v1/messages/count_tokens) that accepts the same inputs as a normal message and returns the token count before any inference runs. It is free to use and supports text, images, PDFs, tools, and system prompts. In Claude Code, run /context at any point to see a live breakdown of everything currently in the context window.

What is the extractMemories background process and how do I disable it?

After every user message, Claude Code forks your entire conversation context into a separate parallel API call to extract memories for future sessions. Because both requests overlap, this fork always gets a cache miss and pays full input token price — invisibly, on every turn. To disable it, run /memory in Claude Code and turn auto-memory off. You can also set the environment variable DISABLE_NON_ESSENTIAL_MODEL_CALLS=1 to stop all background model calls that are not essential to your current task.

Does pasting large text into Claude Code waste tokens?

Yes. Pasting large chunks of text directly into the CLI prompt is significantly less efficient than reading from a file. The pasted content gets added to conversation history and reprocessed on every subsequent message for the rest of the session. Instead, save the content to a temporary file and tell Claude to read it. Claude reads the file once, and the file path — not the full content — travels with later messages.

Ready to build?

Table of Contents