Why do agents and LLMs need clean text from search results, not HTML?

Raw HTML from a web page is mostly noise. Navigation menus, cookie banners, ad scripts, inline styles, tracking pixels, and dozens of nested <div> tags surround the few paragraphs that actually contain the answer. When an LLM receives raw HTML, it tokenizes all of it. A page with 2,000 words of useful content can produce 15,000 or more tokens of HTML. Most of those tokens carry zero informational value, and they crowd out the content the agent needs to reason from.

Content type	Raw HTML	Clean markdown
Navigation and footers	Included, repeated on every page	Stripped
JavaScript and CSS	Included in full	Removed
Ad and tracking markup	Included	Removed
Useful article content	Buried in nested tags	Preserved as readable text
Approximate token count	5-10x the content length	Close to content length
LLM reasoning quality	Degraded by surrounding noise	Accurate, focused

Context window size is finite. Every token spent on <nav>, <script>, and class="wrapper-inner-container" is a token unavailable for actual content. For agents running multi-step research across several pages, this adds up fast. Clean text also reduces hallucination risk: models parsing HTML sometimes misread tag attributes as content, or confuse structured markup with prose. Plain markdown preserves headings, lists, tables, and links without any of the ambiguity.

Firecrawl's Search API returns clean markdown per result, not raw HTML. The extraction step runs server-side, so agents receive content that fits efficiently in a prompt and can be reasoned from immediately.

Ready to build?

All Questions

Why do agents and LLMs need clean text from search results, not HTML?