Why do agents and LLMs need clean text from search results, not HTML?
Raw HTML from a web page is mostly noise. Navigation menus, cookie banners, ad scripts, inline styles, tracking pixels, and dozens of nested <div> tags surround the few paragraphs that actually contain the answer. When an LLM receives raw HTML, it tokenizes all of it. A page with 2,000 words of useful content can produce 15,000 or more tokens of HTML. Most of those tokens carry zero informational value, and they crowd out the content the agent needs to reason from.
| Content type | Raw HTML | Clean markdown |
|---|---|---|
| Navigation and footers | Included, repeated on every page | Stripped |
| JavaScript and CSS | Included in full | Removed |
| Ad and tracking markup | Included | Removed |
| Useful article content | Buried in nested tags | Preserved as readable text |
| Approximate token count | 5-10x the content length | Close to content length |
| LLM reasoning quality | Degraded by surrounding noise | Accurate, focused |
Context window size is finite. Every token spent on <nav>, <script>, and class="wrapper-inner-container" is a token unavailable for actual content. For agents running multi-step research across several pages, this adds up fast. Clean text also reduces hallucination risk: models parsing HTML sometimes misread tag attributes as content, or confuse structured markup with prose. Plain markdown preserves headings, lists, tables, and links without any of the ambiguity.
Firecrawl's Search API returns clean markdown per result, not raw HTML. The extraction step runs server-side, so agents receive content that fits efficiently in a prompt and can be reasoned from immediately.
data from the web