How do I convert web pages into markdown for AI?
Converting a web page to markdown for AI means removing HTML navigation, ads, scripts, and layout markup while preserving content in a format a language model can reason from directly. The main decision is where the conversion happens: in a client-side library you run locally, or in an extraction API that handles fetching, rendering, and cleaning in one call. LLM-ready content requires stripping more than just tags: cookie banners, share buttons, and related article lists inflate context without contributing information, and most HTML-to-text libraries do not remove them.
| Approach | Works on JS-rendered pages | Handles arbitrary external URLs | Setup required |
|---|---|---|---|
| html2text / markdownify | No | No (fetch separately) | Minimal |
| python-readability | No | No (fetch separately) | Minimal |
| Headless browser + parser | Yes | Yes | High |
| Extraction API | Yes | Yes | None |
Use html2text or markdownify for pipelines where you supply the raw HTML yourself, pages are server-rendered, and the structure is predictable. Use an extraction API when working with arbitrary external URLs, JavaScript-heavy sites, or bulk conversions where writing per-site cleaning logic is impractical. The failure case for client-side libraries on modern sites is total: frameworks like React or Vue return a near-empty HTML shell on the initial request, so html2text produces a document with no content.
Firecrawl's Scrape API fetches the URL, renders any JavaScript, strips boilerplate, and returns clean markdown ready for embedding, indexing, or direct prompt injection. The onlyMainContent flag (on by default) removes navigation and sidebars automatically, so the output is the article or page body with no post-processing step required.