How do I convert web pages into markdown for AI?

Converting a web page to markdown for AI means removing HTML navigation, ads, scripts, and layout markup while preserving content in a format a language model can reason from directly. The main decision is where the conversion happens: in a client-side library you run locally, or in an extraction API that handles fetching, rendering, and cleaning in one call. LLM-ready content requires stripping more than just tags: cookie banners, share buttons, and related article lists inflate context without contributing information, and most HTML-to-text libraries do not remove them.

Approach	Works on JS-rendered pages	Handles arbitrary external URLs	Setup required
html2text / markdownify	No	No (fetch separately)	Minimal
python-readability	No	No (fetch separately)	Minimal
Headless browser + parser	Yes	Yes	High
Extraction API	Yes	Yes	None

Use html2text or markdownify for pipelines where you supply the raw HTML yourself, pages are server-rendered, and the structure is predictable. Use an extraction API when working with arbitrary external URLs, JavaScript-heavy sites, or bulk conversions where writing per-site cleaning logic is impractical. The failure case for client-side libraries on modern sites is total: frameworks like React or Vue return a near-empty HTML shell on the initial request, so html2text produces a document with no content.

Firecrawl's Scrape API fetches the URL, renders any JavaScript, strips boilerplate, and returns clean markdown ready for embedding, indexing, or direct prompt injection. The onlyMainContent flag (on by default) removes navigation and sidebars automatically, so the output is the article or page body with no post-processing step required.

Ready to build?

All Questions

How do I convert web pages into markdown for AI?