Highlights and Question formats are now live. Get grounded answers or verbatim excerpts from any page in one call. Try it now →

How do I convert web pages into markdown for AI?

Converting a web page to markdown for AI means removing HTML navigation, ads, scripts, and layout markup while preserving content in a format a language model can reason from directly. The main decision is where the conversion happens: in a client-side library you run locally, or in an extraction API that handles fetching, rendering, and cleaning in one call. LLM-ready content requires stripping more than just tags: cookie banners, share buttons, and related article lists inflate context without contributing information, and most HTML-to-text libraries do not remove them.

ApproachWorks on JS-rendered pagesHandles arbitrary external URLsSetup required
html2text / markdownifyNoNo (fetch separately)Minimal
python-readabilityNoNo (fetch separately)Minimal
Headless browser + parserYesYesHigh
Extraction APIYesYesNone

Use html2text or markdownify for pipelines where you supply the raw HTML yourself, pages are server-rendered, and the structure is predictable. Use an extraction API when working with arbitrary external URLs, JavaScript-heavy sites, or bulk conversions where writing per-site cleaning logic is impractical. The failure case for client-side libraries on modern sites is total: frameworks like React or Vue return a near-empty HTML shell on the initial request, so html2text produces a document with no content.

Firecrawl's Scrape API fetches the URL, renders any JavaScript, strips boilerplate, and returns clean markdown ready for embedding, indexing, or direct prompt injection. The onlyMainContent flag (on by default) removes navigation and sidebars automatically, so the output is the article or page body with no post-processing step required.

Last updated: May 12, 2026