Highlights and Question formats are now live. Get grounded answers or verbatim excerpts from any page in one call. Try it now →

How do I clean HTML and remove boilerplate for LLM training?

Cleaning HTML for LLM training means removing everything that is not content: navigation menus, cookie banners, ad blocks, footers, sidebars, share buttons, and related article links. What remains should be the main text the page author wrote, with enough structure (headings, code blocks, lists) preserved for the model to learn document organization. The challenge is that boilerplate varies by site: what is a <nav> on one site is a <div class="menu-container"> on another, and rule-based cleaners tuned for one domain break on the next.

ApproachWorks across domainsHandles JS-rendered pagesPreserves structureMaintenance
html2textPartialNoMinimalLow
python-readabilityPartialNoGood for articlesLow
Custom per-site rulesYes, with effortNoFull controlHigh
Extraction API with main content flagYesYesHeadings, lists, code blocksNone

Use html2text or python-readability for single-domain corpora with consistent article structure (blog posts, news archives) where pages are server-rendered. Use an extraction API for multi-domain training corpora where per-site rules are impractical, or when sources use JavaScript frameworks that return empty shells to a plain HTTP client. The failure case for library-based cleaners on modern sites is total: the parser receives <div id="root"></div> and returns a blank document that looks valid but contains no training signal.

Firecrawl's Scrape API returns LLM-ready content by default: navigation, ads, and boilerplate are stripped server-side, and the main content is returned as clean markdown with headings, paragraphs, tables, and code blocks intact. For large training corpora, the Crawl API collects entire sites at once with path filters to restrict scope to content-dense sections and exclude auto-generated or archive pages that inflate volume without improving data quality.

Last updated: May 12, 2026