How do I clean HTML and remove boilerplate for LLM training?

Cleaning HTML for LLM training means removing everything that is not content: navigation menus, cookie banners, ad blocks, footers, sidebars, share buttons, and related article links. What remains should be the main text the page author wrote, with enough structure (headings, code blocks, lists) preserved for the model to learn document organization. The challenge is that boilerplate varies by site: what is a <nav> on one site is a <div class="menu-container"> on another, and rule-based cleaners tuned for one domain break on the next.

Approach	Works across domains	Handles JS-rendered pages	Preserves structure	Maintenance
html2text	Partial	No	Minimal	Low
python-readability	Partial	No	Good for articles	Low
Custom per-site rules	Yes, with effort	No	Full control	High
Extraction API with main content flag	Yes	Yes	Headings, lists, code blocks	None

Use html2text or python-readability for single-domain corpora with consistent article structure (blog posts, news archives) where pages are server-rendered. Use an extraction API for multi-domain training corpora where per-site rules are impractical, or when sources use JavaScript frameworks that return empty shells to a plain HTTP client. The failure case for library-based cleaners on modern sites is total: the parser receives <div id="root"></div> and returns a blank document that looks valid but contains no training signal.

Firecrawl's Scrape API returns LLM-ready content by default: navigation, ads, and boilerplate are stripped server-side, and the main content is returned as clean markdown with headings, paragraphs, tables, and code blocks intact. For large training corpora, the Crawl API collects entire sites at once with path filters to restrict scope to content-dense sections and exclude auto-generated or archive pages that inflate volume without improving data quality.

Ready to build?

All Questions

How do I clean HTML and remove boilerplate for LLM training?