How do I ingest a docs site into a RAG system without broken HTML?

Ingesting a documentation site into a RAG system means crawling every page, converting HTML to clean text, and loading the output into a vector store for retrieval. The "broken HTML" problem comes from modern documentation frameworks: Docusaurus, GitBook, Mintlify, and ReadMe all render content via JavaScript, so a plain HTTP request returns a near-empty shell with no page content. Static HTML parsers return empty documents, and naive HTML-to-text converters produce fragments of navigation markup and sidebar labels instead of the article body.

Crawler approach	Handles JS doc frameworks	Returns clean markdown	Crawl scope control
wget or curl	No	No	Minimal
requests + BeautifulSoup	No	Partial	Manual
Headless browser + custom parser	Yes	Requires post-processing	Manual
Crawl API with JS rendering	Yes	Yes	Built-in

Use a static crawl for documentation sites that are server-rendered (older Sphinx or MkDocs deployments with no JavaScript build step). Use a JavaScript-capable crawl API for any modern framework. Path filtering is as important as JavaScript rendering: documentation sites typically include marketing pages, changelogs, and API reference sections at separate paths. Ingesting all of them inflates the RAG index with off-topic content that degrades retrieval quality for technical questions. Set include paths to the documentation section only and exclude version archives to avoid duplicate pages across version branches.

Firecrawl's Crawl API handles JavaScript doc frameworks automatically. Point it at the docs root, set includePaths: ["/docs/*"] to exclude marketing and changelog pages, and set onlyMainContent: true to strip navigation sidebars. The result is a clean markdown document per page with URL and title metadata attached, ready to chunk and load into any vector store.

Ready to build?

All Questions

How do I ingest a docs site into a RAG system without broken HTML?