How do I ingest a docs site into a RAG system without broken HTML?
Ingesting a documentation site into a RAG system means crawling every page, converting HTML to clean text, and loading the output into a vector store for retrieval. The "broken HTML" problem comes from modern documentation frameworks: Docusaurus, GitBook, Mintlify, and ReadMe all render content via JavaScript, so a plain HTTP request returns a near-empty shell with no page content. Static HTML parsers return empty documents, and naive HTML-to-text converters produce fragments of navigation markup and sidebar labels instead of the article body.
| Crawler approach | Handles JS doc frameworks | Returns clean markdown | Crawl scope control |
|---|---|---|---|
| wget or curl | No | No | Minimal |
| requests + BeautifulSoup | No | Partial | Manual |
| Headless browser + custom parser | Yes | Requires post-processing | Manual |
| Crawl API with JS rendering | Yes | Yes | Built-in |
Use a static crawl for documentation sites that are server-rendered (older Sphinx or MkDocs deployments with no JavaScript build step). Use a JavaScript-capable crawl API for any modern framework. Path filtering is as important as JavaScript rendering: documentation sites typically include marketing pages, changelogs, and API reference sections at separate paths. Ingesting all of them inflates the RAG index with off-topic content that degrades retrieval quality for technical questions. Set include paths to the documentation section only and exclude version archives to avoid duplicate pages across version branches.
Firecrawl's Crawl API handles JavaScript doc frameworks automatically. Point it at the docs root, set includePaths: ["/docs/*"] to exclude marketing and changelog pages, and set onlyMainContent: true to strip navigation sidebars. The result is a clean markdown document per page with URL and title metadata attached, ready to chunk and load into any vector store.