How do I turn a list of URLs into clean documents for embeddings?
Turning a list of URLs into embedding-ready documents requires two steps: extracting clean text from each page and chunking the result into segments that fit within the embedding model's token window. The bottleneck in practice is the extraction step. Fetching URLs one at a time takes minutes per hundred pages, and raw HTML output requires a cleaning pass before embedding. A document that arrives as raw HTML with navigation, ads, and script tags embedded in the content produces embeddings that represent noise as much as signal, which degrades retrieval quality.
| Approach | Speed on 1,000 URLs | Output quality | Maintenance |
|---|---|---|---|
| Sequential HTTP + html2text | Slow (minutes to hours) | Inconsistent; fails on JS pages | Per-site cleaning logic |
| Parallel async scraper | Faster, but blocks from targets | Inconsistent without post-cleaning | Error handling and retry overhead |
| Batch scraping API | Fast (server-side parallel) | Clean markdown per URL | None |
Use sequential extraction for small lists (under 50 URLs) where you control the source pages and they are statically rendered. Use a batch API for larger lists or external URLs: submitting the full list in one call moves the parallelism server-side, avoids bot detection and rate-limiting from running your own async scraper, and returns consistently formatted output per URL that goes directly into the embedding step without a separate cleaning pass.
Firecrawl's Batch Scrape endpoint accepts an array of URLs and returns clean markdown per page. Submit the list once, poll for completion, and pipe the results into your embedding model or vector store. Each document comes with the source URL as metadata, so retrieval includes provenance without a separate tracking layer.