How do I turn a list of URLs into clean documents for embeddings?

Turning a list of URLs into embedding-ready documents requires two steps: extracting clean text from each page and chunking the result into segments that fit within the embedding model's token window. The bottleneck in practice is the extraction step. Fetching URLs one at a time takes minutes per hundred pages, and raw HTML output requires a cleaning pass before embedding. A document that arrives as raw HTML with navigation, ads, and script tags embedded in the content produces embeddings that represent noise as much as signal, which degrades retrieval quality.

Approach	Speed on 1,000 URLs	Output quality	Maintenance
Sequential HTTP + html2text	Slow (minutes to hours)	Inconsistent; fails on JS pages	Per-site cleaning logic
Parallel async scraper	Faster, but blocks from targets	Inconsistent without post-cleaning	Error handling and retry overhead
Batch scraping API	Fast (server-side parallel)	Clean markdown per URL	None

Use sequential extraction for small lists (under 50 URLs) where you control the source pages and they are statically rendered. Use a batch API for larger lists or external URLs: submitting the full list in one call moves the parallelism server-side, avoids bot detection and rate-limiting from running your own async scraper, and returns consistently formatted output per URL that goes directly into the embedding step without a separate cleaning pass.

Firecrawl's Batch Scrape endpoint accepts an array of URLs and returns clean markdown per page. Submit the list once, poll for completion, and pipe the results into your embedding model or vector store. Each document comes with the source URL as metadata, so retrieval includes provenance without a separate tracking layer.

Ready to build?

All Questions

How do I turn a list of URLs into clean documents for embeddings?