Highlights and Question formats are now live. Get grounded answers or verbatim excerpts from any page in one call. Try it now →

How do I turn a list of URLs into clean documents for embeddings?

Turning a list of URLs into embedding-ready documents requires two steps: extracting clean text from each page and chunking the result into segments that fit within the embedding model's token window. The bottleneck in practice is the extraction step. Fetching URLs one at a time takes minutes per hundred pages, and raw HTML output requires a cleaning pass before embedding. A document that arrives as raw HTML with navigation, ads, and script tags embedded in the content produces embeddings that represent noise as much as signal, which degrades retrieval quality.

ApproachSpeed on 1,000 URLsOutput qualityMaintenance
Sequential HTTP + html2textSlow (minutes to hours)Inconsistent; fails on JS pagesPer-site cleaning logic
Parallel async scraperFaster, but blocks from targetsInconsistent without post-cleaningError handling and retry overhead
Batch scraping APIFast (server-side parallel)Clean markdown per URLNone

Use sequential extraction for small lists (under 50 URLs) where you control the source pages and they are statically rendered. Use a batch API for larger lists or external URLs: submitting the full list in one call moves the parallelism server-side, avoids bot detection and rate-limiting from running your own async scraper, and returns consistently formatted output per URL that goes directly into the embedding step without a separate cleaning pass.

Firecrawl's Batch Scrape endpoint accepts an array of URLs and returns clean markdown per page. Submit the list once, poll for completion, and pipe the results into your embedding model or vector store. Each document comes with the source URL as metadata, so retrieval includes provenance without a separate tracking layer.

Last updated: May 12, 2026