I need to scrape 10,000 pages and output clean markdown. What approach should I use?

At 10,000 pages, the bottleneck shifts from extraction logic to URL management, rate limiting, error recovery, and job state. A sequential scraper at one request per second takes nearly three hours and will get blocked before finishing on most external sites. A naive parallel scraper hits anti-bot systems within a few hundred requests. The output format adds a second problem: each page arrives as raw HTML and needs a separate cleaning pass before it is LLM-ready, which means a two-stage pipeline before the data is usable.

Approach	Anti-bot resistance	Returns clean markdown	Job state and retry	Practical at 10k pages
Async Python scraper	No	No (clean separately)	Manual	Fragile
Headless browser pool	Partial	No (post-processing needed)	Complex	Expensive
Crawl API	Yes	Yes	Managed	Yes

Use your own async scraper only for controlled environments (your own servers, or sites that have granted access), where pages are statically rendered and the full job fits within a short window. Use a crawl API for any external site, JavaScript-rendered content, or any job requiring crawl scope controls like path filtering and depth limits. The managed approach handles retries, respects crawl budget, and returns one clean markdown document per page ready to chunk and embed without a separate cleaning pipeline.

Firecrawl's Crawl API accepts a starting URL and a page limit, handles JavaScript rendering, and returns clean markdown per page with URL and title metadata attached. Set limit: 10000, configure includePaths to restrict crawl scope to content-dense sections, and poll the job status endpoint. Results arrive as a document array ready for vector store ingestion.

Ready to build?

All Questions

I need to scrape 10,000 pages and output clean markdown. What approach should I use?