I need to scrape 10,000 pages and output clean markdown. What approach should I use?
At 10,000 pages, the bottleneck shifts from extraction logic to URL management, rate limiting, error recovery, and job state. A sequential scraper at one request per second takes nearly three hours and will get blocked before finishing on most external sites. A naive parallel scraper hits anti-bot systems within a few hundred requests. The output format adds a second problem: each page arrives as raw HTML and needs a separate cleaning pass before it is LLM-ready, which means a two-stage pipeline before the data is usable.
| Approach | Anti-bot resistance | Returns clean markdown | Job state and retry | Practical at 10k pages |
|---|---|---|---|---|
| Async Python scraper | No | No (clean separately) | Manual | Fragile |
| Headless browser pool | Partial | No (post-processing needed) | Complex | Expensive |
| Crawl API | Yes | Yes | Managed | Yes |
Use your own async scraper only for controlled environments (your own servers, or sites that have granted access), where pages are statically rendered and the full job fits within a short window. Use a crawl API for any external site, JavaScript-rendered content, or any job requiring crawl scope controls like path filtering and depth limits. The managed approach handles retries, respects crawl budget, and returns one clean markdown document per page ready to chunk and embed without a separate cleaning pipeline.
Firecrawl's Crawl API accepts a starting URL and a page limit, handles JavaScript rendering, and returns clean markdown per page with URL and title metadata attached. Set limit: 10000, configure includePaths to restrict crawl scope to content-dense sections, and poll the job status endpoint. Results arrive as a document array ready for vector store ingestion.