What is focused crawling?
Focused crawling limits a crawler to a specific topic, domain set, or content category rather than following all discoverable links. Where a general crawler builds broad coverage, a focused crawler prioritizes relevance: it either pre-filters URLs by pattern before fetching, or evaluates page content after fetching and discards pages that fall outside the target subject. The result is a smaller dataset with higher signal-to-noise ratio, at lower bandwidth and compute cost than crawling everything.
| Factor | General crawling | Focused crawling |
|---|---|---|
| Scope | All reachable URLs | Topic or domain subset |
| Dataset size | Large, mixed relevance | Smaller, high relevance |
| Filtering approach | Minimal | URL patterns or content classifiers |
| Crawl strategy | Typically breadth-first | Depth-first for deep topic clusters |
| Primary use case | Search indexes, web archives | LLM training data, domain research |
Focused crawling makes the most sense when you need high-quality, on-topic content: training a domain-specific model on technical documentation, collecting product data from a curated list of retailers, or building a research corpus from a specific publication type. The core design decision is when to filter: pre-fetching by URL pattern is fast but shallow, since relevant pages may live at unpredictable paths; post-fetching by content classifier is more accurate but costs a request per discarded page. Most production focused crawlers combine both: tight crawl scope rules as a first pass, followed by content-level filtering on what gets kept. A depth-first strategy works better than breadth-first when relevant content clusters deep in a site's structure.
Firecrawl's Crawl API supports focused crawling through path filters and domain constraints, returning clean Markdown per page so content filtering can run directly on extracted text without needing to parse HTML.
data from the web