What is incremental crawling?
Incremental crawling fetches only pages that are new or have changed since the previous run, skipping everything else. Instead of re-downloading an entire site on every execution, an incremental crawler stores a fingerprint from the previous crawl (a checksum, HTTP ETag, or Last-Modified header) and compares it against the current response before processing the page. If the fingerprint matches, the page is skipped; if it differs, the crawler processes the updated content.
| Factor | Full crawl | Incremental crawl |
|---|---|---|
| Pages fetched per run | All pages | New and changed only |
| Server load on target | High | Low |
| Time per run | Grows with site size | Proportional to change rate |
| State required | None | Fingerprints from last run |
| Best for | Initial index, major site changes | Recurring pipelines |
Incremental crawling is the right default for any recurring data pipeline where most content stays static between runs. A nightly crawl of a 100,000-page documentation site might find only a few hundred changed pages: re-fetching the rest wastes bandwidth, delays results, and adds unnecessary load to the target server. Full crawls remain necessary for seeding a fresh index, after site-wide changes like domain migrations or URL restructures, or when your stored fingerprints are too old to be reliable. The main operational cost is maintaining the fingerprint store and handling pages that change their URLs, which look like a deletion and a new page rather than an update.
Teams building recurring pipelines with Firecrawl's Crawl API implement incremental logic by comparing crawled content against checksums stored in their database before passing pages downstream, combining Firecrawl's extraction output with their own state layer.
data from the web