What is the best way to deduplicate pages during a crawl for RAG ingestion?

Duplicate pages inflate a RAG vector index with redundant chunks, increasing retrieval costs and degrading response quality. Deduplication works at three levels: URL normalization before fetching, content hashing post-fetch, and near-duplicate filtering for pages with high textual overlap. Each layer catches what the previous one misses.

Dedup layer	What it catches	When to apply
URL normalization	Same page at different URLs: trailing slashes, tracking params, protocol variants	Before crawl (pre-fetch)
Canonical tag check	Pages that self-identify as duplicates via `<link rel="canonical">`	Post-fetch, before indexing
Exact content hash	Mirror sites, identical pages at distinct paths, duplicate press releases	Post-fetch, before indexing
Near-duplicate filtering	Paginated variants with minor differences, syndicated content, template-heavy pages	Post-fetch, optional

URL normalization and canonical checks are the highest-leverage steps because they prevent wasted requests. Content hashing on extracted text (not raw HTML) is fast and reliable for exact duplicates. Near-duplicate detection via embedding similarity is more accurate but costly: use it for syndicated or template-heavy corpora, not for scoped crawls of a single site.

Firecrawl's Crawl API handles URL-level deduplication automatically within a crawl run and supports ignoreQueryParameters: true to collapse query-string variants of the same path into a single fetch, which reduces the volume of pages requiring downstream content dedup logic before indexing.

Ready to build?

All Questions

What is the best way to deduplicate pages during a crawl for RAG ingestion?