What is URL normalization in web crawling?
URL normalization converts different representations of the same URL into a single canonical form before adding it to the URL frontier. Without it, http://example.com/page, https://example.com/page/, https://www.example.com/page, and https://example.com/page?utm_source=email all resolve to identical content but appear as four distinct URLs, causing the crawler to fetch and process the same page four times.
| Normalization step | Before | After |
|---|---|---|
| Enforce HTTPS | http://example.com | https://example.com |
| Remove trailing slash | example.com/page/ | example.com/page |
| Strip tracking params | example.com?utm_source=x | example.com |
| Lowercase path | example.com/Page | example.com/page |
| Remove fragment | example.com/page#section | example.com/page |
Apply normalization before inserting URLs into the frontier, not after fetching. Done post-fetch, you have already wasted the request. The main tradeoff is over-normalization: stripping query parameters that differentiate real pages (?page=2, ?category=shoes) merges distinct pages into one and loses content. Normalization rules need to be tuned per site because URL conventions vary widely across CMS platforms and frameworks. Normalization works alongside robots.txt and crawl scope rules to keep crawl budget focused on unique, in-scope content.
Firecrawl's Crawl API deduplicates exact URL matches automatically within a crawl run. For query parameter variants of the same path (e.g. ?ref=email vs ?ref=social), pass ignoreQueryParameters: true to treat them as the same URL and avoid re-scraping the same page with different parameters.
data from the web