What is link extraction in web crawling?
Link extraction is the step in the crawl loop where a crawler parses a downloaded page and collects all URLs to visit next. The crawler finds anchor tags (<a href>), resolves relative URLs against the page's base URL, applies scope filters, deduplicates against already-visited URLs, and adds the remainder to the URL frontier. Every URL a crawler ever visits was first discovered through link extraction, making it the mechanism that determines the shape and completeness of any crawl.
| Decision | Options | Tradeoff |
|---|---|---|
| Link sources | Anchor tags only vs all attributes | Broader sources add noise |
| URL resolution | Absolute and relative | Relative links require base URL context |
| Canonical tag handling | Follow <link rel=canonical> vs ignore | Reduces duplication, may skip variants |
| External links | Follow vs restrict to domain | Broader coverage vs crawl scope |
| Fragment stripping | Remove #section from URLs | Prevents near-duplicate queue entries |
HTML anchor parsing covers the majority of sites, but JavaScript-rendered pages require a headless browser to execute scripts before link extraction can run. A crawler parsing raw HTML on a React or Next.js app may find zero links in the initial response because all navigation is injected by JavaScript after load. For sites with XML sitemaps, link extraction can be bypassed entirely for the initial seed: sitemaps provide a complete URL list without requiring page-by-page traversal. Sitemaps are faster but only cover what the site owner has explicitly listed; link extraction discovers pages that sitemaps omit.
Firecrawl's Crawl API handles link extraction across both static and JavaScript-rendered pages, following links through a full browser render so dynamically injected navigation is captured alongside standard HTML anchors.
data from the web