What is link extraction in web crawling?

Link extraction is the step in the crawl loop where a crawler parses a downloaded page and collects all URLs to visit next. The crawler finds anchor tags (<a href>), resolves relative URLs against the page's base URL, applies scope filters, deduplicates against already-visited URLs, and adds the remainder to the URL frontier. Every URL a crawler ever visits was first discovered through link extraction, making it the mechanism that determines the shape and completeness of any crawl.

Decision	Options	Tradeoff
Link sources	Anchor tags only vs all attributes	Broader sources add noise
URL resolution	Absolute and relative	Relative links require base URL context
Canonical tag handling	Follow `<link rel=canonical>` vs ignore	Reduces duplication, may skip variants
External links	Follow vs restrict to domain	Broader coverage vs crawl scope
Fragment stripping	Remove `#section` from URLs	Prevents near-duplicate queue entries

HTML anchor parsing covers the majority of sites, but JavaScript-rendered pages require a headless browser to execute scripts before link extraction can run. A crawler parsing raw HTML on a React or Next.js app may find zero links in the initial response because all navigation is injected by JavaScript after load. For sites with XML sitemaps, link extraction can be bypassed entirely for the initial seed: sitemaps provide a complete URL list without requiring page-by-page traversal. Sitemaps are faster but only cover what the site owner has explicitly listed; link extraction discovers pages that sitemaps omit.

Firecrawl's Crawl API handles link extraction across both static and JavaScript-rendered pages, following links through a full browser render so dynamically injected navigation is captured alongside standard HTML anchors.

Ready to build?

All Questions

What is link extraction in web crawling?