Firecrawl CLI gives agents the complete web data toolkit for scraping, searching, and browsing. Try it now →

What is link extraction in web crawling?

Link extraction is the step in the crawl loop where a crawler parses a downloaded page and collects all URLs to visit next. The crawler finds anchor tags (<a href>), resolves relative URLs against the page's base URL, applies scope filters, deduplicates against already-visited URLs, and adds the remainder to the URL frontier. Every URL a crawler ever visits was first discovered through link extraction, making it the mechanism that determines the shape and completeness of any crawl.

DecisionOptionsTradeoff
Link sourcesAnchor tags only vs all attributesBroader sources add noise
URL resolutionAbsolute and relativeRelative links require base URL context
Canonical tag handlingFollow <link rel=canonical> vs ignoreReduces duplication, may skip variants
External linksFollow vs restrict to domainBroader coverage vs crawl scope
Fragment strippingRemove #section from URLsPrevents near-duplicate queue entries

HTML anchor parsing covers the majority of sites, but JavaScript-rendered pages require a headless browser to execute scripts before link extraction can run. A crawler parsing raw HTML on a React or Next.js app may find zero links in the initial response because all navigation is injected by JavaScript after load. For sites with XML sitemaps, link extraction can be bypassed entirely for the initial seed: sitemaps provide a complete URL list without requiring page-by-page traversal. Sitemaps are faster but only cover what the site owner has explicitly listed; link extraction discovers pages that sitemaps omit.

Firecrawl's Crawl API handles link extraction across both static and JavaScript-rendered pages, following links through a full browser render so dynamically injected navigation is captured alongside standard HTML anchors.

Last updated: Mar 11, 2026
FOOTER
The easiest way to extract
data from the web
Backed by
Y Combinator
LinkedinGithubYouTube
SOC II · Type 2
AICPA
SOC 2
X (Twitter)
Discord