What is crawl scope?

Crawl scope defines which URLs a crawler is permitted to fetch through a set of boundary rules: allowed domains, path include patterns, path exclude patterns, and maximum depth. Without explicit scope rules, a crawler following external links drifts outside the target site and never returns, or crawls admin pages, session URLs, and pagination variants that inflate crawl budget without adding useful content.

Scope rule	What it controls	Example
Allowed domains	Restricts crawl to specific hostnames	`docs.example.com` only
Include paths	Crawl only URLs matching a pattern	`/blog/*`
Exclude paths	Skip URLs matching a pattern	`/admin/`, `/search`
Max depth	Limits how far from the root URL the crawler ventures	`maxDiscoveryDepth: 3`
File type filters	Skip non-target content types	Exclude `.pdf`, `.zip`

Set scope before starting a crawl, not after. Crawling the full site and filtering results post-hoc wastes requests and load on the target server. Include patterns are the most precise control: crawling a documentation site with /docs/* as the only include path ensures the crawler never touches marketing pages, blog posts, or changelogs that would dilute a technical dataset. Exclude patterns handle the inverse, blocking known noise like pagination variants (?page=*), session tokens in URLs, and auto-generated search results. Scope rules and robots.txt address complementary concerns: robots.txt reflects what the site owner allows; scope rules reflect what you actually need.

Firecrawl's Crawl API accepts path filters and domain constraints directly as crawl parameters, so you can define scope on any job without building custom URL filtering or post-processing logic.

Ready to build?

All Questions