What is crawl scope?
Crawl scope defines which URLs a crawler is permitted to fetch through a set of boundary rules: allowed domains, path include patterns, path exclude patterns, and maximum depth. Without explicit scope rules, a crawler following external links drifts outside the target site and never returns, or crawls admin pages, session URLs, and pagination variants that inflate crawl budget without adding useful content.
| Scope rule | What it controls | Example |
|---|---|---|
| Allowed domains | Restricts crawl to specific hostnames | docs.example.com only |
| Include paths | Crawl only URLs matching a pattern | /blog/* |
| Exclude paths | Skip URLs matching a pattern | /admin/*, /search* |
| Max depth | Limits how far from the root URL the crawler ventures | maxDiscoveryDepth: 3 |
| File type filters | Skip non-target content types | Exclude .pdf, .zip |
Set scope before starting a crawl, not after. Crawling the full site and filtering results post-hoc wastes requests and load on the target server. Include patterns are the most precise control: crawling a documentation site with /docs/* as the only include path ensures the crawler never touches marketing pages, blog posts, or changelogs that would dilute a technical dataset. Exclude patterns handle the inverse, blocking known noise like pagination variants (?page=*), session tokens in URLs, and auto-generated search results. Scope rules and robots.txt address complementary concerns: robots.txt reflects what the site owner allows; scope rules reflect what you actually need.
Firecrawl's Crawl API accepts path filters and domain constraints directly as crawl parameters, so you can define scope on any job without building custom URL filtering or post-processing logic.
data from the web