Introducing /agent. Gather web data with just a prompt. Try it now →
. . .. ..+ .:. .. .. .:: +.. ..: :. .:..::. .. .. .--:::. .. ... .:. .. .. .:+=-::.:. . ...-.::. .. ::.... .:--+::..: ......:+....:. :.. .. ....... ::-=:::: ..:-:-...: .--..:: ......... .. . . . ..::-:-.. .-+-:::.. ...::::. .: ...::.:.. . -... ....: . . .--=+-::. :-=-:.... . .:..:: .:---:::::-::.... ..::........::=..... ...:-.. .:-=--+=-:. ..--:..=::.... . .:.. ..:---::::---=:::..:... ..........::::.:::::::-::.-.. ...::--==:. ..-::-+==-:... .-::....... ..--:. ..:=+==.---=-+-:::::::-.. . .....::......:: ::::-::.---=+-:..::-+==++X=-:. ..:-::-=-== ---.. .:.--::.. .:-==::=--X==-----====--::+:::+... ..-....-:..::-::=-=-:-::--===++=-==-----== X+=-:.::-==----+==+XX+=-::.:+--==--::. .:-+X=----+X=-=------===--::-:...:. .... ....::::...:-:-==+++=++==+++XX++==++--+-+==++++=-===+=---:-==+X:XXX+=-:-=-==++=-:. .:-=+=- -=X+X+===+---==--==--:..::...+....+ ..:::---.::.---=+==XXXXXXXX+XX++==++===--+===:+X+====+=--::--=+XXXXXXX+==++==+XX+=: ::::--=+++X++X+XXXX+=----==++.+=--::+::::+. ::.=... .:::-==-------=X+++XXXXXXXXXXX++==++.==-==-:-==+X++==+=-=--=++++X++:X:X+++X+-+X X+=---=-==+=+++XXXXX+XX=+=--=X++XXX==---::-+-::::.:..-..
. . .. ..+ .:. .. .. .:: +.. ..: :. .:..::. .. .. .--:::. .. ... .:. .. .. .:+=-::.:. . ...-.::. .. ::.... .:--+::..: ......:+....:. :.. .. ....... ::-=:::: ..:-:-...: .--..:: ......... .. . . . ..::-:-.. .-+-:::.. ...::::. .: ...::.:.. . -... ....: . . .--=+-::. :-=-:.... . .:..:: .:---:::::-::.... ..::........::=..... ...:-.. .:-=--+=-:. ..--:..=::.... . .:.. ..:---::::---=:::..:... ..........::::.:::::::-::.-.. ...::--==:. ..-::-+==-:... .-::....... ..--:. ..:=+==.---=-+-:::::::-.. . .....::......:: ::::-::.---=+-:..::-+==++X=-:. ..:-::-=-== ---.. .:.--::.. .:-==::=--X==-----====--::+:::+... ..-....-:..::-::=-=-:-::--===++=-==-----== X+=-:.::-==----+==+XX+=-::.:+--==--::. .:-+X=----+X=-=------===--::-:...:. .... ....::::...:-:-==+++=++==+++XX++==++--+-+==++++=-===+=---:-==+X:XXX+=-:-=-==++=-:. .:-=+=- -=X+X+===+---==--==--:..::...+....+ ..:::---.::.---=+==XXXXXXXX+XX++==++===--+===:+X+====+=--::--=+XXXXXXX+==++==+XX+=: ::::--=+++X++X+XXXX+=----==++.+=--::+::::+. ::.=... .:::-==-------=X+++XXXXXXXXXXX++==++.==-==-:-==+X++==+=-=--=++++X++:X:X+++X+-+X X+=---=-==+=+++XXXXX+XX=+=--=X++XXX==---::-+-::::.:..-..
2 Months Free — Annually

AI Model
Training Data

Add web data to your training pipelines.
Firecrawl turns sites, docs, and PDFs into clean datasets for pre-training, fine-tuning, and RL.

//
Used by over 500,000 developers
//
Trusted by 5000+
companies
of all sizes
Logo 17
Logo 18
Logo 1
Logo 2
Logo 3
Logo 5
Logo 6
Logo 7
Logo 8
Logo 9
Logo 10
Logo 11
Logo 12
Logo 13
Logo 14
Logo 15
Logo 16
Logo 17
Logo 18
Logo 19
Logo 20
Logo 21
Logo 17
Logo 18
Logo 1
Logo 2
Logo 3
Logo 5
Logo 6
Logo 7
Logo 8
Logo 9
Logo 10
Logo 11
Logo 12
Logo 13
Logo 14
Logo 15
Logo 16
Logo 17
Logo 18
Logo 19
Logo 20
Logo 21
10x
faster dataset collection
100k+
URLs crawled per project
24/7
scheduled refresh pipelines

Perfect for

Model training teams

Collect domain-specific datasets for pre-training and fine-tuning without custom crawlers.

RAG and evaluation pipelines

Build fresh eval sets and benchmarks from real docs and sites with preserved URLs.

RLHF and instruction data

Extract structured sections so you can generate prompts, pairs, and preference data in code.

Compliance-minded orgs

Scope allowed sources by domain and path so you can audit what goes into your training data.

[ 01 / 03 ]
·
Use Cases
AI Training Pipeline
Training Progress
Web Data Collection
Data Cleaning & Processing
Pre-training
Fine-tuning
RLHF & Post-training
Real-time Metrics
Web pages scraped
0
Training tokens
0.0B
Model accuracy
0.0%
Data quality score
0.0%

How it works

Crawl approved sources

Crawl target domains and docs portals into structured, domain-specific text datasets so your models train on the same pages your users read, not a generic crawl of the public web.

Extract structure to JSON

Extract headings, sections, and metadata into JSON so you can generate instruction pairs, Q&A datasets, and RLHF prompts in code instead of hand-labeling examples.

Filter and scope the surface

Filter pages by domain, path, or custom rules so you can enforce which web content is allowed into training sets and answer “where did this come from?” with a concrete list of URLs instead of guesses.

Schedule refreshes

Schedule recurring Firecrawl crawls so fine-tuning datasets and evaluation sets stay fresh without rerunning scrape jobs every time something changes.

Export to your training stack

Export data in formats your training stack expects so PyTorch, TensorFlow, or custom orchestrators plug it in without brittle HTML parsers or cleanup scripts.

Discover new sources over time

Combine Firecrawl’s search and crawl endpoints so you can discover new relevant sources over time and grow or refresh datasets as your model scope expands.

[ 02 / 03 ]
·
What Our Customers Say
//
Community
//

People love
building with Firecrawl

Discover why developers choose Firecrawl every day.

How Firecrawl compares to alternatives

FeatureFirecrawlManual CSV uploadsBrowser extensionsGeneric scrapers
Web search API (/search)
Site crawling (/crawl)
Extract to JSON (/extract)
Structured markdown output
Automatic scheduling & refresh
JavaScript rendering
URL metadata preserved
Multi-tenant scoping
API-first integration
Built-in rate limiting & retries
No manual intervention required
//
FAQ
//

Frequently
asked questions

Everything you need to know about this use case.
General
Technical
Integration
Why Firecrawl?
[ 03 / 03 ]
·
Pricing
Loading pricing...
[ CTA ]
[ CRAWL ]
[ SCRAPE ]
[ CTA ]
//
Get started
//
Ready to scale your training data?
Start collecting high-quality web data for your AI models today.
FOOTER
The easiest way to extract
data from the web
. . .. ..+ .:. .. .. .:: +.. ..: :. .:..::. .. .. .--:::. .. ... .:. .. .. .:+=-::.:. . ...-.::. .. ::.... .:--+::..: ......:+....:. :.. .. ....... ::-=:::: ..:-:-...: .--..:: ......... .. . . . ..::-:-.. .-+-:::.. ...::::. .: ...::.:.. . -... ....: . . .--=+-::. :-=-:.... . .:..:: .:---:::::-::.... ..::........::=..... ...:-.. .:-=--+=-:. ..--:..=::.... . .:.. ..:---::::---=:::..:... ..........::::.:::::::-::.-.. ...::--==:. ..-::-+==-:... .-::....... ..--:. ..:=+==.---=-+-:::::::-.. . .....::......:: ::::-::.---=+-:..::-+==++X=-:. ..:-::-=-== ---.. .:.--::.. .:-==::=--X==-----====--::+:::+... ..-....-:..::-::=-=-:-::--===++=-==-----== X+=-:.::-==----+==+XX+=-::.:+--==--::. .:-+X=----+X=-=------===--::-:...:. .... ....::::...:-:-==+++=++==+++XX++==++--+-+==++++=-===+=---:-==+X:XXX+=-:-=-==++=-:. .:-=+=- -=X+X+===+---==--==--:..::...+....+ ..:::---.::.---=+==XXXXXXXX+XX++==++===--+===:+X+====+=--::--=+XXXXXXX+==++==+XX+=: ::::--=+++X++X+XXXX+=----==++.+=--::+::::+. ::.=... .:::-==-------=X+++XXXXXXXXXXX++==++.==-==-:-==+X++==+=-=--=++++X++:X:X+++X+-+X X+=---=-==+=+++XXXXX+XX=+=--=X++XXX==---::-+-::::.:..-..
Backed by
Y Combinator
LinkedinGithubYouTube
SOC II · Type 2
AICPA
SOC 2
X (Twitter)
Discord