AI Model
Training Data
Add web data to your training pipelines.
Firecrawl turns sites, docs, and PDFs into clean datasets for pre-training, fine-tuning, and RL.
companies of all sizes












































Perfect for
Model training teams
Collect domain-specific datasets for pre-training and fine-tuning without custom crawlers.
RAG and evaluation pipelines
Build fresh eval sets and benchmarks from real docs and sites with preserved URLs.
RLHF and instruction data
Extract structured sections so you can generate prompts, pairs, and preference data in code.
Compliance-minded orgs
Scope allowed sources by domain and path so you can audit what goes into your training data.
How it works
Crawl approved sources
Crawl target domains and docs portals into structured, domain-specific text datasets so your models train on the same pages your users read, not a generic crawl of the public web.
Extract structure to JSON
Extract headings, sections, and metadata into JSON so you can generate instruction pairs, Q&A datasets, and RLHF prompts in code instead of hand-labeling examples.
Filter and scope the surface
Filter pages by domain, path, or custom rules so you can enforce which web content is allowed into training sets and answer “where did this come from?” with a concrete list of URLs instead of guesses.
Schedule refreshes
Schedule recurring Firecrawl crawls so fine-tuning datasets and evaluation sets stay fresh without rerunning scrape jobs every time something changes.
Export to your training stack
Export data in formats your training stack expects so PyTorch, TensorFlow, or custom orchestrators plug it in without brittle HTML parsers or cleanup scripts.
Discover new sources over time
Combine Firecrawl’s search and crawl endpoints so you can discover new relevant sources over time and grow or refresh datasets as your model scope expands.
People love
building with Firecrawl











Firecrawl is an open-source framework that takes a URL, crawls it, and conver..."

Upload a CSV of emails and..."



Firecrawl is an open-source framework that takes a URL, crawls it, and conver..."

Upload a CSV of emails and..."
How Firecrawl compares to alternatives
| Feature | Firecrawl | Manual CSV uploads | Browser extensions | Generic scrapers |
|---|---|---|---|---|
| Web search API (/search) | ||||
| Site crawling (/crawl) | ||||
| Extract to JSON (/extract) | ||||
| Structured markdown output | ||||
| Automatic scheduling & refresh | ||||
| JavaScript rendering | ||||
| URL metadata preserved | ||||
| Multi-tenant scoping | ||||
| API-first integration | ||||
| Built-in rate limiting & retries | ||||
| No manual intervention required |
Tutorials & Guides

How to Create Custom Instruction Datasets for LLM Fine-tuning
A comprehensive guide to creating instruction datasets for fine-tuning LLMs, including best practices and a practical code documentation example.

Fine-tuning Gemma 3 on a Custom Web Dataset With Firecrawl and Unsloth AI
Learn how to efficiently fine-tune Google's Gemma 3 language model on your custom dataset using Firecrawl for data collection.

How to Create a Dermatology Q&A Dataset with OpenAI Harmony & Firecrawl Search
A step by step guide on collecting dermatology data from the web using Firecrawl and generating a structured Q&A dataset.
Frequently
asked questions
data from the web