Firecrawl CLI gives agents the complete web data toolkit for scraping, searching, and browsing. Try it now →

AI Model
Training Data

Add web data to your training pipelines.
Firecrawl turns sites, docs, and PDFs into clean datasets for pre-training, fine-tuning, and RL.

Used by over 500,000 developers

Trusted by 80,000+
companies of all sizes

10x

faster dataset collection

100k+

URLs crawled per project

24/7

scheduled refresh pipelines

Perfect for

Model training teams

Collect domain-specific datasets for pre-training and fine-tuning without custom crawlers.

RAG and evaluation pipelines

Build fresh eval sets and benchmarks from real docs and sites with preserved URLs.

RLHF and instruction data

Extract structured sections so you can generate prompts, pairs, and preference data in code.

Compliance-minded orgs

Scope allowed sources by domain and path so you can audit what goes into your training data.

[ 01 / 03 ]

Use Cases

AI Training Pipeline

Training Progress

Web Data Collection

Data Cleaning & Processing

Pre-training

Fine-tuning

RLHF & Post-training

Real-time Metrics

Web pages scraped

Training tokens

0.0B

Model accuracy

0.0%

Data quality score

0.0%

How it works

Crawl approved sources

Crawl target domains and docs portals into structured, domain-specific text datasets so your models train on the same pages your users read, not a generic crawl of the public web.

Crawl endpoint

Extract structure to JSON

Extract headings, sections, and metadata into JSON so you can generate instruction pairs, Q&A datasets, and RLHF prompts in code instead of hand-labeling examples.

Extract endpoint

Filter and scope the surface

Filter pages by domain, path, or custom rules so you can enforce which web content is allowed into training sets and answer “where did this come from?” with a concrete list of URLs instead of guesses.

API reference

Schedule refreshes

Schedule recurring Firecrawl crawls so fine-tuning datasets and evaluation sets stay fresh without rerunning scrape jobs every time something changes.

Batch scrape docs

Export to your training stack

Export data in formats your training stack expects so PyTorch, TensorFlow, or custom orchestrators plug it in without brittle HTML parsers or cleanup scripts.

Scrape endpoint

Discover new sources over time

Combine Firecrawl’s search and crawl endpoints so you can discover new relevant sources over time and grow or refresh datasets as your model scope expands.

Search endpoint

[ 02 / 03 ]

What Our Customers Say

Community

People love
building with Firecrawl

Discover why developers choose Firecrawl every day.

Morgan Linton@morganlinton"If you're coding with AI, and haven't discovered @firecrawl yet, prepare to have your mind blown 🤯"

Chris DeWeese@chrisdeweese_"Started using @firecrawl for a project, I wish I used this sooner."

Alex Reibman@AlexReibman"Moved our internal agent's web scraping tool from Apify to Firecrawl because it benchmarked 50x faster with AgentOps."

Tom - Morpho@TomReppelin"I found gold today. Thank you @firecrawl"

Morgan Linton@morganlinton"If you're coding with AI, and haven't discovered @firecrawl yet, prepare to have your mind blown 🤯"

Chris DeWeese@chrisdeweese_"Started using @firecrawl for a project, I wish I used this sooner."

Alex Reibman@AlexReibman"Moved our internal agent's web scraping tool from Apify to Firecrawl because it benchmarked 50x faster with AgentOps."

Tom - Morpho@TomReppelin"I found gold today. Thank you @firecrawl"

Bardia@thepericulum"The Firecrawl team ships. I wanted types for their node SDK, and less than an hour later, I got them."

Matt Busigin@mbusigin"Firecrawl is dope. Congrats guys 👏"

Sumanth@Sumanth_077"Web scraping will never be the same!

Firecrawl is an open-source framework that takes a URL, crawls it, and conver..."

Steven Tey@steventey"Open-source Clay alternative just dropped

Upload a CSV of emails and..."

Bardia@thepericulum"The Firecrawl team ships. I wanted types for their node SDK, and less than an hour later, I got them."

Matt Busigin@mbusigin"Firecrawl is dope. Congrats guys 👏"

Sumanth@Sumanth_077"Web scraping will never be the same!

Firecrawl is an open-source framework that takes a URL, crawls it, and conver..."

Steven Tey@steventey"Open-source Clay alternative just dropped

Upload a CSV of emails and..."

How Firecrawl compares to alternatives

Feature	Firecrawl	Manual CSV uploads	Browser extensions	Generic scrapers
Web search API (/search)
Site crawling (/crawl)
Extract to JSON (/extract)
Structured markdown output
Automatic scheduling & refresh
JavaScript rendering
URL metadata preserved
Multi-tenant scoping
API-first integration
Built-in rate limiting & retries
No manual intervention required

Tutorials & guides

How to Create Custom Instruction Datasets for LLM Fine-tuning

A comprehensive guide to creating instruction datasets for fine-tuning LLMs, including best practices and a practical code documentation example.

Read tutorial →

Fine-tuning Gemma 3 on a Custom Web Dataset With Firecrawl and Unsloth AI

Learn how to efficiently fine-tune Google's Gemma 3 language model on your custom dataset using Firecrawl for data collection.

Read tutorial →

How to Create a Dermatology Q&A Dataset with OpenAI Harmony & Firecrawl Search

A step by step guide on collecting dermatology data from the web using Firecrawl and generating a structured Q&A dataset.

Read tutorial →

FAQ

Frequently
asked questions

Everything you need to know about this use case.

General

Teams use Firecrawl to build domain-specific pre-training sets, instruction and Q&A datasets for fine-tuning, reinforcement learning from human feedback (RLHF) preference data, and evaluation sets derived from real docs and sites. The pattern is simple: define domains and paths, crawl with Firecrawl, then turn the structured output into your preferred training format.

Each Firecrawl job can be scoped by domain, path, and custom filters so you only ingest the sources you’ve explicitly allowed. Because those settings live in code or config, you can review and diff what changed any time someone updates the training data surface.

Technical

Yes. You can schedule Firecrawl jobs to run daily, weekly, or on any cadence you need. Many teams refresh key domains regularly so their fine-tuned models and evaluation sets reflect the latest docs, policies, and product changes without one-off scraping runs.

Firecrawl exposes a simple HTTP API and SDKs. Most teams add a data collection step at the start of their pipelines that calls Firecrawl, writes structured outputs to object storage, and then hands those files to their existing preprocessing and training jobs.

Integration

Firecrawl is designed to respect robots.txt and standard web crawling conventions. You still own the decision about which sites to include and are responsible for making sure your AI training use complies with each site’s terms and any regulatory requirements.

Why Firecrawl?

The world's most comprehensive web data API. Our custom browser stack and semantic index deliver superior data quality across any website, handling more content types and edge cases than any competitor.

JavaScript rendering, dynamic content, and robust request handling built-in.

Process millions of pages with automatic rate limiting, caching, and distributed infrastructure.

Optimized scraping engine with parallel processing and smart caching for instant results.

Comprehensive docs, SDKs for all major languages, and dedicated support to help you succeed.

[ 03 / 03 ]

Pricing

Loading pricing...

[ MAP ]

[ AGENT ]

[ SCRAPE ]

[ SEARCH ]

Get started

Ready to scale your training data?

Start collecting high-quality web data for your AI models today.

FOOTER

The easiest way to extract
data from the web

Backed by

Y Combinator

Linkedin Github YouTube

SOC II · Type 2

AICPA

SOC 2

X (Twitter)

Discord

AI Model Training Data

Perfect for

Model training teams

RAG and evaluation pipelines

RLHF and instruction data

Compliance-minded orgs

How it works

Crawl approved sources

Extract structure to JSON

Filter and scope the surface

Schedule refreshes

Export to your training stack

Discover new sources over time

People love building with FirecrawlPeople love building with Firecrawl

How Firecrawl compares to alternatives

Tutorials & guides

How to Create Custom Instruction Datasets for LLM Fine-tuning

Fine-tuning Gemma 3 on a Custom Web Dataset With Firecrawl and Unsloth AI

How to Create a Dermatology Q&A Dataset with OpenAI Harmony & Firecrawl Search

Frequently asked questionsFrequently asked questions

AI Model
Training Data

People love
building with Firecrawl

Frequently
asked questions