Introducing Firecrawl v2.5 - The world's best web data API. Read the blog.
Introducing Scrape-Evals - An Open Benchmark for Web Scraping
placeholderRafael Miller
Nov 20, 2025
Introducing Scrape-Evals - An Open Benchmark for Web Scraping image

TL;DR: We built scrape-evals, an open-source benchmark to test web scraping engines on 1,000 real URLs. Measure the coverage and quality of any scraping engine.

Why This Exists

If you’re choosing a web scraping solution today, you’re basically flying blind. There are dozens of options (Playwright, Scrapy, Puppeteer, plus API services like Firecrawl, ScraperAPI, Zyte, etc.), and everyone claims to be the best. But there’s no standard way to actually compare them. So we built one.

Full disclosure: I work at Firecrawl, and yes, we did well in these benchmarks. But the entire framework is open source. Run it yourself, add your own scrapers, check our methodology. If you find issues, open a PR.

What We Measure

Coverage (Success Rate): Can the system successfully retrieve complete, rendered web page content? This metric measures the ability to access publicly available pages while properly handling modern web technologies such as JavaScript rendering, dynamic content loading, and standard authentication flows.

Quality (F1 Score): How much of the important content did you capture, and how much junk did you include? We use human-curated ground truth snippets for each URL and calculate precision/recall to get an F1 score.

The Results

Tested 13 engines on 1,000 URLs. Here’s what happened:

EngineCoverage (Success Rate) (%)Quality (F1)
Firecrawl80.90.68
Exa76.30.53
Tavily67.60.50
ScraperAPI63.50.45
Zyte62.90.47
ScrapingBee60.60.45
Apify60.20.42
Crawl4ai58.00.45
Selenium55.00.40
Scrapy54.00.43
Puppeteer53.70.41
Rest (requests)50.60.36
Playwright39.50.34

What surprised us

Playwright landing at the bottom on coverage genuinely surprised us. It’s one of the most capable browser automation tools out there, but that’s also the trap: in-house Playwright stacks that skip the unglamorous work (fingerprints, retries, anti-bot handling, timeouts) look great in demos and quietly fall over on the messy edges of the real web.

Some Technical Decisions

The Dataset: 1,000 Pages from the Real Web

We curated 1,000 URLs with human-annotated ground truth, now available on Hugging Face. The selection process was deliberately simple: random URLs from unique domains across the web. No cherry-picking tech blogs or well-structured content. We wanted real-world messiness: e-commerce sites with dynamic pricing, news articles behind paywalls, documentation with nested navigation, SPAs that load content via JavaScript, forums with infinite scroll, etc. For each URL, we manually scraped and annotated:

  • A ~100-word core snippet: the actual content you want (article text, product details, documentation)
  • A ~10-word noise snippet: the junk you don’t want (nav bars, footers, ads, cookie banners) This manual curation took time, but it’s the only way to get ground truth that actually reflects what users care about.

Why F1 Score as Quality?

When evaluating the quality of scrapers, especially for AI applications, you need to care about two things:

Precision: Of everything you extracted, how much was actually useful?

Precision = relevant_content_extracted / total_content_extracted

Recall: Of all the useful content on the page, how much did you capture?

Recall = relevant_content_extracted / total_relevant_content_available

Here’s why both matter:

  • High recall but low precision = you got all the good stuff, but also grabbed tons of navigation menus, ads, and footer links. Your AI now wastes tokens processing garbage.
  • High precision but low recall = everything you extracted is clean, but you missed half the article. Your AI doesn’t have enough context to answer questions.

F1 score balances these trade-offs:

F1 = 2 * (Precision * Recall) / (Precision + Recall)

For AI-ready scrapers, this is critical. LLMs have context windows (even with longer contexts, you’re paying per token). If your scraper dumps 10,000 tokens of html tags or navigation text into the context, you’re wasting money and degrading performance. But if it only extracts 500 tokens and misses key information, your AI can’t do its job. F1 score rewards scrapers that extract complete, clean content. That’s what you actually need in production.

Why unique domains?

We wanted diversity. If we scraped 100 pages from the same news site, the benchmark would just measure “how well do you handle this one site’s structure?” Unique domains force scrapers to handle different frameworks, rendering patterns, and anti-bot strategies.

How It Works

At a high level, the loop is simple:

  1. Take the dataset of URLs plus ground-truth snippets (from Hugging Face).
  2. Let each engine scrape every URL under the same constraints (timeouts, headers, etc.).
  3. Compare the scraped content to the ground truth to compute coverage and precision/recall.
  4. Aggregate the scores so you can line up engines side by side.

Under the hood, we save the raw HTML, extracted content, and metrics for every run. You can plug in new engines via a small Python interface, run them sync or async, and then dig into specific failures instead of just staring at a single top-line number.

What’s Next

We’re going to keep growing and maintaining this dataset over time. It’s already wired into our internal monitoring, so as pages break, move, or start serving junk, we’ll release a new version. We’ll also keep adding new, manually scraped pages from fresh domains so the benchmark doesn’t go stale or overfit to a narrow slice of the web.

The dataset will stay public on Hugging Face, and we’d like it to become a shared asset for the web scraping community. If you spot bad entries, have ideas for new sites, or want to help maintain it, contributions are welcome.

Try It Yourself

Everything is in the repo:

  • Clone it
  • Run the evals on your own machine
  • Compare your numbers to ours
  • Plug in your own scraper implementation
  • Open issues or PRs if you see something off

If you break our benchmark with something better, we genuinely want to see it!

— Rafael and the Firecrawl team

placeholder
Rafael Miller @rafaelmmiller
Head of Evaluations at Firecrawl
About the Author
Rafael Miller is the Head of Evaluations at Firecrawl. He was the CEO/CTO & Founder of Lampejo, where he developed an educational app with over 200,000 downloads. Rafael also was the CTO at FRST and Neomove, leading software engineering and data teams.
FOOTER
The easiest way to extract
data from the web
. . .. ..+ .:. .. .. .:: +.. ..: :. .:..::. .. .. .--:::. .. ... .:. .. .. .:+=-::.:. . ...-.::. .. ::.... .:--+::..: ......:+....:. :.. .. ....... ::-=:::: ..:-:-...: .--..:: ......... .. . . . ..::-:-.. .-+-:::.. ...::::. .: ...::.:.. . -... ....: . . .--=+-::. :-=-:.... . .:..:: .:---:::::-::.... ..::........::=..... ...:-.. .:-=--+=-:. ..--:..=::.... . .:.. ..:---::::---=:::..:... ..........::::.:::::::-::.-.. ...::--==:. ..-::-+==-:... .-::....... ..--:. ..:=+==.---=-+-:::::::-.. . .....::......:: ::::-::.---=+-:..::-+==++X=-:. ..:-::-=-== ---.. .:.--::.. .:-==::=--X==-----====--::+:::+... ..-....-:..::-::=-=-:-::--===++=-==-----== X+=-:.::-==----+==+XX+=-::.:+--==--::. .:-+X=----+X=-=------===--::-:...:. .... ....::::...:-:-==+++=++==+++XX++==++--+-+==++++=-===+=---:-==+X:XXX+=-:-=-==++=-:. .:-=+=- -=X+X+===+---==--==--:..::...+....+ ..:::---.::.---=+==XXXXXXXX+XX++==++===--+===:+X+====+=--::--=+XXXXXXX+==++==+XX+=: ::::--=+++X++X+XXXX+=----==++.+=--::+::::+. ::.=... .:::-==-------=X+++XXXXXXXXXXX++==++.==-==-:-==+X++==+=-=--=++++X++:X:X+++X+-+X X+=---=-==+=+++XXXXX+XX=+=--=X++XXX==---::-+-::::.:..-..
Backed by
Y Combinator
LinkedinGithub
SOC II · Type 2
AICPA
SOC 2
X (Twitter)
Discord