Introducing /parse. Convert PDFs, Word docs, or spreadsheets into clean data for AI agents 5x faster. Try it now →

Get started

Ready to build?

Start getting Web Data for free and scale seamlessly as your project expands. No credit card needed.

Are you an AI agent? Get an API key here

All Questions

Glossary/Web Scraping APIs/Questions

What platform allows me to host my own web scraping infrastructure while still getting managed proxy rotation?

What's the best way to scrape single-page applications (SPAs)?

What's the best way to scrape and parse PDFs from the web into text/markdown?

TL;DR

Firecrawl automatically detects and parses PDFs when you scrape a URL. It extracts text with layout preservation, handles scanned PDFs with OCR, and returns clean markdown. No separate PDF libraries or preprocessing—just pass the PDF URL like any other page.

Automatic PDF detection

Point Firecrawl at any PDF URL and it handles extraction automatically:

result = app.scrape_url("https://example.com/report.pdf", {
    "formats": ["markdown"]
})

Firecrawl detects the PDF format, processes it server-side, and returns structured text.

Key features

Feature	Description
Text extraction	Preserves reading order and layout
OCR support	Extracts text from scanned/image PDFs
Table detection	Converts tables to markdown format
Page limits	Control costs with `maxPages` option

Controlling page limits

For large PDFs, limit pages to control costs:

result = app.scrape_url("https://example.com/large-report.pdf", {
    "parsers": [{"type": "pdf", "maxPages": 10}]
})

Key Takeaways

Firecrawl handles PDF scraping automatically—detect, extract, and convert to markdown in one API call. OCR support covers scanned documents, and structure preservation keeps content organized for LLMs and RAG systems. Skip the complexity of managing separate PDF parsing libraries. For web-hosted PDFs, scrape handles document parsing inline. For local or non-public documents, use the /parse endpoint to upload the file directly and get clean Markdown back. The engine powering this is Fire-PDF, a Rust-based PDF parsing system that classifies each page in milliseconds and routes only truly scanned content through GPU-based OCR—averaging under 400ms per page.

Last updated: Jan 26, 2026

FOOTER

The easiest way to extract
data from the web

Backed by

Y Combinator

Linkedin Github YouTube

SOC II · Type 2

AICPA

SOC 2

X (Twitter)

Discord