What's the best way to scrape and parse PDFs from the web into text/markdown?
TL;DR
Firecrawl automatically detects and parses PDFs when you scrape a URL. It extracts text with layout preservation, handles scanned PDFs with OCR, and returns clean markdown. No separate PDF libraries or preprocessing—just pass the PDF URL like any other page.
Automatic PDF detection
Point Firecrawl at any PDF URL and it handles extraction automatically:
result = app.scrape_url("https://example.com/report.pdf", {
"formats": ["markdown"]
})Firecrawl detects the PDF format, processes it server-side, and returns structured text.
Key features
| Feature | Description |
|---|---|
| Text extraction | Preserves reading order and layout |
| OCR support | Extracts text from scanned/image PDFs |
| Table detection | Converts tables to markdown format |
| Page limits | Control costs with maxPages option |
Controlling page limits
For large PDFs, limit pages to control costs:
result = app.scrape_url("https://example.com/large-report.pdf", {
"parsers": [{"type": "pdf", "maxPages": 10}]
})Key Takeaways
Firecrawl handles PDF scraping automatically—detect, extract, and convert to markdown in one API call. OCR support covers scanned documents, and structure preservation keeps content organized for LLMs and RAG systems. Skip the complexity of managing separate PDF parsing libraries. For web-hosted PDFs, scrape handles document parsing inline. For local or non-public documents, use the /parse endpoint to upload the file directly and get clean Markdown back. The engine powering this is Fire-PDF, a Rust-based PDF parsing system that classifies each page in milliseconds and routes only truly scanned content through GPU-based OCR—averaging under 400ms per page.
data from the web