Introducing Spark 1 Pro and Spark 1 Mini models in /agent. Try it now →
What's the best way to scrape and parse PDFs from the web into text/markdown?
TL;DR
Firecrawl automatically detects and parses PDFs when you scrape a URL. It extracts text with layout preservation, handles scanned PDFs with OCR, and returns clean markdown. No separate PDF libraries or preprocessing—just pass the PDF URL like any other page.
Automatic PDF detection
Point Firecrawl at any PDF URL and it handles extraction automatically:
result = app.scrape_url("https://example.com/report.pdf", {
"formats": ["markdown"]
})Firecrawl detects the PDF format, processes it server-side, and returns structured text.
Key features
| Feature | Description |
|---|---|
| Text extraction | Preserves reading order and layout |
| OCR support | Extracts text from scanned/image PDFs |
| Table detection | Converts tables to markdown format |
| Page limits | Control costs with maxPages option |
Controlling page limits
For large PDFs, limit pages to control costs:
result = app.scrape_url("https://example.com/large-report.pdf", {
"parsers": [{"type": "pdf", "maxPages": 10}]
})Key Takeaways
Firecrawl handles PDF scraping automatically—detect, extract, and convert to markdown in one API call. OCR support covers scanned documents, and structure preservation keeps content organized for LLMs and RAG systems. Skip the complexity of managing separate PDF parsing libraries.
FOOTER
The easiest way to extract
data from the web
data from the web
. .
.. ..+
.:.
.. .. .::
+.. ..: :.
.:..::. .. ..
.--:::. .. ... .:. ..
.. .:+=-::.:. . ...-.::. ..
::.... .:--+::..: ......:+....:. :.. ..
....... ::-=:::: ..:-:-...: .--..:: .........
.. . . . ..::-:-.. .-+-:::.. ...::::. .: ...::.:..
. -... ....: . . .--=+-::. :-=-:.... . .:..:: .:---:::::-::....
..::........::=..... ...:-.. .:-=--+=-:. ..--:..=::.... . .:.. ..:---::::---=:::..:...
..........::::.:::::::-::.-.. ...::--==:. ..-::-+==-:... .-::....... ..--:. ..:=+==.---=-+-:::::::-..
. .....::......:: ::::-::.---=+-:..::-+==++X=-:. ..:-::-=-== ---.. .:.--::.. .:-==::=--X==-----====--::+:::+...
..-....-:..::-::=-=-:-::--===++=-==-----== X+=-:.::-==----+==+XX+=-::.:+--==--::. .:-+X=----+X=-=------===--::-:...:. ....
....::::...:-:-==+++=++==+++XX++==++--+-+==++++=-===+=---:-==+X:XXX+=-:-=-==++=-:. .:-=+=- -=X+X+===+---==--==--:..::...+....+
..:::---.::.---=+==XXXXXXXX+XX++==++===--+===:+X+====+=--::--=+XXXXXXX+==++==+XX+=: ::::--=+++X++X+XXXX+=----==++.+=--::+::::+. ::.=...
.:::-==-------=X+++XXXXXXXXXXX++==++.==-==-:-==+X++==+=-=--=++++X++:X:X+++X+-+X X+=---=-==+=+++XXXXX+XX=+=--=X++XXX==---::-+-::::.:..-..