Introducing Browser Sandbox - Give your agents a secure, fully managed browser environment Read more →

What is the difference between scanned and text-based PDFs for data extraction?

Text-based PDFs contain embedded, selectable text readable directly from the file structure. Scanned PDFs are rasterized images of physical or digital documents, with no embedded text at all. Every character visible on screen exists only as pixels, which means OCR must run before any extraction tool can return usable text. Many documents mix both: a scanned contract may have a text-based cover page and OCR-resistant signature blocks mid-document.

FactorText-based PDFScanned PDF
Text accessDirectly readableRequires OCR
Extraction speedFastSlower (OCR inference)
Table accuracyHigh for structured layoutsDepends on scan quality
Copy-paste in viewerWorksBlocked or garbled
Parser compatibilityMost tools workNeeds dedicated OCR pipeline

Knowing which type you have matters before choosing a parsing approach. Most PDFs linked from government portals, academic repositories, and financial databases are scanned. Annual reports and contracts downloaded from corporate sites are often text-based. When a library returns empty strings or garbled output, a scanned source is usually why.

Firecrawl's document parsing handles both types automatically. The default auto mode extracts embedded text and falls back to OCR for scanned pages. Pass mode: "ocr" to force OCR on every page regardless of content type. Either way, the output is clean Markdown ready for downstream processing, with no parser configuration or OCR pipeline to set up.

Last updated: Mar 01, 2026
FOOTER
The easiest way to extract
data from the web
. . .. ..+ .:. .. .. .:: +.. ..: :. .:..::. .. .. .--:::. .. ... .:. .. .. .:+=-::.:. . ...-.::. .. ::.... .:--+::..: ......:+....:. :.. .. ....... ::-=:::: ..:-:-...: .--..:: ......... .. . . . ..::-:-.. .-+-:::.. ...::::. .: ...::.:.. . -... ....: . . .--=+-::. :-=-:.... . .:..:: .:---:::::-::.... ..::........::=..... ...:-.. .:-=--+=-:. ..--:..=::.... . .:.. ..:---::::---=:::..:... ..........::::.:::::::-::.-.. ...::--==:. ..-::-+==-:... .-::....... ..--:. ..:=+==.---=-+-:::::::-.. . .....::......:: ::::-::.---=+-:..::-+==++X=-:. ..:-::-=-== ---.. .:.--::.. .:-==::=--X==-----====--::+:::+... ..-....-:..::-::=-=-:-::--===++=-==-----== X+=-:.::-==----+==+XX+=-::.:+--==--::. .:-+X=----+X=-=------===--::-:...:. .... ....::::...:-:-==+++=++==+++XX++==++--+-+==++++=-===+=---:-==+X:XXX+=-:-=-==++=-:. .:-=+=- -=X+X+===+---==--==--:..::...+....+ ..:::---.::.---=+==XXXXXXXX+XX++==++===--+===:+X+====+=--::--=+XXXXXXX+==++==+XX+=: ::::--=+++X++X+XXXX+=----==++.+=--::+::::+. ::.=... .:::-==-------=X+++XXXXXXXXXXX++==++.==-==-:-==+X++==+=-=--=++++X++:X:X+++X+-+X X+=---=-==+=+++XXXXX+XX=+=--=X++XXX==---::-+-::::.:..-..
Backed by
Y Combinator
LinkedinGithubYouTube
SOC II · Type 2
AICPA
SOC 2
X (Twitter)
Discord