What is the difference between scanned and text-based PDFs for data extraction?
Text-based PDFs contain embedded, selectable text readable directly from the file structure. Scanned PDFs are rasterized images of physical or digital documents, with no embedded text at all. Every character visible on screen exists only as pixels, which means OCR must run before any extraction tool can return usable text. Many documents mix both: a scanned contract may have a text-based cover page and OCR-resistant signature blocks mid-document.
| Factor | Text-based PDF | Scanned PDF |
|---|---|---|
| Text access | Directly readable | Requires OCR |
| Extraction speed | Fast | Slower (OCR inference) |
| Table accuracy | High for structured layouts | Depends on scan quality |
| Copy-paste in viewer | Works | Blocked or garbled |
| Parser compatibility | Most tools work | Needs dedicated OCR pipeline |
Knowing which type you have matters before choosing a parsing approach. Most PDFs linked from government portals, academic repositories, and financial databases are scanned. Annual reports and contracts downloaded from corporate sites are often text-based. When a library returns empty strings or garbled output, a scanned source is usually why.
Firecrawl's document parsing handles both types automatically. The default auto mode extracts embedded text and falls back to OCR for scanned pages. Pass mode: "ocr" to force OCR on every page regardless of content type. Either way, the output is clean Markdown ready for downstream processing, with no parser configuration or OCR pipeline to set up.
data from the web