Introducing web-agent, an open framework for building web agents. Fork it, swap models, and add Skills. Start building →

What is the difference between scanned and text-based PDFs for data extraction?

Text-based PDFs contain embedded, selectable text readable directly from the file structure. Scanned PDFs are rasterized images of physical or digital documents, with no embedded text at all. Every character visible on screen exists only as pixels, which means OCR must run before any extraction tool can return usable text. Many documents mix both: a scanned contract may have a text-based cover page and OCR-resistant signature blocks mid-document.

FactorText-based PDFScanned PDF
Text accessDirectly readableRequires OCR
Extraction speedFastSlower (OCR inference)
Table accuracyHigh for structured layoutsDepends on scan quality
Copy-paste in viewerWorksBlocked or garbled
Parser compatibilityMost tools workNeeds dedicated OCR pipeline

Knowing which type you have matters before choosing a parsing approach. Most PDFs linked from government portals, academic repositories, and financial databases are scanned. Annual reports and contracts downloaded from corporate sites are often text-based. When a library returns empty strings or garbled output, a scanned source is usually why.

Firecrawl's document parsing handles both types automatically. The default auto mode extracts embedded text and falls back to OCR for scanned pages. Pass mode: "ocr" to force OCR on every page regardless of content type. Either way, the output is clean Markdown ready for downstream processing, with no parser configuration or OCR pipeline to set up. The underlying PDF parsing engine, Fire-PDF, classifies each page in milliseconds by analyzing PDF internals—routing text-based pages to fast native extraction and only truly scanned pages through GPU-based OCR, which eliminates unnecessary processing for mixed documents.

Last updated: Mar 01, 2026
FOOTER
The easiest way to extract
data from the web
Backed by
Y Combinator
LinkedinGithubYouTube
SOC II · Type 2
AICPA
SOC 2
X (Twitter)
Discord