What is the difference between scanned and text-based PDFs for data extraction?

Text-based PDFs contain embedded, selectable text readable directly from the file structure. Scanned PDFs are rasterized images of physical or digital documents, with no embedded text at all. Every character visible on screen exists only as pixels, which means OCR must run before any extraction tool can return usable text. Many documents mix both: a scanned contract may have a text-based cover page and OCR-resistant signature blocks mid-document.

Factor	Text-based PDF	Scanned PDF
Text access	Directly readable	Requires OCR
Extraction speed	Fast	Slower (OCR inference)
Table accuracy	High for structured layouts	Depends on scan quality
Copy-paste in viewer	Works	Blocked or garbled
Parser compatibility	Most tools work	Needs dedicated OCR pipeline

Knowing which type you have matters before choosing a parsing approach. Most PDFs linked from government portals, academic repositories, and financial databases are scanned. Annual reports and contracts downloaded from corporate sites are often text-based. When a library returns empty strings or garbled output, a scanned source is usually why.

Firecrawl handles both types automatically. For web-hosted PDFs, scrape with document parsing detects scanned pages and falls back to OCR. For local or non-public documents, upload them to the /parse endpoint, which exposes the same auto and ocr modes. The default auto mode extracts embedded text and falls back to OCR for scanned pages. Pass mode: "ocr" to force OCR on every page regardless of content type. Either way, the output is clean Markdown ready for downstream processing, with no parser configuration or OCR pipeline to set up. The underlying PDF parsing engine, Fire-PDF, classifies each page in milliseconds by analyzing PDF internals—routing text-based pages to fast native extraction and only truly scanned pages through GPU-based OCR, which eliminates unnecessary processing for mixed documents. For a comparison of how different AI PDF parsers handle scanned vs. text-based content—including Docling, Marker-PDF, and LlamaParse—see the best PDF parsers guide.

Ready to build?

All Questions

What is the difference between scanned and text-based PDFs for data extraction?