Introducing /monitor. Notify your AI agent the moment pages or sites change. Try it now →

What is the difference between scanned and text-based PDFs for data extraction?

Text-based PDFs contain embedded, selectable text readable directly from the file structure. Scanned PDFs are rasterized images of physical or digital documents, with no embedded text at all. Every character visible on screen exists only as pixels, which means OCR must run before any extraction tool can return usable text. Many documents mix both: a scanned contract may have a text-based cover page and OCR-resistant signature blocks mid-document.

FactorText-based PDFScanned PDF
Text accessDirectly readableRequires OCR
Extraction speedFastSlower (OCR inference)
Table accuracyHigh for structured layoutsDepends on scan quality
Copy-paste in viewerWorksBlocked or garbled
Parser compatibilityMost tools workNeeds dedicated OCR pipeline

Knowing which type you have matters before choosing a parsing approach. Most PDFs linked from government portals, academic repositories, and financial databases are scanned. Annual reports and contracts downloaded from corporate sites are often text-based. When a library returns empty strings or garbled output, a scanned source is usually why.

Firecrawl handles both types automatically. For web-hosted PDFs, scrape with document parsing detects scanned pages and falls back to OCR. For local or non-public documents, upload them to the /parse endpoint, which exposes the same auto and ocr modes. The default auto mode extracts embedded text and falls back to OCR for scanned pages. Pass mode: "ocr" to force OCR on every page regardless of content type. Either way, the output is clean Markdown ready for downstream processing, with no parser configuration or OCR pipeline to set up. The underlying PDF parsing engine, Fire-PDF, classifies each page in milliseconds by analyzing PDF internals—routing text-based pages to fast native extraction and only truly scanned pages through GPU-based OCR, which eliminates unnecessary processing for mixed documents. For a comparison of how different AI PDF parsers handle scanned vs. text-based content—including Docling, Marker-PDF, and LlamaParse—see the best PDF parsers guide.

Last updated: Mar 01, 2026