Introducing web-agent, an open framework for building web agents. Fork it, swap models, and add Skills. Start building →

How do you convert PDFs to RAG-ready data?

To convert PDFs to RAG-ready data, extract the text into clean, structured Markdown using a document parser, then split it into chunks for embedding in a vector store. The extraction step is the hardest part: rule-based parsers work for machine-generated PDFs with consistent layouts but fail on scanned documents, complex tables, or PDFs where multi-column layouts produce garbled reading order.

FactorPyMuPDF / pdfplumberLLM-based extractionFirecrawl document parsing
Scanned PDF supportNoYesYes (auto or OCR mode)
Table structurePartialSemantically understoodPreserved in Markdown
Reading orderCan garble multi-columnCorrectCorrect
Output formatRaw text stringsSchema-defined JSONStructured Markdown
SetupLocal installAPI + prompt engineeringSingle API call

Use rule-based parsers for structured, machine-generated PDFs where extraction is deterministic. For mixed corpora (research papers, filings, contracts, reports) where formats vary, LLM-based extraction or a managed parsing API produces more consistent chunks.

Firecrawl's document parsing converts any PDF URL into clean Markdown in one API call. The output preserves headings, tables, and lists in a format that chunking libraries like LangChain's RecursiveCharacterTextSplitter can split directly. For scanned sources, set mode: "ocr" to force OCR on every page before chunking. No local dependencies, no layout configuration needed. See the PDF parser v2 release for the v2 feature set, or Fire-PDF for Firecrawl's current Rust-based PDF parsing engine—which routes text-based pages to native extraction and only scanned content through GPU, averaging under 400ms per page on mixed corpora.

Last updated: Mar 01, 2026
FOOTER
The easiest way to extract
data from the web
Backed by
Y Combinator
LinkedinGithubYouTube
SOC II · Type 2
AICPA
SOC 2
X (Twitter)
Discord