Introducing /parse: Turn any document into LLM-ready data

Introducing /parse. Convert PDFs, Word docs, or spreadsheets into clean data for AI agents 5x faster. Try it now →

Get started

Ready to build?

Start getting Web Data for free and scale seamlessly as your project expands. No credit card needed.

Are you an AI agent? Get an API key here

Introducing /parse: Turn any document into LLM-ready data

Eric Ciarla

Apr 28, 2026

Introducing /parse: Turn any document into LLM-ready data image

With Firecrawl, you can already pull clean markdown from any URL, including PDFs hosted on the web. But a lot of the documents you need to process (contracts, reports, invoices, uploaded files) live on disk, not on the web. Today we're launching /parse, so you can upload files directly and get back the same clean, structured output Firecrawl returns for web pages.

What is Firecrawl /parse?

/parse runs local files through the same parsing engine that powers /scrape. PDFs come back with reading order preserved and tables intact. Word docs shed their XML noise. Spreadsheets become clean tabular markdown. You can ask for a summary or structured JSON extraction in the same call. No post-processing needed.

Supported formats: PDF, DOCX, DOC, ODT, RTF, XLSX, XLS, and HTML. Files up to 50 MB.

A Rust-based engine that's up to 5x faster

Under the hood, /parse is powered by a Rust-based engine averaging under 400ms per page. Instead of routing every document through OCR, it classifies pages first and only sends what actually needs it to the GPU.

Native extraction for text-based pages. Our open-source Rust library pdf-inspector reads PDF internals (fonts, text operators, image coverage) to pull text directly in milliseconds, without rendering.
GPU only where it matters. Scanned and image-heavy pages get routed through a GPU fleet with lane-based isolation, so a 200-page report never slows down a single-page invoice.
Layout-aware accuracy. A neural layout model detects tables, formulas, text blocks, and headers individually, then tunes parameters per region. Tables get higher token budgets, formulas are preserved in LaTeX, and reading order is predicted neurally for multi-column documents.

How /parse makes document processing easier

One pipeline for web pages and files

If you're already using Firecrawl to research the web, your pipeline can now also read email attachments, downloaded reports, and user-uploaded files.

import requests
import json
 
with open("contract.pdf", "rb") as f:
    response = requests.post(
        "https://api.firecrawl.dev/v2/parse",
        headers={"Authorization": "Bearer fc-YOUR_API_KEY"},
        files={"file": f},
        data={
            "options": json.dumps({
                "formats": ["markdown", "json"],
                "json": {
                    "schema": {
                        "type": "object",
                        "properties": {
                            "parties": {"type": "array", "items": {"type": "string"}},
                            "effective_date": {"type": "string"},
                            "total_value": {"type": "string"}
                        }
                    }
                }
            })
        }
    )
 
data = response.json()["data"]
print(data["markdown"])
print(data["json"])

Structured extraction from internal documents

Pass a JSON schema alongside your file and /parse returns typed fields like line items, dates, parties, and totals in a single call. No separate extraction layer required.

RAG ingestion for user uploads

When users upload PDFs or DOCX files to your app, /parse turns them into embedding-ready markdown in one call. Structure is preserved, tables stay intact, and a summary comes back in the same response, ready to chunk and send to your vector store.

A few things to know

50 MB limit, fixed file types. HTML, PDF, DOCX, DOC, ODT, RTF, XLSX, and XLS are supported. Other formats return an UNSUPPORTED_FILE_TYPE error.
Every call re-parses. Results are never cached. Repeat uploads of the same file are billed each time. Same credit model as /scrape: one call plus any LLM formats you request.
Scanned PDFs depend on scan quality. Image-only PDFs go through OCR. Clean scans parse cleanly; low-resolution or handwritten scans produce lower-quality output.

Try it today

/parse is available now for all Firecrawl API users. Send it a document, get back clean context your agents can use.

Get started with /parse · Read the docs

Eric Ciarla @ericciarla

Cofounder of Firecrawl

About the Author

Eric Ciarla is a co-founder of Firecrawl. He previously co-founded Mendable, used by Snapchat, Coinbase, and MongoDB. He's been building products in the AI and data space since 2022.