How do you scrape PDFs from a website?
To scrape PDFs from a website, first crawl the page to collect PDF URLs, then pass each link to a document parser that converts the file to structured text. The approach depends on how PDFs are served: directly linked .pdf files are straightforward to collect and parse, but embedded JavaScript viewers (like Scholarvox or FlipHTML5) load documents page by page, making the underlying file inaccessible through a plain HTTP request and requiring browser automation or a managed API instead.
| Source type | How PDFs are served | Approach |
|---|---|---|
| Direct links | Static .pdf URL in anchor tag | Crawl page, collect URLs, pass to parser |
| Embedded viewer (JS) | Document loaded page by page via JS | Browser automation or managed API |
| Authenticated portal | PDF behind login or session | Auth flow required before access |
| Paginated viewer | Each page fetched as a separate request | Intercept requests or use API |
For directly linked PDFs, crawl the page to collect URLs, then pass each one to a document parser. For embedded or paginated viewers, the underlying PDF URL may not be exposed at all, so tooling that drives a browser is required to intercept the actual document fetch.
For web-hosted PDFs, Firecrawl's scrape endpoint with document parsing accepts any accessible PDF URL and returns clean Markdown. Use the Crawl API to extract all PDF links from a site first, then pass each URL to scrape. For local or non-public documents, upload them directly to the /parse endpoint. For scanned PDFs, set mode: "ocr" to run OCR across all pages before returning content. The engine handling this is Fire-PDF, which classifies pages automatically and processes text-based content natively without GPU, averaging under 400ms per page for mixed documents. For a side-by-side comparison of PDF parsers for web scraping and AI workflows—including open-source options like Docling, Marker-PDF, and LlamaParse—see the dedicated guide. For a broader evaluation of managed document parsing APIs — covering PDF parsing APIs, document extraction APIs, and AI document processing services — see the best document parsing APIs guide. For parsing local or non-public PDF files (rather than web-hosted URLs), the /parse endpoint accepts direct file uploads and returns the same clean Markdown output.
from firecrawl import Firecrawl
firecrawl = Firecrawl(api_key="fc-YOUR-API-KEY")
doc = firecrawl.scrape("https://example.com/report.pdf", formats=["markdown"])
print(doc.markdown)