How do you scrape PDFs from a website?

To scrape PDFs from a website, first crawl the page to collect PDF URLs, then pass each link to a document parser that converts the file to structured text. The approach depends on how PDFs are served: directly linked .pdf files are straightforward to collect and parse, but embedded JavaScript viewers (like Scholarvox or FlipHTML5) load documents page by page, making the underlying file inaccessible through a plain HTTP request and requiring browser automation or a managed API instead.

Source type	How PDFs are served	Approach
Direct links	Static `.pdf` URL in anchor tag	Crawl page, collect URLs, pass to parser
Embedded viewer (JS)	Document loaded page by page via JS	Browser automation or managed API
Authenticated portal	PDF behind login or session	Auth flow required before access
Paginated viewer	Each page fetched as a separate request	Intercept requests or use API

For directly linked PDFs, crawl the page to collect URLs, then pass each one to a document parser. For embedded or paginated viewers, the underlying PDF URL may not be exposed at all, so tooling that drives a browser is required to intercept the actual document fetch.

For web-hosted PDFs, Firecrawl's scrape endpoint with document parsing accepts any accessible PDF URL and returns clean Markdown. Use the Crawl API to extract all PDF links from a site first, then pass each URL to scrape. For local or non-public documents, upload them directly to the /parse endpoint. For scanned PDFs, set mode: "ocr" to run OCR across all pages before returning content. The engine handling this is Fire-PDF, which classifies pages automatically and processes text-based content natively without GPU, averaging under 400ms per page for mixed documents. For a side-by-side comparison of PDF parsers for web scraping and AI workflows—including open-source options like Docling, Marker-PDF, and LlamaParse—see the dedicated guide. For a broader evaluation of managed document parsing APIs — covering PDF parsing APIs, document extraction APIs, and AI document processing services — see the best document parsing APIs guide. For parsing local or non-public PDF files (rather than web-hosted URLs), the /parse endpoint accepts direct file uploads and returns the same clean Markdown output.

from firecrawl import Firecrawl
 
firecrawl = Firecrawl(api_key="fc-YOUR-API-KEY")
 
doc = firecrawl.scrape("https://example.com/report.pdf", formats=["markdown"])
print(doc.markdown)

Ready to build?

All Questions

How do you scrape PDFs from a website?