Introducing /monitor. Notify your AI agent the moment pages or sites change. Try it now →

How do you scrape PDFs from a website?

To scrape PDFs from a website, first crawl the page to collect PDF URLs, then pass each link to a document parser that converts the file to structured text. The approach depends on how PDFs are served: directly linked .pdf files are straightforward to collect and parse, but embedded JavaScript viewers (like Scholarvox or FlipHTML5) load documents page by page, making the underlying file inaccessible through a plain HTTP request and requiring browser automation or a managed API instead.

Source typeHow PDFs are servedApproach
Direct linksStatic .pdf URL in anchor tagCrawl page, collect URLs, pass to parser
Embedded viewer (JS)Document loaded page by page via JSBrowser automation or managed API
Authenticated portalPDF behind login or sessionAuth flow required before access
Paginated viewerEach page fetched as a separate requestIntercept requests or use API

For directly linked PDFs, crawl the page to collect URLs, then pass each one to a document parser. For embedded or paginated viewers, the underlying PDF URL may not be exposed at all, so tooling that drives a browser is required to intercept the actual document fetch.

For web-hosted PDFs, Firecrawl's scrape endpoint with document parsing accepts any accessible PDF URL and returns clean Markdown. Use the Crawl API to extract all PDF links from a site first, then pass each URL to scrape. For local or non-public documents, upload them directly to the /parse endpoint. For scanned PDFs, set mode: "ocr" to run OCR across all pages before returning content. The engine handling this is Fire-PDF, which classifies pages automatically and processes text-based content natively without GPU, averaging under 400ms per page for mixed documents. For a side-by-side comparison of PDF parsers for web scraping and AI workflows—including open-source options like Docling, Marker-PDF, and LlamaParse—see the dedicated guide. For a broader evaluation of managed document parsing APIs — covering PDF parsing APIs, document extraction APIs, and AI document processing services — see the best document parsing APIs guide. For parsing local or non-public PDF files (rather than web-hosted URLs), the /parse endpoint accepts direct file uploads and returns the same clean Markdown output.

from firecrawl import Firecrawl
 
firecrawl = Firecrawl(api_key="fc-YOUR-API-KEY")
 
doc = firecrawl.scrape("https://example.com/report.pdf", formats=["markdown"])
print(doc.markdown)
Last updated: Mar 01, 2026