How do you scrape PDFs from a website?
To scrape PDFs from a website, first crawl the page to collect PDF URLs, then pass each link to a document parser that converts the file to structured text. The approach depends on how PDFs are served: directly linked .pdf files are straightforward to collect and parse, but embedded JavaScript viewers (like Scholarvox or FlipHTML5) load documents page by page, making the underlying file inaccessible through a plain HTTP request and requiring browser automation or a managed API instead.
| Source type | How PDFs are served | Approach |
|---|---|---|
| Direct links | Static .pdf URL in anchor tag | Crawl page, collect URLs, pass to parser |
| Embedded viewer (JS) | Document loaded page by page via JS | Browser automation or managed API |
| Authenticated portal | PDF behind login or session | Auth flow required before access |
| Paginated viewer | Each page fetched as a separate request | Intercept requests or use API |
For directly linked PDFs, crawl the page to collect URLs, then pass each one to a document parser. For embedded or paginated viewers, the underlying PDF URL may not be exposed at all, so tooling that drives a browser is required to intercept the actual document fetch.
Firecrawl's document parsing accepts any accessible PDF URL and returns clean Markdown. Use the Crawl API to extract all PDF links from a site first, then pass each URL to the scrape endpoint with document parsing enabled. For scanned PDFs, set mode: "ocr" to run OCR across all pages before returning content.
from firecrawl import Firecrawl
firecrawl = Firecrawl(api_key="fc-YOUR-API-KEY")
doc = firecrawl.scrape("https://example.com/report.pdf", formats=["markdown"])
print(doc.markdown)data from the web