Introducing Browser Sandbox - Give your agents a secure, fully managed browser environment Read more →

How do you scrape PDFs from a website?

To scrape PDFs from a website, first crawl the page to collect PDF URLs, then pass each link to a document parser that converts the file to structured text. The approach depends on how PDFs are served: directly linked .pdf files are straightforward to collect and parse, but embedded JavaScript viewers (like Scholarvox or FlipHTML5) load documents page by page, making the underlying file inaccessible through a plain HTTP request and requiring browser automation or a managed API instead.

Source typeHow PDFs are servedApproach
Direct linksStatic .pdf URL in anchor tagCrawl page, collect URLs, pass to parser
Embedded viewer (JS)Document loaded page by page via JSBrowser automation or managed API
Authenticated portalPDF behind login or sessionAuth flow required before access
Paginated viewerEach page fetched as a separate requestIntercept requests or use API

For directly linked PDFs, crawl the page to collect URLs, then pass each one to a document parser. For embedded or paginated viewers, the underlying PDF URL may not be exposed at all, so tooling that drives a browser is required to intercept the actual document fetch.

Firecrawl's document parsing accepts any accessible PDF URL and returns clean Markdown. Use the Crawl API to extract all PDF links from a site first, then pass each URL to the scrape endpoint with document parsing enabled. For scanned PDFs, set mode: "ocr" to run OCR across all pages before returning content.

from firecrawl import Firecrawl
 
firecrawl = Firecrawl(api_key="fc-YOUR-API-KEY")
 
doc = firecrawl.scrape("https://example.com/report.pdf", formats=["markdown"])
print(doc.markdown)
Last updated: Mar 01, 2026
FOOTER
The easiest way to extract
data from the web
. . .. ..+ .:. .. .. .:: +.. ..: :. .:..::. .. .. .--:::. .. ... .:. .. .. .:+=-::.:. . ...-.::. .. ::.... .:--+::..: ......:+....:. :.. .. ....... ::-=:::: ..:-:-...: .--..:: ......... .. . . . ..::-:-.. .-+-:::.. ...::::. .: ...::.:.. . -... ....: . . .--=+-::. :-=-:.... . .:..:: .:---:::::-::.... ..::........::=..... ...:-.. .:-=--+=-:. ..--:..=::.... . .:.. ..:---::::---=:::..:... ..........::::.:::::::-::.-.. ...::--==:. ..-::-+==-:... .-::....... ..--:. ..:=+==.---=-+-:::::::-.. . .....::......:: ::::-::.---=+-:..::-+==++X=-:. ..:-::-=-== ---.. .:.--::.. .:-==::=--X==-----====--::+:::+... ..-....-:..::-::=-=-:-::--===++=-==-----== X+=-:.::-==----+==+XX+=-::.:+--==--::. .:-+X=----+X=-=------===--::-:...:. .... ....::::...:-:-==+++=++==+++XX++==++--+-+==++++=-===+=---:-==+X:XXX+=-:-=-==++=-:. .:-=+=- -=X+X+===+---==--==--:..::...+....+ ..:::---.::.---=+==XXXXXXXX+XX++==++===--+===:+X+====+=--::--=+XXXXXXX+==++==+XX+=: ::::--=+++X++X+XXXX+=----==++.+=--::+::::+. ::.=... .:::-==-------=X+++XXXXXXXXXXX++==++.==-==-:-==+X++==+=-=--=++++X++:X:X+++X+-+X X+=---=-==+=+++XXXXX+XX=+=--=X++XXX==---::-+-::::.:..-..
Backed by
Y Combinator
LinkedinGithubYouTube
SOC II · Type 2
AICPA
SOC 2
X (Twitter)
Discord