Introducing /parse. Convert PDFs, Word docs, or spreadsheets into clean data for AI agents 5x faster. Try it now →

Get started

Ready to build?

Start getting Web Data for free and scale seamlessly as your project expands. No credit card needed.

Are you an AI agent? Get an API key here

All Questions

Glossary/Web Scraping APIs/Questions

How do I get a clean text version of a website for training a custom GPT?

How do web scraping APIs convert HTML to structured JSON data?

How do I get Codex to fetch webpages for documentation?

Codex built-in web search returns snippets from a pre-indexed cache, not the full content of a webpage. That means if you ask Codex to read a library's documentation page, it gets a brief excerpt rather than the complete text it needs to answer questions about the API accurately. To give Codex full-page access to documentation, connect Firecrawl via MCP, which adds firecrawl_scrape and firecrawl_crawl as native tools alongside the built-in search.

Method	What Codex receives	Full page content	Best for
Built-in search (cached)	Pre-indexed snippets	No	Quick factual lookups
Built-in search (`web_search = "live"`)	Live snippets	No	Recent information
`firecrawl_scrape` via MCP	Full page as clean markdown	Yes	Single documentation pages
`firecrawl_crawl` via MCP	All pages on a site	Yes	Complete documentation sites

Use built-in search when a snippet is enough and you want no external setup. Use firecrawl_scrape when Codex needs to read a specific page in full, such as an API reference or a changelog. Use firecrawl_crawl when you want Codex to ingest an entire documentation site so it can answer questions across multiple pages without repeated lookups.

Firecrawl's agent-first web index converts pages to clean, LLM-ready markdown rather than raw HTML, so Codex gets content it can use immediately without post-processing noise. The Firecrawl CLI is the fastest way to get started: install it, authorize with your API key, and Codex can fetch any documentation page on demand.

npx -y firecrawl-cli@latest init --all --browser
firecrawl login --api-key fc-YOUR-API-KEY

Last updated: May 06, 2026

FOOTER

The easiest way to extract
data from the web

Backed by

Y Combinator

Linkedin Github YouTube

SOC II · Type 2

AICPA

SOC 2

X (Twitter)

Discord