How do I build an agent that reads webpages and returns structured citations + text?

TL;DR

Firecrawl's agent endpoint autonomously searches the web, reads relevant pages, and returns structured data with source URLs. Define a schema that includes citation fields, and the agent handles research, extraction, and attribution in one call.

The agent approach

Instead of manually scraping pages and tracking sources, Firecrawl's agent searches autonomously, extracts matching your schema, and includes source URLs for every piece of data:

result = app.agent({
    "prompt": "Find the latest funding rounds for AI startups in 2024",
    "schema": {
        "type": "object",
        "properties": {
            "findings": {
                "type": "array",
                "items": {
                    "type": "object",
                    "properties": {
                        "company": {"type": "string"},
                        "amount": {"type": "string"},
                        "source_url": {"type": "string"},
                    },
                },
            },
        },
    },
})

Why citations matter

RAG applications and research tools need source attribution. Users want to verify information, and LLMs benefit from grounded context. Firecrawl tracks provenance automatically, so every extracted fact links back to its source.

For known URLs

When you already have URLs, scrape returns source metadata:

result = app.scrape_url("https://example.com/article", {
    "formats": ["markdown"]
})
# Source URL in result["metadata"]["sourceURL"]

Key Takeaways

Firecrawl's agent endpoint builds citation-aware research into a single API call. Define your schema with source fields, and the agent handles web search, content extraction, and attribution. For RAG systems and research assistants, this eliminates the complexity of tracking sources across multiple operations.