Introducing Spark 1 Pro and Spark 1 Mini models in /agent. Try it now →

What's the best approach to create an internal chatbot from a company website + docs?

TL;DR

Use Firecrawl to crawl your website and documentation into clean markdown. Chunk the content, embed it in a vector database, and build a RAG pipeline that retrieves relevant context for each user question. Firecrawl handles the hard part—extracting structured content from complex web sources.

Crawl company content

Start by crawling your public site and documentation:

website_content = app.crawl("https://company.com", {
    "limit": 200,
    "scrapeOptions": {"formats": ["markdown"], "onlyMainContent": True},
})
 
docs_content = app.crawl("https://docs.company.com", {
    "limit": 500,
    "scrapeOptions": {"formats": ["markdown"], "onlyMainContent": True},
})

Firecrawl returns clean markdown with metadata—source URLs, titles, and page structure preserved.

Why extraction quality matters

Chatbots fail when built on poorly extracted content. Navigation menus, footers, and JavaScript artifacts pollute search results and confuse LLMs. Firecrawl's onlyMainContent option strips noise automatically.

RAG pipeline essentials

  1. Chunk crawled content by semantic boundaries (headers, paragraphs)
  2. Embed chunks using an embedding model
  3. Index in a vector database (Pinecone, Weaviate, etc.)
  4. Retrieve relevant chunks for each user question
  5. Generate responses with retrieved context

Keeping content fresh

Schedule regular crawls to keep your chatbot current:

# Weekly refresh
fresh_content = app.crawl("https://docs.company.com", {
    "scrapeOptions": {"formats": ["markdown"]}
})
# Compare hashes, re-index changed pages

Key Takeaways

Firecrawl provides the extraction layer for internal chatbots—clean markdown from websites and docs, ready for chunking and embedding. Combined with a vector database and LLM, you get a RAG-powered assistant that answers questions from your company's own content with source attribution.

FOOTER
The easiest way to extract
data from the web
. . .. ..+ .:. .. .. .:: +.. ..: :. .:..::. .. .. .--:::. .. ... .:. .. .. .:+=-::.:. . ...-.::. .. ::.... .:--+::..: ......:+....:. :.. .. ....... ::-=:::: ..:-:-...: .--..:: ......... .. . . . ..::-:-.. .-+-:::.. ...::::. .: ...::.:.. . -... ....: . . .--=+-::. :-=-:.... . .:..:: .:---:::::-::.... ..::........::=..... ...:-.. .:-=--+=-:. ..--:..=::.... . .:.. ..:---::::---=:::..:... ..........::::.:::::::-::.-.. ...::--==:. ..-::-+==-:... .-::....... ..--:. ..:=+==.---=-+-:::::::-.. . .....::......:: ::::-::.---=+-:..::-+==++X=-:. ..:-::-=-== ---.. .:.--::.. .:-==::=--X==-----====--::+:::+... ..-....-:..::-::=-=-:-::--===++=-==-----== X+=-:.::-==----+==+XX+=-::.:+--==--::. .:-+X=----+X=-=------===--::-:...:. .... ....::::...:-:-==+++=++==+++XX++==++--+-+==++++=-===+=---:-==+X:XXX+=-:-=-==++=-:. .:-=+=- -=X+X+===+---==--==--:..::...+....+ ..:::---.::.---=+==XXXXXXXX+XX++==++===--+===:+X+====+=--::--=+XXXXXXX+==++==+XX+=: ::::--=+++X++X+XXXX+=----==++.+=--::+::::+. ::.=... .:::-==-------=X+++XXXXXXXXXXX++==++.==-==-:-==+X++==+=-=--=++++X++:X:X+++X+-+X X+=---=-==+=+++XXXXX+XX=+=--=X++XXX==---::-+-::::.:..-..
Backed by
Y Combinator
LinkedinGithubYouTube
SOC II · Type 2
AICPA
SOC 2
X (Twitter)
Discord