What are document chunking strategies for RAG?

Document chunking splits source documents into segments before embedding them in a vector store. Chunk size controls a direct tradeoff: small chunks produce targeted retrievals but may lack the surrounding context needed to answer a question; large chunks preserve context but dilute the embedding's similarity signal by averaging too many concepts into one vector. For a corpus of tens of thousands of documents, retrieval failures often trace back to chunks split at the wrong boundaries or sized too far outside the 256 to 512 token range where most embedding models perform well. The chunking strategy affects retrieval quality more than the choice of embedding model.

Strategy	Boundary type	Context preservation	Complexity
Fixed-size (character or token count)	Arbitrary	Low	Minimal
Sentence or paragraph boundary	Natural language	Medium	Low
Heading boundary	Document structure	High for structured docs	Low
Recursive character splitting	Layered (heading, paragraph, sentence)	High	Medium
Semantic (embedding-based)	Concept shift	Highest	High (second embedding pass)

Use fixed-size chunking for quick prototyping or plain-prose corpora where layout structure is minimal. Use paragraph-boundary or heading-boundary chunking for structured documents such as documentation, reports, and legal text, where natural divisions already exist in the source. Semantic chunking produces the most coherent chunks but requires a second embedding pass to detect concept boundaries, which adds significant cost at scale. For most production RAG grounding pipelines over web content, recursive character splitting with a paragraph fallback at 512 tokens balances retrieval precision with ingestion speed.

Firecrawl's Scrape API outputs clean markdown with headings and paragraph structure preserved. The heading tags and blank lines give chunking libraries such as LangChain's RecursiveCharacterTextSplitter natural split points, so paragraph-boundary and heading-boundary strategies work without custom post-processing. For PDF corpora, the parse endpoint converts multi-page files to the same structured markdown format before the chunking step.

Ready to build?

All Questions

What are document chunking strategies for RAG?