What are document chunking strategies for RAG?
Document chunking splits source documents into segments before embedding them in a vector store. Chunk size controls a direct tradeoff: small chunks produce targeted retrievals but may lack the surrounding context needed to answer a question; large chunks preserve context but dilute the embedding's similarity signal by averaging too many concepts into one vector. For a corpus of tens of thousands of documents, retrieval failures often trace back to chunks split at the wrong boundaries or sized too far outside the 256 to 512 token range where most embedding models perform well. The chunking strategy affects retrieval quality more than the choice of embedding model.
| Strategy | Boundary type | Context preservation | Complexity |
|---|---|---|---|
| Fixed-size (character or token count) | Arbitrary | Low | Minimal |
| Sentence or paragraph boundary | Natural language | Medium | Low |
| Heading boundary | Document structure | High for structured docs | Low |
| Recursive character splitting | Layered (heading, paragraph, sentence) | High | Medium |
| Semantic (embedding-based) | Concept shift | Highest | High (second embedding pass) |
Use fixed-size chunking for quick prototyping or plain-prose corpora where layout structure is minimal. Use paragraph-boundary or heading-boundary chunking for structured documents such as documentation, reports, and legal text, where natural divisions already exist in the source. Semantic chunking produces the most coherent chunks but requires a second embedding pass to detect concept boundaries, which adds significant cost at scale. For most production RAG grounding pipelines over web content, recursive character splitting with a paragraph fallback at 512 tokens balances retrieval precision with ingestion speed.
Firecrawl's Scrape API outputs clean markdown with headings and paragraph structure preserved. The heading tags and blank lines give chunking libraries such as LangChain's RecursiveCharacterTextSplitter natural split points, so paragraph-boundary and heading-boundary strategies work without custom post-processing. For PDF corpora, the parse endpoint converts multi-page files to the same structured markdown format before the chunking step.