We just raised our Series A and shipped Firecrawl /v2 ๐ŸŽ‰. Read the blog.
Best Chunking Strategies for RAG in 2025
placeholderBex Tuychiev
Oct 10, 2025
Best Chunking Strategies for RAG in 2025 image

Your RAG systemโ€™s retrieval accuracy depends on how you chunk your documents. The wrong strategy can create up to a 9% gap in recall performance between best and worst approaches. Thatโ€™s the difference between a system that helps users and one that frustrates them.

Hereโ€™s the challenge: you need to break documents into smaller pieces before embedding them, but deciding which approach works best isnโ€™t obvious. Fixed-size chunks are easy to implement but ignore context boundaries. Semantic chunking preserves meaning but costs money to run. Page-level chunking achieved the highest accuracy in NVIDIAโ€™s 2024 benchmarks, yet it might not fit your use case.

This article compares six chunking strategies using real benchmark data from NVIDIA, Chroma, and other research teams. Youโ€™ll see specific numbers, actual performance metrics, and honest trade-offs between approaches.

Weโ€™ll cover the following chunking strategies:

Each strategy includes working code examples you can test (see Getting Sample Data to set up a test dataset).

Understanding Chunking Trade-offs

Every chunking strategy trades off context preservation against retrieval precision. Smaller chunks match queries more precisely but lose surrounding context. Larger chunks preserve relationships between ideas but dilute relevance in your embeddings.

The cost spectrum

Different strategies have different computational costs:

Simple methods (size-based, token-based) are fast and cheap. Split on character count, no API calls, no overhead.

Structure-aware methods (recursive, sentence-based, page-level) add modest complexity with extra parsing logic to respect natural boundaries. Cost is still minimal.

Semantic methods require generating embeddings for every sentence and calculating similarity scores to find split points. This means API calls or running local models.

LLM-based methods send every document through an LLM to analyze structure. High quality, high cost, slower processing.

Why thereโ€™s no universal winner

Three factors determine what works best for your use case:

Your embedding model affects performance. Different architectures have different characteristics that change which chunking strategies work well.

Your document type matters. PDFs with tables need different handling than blog posts or code files.

Your query patterns influence optimal chunk size. Factoid lookups need different chunking than analytical questions.

Letโ€™s look at each strategy in detail.

Recursive Character Splitting

Recursive character splitting is where most teams should start. It works by trying to split text at natural boundaries, checking multiple separators in order until it finds one that works.

How it works

The splitter respects natural language boundaries. It doesnโ€™t cut sentences mid-thought, and waits for paragraph breaks. If the paragraph is too long, it waits for a sentence. If the sentence is too long, it looks for a space between words.

This happens through a hierarchy of separators:

  1. Double newline \n\n (paragraph breaks)
  2. Single newline \n (line breaks)
  3. Space (word boundaries)
  4. Empty string "" (individual characters, last resort)

When you set a chunk size of 512 characters, the splitter tries to respect that limit while breaking at the highest-level separator it can. A 600-character paragraph gets split at a sentence boundary instead of mid-word.

For code, you can customize separators to respect function and class boundaries:

# Code-aware separators
separators = [
    "\n\nclass ",  # Class definitions
    "\n\ndef ",    # Function definitions
    "\n\n",        # Paragraph breaks
    "\n",          # Line breaks
    " ",           # Spaces
    ""
]

Implementation example

Hereโ€™s how to use LangChainโ€™s RecursiveCharacterTextSplitter.

For the examples below, weโ€™ll use clean drug information data. If you want to follow along, see Getting Sample Data for Testing Chunking Strategies to scrape your own dataset with Firecrawl.

from langchain_text_splitters import RecursiveCharacterTextSplitter
import json
from pathlib import Path

# Load one of our scraped documents
doc_path = Path("data/raw_documents/drug_info_00.json")
with open(doc_path) as f:
    doc = json.load(f)

# Configure the splitter
splitter = RecursiveCharacterTextSplitter(
    chunk_size=512,
    chunk_overlap=50,
    length_function=len,
    separators=["\n\n", "\n", ". ", " ", ""]
)

# Split the document
chunks = splitter.split_text(doc["markdown"])

# Show results
print(f"Document: {doc['title']}")
print(f"Original length: {len(doc['markdown'])} characters")
print(f"Number of chunks: {len(chunks)}")
print(f"\nFirst chunk preview:")
print(chunks[0][:200] + "...")

We load a JSON document containing scraped medical information, configure a recursive character splitter with a 512-character target chunk size and 50-character overlap, then split the text using a hierarchy of separators (paragraphs, lines, sentences, words).

The output shows document metadata and a preview of the first chunk:

Document: Sertraline (oral route) - Side effects & dosage - Mayo Clinic
Original length: 23092 characters
Number of chunks: 64

First chunk preview:
## On this page

- [Brand names](https://www.mayoclinic.org/drugs-supplements/sertraline-oral-route/description/drg-20065940#drug-brand-names)

- [Description](https://www.mayoclinic.org/drugs-supplem...

The chunk preserves the document structure. Headers stay with their content. Sections break at natural boundaries instead of mid-sentence.

When to use this

Recursive splitting handles most text content well:

  • Articles and blog posts
  • Technical documentation
  • Research papers
  • Product descriptions
  • Email threads

For documents with clear section markers (like medical drug information with โ€œDescriptionโ€, โ€œBefore Usingโ€, โ€œSide Effectsโ€), add section headers to your separators list with higher priority than paragraph breaks to preserve entire sections. Itโ€™s the default choice for 80% of RAG applications because it balances simplicity with structure awareness.

Trade-offs

Pros:

  • Preserves document organization (headers, paragraphs, lists)
  • Adapts to different content types through custom separators
  • Better context retention than fixed-size splitting
  • Reliable performance in Chromaโ€™s research (88-89% recall with 400-token chunks using text-embedding-3-large)

Cons:

  • Slightly more complex setup than character-based splitting
  • Requires understanding your content structure to choose good separators
  • Variable chunk sizes can complicate batch processing

Size-Based Chunking

Size-based chunking is the simplest approach. Pick a number, split when you hit it, repeat. No structure awareness, no smart decisions, just counting.

How it works

Two main variants exist: character-based and token-based.

Character-based splitting counts characters and splits when you reach the limit. Set chunk_size=1000 and you get chunks of roughly 1000 characters each. A 5000-character document becomes 5 chunks.

Token-based splitting counts tokens instead of characters. This distinction matters because embedding models have token limits, not character limits. The word โ€œunhappinessโ€ is one word and 11 characters, but it might be 2 tokens (โ€œunโ€ + โ€œhappinessโ€) depending on the tokenizer.

Both variants support overlap (sliding windows). A 1000-character chunk with 100-character overlap means the last 100 characters of chunk 1 appear as the first 100 characters of chunk 2. This helps preserve context across boundaries.

Implementation example

Character-based splitting with LangChain:

from langchain_text_splitters import CharacterTextSplitter
import json
from pathlib import Path

# Load document, see the Getting Sample Data section to scrape your own drug_info_00 dataset with Firecrawl.
doc_path = Path("data/raw_documents/drug_info_00.json")
with open(doc_path) as f:
    doc = json.load(f)

# Character-based splitting
char_splitter = CharacterTextSplitter(
    chunk_size=1000,
    chunk_overlap=100,
    separator="\n"
)

chunks = char_splitter.split_text(doc["markdown"])
print(f"Character-based: {len(chunks)} chunks")
print(f"Chunk 0 length: {len(chunks[0])} chars")
print(f"Chunk 1 length: {len(chunks[1])} chars")

We configure a character-based splitter with a 1000-character target and 100-character overlap, using newlines as separators.

For token-based splitting:

from langchain_text_splitters import TokenTextSplitter

# Token-based splitting (uses tiktoken)
token_splitter = TokenTextSplitter(
    chunk_size=512,
    chunk_overlap=50
)

token_chunks = token_splitter.split_text(doc["markdown"])
print(f"\nToken-based: {len(token_chunks)} chunks")

The token-based splitter uses the tiktoken library to count tokens (not characters), which better aligns with embedding model limits. With a 512-token target, this produces different chunk counts than character-based splitting.

Output:

Character-based: 28 chunks
Chunk 0 length: 954 chars
Chunk 1 length: 926 chars

Token-based: 14 chunks

Token-based splitting produces fewer chunks in this case because tokens are more efficient units than fixed character counts with overlap. The 512-token limit effectively captures more semantic content per chunk than the character-based approach.

When to use this

Size-based chunking works for:

  • Prototyping and MVPs: Get something working fast, optimize later
  • Uniform content: News articles, product listings, short-form content with consistent structure
  • When simplicity matters: Small teams, limited resources, straightforward use cases

If youโ€™re testing whether RAG works for your use case at all, start here. The implementation takes 5 minutes.

Trade-offs

Pros:

  • Fastest to implement (literally 3 lines of code)
  • Predictable chunk sizes for batch processing
  • No computational overhead
  • Works with any content type

Cons:

  • Ignores semantic boundaries and document structure
  • Fragments sentences mid-thought
  • Lower retrieval accuracy than structure-aware methods
  • Can split tables, code blocks, or lists in awkward places

Overlap and sliding windows

Overlap helps reduce fragmentation problems. If a key sentence gets split across two chunks, the overlap ensures both chunks contain the complete thought.

Industry best practices recommend 10-20% overlap as a starting point. For a 500-token chunk, use 50-100 tokens of overlap. The Stack Overflow blog on RAG chunking discusses sliding windows and overlap strategies in detail.

More overlap preserves more context but increases storage costs and processing time. Test what works for your use case.

Sentence-Based Chunking

Sentence-based chunking respects natural language boundaries. Instead of counting characters or tokens blindly, it identifies complete sentences and groups them into chunks.

How it works

The splitter uses natural language processing to detect sentence boundaries. Periods, question marks, and exclamation points signal potential splits, but the algorithm is smart enough to handle edge cases like โ€œDr. Smithโ€ or โ€œ3.14โ€ without fragmenting them.

Once sentences are identified, the chunker groups them to hit your target chunk size. If you set chunk_size=1024 tokens, it keeps adding sentences until the next one would exceed the limit, then starts a new chunk.

The result is variable chunk sizes. One chunk might be 950 tokens, another might be 1100, depending on where sentence boundaries fall. The guarantee is that sentences stay intact.

Implementation example

LlamaIndexโ€™s SentenceSplitter handles sentence detection and grouping:

from llama_index.core.node_parser import SentenceSplitter
from llama_index.core import Document
import json
from pathlib import Path

# Load document, see the Getting Sample Data section to scrape your own drug_info_00 dataset with Firecrawl.
doc_path = Path("data/raw_documents/drug_info_00.json")
with open(doc_path) as f:
    doc_data = json.load(f)

# Create LlamaIndex Document
doc = Document(text=doc_data["markdown"])

# Configure sentence splitter
splitter = SentenceSplitter(
    chunk_size=1024,
    chunk_overlap=20
)

# Split into nodes (LlamaIndex's chunk equivalent)
nodes = splitter.get_nodes_from_documents([doc])

# Show results
print(f"Document: {doc_data['title']}")
print(f"Number of chunks: {len(nodes)}")
print(f"\nChunk sizes:")
for i, node in enumerate(nodes[:5]):
    print(f"Chunk {i}: {len(node.text)} characters")

We create a LlamaIndex Document object from our text, configure a sentence splitter with a 1024-character target size and 20-character overlap, then split the document into nodes (LlamaIndexโ€™s term for chunks).

Hereโ€™s what we get:

Document: Sertraline (oral route) - Side effects & dosage - Mayo Clinic
Number of chunks: 6

Chunk sizes:
Chunk 0: 3883 characters
Chunk 1: 3609 characters
Chunk 2: 4049 characters
Chunk 3: 4502 characters
Chunk 4: 4005 characters

Each chunk contains complete sentences. No thought gets cut off mid-expression.

When to use this

Sentence-based chunking works well for:

  • Conversational data (chat logs, customer support transcripts)
  • Q&A content where each question-answer pair should stay together
  • Short-form content with clear sentence structure
  • When preserving complete thoughts matters more than uniform chunk sizes

If your RAG system answers questions where context comes from complete sentences, this approach helps retrieval quality.

Trade-offs

Pros:

  • Maintains sentence integrity (no mid-sentence splits)
  • Natural language structure feels more coherent
  • Better than size-based for conversational or Q&A content
  • Users reading retrieved chunks see complete thoughts

Cons:

  • Variable chunk sizes complicate batch processing
  • Long sentences can create oversized chunks
  • Sentence detection fails on poorly formatted text
  • More complex than simple character counting

Page-Level Chunking

PDFs and structured documents benefit from specialized approaches that respect document layout and visual boundaries.

Page-level chunking treats each page as a separate chunk. Instead of counting tokens or looking for sentence breaks, it splits wherever the document pagination naturally occurs.

How it works

PDF files have built-in page boundaries. Page-level chunking extracts content page by page and creates one chunk per page (or groups a few pages together if theyโ€™re short).

This matters because PDFs often organize information visually. Financial reports put balance sheets on one page, income statements on another. Research papers have figures on specific pages with captions. Medical documentation separates patient history, current symptoms, and treatment plans across different pages.

Breaking content across these natural boundaries would lose important context. Page-level chunking preserves it.

Implementation example

Unstructured.io provides PDF parsing with page-aware chunking:

from unstructured.partition.pdf import partition_pdf
from unstructured.chunking.title import chunk_by_title

# Parse PDF and partition by page
elements = partition_pdf(
    filename="sample_report.pdf",
    strategy="hi_res",  # High-resolution extraction for tables/images
)

# Chunk with multipage_sections=False to respect page boundaries
chunks = chunk_by_title(
    elements,
    multipage_sections=False,  # Don't merge across pages
    combine_text_under_n_chars=200,
    max_characters=2000
)

# Show results
print(f"Total chunks: {len(chunks)}")
for i, chunk in enumerate(chunks[:3]):
    print(f"\nChunk {i}:")
    print(f"  Type: {chunk.category}")
    print(f"  Length: {len(str(chunk))} characters")
    print(f"  Preview: {str(chunk)[:150]}...")

We use Unstructured.ioโ€™s partition_pdf function with the โ€œhi_resโ€ strategy to extract text, tables, and images from a PDF while preserving page structure. The chunk_by_title function groups elements by section titles, with multipage_sections=False ensuring chunks donโ€™t span page boundaries.

Running this produces:

Total chunks: 12

Chunk 0:
  Type: Title
  Length: 487 characters
  Preview: FINANCIAL SUMMARY
Q4 2024 Results

Revenue increased 23% year-over-year to $2.3B, driven by enterprise segment growth. Operating margin improved to 18.5%...

Chunk 1:
  Type: Table
  Length: 623 characters
  Preview: Balance Sheet
Assets: $15.2B
Liabilities: $8.7B
Equity: $6.5B...

The multipage_sections=False parameter ensures page breaks start new chunks. Tables, figures, and text from the same page stay together.

When to use this

Page-level chunking excels with:

  • PDFs with visual layouts (reports, presentations, forms)
  • Table-heavy content (financial statements, research data)
  • Documents where pagination has semantic meaning
  • Mixed content types per page (text + tables + images)

If your documents have clear page-based organization, this preserves that structure.

Trade-offs

Pros:

  • Preserves page context and visual layout
  • Handles tables and figures naturally
  • Works with documents where pages represent logical units
  • Highest accuracy in NVIDIA benchmarks (0.648 with lowest variance)

Cons:

  • Only makes sense for paginated documents (PDFs, presentations)
  • Assumes pages align with semantic boundaries (not always true)
  • Variable chunk sizes based on page content density
  • May create very small or very large chunks depending on pagination

Research validation

NVIDIAโ€™s 2024 benchmark tested seven chunking strategies across five datasets. Page-level chunking won with 0.648 accuracy and the lowest standard deviation (0.107), meaning it performed consistently well across different document types.

The results make sense. Financial reports, legal documents, and research papers organize information by pages. Respecting that structure helps retrieval find the right context.

But remember: NVIDIA tested document types where pagination matters. If your PDFs are just text exports with arbitrary page breaks, page-level chunking wonโ€™t help.

Semantic Chunking

Semantic chunking splits text based on meaning, not structure. Instead of looking for paragraph breaks or sentence boundaries, it analyzes how related consecutive sentences are and creates chunks where topics shift.

How it works

The process follows four steps:

  1. Sentence segmentation: Break the document into individual sentences
  2. Embedding generation: Create vector embeddings for each sentence
  3. Similarity analysis: Compare embeddings between consecutive sentences to measure how related they are
  4. Chunk formation: When similarity drops below a threshold, start a new chunk

Consider a research paper. The introduction flows naturally from background to motivation to contribution. Then thereโ€™s a shift to related work. That transition is noticeable when reading. Semantic chunking detects it mathematically using embedding similarity.

The threshold determines sensitivity. Three common methods exist:

Percentile threshold (default): Split when similarity difference exceeds the 95th percentile. If most sentences are 0.85 similar but two consecutive ones are only 0.65 similar, thatโ€™s a topic shift.

Standard deviation: Split when difference exceeds 3 standard deviations from the mean. Catches statistically unusual topic transitions.

Interquartile range: Uses the middle 50% of similarity scores to identify outliers. Less sensitive to extreme values than standard deviation.

Implementation example

LangChainโ€™s SemanticChunker handles the embedding and similarity analysis:

from langchain_experimental.text_splitter import SemanticChunker
from langchain_openai import OpenAIEmbeddings
import json
from pathlib import Path
import os

# Set up embeddings model
embeddings = OpenAIEmbeddings(
    model="text-embedding-3-small",
    openai_api_key=os.getenv("OPENAI_API_KEY")
)

# Load document, see the Getting Sample Data section to scrape your own drug_info_00 dataset with Firecrawl.
doc_path = Path("data/raw_documents/drug_info_00.json")
with open(doc_path) as f:
    doc = json.load(f)

# Configure semantic chunker
semantic_splitter = SemanticChunker(
    embeddings=embeddings,
    breakpoint_threshold_type="percentile",  # or "standard_deviation", "interquartile"
    breakpoint_threshold_amount=95
)

# Split based on semantic similarity
chunks = semantic_splitter.split_text(doc["markdown"])

# Show results
print(f"Document: {doc['title']}")
print(f"Number of chunks: {len(chunks)}")
print(f"\nChunk sizes:")
for i, chunk in enumerate(chunks[:5]):
    print(f"Chunk {i}: {len(chunk)} characters")

We initialize OpenAI embeddings, load our document, and configure a semantic chunker that uses the percentile threshold method (95th percentile) to detect topic transitions. The chunker embeds each sentence, compares consecutive embeddings, and splits where similarity drops sharply.

The results show highly variable chunk sizes based on semantic coherence rather than fixed limits:

Document: Sertraline (oral route) - Side effects & dosage - Mayo Clinic
Number of chunks: 8

Chunk sizes:
Chunk 0: 1811 characters
Chunk 1: 315 characters
Chunk 2: 4701 characters
Chunk 3: 8405 characters
Chunk 4: 6309 characters

Notice the fewer chunks compared to size-based or recursive splitting. Semantic chunking groups related content together regardless of length. Each chunk represents a coherent topic.

When to use this

Semantic chunking works best for:

  • Dense unstructured text (research papers, long-form articles, technical documentation)
  • Content with subtle topic transitions that structure-based methods miss
  • When retrieval accuracy is the top priority and budget allows
  • Documents where natural sections arenโ€™t marked with headers

The cost matters. Every sentence needs an embedding, which means API calls (if using OpenAI) or local model inference. For a 10,000-word document, you might generate 200-300 embeddings just for chunking.

Trade-offs

Pros:

  • Maintains semantic coherence across chunk boundaries
  • Detects subtle topic shifts that structure-based methods miss
  • 2-3 percentage points better recall than RecursiveCharacterTextSplitter (Chroma research found performance differences up to 9 percentage points across all chunking methods tested)
  • Highest accuracy: 0.919 recall with LLM-enhanced variant

Cons:

  • Computationally expensive (embedding every sentence)
  • Requires threshold tuning for your specific content
  • Embedding API costs add up (or local model inference time)
  • Slower processing than simpler methods

Advanced variants

Cluster semantic chunking groups similar sentences even if they arenโ€™t adjacent. This helps when a document revisits topics or has nested structure. This is a more advanced approach than standard semantic chunking. LlamaIndexโ€™s SemanticSplitterNodeParser implements standard semantic splitting by analyzing consecutive sentences.

Hierarchical chunking creates multiple chunk layers. Summary chunks for high-level queries, detail chunks for specific questions. More complex but powerful for documents with nested information architecture.

Test whether semantic chunking justifies the cost for your use case. If recursive splitting gives 88% recall and semantic gives 91%, is the 3% improvement worth 10x processing time and embedding costs?

LLM-Based Chunking

LLM-based chunking uses a language model to analyze document structure and decide where to split. Instead of following fixed rules or embedding similarity, the LLM reads the content and makes context-aware decisions about chunk boundaries.

How it works

You send the document (or sections of it) to an LLM with instructions about how to chunk it. The model analyzes the content, identifies logical boundaries, and returns split points or pre-chunked content.

A basic approach:

  1. Send document sections to LLM: โ€œIdentify logical section breaks in this text where topics shiftโ€
  2. LLM analyzes structure: Understands headers, topic transitions, argument flow
  3. Receive split points or chunks: Get back either marked boundaries or pre-chunked text
  4. Create final chunks: Use LLM suggestions to split the original document

More sophisticated approaches ask the LLM to summarize each chunk or generate metadata, creating semantically rich chunks with built-in context.

Implementation example

Hereโ€™s a conceptual example using OpenAIโ€™s API:

from openai import OpenAI
import json
from pathlib import Path

client = OpenAI()

# Load document, see the Getting Sample Data section to scrape your own drug_info_00 dataset with Firecrawl.
doc_path = Path("data/raw_documents/drug_info_00.json")
with open(doc_path) as f:
    doc = json.load(f)

# Prompt LLM to identify chunk boundaries
response = client.chat.completions.create(
    model="gpt-4",
    messages=[{
        "role": "system",
        "content": "You are a document analysis expert. Identify logical sections in the following medical document and suggest where to split it into semantically coherent chunks. Return section titles and approximate character positions for splits."
    }, {
        "role": "user",
        "content": doc["markdown"][:8000]  # Send first portion
    }]
)

# Parse LLM suggestions
suggestions = response.choices[0].message.content
print("LLM-suggested chunks:")
print(suggestions)

# In production, you'd parse the response and create actual chunks
# This is a simplified example showing the concept

We send the first 8000 characters of our document to GPT-4 with a system prompt instructing it to analyze document structure and suggest chunk boundaries. The LLM returns section titles with character positions.

In a production system, youโ€™d parse these suggestions and use them to split the full document. Hereโ€™s example output:

LLM-suggested chunks:
1. "Table of Contents" (Start: 0, End: 777)
2. "Brand Name" (Start: 780, End: 869)
3. "Description" (Start: 872, End: 1518)
4. "Before Using" (Start: 1521, End: 6835)
5. "Proper Use" (Start: 6838, End: 7680)

The LLM identified semantic sections that align with how medical professionals think about drug information, not just where paragraph breaks happen.

When to use this

LLM-based chunking makes sense for:

  • High-value content where chunking quality directly affects business outcomes
  • Documents with complex or unusual structure that breaks simple methods
  • Experimental projects where you can iterate on prompts
  • Small document collections where processing cost isnโ€™t prohibitive

This isnโ€™t for production-scale document processing unless cost isnโ€™t a concern. One 10,000-word document might need multiple LLM calls to analyze fully, each costing 0.01โˆ’0.01-0.10 depending on model and token count.

Trade-offs

Pros:

  • Context-aware decisions based on actual content meaning
  • Adapts to document structure dynamically
  • Can generate chunk summaries and metadata in the same pass
  • Handles unusual document types that break rule-based methods

Cons:

  • Expensive (LLM API costs for every document)
  • Slow (LLM inference latency, especially for large documents)
  • Requires LLM access (API dependency or local model infrastructure)
  • Limited production use (cost and latency prohibitive at scale)
  • Needs prompt engineering and testing

Agentic chunking

Agentic chunking extends the LLM approach by giving the model agency to decide chunking strategy per document. Instead of one fixed prompt, the agent analyzes document characteristics and picks the right method.

A research paper might get semantic chunking. A financial report might get page-level. A code file might get function-level splitting. The agent decides based on document type, structure, and content density.

This sounds promising, but it remains largely experimental. Weaviateโ€™s chunking blog and F22 Labs guide discuss the approach, though the complexity and cost limit real-world adoption.

Practical considerations

If youโ€™re considering LLM-based chunking, test it on a small sample first. Calculate actual costs:

  • 100 documents ร— 5,000 words each = 500,000 words
  • At ~1.3 tokens/word = 650,000 tokens
  • GPT-4 input: 5.00/1Mtokens=5.00/1M tokens = 3.25
  • Add output tokens and multiple passes for large docs

For ongoing document processing, costs add up fast. LLM-based chunking works best for one-time processing of valuable content or as a benchmark to evaluate simpler methods.

What the Research Says About RAG Chunking Strategies

NVIDIA tested seven chunking strategies across five datasets in 2024. Page-level chunking won with 0.648 accuracy and 0.107 standard deviation. Query type affected optimal chunk size: factoid queries performed best with 256-512 tokens, analytical queries needed 1024+ tokens.

Chroma Research found performance varied by up to 9% in recall across methods. LLMSemanticChunker achieved 0.919 recall, ClusterSemanticChunker reached 0.913, and RecursiveCharacterTextSplitter hit 85.4-89.5% (best at 400 tokens: 88.1-89.5%).

Superlinkedโ€™s HotpotQA tests showed SentenceSplitter outperformed semantic approaches with ColBERT v2 embeddings. Embedding model choice matters as much as chunking strategy.

Start with RecursiveCharacterTextSplitter at 400-512 tokens with 10-20% overlap. Move to semantic or page-level chunking only if your metrics show you need the extra performance and budget allows for the costs.

Decision Framework

The default choice

Start with RecursiveCharacterTextSplitter:

  • Chunk size: 400-512 tokens
  • Overlap: 50-100 tokens (10-20%)
  • Separators: ["\n\n", "\n", " ", ""] (default) or add ". " for better sentence splitting

This handles most text content well: blog posts, documentation, research papers, web articles.

When to use a different approach

You can also combine strategies based on content type. Hereโ€™s when to pick something else:

Your SituationUse This StrategyWhy
Working with PDFsPage-level chunking (details below)Won NVIDIAโ€™s benchmarks (0.648 accuracy), handles tables
Accuracy is important, budget allowsSemantic chunking (details below)Up to 9% better recall, maintains semantic coherence
Budget is tight, need speedSize-based chunking (details below)Fastest to implement, no computational overhead
Processing code filesRecursive with code separators (details below)Respects function/class boundaries
Short-form content (tweets, Q&A)Sentence-based chunking (details below)Preserves complete thoughts
High-value content, experimentalLLM-based chunking (details below)Context-aware, adapts to structure dynamically

Many production systems use hybrid approaches: route PDFs to page-level chunking, web pages to recursive splitting, and code to code-aware separators based on file type or content analysis.

Adjust chunk size for your query type

Once youโ€™ve picked a strategy, tune the chunk size:

  • Factoid queries (names, dates, facts): 256-512 tokens for precise matching
  • Analytical queries (explanations, comparisons): 1024+ tokens for more context
  • Mixed queries: Start with 400-512 tokens (balanced middle ground)

Conflicting signals? If your content type suggests one approach (PDFs โ†’ page-level) but query type suggests another (factoid โ†’ 256-512 tokens), content type usually wins. Test both to confirm.

Before going to production, test 2-3 strategies on 50-100 representative documents with 20-30 realistic queries. Measure recall, precision, and answer quality to pick the winner for your use case.

Getting Sample Data for Testing Chunking Strategies

Testing chunking strategies requires sample data. For this article, weโ€™ll use medical and pharmaceutical content because it contains the variety that chunking needs to handle: technical terminology, structured sections, dosage tables, and regulatory information. The challenge is getting this web content clean content without HTML noise.

Scraping websites typically gives you HTML mixed with navigation menus, forms, JavaScript elements, and other artifacts. This noise fragments during chunking and pollutes your embeddings. Clean extraction solves this problem.

Building a test dataset with Firecrawl

Firecrawl extracts clean content from web pages. It renders JavaScript, removes boilerplate elements, and converts everything to markdown or structured JSON while preserving headers, structured data, and document hierarchy.

Weโ€™ll build a test dataset by scraping drug information from Mayo Clinic. Their drug reference pages provide technical depth, clear section structure (Description, Before Using, Proper Use, Precautions, Side Effects), and real-world complexity for testing chunking approaches.

The process uses two Firecrawl methods:

  1. Crawl to discover drug information pages
  2. Batch scrape to extract clean content from those pages

Setup

First, sign up at firecrawl.dev and get your API key. Install the SDK:

pip install firecrawl-py python-dotenv

Save your API key in a .env file:

touch .env
echo "FIRECRAWL_API_KEY='fc-YOUR-KEY-HERE'" >> .env

Discovering and scraping drug information

With Firecrawl configured, weโ€™ll crawl Mayo Clinicโ€™s drug information section to discover individual drug pages, then scrape their content. Weโ€™ll limit it to 10 pages for this example:

from firecrawl import Firecrawl
from dotenv import load_dotenv
import json
from pathlib import Path

load_dotenv()
app = Firecrawl()

# Step 1: Crawl to discover drug information pages
crawl_result = app.crawl(
    "https://www.mayoclinic.org/drugs-supplements",
    limit=10,
    scrape_options={'formats': ['markdown']}
)

# Extract URLs from crawl results
label_urls = []
if crawl_result.data:
    for page in crawl_result.data:
        if hasattr(page, 'metadata') and page.metadata:
            url = getattr(page.metadata, 'source_url', None) or getattr(page.metadata, 'url', None)
            if url and url != "https://www.mayoclinic.org/drugs-supplements":
                label_urls.append(url)

print(f"Discovered {len(label_urls)} drug information pages")

The crawl method starts at Mayo Clinicโ€™s drug information section and follows links to individual drug pages. We limit it to 10 pages and request markdown format. The code handles the object structure returned by Firecrawlโ€™s SDK, filtering out the main index page.

Once you have the URLs, batch scraping extracts the actual content:

# Step 2: Batch scrape all discovered drug information pages
batch_job = app.batch_scrape(label_urls, formats=["markdown"])

# Process results
documents = []
for result in batch_job.data:
    documents.append({
        "url": result.metadata.url if result.metadata else "",
        "markdown": result.markdown,
        "title": result.metadata.title if result.metadata else "",
        "description": result.metadata.description if result.metadata else ""
    })
    print(f"Scraped: {result.metadata.title if result.metadata else 'Unknown'}")

The batch_scrape method processes all discovered URLs in parallel. Each result contains clean markdown with the drug information content, preserving sections like Description, Before Using, Proper Use, Precautions, and Side Effects.

Finally, save the documents to disk for chunking experiments:

# Save to disk for chunking experiments
output_dir = Path("data/raw_documents")
output_dir.mkdir(parents=True, exist_ok=True)

for i, doc in enumerate(documents):
    filepath = output_dir / f"drug_info_{i:02d}.json"
    with open(filepath, "w") as f:
        json.dump(doc, f, indent=2)

print(f"\nSaved {len(documents)} documents to {output_dir}")

This saves each document as JSON with the markdown content and metadata, giving you a clean dataset ready for testing different chunking approaches.

Output:

Discovered 9 drug information pages
Scraped: Sertraline (oral route) - Side effects & dosage - Mayo Clinic
Scraped: Sermorelin (injection route) - Side effects & dosage - Mayo Clinic
Scraped: Rituximab (intravenous route) - Side effects & uses - Mayo Clinic
Scraped: Valproic acid (oral route) - Side effects & dosage - Mayo Clinic
Scraped: Mepivacaine (injection route) - Side effects & uses - Mayo Clinic
Scraped: Tretinoin (topical route) - Side effects & dosage - Mayo Clinic
Scraped: Semaglutide (subcutaneous route) - Side effects & dosage - Mayo Clinic
Scraped: Clonidine (oral route) - Side effects & dosage - Mayo Clinic
Scraped: Progesterone (oral route) - Side effects & dosage - Mayo Clinic

Saved 9 documents to data/raw_documents

Why clean extraction matters for chunking

The difference between raw HTML and Firecrawlโ€™s markdown becomes clear when you chunk:

Raw HTML: Character-based splitting breaks <div> tags mid-element, fragments dosage tables around form elements, and embeds navigation text in your chunks.

Firecrawl markdown: Section headers (Description, Before Using, Proper Use, Precautions, Side Effects) provide natural split points. Dosage tables stay intact. Medical terminology is properly formatted. Every chunking strategy works better with this clean input.

With clean data ready, you can now test chunking strategies on the same content and make fair comparisons. The next section starts with recursive character splitting, the recommended default approach for most text content.

Complete RAG Pipeline Example

Youโ€™ve seen different chunking strategies and how they work in isolation. Hereโ€™s how chunking fits into a complete RAG pipeline.

A typical RAG workflow follows these steps: collect data, clean it, chunk it, embed it, store it in a vector database, and query. We already covered data collection in the Getting Sample Data section where we scraped drug information with Firecrawl. Now weโ€™ll build the rest of the pipeline, focusing on how the chunking step affects the final system.

The workflow

Hereโ€™s what a complete pipeline looks like:

  1. Scrape web content with Firecrawl (covered in section 4)
  2. Load the clean markdown
  3. Chunk the documents (strategy choice matters here)
  4. Generate embeddings
  5. Store in your vector database
  6. Query

Weโ€™ll build this pipeline with Pinecone and recursive character splitting. Youโ€™ll see how to swap in different chunking strategies or vector databases later.

Building the pipeline with recursive chunking

Weโ€™ll use the drug information we scraped earlier, chunk it with RecursiveCharacterTextSplitter, and store it in Pinecone. First, install the required packages:

pip install langchain-pinecone langchain-openai pinecone-client langchain-text-splitters python-dotenv

Set up your environment variables in a .env file:

PINECONE_API_KEY=your-pinecone-key
OPENAI_API_KEY=your-openai-key

The code below builds the complete pipeline. Weโ€™ll walk through each step.

Import dependencies and load environment

import os
import json
from pathlib import Path
from dotenv import load_dotenv
from pinecone import Pinecone, ServerlessSpec
from langchain_openai import OpenAIEmbeddings
from langchain_pinecone import PineconeVectorStore
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_core.documents import Document

load_dotenv()

This imports libraries for vector storage (Pinecone), embeddings (OpenAI), and document chunking (LangChain).

Initialize Pinecone and create index

# Initialize Pinecone
pc = Pinecone(api_key=os.environ["PINECONE_API_KEY"])
index_name = "drug-info-rag"

# Create index if it doesn't exist
if not pc.has_index(index_name):
    pc.create_index(
        name=index_name,
        dimension=1536,
        metric="cosine",
        spec=ServerlessSpec(cloud="aws", region="us-east-1")
    )

This connects to Pinecone and creates a serverless index configured for OpenAIโ€™s text-embedding-3-small model (1536 dimensions) using cosine similarity.

Set up embeddings and vector store

# Set up vector store
embeddings = OpenAIEmbeddings(model="text-embedding-3-small")
index = pc.Index(index_name)
vector_store = PineconeVectorStore(index=index, embedding=embeddings)

This initializes the OpenAI embeddings model and connects it to our Pinecone index through LangChainโ€™s vector store wrapper.

Load documents

# Load the scraped drug information
documents = []
data_dir = Path("data/raw_documents")

for json_file in data_dir.glob("drug_info_*.json"):
    with open(json_file) as f:
        doc_data = json.load(f)
        documents.append(
            Document(
                page_content=doc_data["markdown"],
                metadata={"source": doc_data["title"], "url": doc_data["url"]}
            )
        )

print(f"Loaded {len(documents)} documents")

This reads the JSON files containing our scraped drug information and converts them to LangChain Document objects, preserving the title and URL as metadata.

Chunk documents with recursive splitting

# Chunk with recursive character splitter
splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=50,
    separators=["\n\n", "\n", ". ", " ", ""]
)

chunks = splitter.split_documents(documents)
print(f"Created {len(chunks)} chunks from {len(documents)} documents")
print(f"Average chunk size: {sum(len(c.page_content) for c in chunks) // len(chunks)} characters")

Chunking strategy matters here. Weโ€™re using recursive character splitting with 500-character chunks and 50-character overlap. The separator hierarchy (["\n\n", "\n", ". ", " ", ""]) means the splitter tries paragraph breaks first, then line breaks, then sentence boundaries, then word boundaries. This preserves the document structure we got from Firecrawlโ€™s clean markdown.

The output shows how many chunks we created and their average size.

Add to Pinecone and query

# Add chunks to Pinecone
vector_store.add_documents(chunks)
print("Added chunks to Pinecone")

# Query the system
query = "What are the side effects of sertraline?"
results = vector_store.similarity_search(query, k=3)

print(f"\nQuery: '{query}'\n")
for i, result in enumerate(results, 1):
    print(f"Result {i} (from {result.metadata['source']}):")
    print(result.page_content[:200] + "...\n")

This uploads all chunks to Pinecone with their embeddings, then performs a semantic search for information about sertraline side effects. The system retrieves the top 3 most similar chunks.

Output:

Loaded 9 documents
Created 127 chunks from 9 documents
Average chunk size: 456 characters
Added chunks to Pinecone

Query: 'What are the side effects of sertraline?'

Result 1 (from Sertraline (oral route) - Side effects & dosage - Mayo Clinic):
## Side effects

Along with its needed effects, a medicine may cause some unwanted effects. Although not all of these side effects may occur, if they do occur they may need medical attention.

Check with your doctor...

The chunking strategy directly affected these results. Because we used recursive splitting with sentence boundaries, each chunk contains complete thoughts about side effects. The 500-character size was large enough to capture multiple related side effects in each chunk, giving the LLM good context for answering the question.

Testing different chunking strategies

You can test how chunking strategy affects retrieval by changing just the splitter configuration. Hereโ€™s semantic chunking instead:

from langchain_experimental.text_splitter import SemanticChunker

# Replace the RecursiveCharacterTextSplitter with SemanticChunker
semantic_splitter = SemanticChunker(
    embeddings=embeddings,
    breakpoint_threshold_type="percentile"
)

semantic_chunks = semantic_splitter.split_documents(documents)
print(f"Semantic chunking created {len(semantic_chunks)} chunks")

Semantic chunking will create fewer, larger chunks because it groups sentences by semantic similarity rather than character count. Youโ€™d then add these chunks to a different Pinecone index and compare retrieval quality for your specific queries.

The same pattern works for other strategies: swap the splitter, re-chunk, compare results.

Adapting for other vector databases

The pattern stays the same across all vector databases: load data, chunk it, embed it, store it. Only the vector store setup changes.

For Qdrant:

from langchain_qdrant import QdrantVectorStore
from qdrant_client import QdrantClient

client = QdrantClient(url="http://localhost:6333")
vector_store = QdrantVectorStore(
    client=client,
    collection_name="drug-info",
    embedding=embeddings
)
vector_store.add_documents(chunks)

For Weaviate:

from langchain_weaviate import WeaviateVectorStore
import weaviate

with weaviate.connect_to_local() as client:
    vector_store = WeaviateVectorStore(
        client=client,
        index_name="DrugInfo",
        text_key="text",
        embedding=embeddings
    )
    vector_store.add_documents(chunks)

For ChromaDB:

from langchain_chroma import Chroma

vector_store = Chroma(
    collection_name="drug-info",
    embedding_function=embeddings,
    persist_directory="./chroma_db"
)
vector_store.add_documents(chunks)

For pgvector (PostgreSQL):

from langchain_postgres import PGVector

connection_string = "postgresql://user:pass@localhost:5432/vectordb"
vector_store = PGVector(
    connection_string=connection_string,
    collection_name="drug_info",
    embedding_function=embeddings
)
vector_store.add_documents(chunks)

The chunking logic stays identical across all of these. You use the same RecursiveCharacterTextSplitter (or any other strategy) and the same chunks. The vector database handles storage and retrieval, but the chunking quality determines what gets stored.

This is why chunking strategy matters more than vector database choice for many applications. A well-chunked document retrieves better regardless of which database you use. Poor chunking degrades retrieval even on the fastest vector database.

Which chunking strategy should you use?

No single chunking strategy is universally best. Page-level chunking won NVIDIAโ€™s benchmarks with 0.648 accuracy and the lowest variance across document types. Semantic chunking can improve recall by up to 9% over simpler methods. RecursiveCharacterTextSplitter with 400-512 tokens delivered 85-90% recall in Chromaโ€™s tests without the computational overhead, making it a solid default for most teams.

The benchmarks in this article used specific datasets, embedding models, and query patterns that probably differ from yours. Your chunking choice determines what your vector database stores, which determines what your RAG system can retrieve. Track your metrics over time: recall shows whether youโ€™re retrieving relevant chunks, precision shows whether those chunks are actually useful, MRR and NDCG measure ranking quality.

FOOTER
The easiest way to extract
data from the web
. . .. ..+ .:. .. .. .:: +.. ..: :. .:..::. .. .. .--:::. .. ... .:. .. .. .:+=-::.:. . ...-.::. .. ::.... .:--+::..: ......:+....:. :.. .. ....... ::-=:::: ..:-:-...: .--..:: ......... .. . . . ..::-:-.. .-+-:::.. ...::::. .: ...::.:.. . -... ....: . . .--=+-::. :-=-:.... . .:..:: .:---:::::-::.... ..::........::=..... ...:-.. .:-=--+=-:. ..--:..=::.... . .:.. ..:---::::---=:::..:... ..........::::.:::::::-::.-.. ...::--==:. ..-::-+==-:... .-::....... ..--:. ..:=+==.---=-+-:::::::-.. . .....::......:: ::::-::.---=+-:..::-+==++X=-:. ..:-::-=-== ---.. .:.--::.. .:-==::=--X==-----====--::+:::+... ..-....-:..::-::=-=-:-::--===++=-==-----== X+=-:.::-==----+==+XX+=-::.:+--==--::. .:-+X=----+X=-=------===--::-:...:. .... ....::::...:-:-==+++=++==+++XX++==++--+-+==++++=-===+=---:-==+X:XXX+=-:-=-==++=-:. .:-=+=- -=X+X+===+---==--==--:..::...+....+ ..:::---.::.---=+==XXXXXXXX+XX++==++===--+===:+X+====+=--::--=+XXXXXXX+==++==+XX+=: ::::--=+++X++X+XXXX+=----==++.+=--::+::::+. ::.=... .:::-==-------=X+++XXXXXXXXXXX++==++.==-==-:-==+X++==+=-=--=++++X++:X:X+++X+-+X X+=---=-==+=+++XXXXX+XX=+=--=X++XXX==---::-+-::::.:..-..
Backed by
Y Combinator
LinkedinGithub
SOC II ยท Type 2
AICPA
SOC 2
X (Twitter)
Discord