What is hybrid search for document retrieval?

Hybrid search combines two retrieval methods in a single pipeline: sparse keyword matching (typically BM25) and dense vector search over embeddings, then merges the ranked lists into one result set. Sparse retrieval is precise for exact terms such as product codes, error strings, and proper names, but misses synonyms and rephrased questions. Dense retrieval captures semantic similarity across vocabulary mismatches but scores poorly when a query contains a specific identifier that appears verbatim in only a handful of documents. At scale, over tens of thousands of documents, either method alone leaves accuracy gaps that the other fills.

Factor	BM25 (sparse)	Vector search (dense)	Hybrid
Matches exact terms	Yes	Weak	Yes
Handles synonyms and paraphrases	No	Yes	Yes
Performance on technical queries	High	Moderate	High
Performance on conversational queries	Moderate	High	High
Index complexity	Low	Medium (vector store)	Higher (both indexes)

Use hybrid search when a document corpus contains both natural-language content and precise identifiers. A pure vector index over technical documentation misranks exact function names; a pure BM25 index over FAQ content misses rephrased questions. The merge step (often reciprocal rank fusion) normalizes scores across both retrieval paths before final ranking. Systems searching 10,000 or more documents benefit most from hybrid approaches because the accuracy gains outweigh the added infrastructure cost of maintaining two indexes. For smaller corpora, semantic search alone is usually sufficient.

For the content layer, Firecrawl's Scrape API extracts clean markdown from any web page. Feed that output to both your BM25 index and your embedding model to build the sparse and dense retrieval layers without a per-source cleaning step. For PDF and document corpora, the parse endpoint converts uploaded files to structured markdown with the same consistent output format, so both web pages and documents flow into the same indexing pipeline.

Ready to build?

All Questions

What is hybrid search for document retrieval?