What is hybrid search for document retrieval?
Hybrid search combines two retrieval methods in a single pipeline: sparse keyword matching (typically BM25) and dense vector search over embeddings, then merges the ranked lists into one result set. Sparse retrieval is precise for exact terms such as product codes, error strings, and proper names, but misses synonyms and rephrased questions. Dense retrieval captures semantic similarity across vocabulary mismatches but scores poorly when a query contains a specific identifier that appears verbatim in only a handful of documents. At scale, over tens of thousands of documents, either method alone leaves accuracy gaps that the other fills.
| Factor | BM25 (sparse) | Vector search (dense) | Hybrid |
|---|---|---|---|
| Matches exact terms | Yes | Weak | Yes |
| Handles synonyms and paraphrases | No | Yes | Yes |
| Performance on technical queries | High | Moderate | High |
| Performance on conversational queries | Moderate | High | High |
| Index complexity | Low | Medium (vector store) | Higher (both indexes) |
Use hybrid search when a document corpus contains both natural-language content and precise identifiers. A pure vector index over technical documentation misranks exact function names; a pure BM25 index over FAQ content misses rephrased questions. The merge step (often reciprocal rank fusion) normalizes scores across both retrieval paths before final ranking. Systems searching 10,000 or more documents benefit most from hybrid approaches because the accuracy gains outweigh the added infrastructure cost of maintaining two indexes. For smaller corpora, semantic search alone is usually sufficient.
For the content layer, Firecrawl's Scrape API extracts clean markdown from any web page. Feed that output to both your BM25 index and your embedding model to build the sparse and dense retrieval layers without a per-source cleaning step. For PDF and document corpora, the parse endpoint converts uploaded files to structured markdown with the same consistent output format, so both web pages and documents flow into the same indexing pipeline.