Redirecting to original paper in 30 seconds...

Click below to go immediately or wait for automatic redirect

arxiv_cl 94% Match Research Paper Information retrieval researchers,NLP engineers,Search engine developers,Machine learning practitioners 2 weeks ago

Don't Retrieve, Generate: Prompting LLMs for Synthetic Training Data in Dense Retrieval

large-language-models › training-methods
📄 Abstract

Abstract: Training effective dense retrieval models typically relies on hard negative (HN) examples mined from large document corpora using methods such as BM25 or cross-encoders (CE), which require full corpus access. We propose a corpus-free alternative: an end-to-end pipeline where a Large Language Model (LLM) first generates a query from a passage and then produces a hard negative example using only the generated query text. Our dataset comprises 7,250 arXiv abstracts spanning diverse domains including mathematics, physics, computer science, and related fields, serving as positive passages for query generation. We evaluate two fine-tuning configurations of DistilBERT for dense retrieval; one using LLM-generated hard negatives conditioned solely on the query, and another using negatives generated with both the query and its positive document as context. Compared to traditional corpus-based mining methods {LLM Query $\rightarrow$ BM25 HN and LLM Query $\rightarrow$ CE HN on multiple BEIR benchmark datasets, our all-LLM pipeline outperforms strong lexical mining baselines and achieves performance comparable to cross-encoder-based methods, demonstrating the potential of corpus-free hard negative generation for retrieval model training.
Authors (1)
Aarush Sinha
Submitted
April 20, 2025
arXiv Category
cs.IR
arXiv PDF

Key Contributions

This paper proposes a novel corpus-free approach for training dense retrieval models by using LLMs to generate synthetic hard negative (HN) examples. Instead of relying on traditional methods that require full corpus access, the proposed pipeline uses an LLM to first generate a query from a passage and then produce an HN using only the generated query text, significantly reducing data requirements.

Business Value

Enables the development of more efficient and effective search and retrieval systems, especially in scenarios where full corpus access is limited or computationally expensive, potentially reducing infrastructure costs.