Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Training effective dense retrieval models typically relies on hard negative
(HN) examples mined from large document corpora using methods such as BM25 or
cross-encoders (CE), which require full corpus access. We propose a corpus-free
alternative: an end-to-end pipeline where a Large Language Model (LLM) first
generates a query from a passage and then produces a hard negative example
using only the generated query text. Our dataset comprises 7,250 arXiv
abstracts spanning diverse domains including mathematics, physics, computer
science, and related fields, serving as positive passages for query generation.
We evaluate two fine-tuning configurations of DistilBERT for dense retrieval;
one using LLM-generated hard negatives conditioned solely on the query, and
another using negatives generated with both the query and its positive document
as context. Compared to traditional corpus-based mining methods {LLM Query
$\rightarrow$ BM25 HN and LLM Query $\rightarrow$ CE HN on multiple BEIR
benchmark datasets, our all-LLM pipeline outperforms strong lexical mining
baselines and achieves performance comparable to cross-encoder-based methods,
demonstrating the potential of corpus-free hard negative generation for
retrieval model training.
Key Contributions
This paper proposes a novel corpus-free approach for training dense retrieval models by using LLMs to generate synthetic hard negative (HN) examples. Instead of relying on traditional methods that require full corpus access, the proposed pipeline uses an LLM to first generate a query from a passage and then produce an HN using only the generated query text, significantly reducing data requirements.
Business Value
Enables the development of more efficient and effective search and retrieval systems, especially in scenarios where full corpus access is limited or computationally expensive, potentially reducing infrastructure costs.