arxiv_cl 94% Match Research Paper Information retrieval researchers,NLP engineers,Search engine developers,Machine learning practitioners 2 weeks ago

Don't Retrieve, Generate: Prompting LLMs for Synthetic Training Data in Dense Retrieval

large-language-models › training-methods

📄 Abstract

Abstract: Training effective dense retrieval models typically relies on hard negative (HN) examples mined from large document corpora using methods such as BM25 or cross-encoders (CE), which require full corpus access. We propose a corpus-free alternative: an end-to-end pipeline where a Large Language Model (LLM) first generates a query from a passage and then produces a hard negative example using only the generated query text. Our dataset comprises 7,250 arXiv abstracts spanning diverse domains including mathematics, physics, computer science, and related fields, serving as positive passages for query generation. We evaluate two fine-tuning configurations of DistilBERT for dense retrieval; one using LLM-generated hard negatives conditioned solely on the query, and another using negatives generated with both the query and its positive document as context. Compared to traditional corpus-based mining methods {LLM Query $\rightarrow$ BM25 HN and LLM Query $\rightarrow$ CE HN on multiple BEIR benchmark datasets, our all-LLM pipeline outperforms strong lexical mining baselines and achieves performance comparable to cross-encoder-based methods, demonstrating the potential of corpus-free hard negative generation for retrieval model training.

Authors (1)

Aarush Sinha

Submitted

April 20, 2025

arXiv Category

cs.IR

arXiv PDF

Key Contributions

This paper proposes a novel corpus-free approach for training dense retrieval models by using LLMs to generate synthetic hard negative (HN) examples. Instead of relying on traditional methods that require full corpus access, the proposed pipeline uses an LLM to first generate a query from a passage and then produce an HN using only the generated query text, significantly reducing data requirements.

Business Value

Enables the development of more efficient and effective search and retrieval systems, especially in scenarios where full corpus access is limited or computationally expensive, potentially reducing infrastructure costs.

Paper Metadata

Innovation Type

Methodological/Data Generation

Deployment Feasibility

High, as it leverages LLMs and standard retrieval models, and reduces reliance on large corpora.

Limitations Addressed

The reliance on full corpus access for mining hard negatives and the associated computational costs and inefficiencies of traditional methods like BM25 or cross-encoders.

Performance Gains

Compared to traditional corpus-based mining methods (LLM Query -> BM25 HN and LLM Query -> CE HN).

Technical Tags

dense retrievalLLM generationsynthetic datahard negativescorpus-freequery generationinformation retrievalfine-tuningDistilBERTtraining data augmentation

Research Topics

Information RetrievalNatural Language ProcessingLarge Language ModelsSynthetic Data GenerationMachine Learning Training

Methods & Architectures

LLM-based query generationLLM-based hard negative generationEnd-to-end pipelineFine-tuning (DistilBERT) DistilBERTLLMs

Applications & Tasks

Information Retrieval Systems Search Engines Document Understanding Need for hard negative examplesRequirement of full corpus access for miningInefficiency of traditional mining methods Generating synthetic hard negative examplesTraining dense retrieval modelsImproving retrieval performance without corpus access

Datasets & Benchmarks

Datasets

arXiv abstracts

Retrieval performanceAccuracy

Related Fields

Information RetrievalNatural Language ProcessingMachine LearningData AugmentationLarge Language Models

Keywords

dense retrievalLLMsynthetic datahard negativecorpus-freequery generationinformation retrievalDistilBERTtraining datadata augmentationBM25cross-encoderNLPsearchretrieval

Academic Context

#Information Retrieval#Natural Language Processing#Large Language Models#Synthetic Data Generation#Machine Learning Training

Technology Stack

Frameworks & Libraries

Smashcima

Commercial Potential

Potential Products

Improved search engine componentsLibraries for synthetic data generation in IRTools for training retrieval models with limited data

Target Industries

TechnologySearchE-commerceInformation Management

Use Case Examples

Training retrieval models for specialized document collectionsEnhancing search capabilities in resource-constrained environmentsGenerating diverse training data for retrieval tasks

Competitive Edge

Offers a more efficient and flexible alternative to traditional hard negative mining methods by leveraging LLM generation capabilities, reducing dependency on large corpora.

Market Opportunity

The global search engine market and the demand for efficient information retrieval solutions.

Revenue Models

Licensing of the generation pipelineintegration services for search platforms.

Resource Requirements

Compute Needs

Requires computational resources for LLM inference and model fine-tuning.

Data Requirements

Requires a set of positive passages (e.g., arXiv abstracts) to generate queries and subsequently hard negatives.

Deployment Constraints

Performance might depend on the quality of LLM-generated queries and negatives.

Scalability

Scalable to large datasets by generating synthetic data on demand.

Production Readiness

Maturity Level

Research/Experimental

Time to Market

Short to Medium, for integration into existing IR systems.

View Full Paper Back to Papers