arxiv_cl 90% Match Research Paper Researchers,Students,Information scientists,NLP practitioners 1 week ago

Talk2Ref: A Dataset for Reference Prediction from Scientific Talks

large-language-models › evaluation

📄 Abstract

Abstract: Scientific talks are a growing medium for disseminating research, and automatically identifying relevant literature that grounds or enriches a talk would be highly valuable for researchers and students alike. We introduce Reference Prediction from Talks (RPT), a new task that maps long, and unstructured scientific presentations to relevant papers. To support research on RPT, we present Talk2Ref, the first large-scale dataset of its kind, containing 6,279 talks and 43,429 cited papers (26 per talk on average), where relevance is approximated by the papers cited in the talk's corresponding source publication. We establish strong baselines by evaluating state-of-the-art text embedding models in zero-shot retrieval scenarios, and propose a dual-encoder architecture trained on Talk2Ref. We further explore strategies for handling long transcripts, as well as training for domain adaptation. Our results show that fine-tuning on Talk2Ref significantly improves citation prediction performance, demonstrating both the challenges of the task and the effectiveness of our dataset for learning semantic representations from spoken scientific content. The dataset and trained models are released under an open license to foster future research on integrating spoken scientific communication into citation recommendation systems.

Authors (3)

Frederik Broy

Maike Züfle

Jan Niehues

Submitted

October 28, 2025

arXiv Category

cs.CL

arXiv PDF

Key Contributions

Introduces the Reference Prediction from Talks (RPT) task and the Talk2Ref dataset, the first large-scale resource for mapping scientific talks to relevant papers. It establishes baselines using state-of-the-art embedding models and proposes a dual-encoder architecture, addressing the challenge of finding literature grounding scientific presentations.

Business Value

Accelerates scientific discovery and knowledge dissemination by enabling researchers and students to quickly find relevant literature cited in talks. This improves research efficiency and knowledge discovery.

Paper Metadata

Innovation Type

Dataset/Task Definition/Methodology

Deployment Feasibility

Moderate, requires infrastructure for processing talk transcripts and running retrieval models.

Limitations Addressed

Lack of datasets and established tasks for automatically linking scientific talks to relevant literature.

Performance Gains

Fine-tuning on Talk2Ref significantly improves citation prediction performance.

Technical Tags

reference predictionscientific talkscitation predictioninformation retrievaltext embedding modelsdual-encoder architecturelong transcriptsdomain adaptation

Research Topics

Information Retrieval from Scientific TalksCitation RecommendationCross-Modal Information RetrievalHandling Long Documents

Methods & Architectures

Reference Prediction from Talks (RPT) taskTalk2Ref datasetZero-shot retrieval using text embedding modelsDual-encoder architectureStrategies for long transcriptsDomain adaptation Dual-encoder architectureText embedding models

Applications & Tasks

Scientific Research Information Retrieval Natural Language Processing Knowledge Management Difficulty in automatically identifying relevant literature from scientific talksNeed for a dataset to support research on reference prediction from talksHandling long and unstructured presentation transcripts Reference predictionCitation predictionInformation retrieval

Datasets & Benchmarks

Datasets

Talk2Ref

Related Fields

Information RetrievalNatural Language ProcessingMachine LearningScientific Communication

Keywords

Reference PredictionScientific TalksCitation PredictionInformation RetrievalDatasetLLMText EmbeddingsDual-EncoderLong DocumentsDomain AdaptationTalk2Ref

Academic Context

#Information Retrieval from Scientific Talks#Citation Recommendation#Cross-Modal Information Retrieval#Handling Long Documents

Commercial Potential

Potential Products

Tools for automatically generating bibliographies from talksResearch discovery platformsAcademic search engines enhanced with talk content

Target Industries

AcademiaResearch InstitutionsPublishingTechnology

Use Case Examples

Finding papers mentioned in a conference presentationBuilding a knowledge graph of research connections from talksAssisting students in literature review for a specific research area

Competitive Edge

Pioneers the task of reference prediction from scientific talks and provides the first large-scale dataset, establishing a new research direction and benchmark.

Resource Requirements

Compute Needs

Moderate (for training and inference)

Data Requirements

Scientific talks with transcripts and corresponding cited papers.

Deployment Constraints

Handling very long transcripts and computational cost of dual-encoder models.

Scalability

Scalability depends on the efficiency of the dual-encoder architecture and retrieval system.

Production Readiness

Maturity Level

Research

Time to Market

2-4 years

View Full Paper Back to Papers