Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
π Abstract
Abstract: Scientific talks are a growing medium for disseminating research, and
automatically identifying relevant literature that grounds or enriches a talk
would be highly valuable for researchers and students alike. We introduce
Reference Prediction from Talks (RPT), a new task that maps long, and
unstructured scientific presentations to relevant papers. To support research
on RPT, we present Talk2Ref, the first large-scale dataset of its kind,
containing 6,279 talks and 43,429 cited papers (26 per talk on average), where
relevance is approximated by the papers cited in the talk's corresponding
source publication. We establish strong baselines by evaluating
state-of-the-art text embedding models in zero-shot retrieval scenarios, and
propose a dual-encoder architecture trained on Talk2Ref. We further explore
strategies for handling long transcripts, as well as training for domain
adaptation. Our results show that fine-tuning on Talk2Ref significantly
improves citation prediction performance, demonstrating both the challenges of
the task and the effectiveness of our dataset for learning semantic
representations from spoken scientific content. The dataset and trained models
are released under an open license to foster future research on integrating
spoken scientific communication into citation recommendation systems.
Authors (3)
Frederik Broy
Maike ZΓΌfle
Jan Niehues
Submitted
October 28, 2025
Key Contributions
Introduces the Reference Prediction from Talks (RPT) task and the Talk2Ref dataset, the first large-scale resource for mapping scientific talks to relevant papers. It establishes baselines using state-of-the-art embedding models and proposes a dual-encoder architecture, addressing the challenge of finding literature grounding scientific presentations.
Business Value
Accelerates scientific discovery and knowledge dissemination by enabling researchers and students to quickly find relevant literature cited in talks. This improves research efficiency and knowledge discovery.