Redirecting to original paper in 30 seconds...

Click below to go immediately or wait for automatic redirect

arxiv_ai 95% Match Research Paper AI researchers,LLM developers,Data scientists,Information scientists 1 week ago

Cite Pretrain: Retrieval-Free Knowledge Attribution for Large Language Models

large-language-models › training-methods
📄 Abstract

Abstract: Trustworthy language models should provide both correct and verifiable answers. However, citations generated directly by standalone LLMs are often unreliable. As a result, current systems insert citations by querying an external retriever at inference time, introducing latency, infrastructure dependence, and vulnerability to retrieval noise. We explore whether LLMs can be made to reliably attribute to the documents seen during continual pretraining without test-time retrieval, by revising the training process. To study this, we construct CitePretrainBench, a benchmark that mixes real-world corpora (Wikipedia, Common Crawl, arXiv) with novel documents and probes both short-form (single-fact) and long-form (multi-fact) citation tasks. Our approach follows a two-stage process: (1) continual pretraining to index factual knowledge by binding it to persistent document identifiers; and (2) instruction tuning to elicit citation behavior. We introduce Active Indexing for the first stage, which creates generalizable, source-anchored bindings by augmenting training with synthetic data that (i) restate each fact in diverse, compositional forms and (ii) enforce bidirectional training (source-to-fact and fact-to-source). This equips the model to both generate content from a cited source and attribute its own answers, improving robustness to paraphrase and composition. Experiments with Qwen-2.5-7B&3B show that Active Indexing consistently outperforms a Passive Indexing baseline, which simply appends an identifier to each document, achieving citation precision gains of up to 30.2% across all tasks and models. Our ablation studies reveal that performance continues to improve as we scale the amount of augmented data, showing a clear upward trend even at 16x the original token count. Finally, we show that internal citations complement external ones by making the model more robust to retrieval noise.
Authors (5)
Yukun Huang
Sanxing Chen
Jian Pei
Manzil Zaheer
Bhuwan Dhingra
Submitted
June 21, 2025
arXiv Category
cs.AI
arXiv PDF

Key Contributions

This paper proposes a retrieval-free method for knowledge attribution in LLMs, enabling them to reliably cite sources seen during continual pretraining without requiring test-time retrieval. It introduces a two-stage process: continual pretraining with 'Active Indexing' to bind factual knowledge to document identifiers, followed by instruction tuning to elicit citation behavior, aiming to produce correct and verifiable answers.

Business Value

Increases the trustworthiness and reliability of LLM-generated content, crucial for applications requiring factual accuracy and verifiability, such as research, journalism, and legal services.