arxiv_ai 95% Match Research Paper AI researchers,LLM developers,Data scientists,Information scientists 1 week ago

Cite Pretrain: Retrieval-Free Knowledge Attribution for Large Language Models

large-language-models › training-methods

📄 Abstract

Abstract: Trustworthy language models should provide both correct and verifiable answers. However, citations generated directly by standalone LLMs are often unreliable. As a result, current systems insert citations by querying an external retriever at inference time, introducing latency, infrastructure dependence, and vulnerability to retrieval noise. We explore whether LLMs can be made to reliably attribute to the documents seen during continual pretraining without test-time retrieval, by revising the training process. To study this, we construct CitePretrainBench, a benchmark that mixes real-world corpora (Wikipedia, Common Crawl, arXiv) with novel documents and probes both short-form (single-fact) and long-form (multi-fact) citation tasks. Our approach follows a two-stage process: (1) continual pretraining to index factual knowledge by binding it to persistent document identifiers; and (2) instruction tuning to elicit citation behavior. We introduce Active Indexing for the first stage, which creates generalizable, source-anchored bindings by augmenting training with synthetic data that (i) restate each fact in diverse, compositional forms and (ii) enforce bidirectional training (source-to-fact and fact-to-source). This equips the model to both generate content from a cited source and attribute its own answers, improving robustness to paraphrase and composition. Experiments with Qwen-2.5-7B&3B show that Active Indexing consistently outperforms a Passive Indexing baseline, which simply appends an identifier to each document, achieving citation precision gains of up to 30.2% across all tasks and models. Our ablation studies reveal that performance continues to improve as we scale the amount of augmented data, showing a clear upward trend even at 16x the original token count. Finally, we show that internal citations complement external ones by making the model more robust to retrieval noise.

Authors (5)

Yukun Huang

Sanxing Chen

Jian Pei

Manzil Zaheer

Bhuwan Dhingra

Submitted

June 21, 2025

arXiv Category

cs.AI

arXiv PDF

Key Contributions

This paper proposes a retrieval-free method for knowledge attribution in LLMs, enabling them to reliably cite sources seen during continual pretraining without requiring test-time retrieval. It introduces a two-stage process: continual pretraining with 'Active Indexing' to bind factual knowledge to document identifiers, followed by instruction tuning to elicit citation behavior, aiming to produce correct and verifiable answers.

Business Value

Increases the trustworthiness and reliability of LLM-generated content, crucial for applications requiring factual accuracy and verifiability, such as research, journalism, and legal services.

Paper Metadata

Innovation Type

Training Methodology

Deployment Feasibility

High. The method focuses on modifying the training process, which can be integrated into LLM development pipelines.

Limitations Addressed

Unreliable citations from LLMs,Latency and complexity of retrieval-augmented systems,Dependence on external retrievers,Noise introduced by retrieval systems

Performance Gains

Enables LLMs to reliably attribute knowledge to source documents without test-time retrieval, improving trustworthiness and reducing latency.

Technical Tags

Knowledge AttributionLarge Language Models (LLMs)Retrieval-FreeContinual PretrainingInstruction TuningCitation GenerationVerifiable AnswersCitePretrainBenchActive IndexingDocument Identifiers

Research Topics

Trustworthy AILLM FactualityKnowledge GroundingContinual LearningAttribution Methods

Methods & Architectures

Continual PretrainingInstruction TuningActive IndexingRetrieval-Free Knowledge Attribution Large Language Models (LLMs)

Applications & Tasks

Natural Language Processing Information Retrieval AI Factuality Unreliable citations from standalone LLMsLatency and infrastructure dependence of retrieval-based citationVulnerability to retrieval noise Generating reliable and verifiable answersAttributing knowledge to source documents without test-time retrievalImproving LLM factuality and trustworthiness

Datasets & Benchmarks

Datasets

Wikipedia, Common Crawl, arXiv

Benchmarks

CitePretrainBench benchmark for short-form and long-form citation tasks.

Citation accuracyVerifiability of answersReliability of attributionPerformance on short-form and long-form citation tasks

Related Fields

Machine LearningNatural Language ProcessingInformation RetrievalKnowledge RepresentationAI Ethics

Keywords

LLMknowledge attributionretrieval-freecitationverifiablecontinual pretraininginstruction tuningtrustworthy AIfactualityCitePretrainBenchactive indexingNLP

Academic Context

#Trustworthy AI#LLM Factuality#Knowledge Grounding#Continual Learning#Attribution Methods

Commercial Potential

Potential Products

More reliable AI writing assistantsFact-checking toolsAI systems for academic research

Target Industries

PublishingMediaAcademiaLegalTechnology

Use Case Examples

Generating research summaries with accurate citationsEnsuring AI-generated news articles are verifiableBuilding AI assistants that can provide evidence for their claims

Competitive Edge

Offers a more efficient and integrated approach to citation generation compared to retrieval-augmented methods, reducing reliance on external systems.

Market Opportunity

The rapidly growing market for LLM applications and the increasing demand for trustworthy AI.

Revenue Models

Licensing of the training methodology or models trained with this approach.

Resource Requirements

Compute Needs

High, for continual pretraining and instruction tuning of large language models.

Data Requirements

Large corpora for pretraining (e.g., Wikipedia, Common Crawl, arXiv) and instruction tuning datasets.

Deployment Constraints

Requires careful design of the pretraining and instruction tuning phases to ensure reliable citation behavior.

Scalability

The retrieval-free nature potentially improves inference scalability compared to retrieval-augmented methods.

Regulatory Considerations

Ensuring responsible AI development and preventing misinformation.

Production Readiness

Maturity Level

Research

Time to Market

1-3 years for integration into LLM development frameworks.

Patent Potential

Moderate, for the 'Active Indexing' technique and the retrieval-free attribution method.

View Full Paper Back to Papers