Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Investigative journalists routinely confront large document collections.
Large language models (LLMs) with retrieval-augmented generation (RAG)
capabilities promise to accelerate the process of document discovery, but
newsroom adoption remains limited due to hallucination risks, verification
burden, and data privacy concerns. We present a journalist-centered approach to
LLM-powered document search that prioritizes transparency and editorial control
through a five-stage pipeline -- corpus summarization, search planning,
parallel thread execution, quality evaluation, and synthesis -- using small,
locally-deployable language models that preserve data security and maintain
complete auditability through explicit citation chains. Evaluating three
quantized models (Gemma 3 12B, Qwen 3 14B, and GPT-OSS 20B) on two corpora, we
find substantial variation in reliability. All models achieved high citation
validity and ran effectively on standard desktop hardware (e.g., 24 GB of
memory), demonstrating feasibility for resource-constrained newsrooms. However,
systematic challenges emerged, including error propagation through multi-stage
synthesis and dramatic performance variation based on training data overlap
with corpus content. These findings suggest that effective newsroom AI
deployment requires careful model selection and system design, alongside human
oversight for maintaining standards of accuracy and accountability.
Key Contributions
This paper presents a journalist-centered approach for on-premise AI-powered document search using small, locally deployable language models (SLMs) integrated with RAG. The five-stage pipeline prioritizes transparency, editorial control, and data security by maintaining explicit citation chains and auditability, addressing key concerns hindering LLM adoption in newsrooms.
Business Value
Enables news organizations to leverage AI for faster and more efficient document analysis while maintaining strict data privacy and editorial control, potentially uncovering critical information more effectively.