arxiv_ir 95% Match Research Paper Investigative Journalists,Newsroom Editors,Information Security Professionals,AI Researchers focusing on privacy-preserving NLP 1 month ago

On-Premise AI for the Newsroom: Evaluating Small Language Models for Investigative Document Search

large-language-models › model-architecture

📄 Abstract

Abstract: Investigative journalists routinely confront large document collections. Large language models (LLMs) with retrieval-augmented generation (RAG) capabilities promise to accelerate the process of document discovery, but newsroom adoption remains limited due to hallucination risks, verification burden, and data privacy concerns. We present a journalist-centered approach to LLM-powered document search that prioritizes transparency and editorial control through a five-stage pipeline -- corpus summarization, search planning, parallel thread execution, quality evaluation, and synthesis -- using small, locally-deployable language models that preserve data security and maintain complete auditability through explicit citation chains. Evaluating three quantized models (Gemma 3 12B, Qwen 3 14B, and GPT-OSS 20B) on two corpora, we find substantial variation in reliability. All models achieved high citation validity and ran effectively on standard desktop hardware (e.g., 24 GB of memory), demonstrating feasibility for resource-constrained newsrooms. However, systematic challenges emerged, including error propagation through multi-stage synthesis and dramatic performance variation based on training data overlap with corpus content. These findings suggest that effective newsroom AI deployment requires careful model selection and system design, alongside human oversight for maintaining standards of accuracy and accountability.

Key Contributions

This paper presents a journalist-centered approach for on-premise AI-powered document search using small, locally deployable language models (SLMs) integrated with RAG. The five-stage pipeline prioritizes transparency, editorial control, and data security by maintaining explicit citation chains and auditability, addressing key concerns hindering LLM adoption in newsrooms.

Business Value

Enables news organizations to leverage AI for faster and more efficient document analysis while maintaining strict data privacy and editorial control, potentially uncovering critical information more effectively.

Paper Metadata

Innovation Type

System Design and Application

Deployment Feasibility

High. The use of quantized SLMs and on-premise deployment makes it feasible for organizations with standard IT infrastructure.

Limitations Addressed

Hallucination risks associated with large LLMs.,Data privacy and security concerns for sensitive newsroom documents.,High computational requirements of large LLMs.,Lack of transparency and auditability in LLM-generated results.

Performance Gains

High citation validity achieved by all evaluated models.,Effective operation on standard desktop hardware (e.g., 24 GB memory).

Technical Tags

Small Language Models (SLMs)Retrieval-Augmented Generation (RAG)Investigative JournalismDocument SearchData PrivacyAuditabilityQuantized ModelsCorpus SummarizationCitation ChainsOn-Premise AI

Research Topics

Natural Language ProcessingInformation RetrievalJournalism TechnologyAI Ethics and PrivacyModel Deployment

Methods & Architectures

Retrieval-Augmented Generation (RAG)Corpus SummarizationIterative Search PlanningQuantization Gemma 3 12BQwen 3 14BGPT-OSS 20B

Applications & Tasks

Newsrooms Investigative Journalism Legal Document Analysis Research Hallucination risks in LLMsData privacy concernsVerification burdenScalability of LLMs for sensitive dataNeed for editorial control Accelerating document discoveryImproving information retrieval accuracyEnsuring data privacy and securityProviding auditable search results

Related Fields

Natural Language ProcessingInformation RetrievalJournalismComputer SecurityAI Ethics

Keywords

Small Language ModelsSLMRAGInvestigative JournalismDocument SearchNewsroom AIOn-Premise AIData PrivacyAuditabilityQuantized ModelsGemmaQwenGPT-OSSInformation Retrieval

Academic Context

#Natural Language Processing#Information Retrieval#Journalism Technology#AI Ethics and Privacy#Model Deployment

Commercial Potential

Potential Products

Secure document analysis platform for newsrooms.AI-powered research assistant for journalists.Privacy-preserving information retrieval tool.

Target Industries

MediaPublishingLegal ServicesResearch Institutions

Use Case Examples

A newsroom using the system to quickly search through years of archived documents for investigative reporting.A legal team analyzing case files while ensuring client confidentiality.

Competitive Edge

Offers a privacy-focused, auditable alternative to cloud-based LLM solutions for document search, specifically tailored for sensitive environments like newsrooms.

Market Opportunity

Growing market for AI solutions in media and legal tech, with increasing demand for privacy-preserving tools.

Revenue Models

Software licensingsubscription services for updates and supportor consulting for implementation.

Resource Requirements

Compute Needs

Moderate. Designed to run on standard desktop hardware with sufficient RAM (e.g., 24 GB).

Data Requirements

Requires access to the document corpus to be searched.

Deployment Constraints

Requires local hardware capable of running quantized models.,Model performance may be lower than larger, cloud-based models.

Scalability

Scalability is limited by the local hardware resources. Processing very large corpora might require distributed setups or more powerful machines.

Regulatory Considerations

Data privacy regulations (e.g.GDPRCCPA) are addressed by the on-premise deployment.

Production Readiness

Maturity Level

Prototype/Research

Time to Market

1-2 years for a production-ready system.

Patent Potential

Moderate, related to the specific pipeline design and integration of SLMs for secure document analysis.

View Full Paper Back to Papers