Redirecting to original paper in 30 seconds...

Click below to go immediately or wait for automatic redirect

arxiv_cl 95% Match Research paper Management researchers,Social scientists,AI researchers,Data annotators 2 weeks ago

To Err Is Human; To Annotate, SILICON? Reducing Measurement Error in LLM Annotation

large-language-models › evaluation
📄 Abstract

Abstract: Unstructured text data annotation is foundational to management research and Large Language Models (LLMs) promise a cost-effective and scalable alternative to human annotation. The validity of insights drawn from LLM annotated data critically depends on minimizing the discrepancy between LLM assigned labels and the unobserved ground truth, as well as ensuring long-term reproducibility of results. We address the gap in the literature on LLM annotation by decomposing measurement error in LLM-based text annotation into four distinct sources: (1) guideline-induced error from inconsistent annotation criteria, (2) baseline-induced error from unreliable human reference standards, (3) prompt-induced error from suboptimal meta-instruction formatting, and (4) model-induced error from architectural differences across LLMs. We develop the SILICON methodology to systematically reduce measurement error from LLM annotation in all four sources above. Empirical validation across seven management research cases shows iteratively refined guidelines substantially increases the LLM-human agreement compared to one-shot guidelines; expert-generated baselines exhibit higher inter-annotator agreement as well as are less prone to producing misleading LLM-human agreement estimates compared to crowdsourced baselines; placing content in the system prompt reduces prompt-induced error; and model performance varies substantially across tasks. To further reduce error, we introduce a cost-effective multi-LLM labeling method, where only low-confidence items receive additional labels from alternative models. Finally, in addressing closed source model retirement cycles, we introduce an intuitive regression-based methodology to establish robust reproducibility protocols. Our evidence indicates that reducing each error source is necessary, and that SILICON supports reproducible, rigorous annotation in management research.
Authors (3)
Xiang Cheng
Raveesh Mayya
João Sedoc
Submitted
December 19, 2024
arXiv Category
cs.CL
arXiv PDF

Key Contributions

Introduces the SILICON methodology to systematically reduce four sources of measurement error (guideline, baseline, prompt, model-induced) in LLM-based text annotation. This enhances the validity and reproducibility of insights derived from LLM-annotated data, particularly for management research.

Business Value

Enables more reliable and cost-effective data annotation for research and business intelligence, leading to higher quality insights from text data.