Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Unstructured text data annotation is foundational to management research and
Large Language Models (LLMs) promise a cost-effective and scalable alternative
to human annotation. The validity of insights drawn from LLM annotated data
critically depends on minimizing the discrepancy between LLM assigned labels
and the unobserved ground truth, as well as ensuring long-term reproducibility
of results. We address the gap in the literature on LLM annotation by
decomposing measurement error in LLM-based text annotation into four distinct
sources: (1) guideline-induced error from inconsistent annotation criteria, (2)
baseline-induced error from unreliable human reference standards, (3)
prompt-induced error from suboptimal meta-instruction formatting, and (4)
model-induced error from architectural differences across LLMs. We develop the
SILICON methodology to systematically reduce measurement error from LLM
annotation in all four sources above. Empirical validation across seven
management research cases shows iteratively refined guidelines substantially
increases the LLM-human agreement compared to one-shot guidelines;
expert-generated baselines exhibit higher inter-annotator agreement as well as
are less prone to producing misleading LLM-human agreement estimates compared
to crowdsourced baselines; placing content in the system prompt reduces
prompt-induced error; and model performance varies substantially across tasks.
To further reduce error, we introduce a cost-effective multi-LLM labeling
method, where only low-confidence items receive additional labels from
alternative models. Finally, in addressing closed source model retirement
cycles, we introduce an intuitive regression-based methodology to establish
robust reproducibility protocols. Our evidence indicates that reducing each
error source is necessary, and that SILICON supports reproducible, rigorous
annotation in management research.
Authors (3)
Xiang Cheng
Raveesh Mayya
João Sedoc
Submitted
December 19, 2024
Key Contributions
Introduces the SILICON methodology to systematically reduce four sources of measurement error (guideline, baseline, prompt, model-induced) in LLM-based text annotation. This enhances the validity and reproducibility of insights derived from LLM-annotated data, particularly for management research.
Business Value
Enables more reliable and cost-effective data annotation for research and business intelligence, leading to higher quality insights from text data.