Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: While coreference resolution is attracting more interest than ever from
computational literature researchers, representative datasets of fully
annotated long documents remain surprisingly scarce. In this paper, we
introduce a new annotated corpus of three full-length French novels, totaling
over 285,000 tokens. Unlike previous datasets focused on shorter texts, our
corpus addresses the challenges posed by long, complex literary works, enabling
evaluation of coreference models in the context of long reference chains. We
present a modular coreference resolution pipeline that allows for fine-grained
error analysis. We show that our approach is competitive and scales effectively
to long documents. Finally, we demonstrate its usefulness to infer the gender
of fictional characters, showcasing its relevance for both literary analysis
and downstream NLP tasks.
Authors (2)
Antoine Bourgois
Thierry Poibeau
Submitted
October 17, 2025
Key Contributions
This paper introduces a new annotated corpus of three full-length French novels for coreference resolution, addressing the scarcity of long-document datasets. It presents a modular pipeline that scales effectively and demonstrates its utility for inferring character gender, benefiting both literary analysis and NLP.
Business Value
Enables deeper computational analysis of literary works, unlocking new insights for researchers and potentially aiding in automated content analysis and summarization of fictional texts.