arxiv_cl 90% Match Research Paper Computational Linguists,Literary Scholars,NLP Researchers,Digital Humanities Researchers 2 weeks ago

The Elephant in the Coreference Room: Resolving Coreference in Full-Length French Fiction Works

large-language-models › evaluation

📄 Abstract

Abstract: While coreference resolution is attracting more interest than ever from computational literature researchers, representative datasets of fully annotated long documents remain surprisingly scarce. In this paper, we introduce a new annotated corpus of three full-length French novels, totaling over 285,000 tokens. Unlike previous datasets focused on shorter texts, our corpus addresses the challenges posed by long, complex literary works, enabling evaluation of coreference models in the context of long reference chains. We present a modular coreference resolution pipeline that allows for fine-grained error analysis. We show that our approach is competitive and scales effectively to long documents. Finally, we demonstrate its usefulness to infer the gender of fictional characters, showcasing its relevance for both literary analysis and downstream NLP tasks.

Authors (2)

Antoine Bourgois

Thierry Poibeau

Submitted

October 17, 2025

arXiv Category

cs.CL

arXiv PDF

Key Contributions

This paper introduces a new annotated corpus of three full-length French novels for coreference resolution, addressing the scarcity of long-document datasets. It presents a modular pipeline that scales effectively and demonstrates its utility for inferring character gender, benefiting both literary analysis and NLP.

Business Value

Enables deeper computational analysis of literary works, unlocking new insights for researchers and potentially aiding in automated content analysis and summarization of fictional texts.

Paper Metadata

Innovation Type

Dataset Creation / Methodological Application

Deployment Feasibility

Moderate. Requires specialized annotation and NLP expertise, but the pipeline is modular.

Limitations Addressed

Addresses the lack of representative datasets for coreference resolution in long, complex literary works and the challenges associated with long reference chains.

Technical Tags

coreference resolutionFrench fictionannotated corpuslong documentslong reference chainsliterary analysisgender inferenceNLP pipeline

Research Topics

Natural Language ProcessingComputational LinguisticsCorpus LinguisticsLiterary StudiesInformation Extraction

Methods & Architectures

Coreference resolution pipelineModular designFine-grained error analysisGender inference module

Applications & Tasks

Literary Analysis Digital Humanities Natural Language Processing Scarcity of Annotated Long DocumentsChallenges of Long Reference ChainsEvaluating Coreference Models on Fiction Coreference Resolution in French FictionInferring Character GenderEnabling Literary Analysis via NLP

Datasets & Benchmarks

Datasets

Annotated corpus of three full-length French novels

Related Fields

Natural Language ProcessingComputational LinguisticsDigital HumanitiesLiterary Studies

Keywords

Coreference ResolutionFrench FictionAnnotated CorpusLong DocumentsLiterary AnalysisNLPDigital HumanitiesGender InferenceComputational LinguisticsCorpus

Academic Context

#Natural Language Processing#Computational Linguistics#Corpus Linguistics#Literary Studies#Information Extraction

Commercial Potential

Potential Products

Literary analysis softwareCoreference resolution tools for long textsCharacter analysis tools

Target Industries

AcademiaPublishingDigital HumanitiesTechnology (NLP)

Use Case Examples

Analyzing character relationships in novelsAutomating the identification of pronouns and their referents in literatureStudying narrative structure through coreference patterns

Competitive Edge

Provides a valuable, specialized dataset and a robust pipeline for a challenging NLP task within the domain of literary analysis.

Market Opportunity

Growing field of Digital Humanities and NLP applications in literature.

Revenue Models

Licensing of the corpusdevelopment of specialized NLP tools for literary analysis.

Resource Requirements

Compute Needs

Standard computational resources for NLP model training and inference.

Data Requirements

The newly created annotated corpus of French fiction.

Deployment Constraints

Requires significant effort for annotation and model training; domain-specific challenges of literary texts.

Scalability

The pipeline is designed to scale effectively to long documents.

Production Readiness

Maturity Level

Research

Time to Market

Medium (for tools based on the corpus)

Patent Potential

Low

View Full Paper Back to Papers