arxiv_ml 80% Match Research Paper NLP researchers,Software developers,Technical writers,Machine learning engineers evaluating text generation 1 week ago

Excision Score: Evaluating Edits with Surgical Precision

large-language-models › evaluation

📄 Abstract

Abstract: Many tasks revolve around editing a document, whether code or text. We formulate the revision similarity problem to unify a wide range of machine learning evaluation problems whose goal is to assess a revision to an existing document. We observe that revisions usually change only a small portion of an existing document, so the existing document and its immediate revisions share a majority of their content. We formulate five adequacy criteria for revision similarity measures, designed to align them with human judgement. We show that popular pairwise measures, like BLEU, fail to meet these criteria, because their scores are dominated by the shared content. They report high similarity between two revisions when humans would assess them as quite different. This is a fundamental flaw we address. We propose a novel static measure, Excision Score (ES), which computes longest common subsequence (LCS) to remove content shared by an existing document with the ground truth and predicted revisions, before comparing only the remaining divergent regions. This is analogous to a surgeon creating a sterile field to focus on the work area. We use approximation to speed the standard cubic LCS computation to quadratic. In code-editing evaluation, where static measures are often used as a cheap proxy for passing tests, we demonstrate that ES surpasses existing measures. When aligned with test execution on HumanEvalFix, ES improves over its nearest competitor, SARI, by 12% Pearson correlation and by >21% over standard measures like BLEU. The key criterion is invariance to shared context; when we perturb HumanEvalFix with increased shared context, ES' improvement over SARI increases to 20% and >30% over standard measures. ES also handles other corner cases that other measures do not, such as correctly aligning moved code blocks, and appropriately rewarding matching insertions or deletions.

Authors (4)

Nikolai Gruzinov

Ksenia Sycheva

Earl T. Barr

Alex Bezzubov

Submitted

October 24, 2025

arXiv Category

cs.LG

arXiv PDF

Key Contributions

Introduces the 'Excision Score' (ES), a novel metric for evaluating document revisions that addresses the fundamental flaw of traditional pairwise measures (like BLEU) being dominated by shared content. ES uses Longest Common Subsequence (LCS) to focus on the changed portions, aligning better with human judgment by proposing five adequacy criteria.

Business Value

Provides a more accurate way to evaluate the quality of edits in documents and code, which is crucial for collaborative platforms, automated editing tools, and version control systems.

Paper Metadata

Innovation Type

Metric/Algorithmic

Deployment Feasibility

High. Excision Score is a computationally efficient metric based on LCS, making it easy to integrate into existing workflows.

Limitations Addressed

Addresses the limitation of existing pairwise similarity measures (e.g., BLEU) that fail to accurately assess revision similarity because their scores are dominated by shared content, leading to misalignments with human judgment.

Performance Gains

Qualitative improvement in alignment with human judgment compared to BLEU.

Technical Tags

revision similaritydocument editingExcision Score (ES)Longest Common Subsequence (LCS)adequacy criteriaevaluation metricstext editingcode editing

Research Topics

Natural Language ProcessingInformation RetrievalEvaluation MetricsDocument AnalysisMachine Learning

Methods & Architectures

Excision Score (ES)Longest Common Subsequence (LCS)Formulation of adequacy criteria

Applications & Tasks

Document Editing Code Editing Version Control Text Generation Evaluation Evaluation Metric DesignRevision Similarity AssessmentText Comparison Evaluating the similarity of document revisionsDeveloping metrics that align with human judgment for edits

Related Fields

Natural Language ProcessingSoftware EngineeringInformation ScienceText Analysis

Keywords

revision similaritydocument editingevaluation metricExcision ScoreLCStext comparisoncode editingNLPhuman judgmentversion control

Academic Context

#Natural Language Processing#Information Retrieval#Evaluation Metrics#Document Analysis#Machine Learning

Commercial Potential

Potential Products

Code review toolsAutomated document summarization/editing toolsPlagiarism detection systems

Target Industries

Software DevelopmentPublishingLegal TechAcademia

Use Case Examples

Evaluating the quality of code refactoring suggestionsAssessing the impact of edits in legal document draftingComparing different versions of generated text summaries

Competitive Edge

Offers a more theoretically sound and practically aligned metric for revision similarity than existing methods like BLEU, which are known to have limitations in this specific task.

Market Opportunity

Significant market for tools involving document and code comparison/evaluation.

Revenue Models

Integration into existing software development toolsAPIs for evaluation services.

Resource Requirements

Compute Needs

Low

Data Requirements

Requires pairs of original and revised documents (text or code).

Deployment Constraints

Requires careful definition of 'document' and 'revision' in specific contexts.

Scalability

The LCS-based Excision Score is computationally efficient and scales well.

Production Readiness

Maturity Level

Research

Time to Market

Short

Patent Potential

Low

View Full Paper Back to Papers