arxiv_cl 96% Match Research Paper AI researchers,Digital humanities scholars,Creative AI developers,Poetry enthusiasts 2 weeks ago

Capabilities and Evaluation Biases of Large Language Models in Classical Chinese Poetry Generation: A Case Study on Tang Poetry

large-language-models › evaluation

📄 Abstract

Abstract: Large Language Models (LLMs) are increasingly applied to creative domains, yet their performance in classical Chinese poetry generation and evaluation remains poorly understood. We propose a three-step evaluation framework that combines computational metrics, LLM-as-a-judge assessment, and human expert validation. Using this framework, we evaluate six state-of-the-art LLMs across multiple dimensions of poetic quality, including themes, emotions, imagery, form, and style. Our analysis reveals systematic generation and evaluation biases: LLMs exhibit "echo chamber" effects when assessing creative quality, often converging on flawed standards that diverge from human judgments. These findings highlight both the potential and limitations of current capabilities of LLMs as proxy for literacy generation and the limited evaluation practices, thereby demonstrating the continued need of hybrid validation from both humans and models in culturally and technically complex creative tasks.

Authors (3)

Bolei Ma

Yina Yao

Anna-Carolina Haensch

Submitted

October 17, 2025

arXiv Category

cs.CL

arXiv PDF

Key Contributions

This paper proposes a novel three-step evaluation framework (computational metrics, LLM-as-a-judge, human validation) for classical Chinese poetry generation by LLMs. It reveals systematic generation and evaluation biases, showing LLMs exhibit 'echo chamber' effects and converge on flawed standards diverging from human judgments, highlighting the need for hybrid validation.

Business Value

Enhances the development of AI systems capable of nuanced creative tasks, ensuring they align with human cultural values and aesthetic standards, leading to more meaningful AI-generated art.

Paper Metadata

Innovation Type

Evaluation Framework

Deployment Feasibility

High, as it focuses on evaluation methodology.

Limitations Addressed

Poor understanding of LLM performance and evaluation biases in classical Chinese poetry generation, and the limitations of current evaluation practices.

Technical Tags

LLM evaluationClassical Chinese poetryCreative generationLLM-as-a-judgeHuman validationGeneration biasEvaluation biasTang poetryPoetic qualityComputational metrics

Research Topics

AI in Creative ArtsLLM CapabilitiesEvaluation MethodologiesCultural AINatural Language Generation

Methods & Architectures

Three-step evaluation frameworkComputational metricsLLM-as-a-judge assessmentHuman expert validationQualitative analysis Large Language Models (LLMs)

Applications & Tasks

Creative Writing Poetry Generation Cultural Heritage AI Ethics Evaluating LLM performance in creative domainsIdentifying generation and evaluation biasesAssessing poetic quality Classical Chinese poetry generationEvaluating creative text generationAssessing LLM alignment with human aesthetic judgment

Datasets & Benchmarks

Datasets

Tang Poetry

Poetic quality dimensions (themes, emotions, imagery, form, style)Computational metricsLLM-as-a-judge scoresHuman expert validation scores

Related Fields

Digital HumanitiesComputational CreativityNatural Language GenerationAI Ethics

Keywords

LLMPoetry GenerationClassical ChineseEvaluationBiasLLM-as-a-judgeHuman ValidationCreative AITang DynastyComputational MetricsAesthetics

Academic Context

#AI in Creative Arts#LLM Capabilities#Evaluation Methodologies#Cultural AI#Natural Language Generation

Commercial Potential

Potential Products

AI-powered creative writing toolsTools for analyzing literary styles

Target Industries

Arts and CulturePublishingEducationTechnology

Use Case Examples

Generating classical Chinese poetry for artistic purposesEvaluating the quality of AI-generated literatureStudying the intersection of AI and traditional arts

Competitive Edge

Addresses a gap in evaluating LLMs for culturally specific creative tasks, proposing a more robust evaluation methodology.

Resource Requirements

Compute Needs

Moderate, for running LLMs for evaluation.

Data Requirements

A corpus of classical Chinese poetry (e.g., Tang poetry) and LLM-generated poems.

Scalability

The evaluation framework is potentially scalable to other creative domains and languages.

Production Readiness

Maturity Level

Research

View Full Paper Back to Papers