Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Large Language Models (LLMs) are increasingly applied to creative domains,
yet their performance in classical Chinese poetry generation and evaluation
remains poorly understood. We propose a three-step evaluation framework that
combines computational metrics, LLM-as-a-judge assessment, and human expert
validation. Using this framework, we evaluate six state-of-the-art LLMs across
multiple dimensions of poetic quality, including themes, emotions, imagery,
form, and style. Our analysis reveals systematic generation and evaluation
biases: LLMs exhibit "echo chamber" effects when assessing creative quality,
often converging on flawed standards that diverge from human judgments. These
findings highlight both the potential and limitations of current capabilities
of LLMs as proxy for literacy generation and the limited evaluation practices,
thereby demonstrating the continued need of hybrid validation from both humans
and models in culturally and technically complex creative tasks.
Authors (3)
Bolei Ma
Yina Yao
Anna-Carolina Haensch
Submitted
October 17, 2025
Key Contributions
This paper proposes a novel three-step evaluation framework (computational metrics, LLM-as-a-judge, human validation) for classical Chinese poetry generation by LLMs. It reveals systematic generation and evaluation biases, showing LLMs exhibit 'echo chamber' effects and converge on flawed standards diverging from human judgments, highlighting the need for hybrid validation.
Business Value
Enhances the development of AI systems capable of nuanced creative tasks, ensuring they align with human cultural values and aesthetic standards, leading to more meaningful AI-generated art.