Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: This paper introduces ExpertLongBench, an expert-level benchmark containing
11 tasks from 9 domains that reflect realistic expert workflows and
applications. Beyond question answering, the application-driven tasks in
ExpertLongBench demand long-form outputs that can exceed 5,000 tokens and
strict adherence to domain-specific requirements. Notably, each task in
ExpertLongBench includes a rubric, designed or validated by domain experts, to
specify task requirements and guide output evaluation. Furthermore, we propose
CLEAR, an evaluation framework that supports accurate evaluation of long-form
model outputs in our benchmark. To achieve fine-grained, expert-aligned
evaluation, CLEAR derives checklists from both model outputs and references by
extracting information corresponding to items in the task-specific rubric.
Checklist items of model outputs are then compared with corresponding items of
reference outputs to assess their correctness, enabling grounded evaluation. We
benchmark 13 popular large language models (LLMs) and analyze components in
CLEAR, showing that (1) existing LLMs, with the top performer Gemini-2.5-Pro
achieving only a 33.4 F1 score, require significant improvement for
expert-level tasks; (2) models can generate content corresponding to the
required aspects, but far from correct; and (3) accurate checklist extraction
and comparison in CLEAR can be achieved by open-weight models for more
scalable, reproducible, and low-cost usage.
Key Contributions
Introduces ExpertLongBench, a benchmark for expert-level long-form generation across 9 domains, featuring tasks requiring outputs >5,000 tokens and strict adherence to requirements. It also proposes CLEAR, an evaluation framework that uses structured checklists derived from task-specific rubrics for fine-grained, expert-aligned assessment of model outputs.
Business Value
Enables more accurate assessment of LLMs for professional applications requiring high-quality, long-form content generation, such as legal drafting, scientific writing, or complex report generation. This leads to better selection and development of AI tools for specialized industries.