Redirecting to original paper in 30 seconds...

Click below to go immediately or wait for automatic redirect

arxiv_cl 95% Match Benchmark Paper LLM researchers,AI developers,Academics,Educators 2 weeks ago

ScholarBench: A Bilingual Benchmark for Abstraction, Comprehension, and Reasoning Evaluation in Academic Contexts

large-language-models › evaluation
📄 Abstract

Abstract: Prior benchmarks for evaluating the domain-specific knowledge of large language models (LLMs) lack the scalability to handle complex academic tasks. To address this, we introduce \texttt{ScholarBench}, a benchmark centered on deep expert knowledge and complex academic problem-solving, which evaluates the academic reasoning ability of LLMs and is constructed through a three-step process. \texttt{ScholarBench} targets more specialized and logically complex contexts derived from academic literature, encompassing five distinct problem types. Unlike prior benchmarks, \texttt{ScholarBench} evaluates the abstraction, comprehension, and reasoning capabilities of LLMs across eight distinct research domains. To ensure high-quality evaluation data, we define category-specific example attributes and design questions that are aligned with the characteristic research methodologies and discourse structures of each domain. Additionally, this benchmark operates as an English-Korean bilingual dataset, facilitating simultaneous evaluation for linguistic capabilities of LLMs in both languages. The benchmark comprises 5,031 examples in Korean and 5,309 in English, with even state-of-the-art models like o3-mini achieving an average evaluation score of only 0.543, demonstrating the challenging nature of this benchmark.
Authors (7)
Dongwon Noh
Donghyeok Koh
Junghun Yuk
Gyuwan Kim
Jaeyong Lee
Kyungtae Lim
+1 more
Submitted
May 22, 2025
arXiv Category
cs.CL
arXiv PDF

Key Contributions

Introduces ScholarBench, a novel bilingual benchmark designed to evaluate the deep expert knowledge and complex academic problem-solving abilities of LLMs. It addresses the scalability limitations of prior benchmarks by focusing on specialized, logically complex contexts derived from academic literature across eight research domains, evaluating abstraction, comprehension, and reasoning.

Business Value

Enables more accurate and comprehensive evaluation of LLMs for academic and research applications, leading to better model development for specialized domains.