Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Prior benchmarks for evaluating the domain-specific knowledge of large
language models (LLMs) lack the scalability to handle complex academic tasks.
To address this, we introduce \texttt{ScholarBench}, a benchmark centered on
deep expert knowledge and complex academic problem-solving, which evaluates the
academic reasoning ability of LLMs and is constructed through a three-step
process. \texttt{ScholarBench} targets more specialized and logically complex
contexts derived from academic literature, encompassing five distinct problem
types. Unlike prior benchmarks, \texttt{ScholarBench} evaluates the
abstraction, comprehension, and reasoning capabilities of LLMs across eight
distinct research domains. To ensure high-quality evaluation data, we define
category-specific example attributes and design questions that are aligned with
the characteristic research methodologies and discourse structures of each
domain. Additionally, this benchmark operates as an English-Korean bilingual
dataset, facilitating simultaneous evaluation for linguistic capabilities of
LLMs in both languages. The benchmark comprises 5,031 examples in Korean and
5,309 in English, with even state-of-the-art models like o3-mini achieving an
average evaluation score of only 0.543, demonstrating the challenging nature of
this benchmark.
Authors (7)
Dongwon Noh
Donghyeok Koh
Junghun Yuk
Gyuwan Kim
Jaeyong Lee
Kyungtae Lim
+1 more
Key Contributions
Introduces ScholarBench, a novel bilingual benchmark designed to evaluate the deep expert knowledge and complex academic problem-solving abilities of LLMs. It addresses the scalability limitations of prior benchmarks by focusing on specialized, logically complex contexts derived from academic literature across eight research domains, evaluating abstraction, comprehension, and reasoning.
Business Value
Enables more accurate and comprehensive evaluation of LLMs for academic and research applications, leading to better model development for specialized domains.