arxiv_cl 95% Match Benchmark Paper LLM researchers,AI developers,Academics,Educators 2 weeks ago

ScholarBench: A Bilingual Benchmark for Abstraction, Comprehension, and Reasoning Evaluation in Academic Contexts

large-language-models › evaluation

📄 Abstract

Abstract: Prior benchmarks for evaluating the domain-specific knowledge of large language models (LLMs) lack the scalability to handle complex academic tasks. To address this, we introduce \texttt{ScholarBench}, a benchmark centered on deep expert knowledge and complex academic problem-solving, which evaluates the academic reasoning ability of LLMs and is constructed through a three-step process. \texttt{ScholarBench} targets more specialized and logically complex contexts derived from academic literature, encompassing five distinct problem types. Unlike prior benchmarks, \texttt{ScholarBench} evaluates the abstraction, comprehension, and reasoning capabilities of LLMs across eight distinct research domains. To ensure high-quality evaluation data, we define category-specific example attributes and design questions that are aligned with the characteristic research methodologies and discourse structures of each domain. Additionally, this benchmark operates as an English-Korean bilingual dataset, facilitating simultaneous evaluation for linguistic capabilities of LLMs in both languages. The benchmark comprises 5,031 examples in Korean and 5,309 in English, with even state-of-the-art models like o3-mini achieving an average evaluation score of only 0.543, demonstrating the challenging nature of this benchmark.

Authors (7)

Dongwon Noh

Donghyeok Koh

Junghun Yuk

Gyuwan Kim

Jaeyong Lee

Kyungtae Lim

+1 more

Submitted

May 22, 2025

arXiv Category

cs.CL

arXiv PDF

Key Contributions

Introduces ScholarBench, a novel bilingual benchmark designed to evaluate the deep expert knowledge and complex academic problem-solving abilities of LLMs. It addresses the scalability limitations of prior benchmarks by focusing on specialized, logically complex contexts derived from academic literature across eight research domains, evaluating abstraction, comprehension, and reasoning.

Business Value

Enables more accurate and comprehensive evaluation of LLMs for academic and research applications, leading to better model development for specialized domains.

Paper Metadata

Innovation Type

Benchmark Development

Deployment Feasibility

High, as it's a benchmark for evaluation, not a deployed system.

Limitations Addressed

Scalability of prior benchmarks for complex academic tasks, lack of evaluation for deep expert knowledge and complex academic problem-solving.

Technical Tags

LLM evaluationbilingual benchmarkacademic reasoningnatural language understandingknowledge extractioncomprehensionabstractiondomain-specific knowledgeKorean NLPEnglish NLP

Research Topics

Large Language Model EvaluationAcademic Knowledge RepresentationCross-lingual TransferReasoning CapabilitiesBenchmark Development

Methods & Architectures

Benchmark constructionData annotationLinguistic analysis

Applications & Tasks

Academia Research Education Evaluating LLM reasoningAssessing domain-specific knowledgeMeasuring comprehension and abstraction Academic problem-solvingKnowledge assessmentReasoning evaluation

Datasets & Benchmarks

Datasets

ScholarBench

AccuracyPerformance on specific reasoning tasks

Related Fields

Natural Language ProcessingArtificial IntelligenceComputational LinguisticsEducation Technology

Keywords

LLMbenchmarkacademiccomprehensionreasoningabstractionbilingualKoreanEnglishdomain knowledgeevaluationNLP

Academic Context

#Large Language Model Evaluation#Academic Knowledge Representation#Cross-lingual Transfer#Reasoning Capabilities#Benchmark Development

Commercial Potential

Target Industries

EducationResearchPublishing

Use Case Examples

Evaluating LLMs for academic writing assistanceBenchmarking AI tutorsAssessing AI's ability to summarize research papers

Competitive Edge

Offers a more specialized and scalable evaluation framework compared to general-purpose NLP benchmarks.

Resource Requirements

Compute Needs

N/A (benchmark)

Data Requirements

N/A (benchmark creation)

Deployment Constraints

N/A (benchmark)

Scalability

Designed for scalability to handle complex academic tasks.

Production Readiness

Maturity Level

Research

View Full Paper Back to Papers