arxiv_ai 95% Match Research Paper AI Researchers,AI Ethicists,LLM Developers,AI Safety Engineers,Policy Makers 2 weeks ago

MoReBench: Evaluating Procedural and Pluralistic Moral Reasoning in Language Models, More than Outcomes

large-language-models › evaluation

📄 Abstract

Abstract: As AI systems progress, we rely more on them to make decisions with us and for us. To ensure that such decisions are aligned with human values, it is imperative for us to understand not only what decisions they make but also how they come to those decisions. Reasoning language models, which provide both final responses and (partially transparent) intermediate thinking traces, present a timely opportunity to study AI procedural reasoning. Unlike math and code problems which often have objectively correct answers, moral dilemmas are an excellent testbed for process-focused evaluation because they allow for multiple defensible conclusions. To do so, we present MoReBench: 1,000 moral scenarios, each paired with a set of rubric criteria that experts consider essential to include (or avoid) when reasoning about the scenarios. MoReBench contains over 23 thousand criteria including identifying moral considerations, weighing trade-offs, and giving actionable recommendations to cover cases on AI advising humans moral decisions as well as making moral decisions autonomously. Separately, we curate MoReBench-Theory: 150 examples to test whether AI can reason under five major frameworks in normative ethics. Our results show that scaling laws and existing benchmarks on math, code, and scientific reasoning tasks fail to predict models' abilities to perform moral reasoning. Models also show partiality towards specific moral frameworks (e.g., Benthamite Act Utilitarianism and Kantian Deontology), which might be side effects of popular training paradigms. Together, these benchmarks advance process-focused reasoning evaluation towards safer and more transparent AI.

Authors (18)

Yu Ying Chiu

Michael S. Lee

Rachel Calcott

Brandon Handoko

Paul de Font-Reaulx

Paula Rodriguez

+12 more

Submitted

October 18, 2025

arXiv Category

cs.CL

arXiv PDF

Key Contributions

Introduces MoReBench, a novel benchmark comprising 1,000 moral scenarios with detailed rubric criteria, designed to evaluate procedural and pluralistic moral reasoning in language models beyond just their final outcomes. This benchmark enables a deeper understanding of how LLMs arrive at decisions in complex ethical dilemmas, moving beyond simple accuracy metrics.

Business Value

Crucial for developing trustworthy AI systems that can be deployed in sensitive applications requiring ethical decision-making. It provides a standardized way to assess and improve the ethical reasoning capabilities of AI, fostering user trust and responsible AI deployment.

Paper Metadata

Innovation Type

Benchmark/Dataset Creation

Deployment Feasibility

The benchmark itself is a resource for evaluation, not a deployable system. Its feasibility lies in its utility for researchers and developers building and testing AI systems.

Limitations Addressed

Addresses the limitation of current AI evaluation methods that often focus solely on final outcomes, failing to capture the reasoning process. It tackles the challenge of evaluating complex, subjective reasoning tasks like moral dilemmas where multiple defensible conclusions exist.

Technical Tags

Moral ReasoningLanguage ModelsProcedural ReasoningPluralistic ReasoningAI AlignmentEthical AIRubric CriteriaMoral DilemmasLLM EvaluationReasoning Traces

Research Topics

Evaluating AI Moral ReasoningUnderstanding LLM Decision ProcessesAI Ethics and AlignmentProcedural vs. Outcome-Based AI EvaluationDeveloping Benchmarks for Complex Reasoning

Methods & Architectures

Benchmark CreationRubric-Based EvaluationQualitative Analysis of Reasoning TracesExpert Annotation Large Language Models (LLMs)

Applications & Tasks

AI Ethics AI Safety Natural Language Processing Human-AI Interaction Evaluating AI Reasoning CapabilitiesAssessing AI Ethical AlignmentUnderstanding AI Decision-Making Processes Moral Reasoning EvaluationAssessing Procedural Reasoning in LLMsEvaluating LLM Alignment with Human Values

Datasets & Benchmarks

Datasets

MoReBench

Coverage of Moral ConsiderationsWeighing of Trade-offsActionable RecommendationsAlignment with Rubric Criteria

Related Fields

AI EthicsAI SafetyNatural Language ProcessingPhilosophyCognitive ScienceHuman-Computer Interaction

Keywords

Moral ReasoningLanguage ModelsAI EthicsAI SafetyEvaluation BenchmarkProcedural ReasoningPluralistic ReasoningLLMMoral DilemmasRubricDecision MakingAlignmentTrustworthy AI

Academic Context

#Evaluating AI Moral Reasoning#Understanding LLM Decision Processes#AI Ethics and Alignment#Procedural vs. Outcome-Based AI Evaluation#Developing Benchmarks for Complex Reasoning

Commercial Potential

Potential Products

AI systems with improved ethical reasoningTools for auditing AI ethical behaviorEducational platforms for AI ethics

Target Industries

TechnologyHealthcareFinanceGovernmentEducation

Use Case Examples

Evaluating AI assistants for ethical adviceTesting autonomous systems' moral decision-makingDeveloping AI for ethical conflict resolution

Competitive Edge

Provides a more nuanced evaluation framework for LLMs compared to existing benchmarks that focus on factual accuracy or simple task completion.

Market Opportunity

Growing market for AI ethics and safety tools.

Revenue Models

Publicationopen-source releasepotential for commercialization of evaluation services.

Resource Requirements

Compute Needs

N/A (benchmark creation)

Data Requirements

Requires curated moral scenarios and expert-defined rubric criteria.

Deployment Constraints

Subjectivity in moral judgments and the complexity of defining comprehensive rubric criteria.

Scalability

The benchmark can be scaled by adding more scenarios and refining criteria.

Regulatory Considerations

Highas it directly relates to AI ethics and responsible deployment.

Production Readiness

Maturity Level

Research/Development

Time to Market

N/A (benchmark)

Patent Potential

Low (benchmark creation)

View Full Paper Back to Papers