arxiv_cl 95% Match Benchmark and Method Paper AI researchers,NLP engineers,Logicians,Formal verification experts 2 weeks ago

Are LLMs Stable Formal Logic Translators in Logical Reasoning Across Linguistically Diversified Texts?

large-language-models › reasoning

📄 Abstract

Abstract: Logical reasoning with large language models (LLMs) has received growing attention. One mainstream approach translates natural language into formal logic and then applies symbolic solvers for deduction. While effective in many tasks, these LLM-based translators often fail to generate consistent symbolic representations when the same concept appears in different linguistic forms. Such inconsistencies break logical coherence and lead to solver errors. However, most existing benchmarks lack this type of linguistic variation, which frequently occurs in real-world text, leaving the problem underexplored. To address this gap, we present SoLT, a benchmark that systematically rewrites reasoning datasets into diverse yet logically equivalent forms across multiple levels. Beyond evaluation, SoLT also provides a general method to enrich any dataset with linguistic diversity while preserving both meaning and logic. To further enhance the stability of LLM-based reasoning, we propose MenTaL, which explicitly guides models to build a concept-symbol mapping table during translation. By linking equivalent expressions to shared symbols, MenTaL maintains consistency and mitigates symbol drift. Experiments on SoLT demonstrate that LLMs indeed suffer from inconsistent symbol mapping under linguistic variation, leading to significant drops in reasoning accuracy. Meanwhile, applying MenTaL brings clear and stable performance improvements across diverse inputs. Overall, our findings reveal that overlooking linguistic diversity hides key weaknesses in LLM-based translators, and our work offers a step toward more reliable logical reasoning in varied real-world scenarios. Our code is available at https://github.com/wufeiwuwoshihua/LinguDiver.

Authors (7)

Qingchuan Li

Jiatong Li

Zirui Liu

Mingyue Cheng

Yuting Zeng

Qi Liu

+1 more

Submitted

June 5, 2025

arXiv Category

cs.CL

arXiv PDF

Key Contributions

Addresses the instability of LLMs in translating natural language to formal logic due to linguistic variations. Introduces SoLT, a benchmark that systematically enriches datasets with diverse yet logically equivalent forms, and proposes MenTa to enhance LLM reasoning stability, aiming to improve the reliability of LLM-based logical deduction.

Business Value

Enhances the reliability of AI systems performing logical reasoning, crucial for applications in legal tech, formal verification, and complex decision support systems.

Paper Metadata

Innovation Type

Benchmark and Method Development

Deployment Feasibility

High for the benchmark and method; LLM deployment depends on the specific application.

Limitations Addressed

Inconsistent symbolic representations generated by LLMs for the same logical concept due to linguistic variations, and the lack of benchmarks that capture such variations.

Technical Tags

LLM reasoningformal logic translationlinguistic variationsymbolic solverslogical coherencebenchmarknatural language understandinglogical consistencydataset enrichment

Research Topics

Logical Reasoning with LLMsFormal Logic RepresentationNatural Language UnderstandingRobustness in AIBenchmark Design

Methods & Architectures

Benchmark creation (SoLT)Dataset enrichmentProposing a method (MenTa) Large Language Models (LLMs)

Applications & Tasks

AI Reasoning Formal Verification Knowledge Representation Inconsistent symbolic representations from LLMsLack of linguistic variation in benchmarksFragility of LLM-based logical reasoning Logical deductionFormalizing natural languageEvaluating LLM consistency

Datasets & Benchmarks

Datasets

SoLT

Consistency of symbolic representationsAccuracy of logical deductionPerformance on reasoning tasks with linguistic variation

Related Fields

Artificial IntelligenceNatural Language ProcessingLogicFormal MethodsKnowledge Representation

Keywords

LLMlogical reasoningformal logicnatural languagelinguistic variationconsistencybenchmarkSoLTMenTasymbolic AI

Academic Context

#Logical Reasoning with LLMs#Formal Logic Representation#Natural Language Understanding#Robustness in AI#Benchmark Design

Commercial Potential

Target Industries

Legal TechSoftware EngineeringAI ResearchKnowledge Management

Use Case Examples

Automated theorem provingVerifying software correctnessBuilding more robust AI reasoning agents

Competitive Edge

Provides a more realistic evaluation setting for LLM logical reasoning by incorporating linguistic variation, which is often missing in existing benchmarks.

Resource Requirements

Compute Needs

Moderate for running evaluations on the benchmark.

Data Requirements

Requires existing reasoning datasets to be enriched.

Deployment Constraints

LLMs' inherent limitations in formal reasoning still apply.

Scalability

The method for enriching datasets is general and can be applied to various reasoning tasks.

Production Readiness

Maturity Level

Research

View Full Paper Back to Papers