arxiv_cl 95% Match Research Paper AI Safety Researchers,ML Interpretability Experts,LLM Developers,AI Ethicists 1 week ago

Mapping Faithful Reasoning in Language Models

large-language-models › reasoning

📄 Abstract

Abstract: Chain-of-thought (CoT) traces promise transparency for reasoning language models, but prior work shows they are not always faithful reflections of internal computation. This raises challenges for oversight: practitioners may misinterpret decorative reasoning as genuine. We introduce Concept Walk, a general framework for tracing how a model's internal stance evolves with respect to a concept direction during reasoning. Unlike surface text, Concept Walk operates in activation space, projecting each reasoning step onto the concept direction learned from contrastive data. This allows us to observe whether reasoning traces shape outcomes or are discarded. As a case study, we apply Concept Walk to the domain of Safety using Qwen 3-4B. We find that in 'easy' cases, perturbed CoTs are quickly ignored, indicating decorative reasoning, whereas in 'hard' cases, perturbations induce sustained shifts in internal activations, consistent with faithful reasoning. The contribution is methodological: Concept Walk provides a lens to re-examine faithfulness through concept-specific internal dynamics, helping identify when reasoning traces can be trusted and when they risk misleading practitioners.

Authors (5)

Jiazheng Li

Andreas Damianou

J Rosser

José Luis Redondo García

Konstantina Palla

Submitted

October 25, 2025

arXiv Category

cs.LG

arXiv PDF

Key Contributions

This paper introduces Concept Walk, a general framework for tracing how a language model's internal stance evolves with respect to a concept direction during reasoning. Operating in activation space, it allows for the assessment of reasoning faithfulness, distinguishing between genuine computational shifts and decorative traces, particularly useful for oversight and AI safety.

Business Value

Enhances trust and safety in LLMs by providing tools to verify their reasoning processes, crucial for high-stakes applications and regulatory compliance.

Paper Metadata

Innovation Type

Methodology/Framework

Deployment Feasibility

Moderate, requires access to model activations and computational resources for analysis.

Limitations Addressed

Chain-of-thought (CoT) traces not always being faithful reflections of internal computation,Difficulty in distinguishing decorative reasoning from genuine computation,Challenges for oversight due to potentially misleading reasoning traces

Performance Gains

Demonstrates ability to distinguish between 'easy' (decorative) and 'hard' (faithful) reasoning cases.

Technical Tags

chain-of-thought (CoT)reasoning faithfulnessactivation space analysisconcept directioninterpretabilityLLM reasoningperturbation analysis

Research Topics

LLM InterpretabilityReasoning in AIAI SafetyModel UnderstandingComputational Linguistics

Methods & Architectures

Concept Walk frameworkactivation space projectioncontrastive data learningperturbation analysiscase study on Safety Large Language Models (LLMs)Qwen 3-4B

Applications & Tasks

AI Safety AI Ethics Natural Language Processing Machine Learning Interpretability Faithfulness of reasoning tracesDistinguishing genuine vs. decorative reasoningUnderstanding internal model computation Tracing how model stance evolves during reasoningAssessing the fidelity of Chain-of-Thought (CoT) explanationsIdentifying decorative reasoning

Datasets & Benchmarks

Benchmarks

Case study on Safety using Qwen 3-4B

Related Fields

AI InterpretabilityAI SafetyMachine LearningNatural Language ProcessingCognitive Science

Keywords

interpretabilityreasoningLLMchain-of-thoughtfaithfulnessactivation spaceconcept directionAI safetyoversightperturbationQwen

Academic Context

#LLM Interpretability#Reasoning in AI#AI Safety#Model Understanding#Computational Linguistics

Commercial Potential

Potential Products

LLM reasoning analysis toolsAI safety verification platformsInterpretability libraries for LLMs

Target Industries

TechnologyAI ResearchFinanceHealthcare

Use Case Examples

Verifying that an LLM's decision-making process is soundDebugging unexpected LLM behaviorEnsuring LLMs adhere to safety guidelines

Competitive Edge

Offers a novel method (Concept Walk) for analyzing reasoning faithfulness in activation space, complementing existing textual analysis of CoT.

Market Opportunity

Growing market for AI safety and interpretability tools.

Revenue Models

Licensing of analysis toolsconsulting services.

Resource Requirements

Compute Needs

Moderate (requires running inference and analyzing activations)

Data Requirements

Requires models and potentially contrastive data for concept direction learning.

Deployment Constraints

Requires access to internal model states, which might not always be available.

Scalability

Scales with the size of the LLM and the complexity of the reasoning task.

Production Readiness

Maturity Level

Research

Time to Market

1-3 years

Patent Potential

Moderate

View Full Paper Back to Papers