arxiv_ai 95% Match Research Paper AI Safety Researchers,ML Researchers,NLP Engineers,Model Developers,AI Ethicists 4 weeks ago

Disentangling Recall and Reasoning in Transformer Models through Layer-wise Attention and Activation Analysis

large-language-models › reasoning

📄 Abstract

Abstract: Transformer-based language models excel at both recall (retrieving memorized facts) and reasoning (performing multi-step inference), but whether these abilities rely on distinct internal mechanisms remains unclear. Distinguishing recall from reasoning is crucial for predicting model generalization, designing targeted evaluations, and building safer interventions that affect one ability without disrupting the other.We approach this question through mechanistic interpretability, using controlled datasets of synthetic linguistic puzzles to probe transformer models at the layer, head, and neuron level. Our pipeline combines activation patching and structured ablations to causally measure component contributions to each task type. Across two model families (Qwen and LLaMA), we find that interventions on distinct layers and attention heads lead to selective impairments: disabling identified "recall circuits" reduces fact-retrieval accuracy by up to 15\% while leaving reasoning intact, whereas disabling "reasoning circuits" reduces multi-step inference by a comparable margin. At the neuron level, we observe task-specific firing patterns, though these effects are less robust, consistent with neuronal polysemanticity.Our results provide the first causal evidence that recall and reasoning rely on separable but interacting circuits in transformer models. These findings advance mechanistic interpretability by linking circuit-level structure to functional specialization and demonstrate how controlled datasets and causal interventions can yield mechanistic insights into model cognition, informing safer deployment of large language models.

Key Contributions

Develops a mechanistic interpretability pipeline using activation patching and structured ablations to disentangle recall and reasoning in Transformer models. It identifies distinct 'recall circuits' and 'reasoning circuits' at the layer and head level across Qwen and LLaMA models.

Business Value

Enables more precise control over LLM behavior, improving reliability, safety, and targeted fine-tuning for specific applications.

Paper Metadata

Innovation Type

Methodology / Analysis Technique

Deployment Feasibility

The methodology is for research and analysis; its findings inform safer deployment and development practices.

Limitations Addressed

Unclear whether recall and reasoning abilities in Transformers rely on distinct internal mechanisms; difficulty in predicting generalization and designing targeted interventions.

Performance Gains

Identified distinct circuits for recall and reasoning, showing that disabling 'recall circuits' reduces fact-retrieval accuracy by up to 15% while leaving reasoning intact.

Technical Tags

Transformer ModelsRecall vs ReasoningMechanistic InterpretabilityLayer-wise AnalysisAttention HeadsNeuron Level ProbingActivation PatchingStructured AblationsSynthetic PuzzlesQwenLLaMACircuit Identification

Research Topics

LLM InterpretabilityCognitive Abilities of LLMsModel UnderstandingAI SafetyMechanistic Interpretability

Methods & Architectures

Mechanistic Interpretability PipelineActivation PatchingStructured AblationsLayer/Head/Neuron Level AnalysisControlled Datasets TransformerQwenLLaMA

Applications & Tasks

Natural Language Processing AI Safety Model Debugging AI Research Distinguishing recall and reasoning abilities in TransformersUnderstanding distinct internal mechanisms for recall and reasoningPredicting model generalizationDesigning targeted interventions Disentangling recall and reasoning circuitsMeasuring component contributions to recall and reasoningIdentifying layers/heads responsible for specific abilities

Related Fields

AI SafetyMechanistic InterpretabilityNatural Language ProcessingDeep LearningCognitive Science

Keywords

interpretabilitymechanistic interpretabilityTransformersrecallreasoningLLMactivation patchingablationcircuitslayer analysisattention headsQwenLLaMAmodel understandingAI safety

Academic Context

#LLM Interpretability#Cognitive Abilities of LLMs#Model Understanding#AI Safety#Mechanistic Interpretability

Companies & Organizations

Companies Mentioned

OpenAI Google Meta

Technology Stack

Frameworks & Libraries

Transformer

Commercial Potential

Potential Products

Tools for dissecting LLM behaviorSafer LLM development platformsTargeted LLM fine-tuning services

Target Industries

TechnologyAI ResearchAI SafetySoftware Development

Use Case Examples

Understanding how an LLM retrieves factual information versus how it performs multi-step reasoning.Developing methods to selectively disable harmful reasoning capabilities without affecting useful recall.

Competitive Edge

Presents a rigorous, mechanistic approach to dissecting LLM cognitive abilities, offering deeper insights than traditional evaluation methods.

Resource Requirements

Compute Needs

Requires significant compute for running large Transformer models and performing ablations.

Data Requirements

Controlled datasets of synthetic linguistic puzzles.

Production Readiness

Maturity Level

Research

View Full Paper Back to Papers