Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Reinforcement learning with verifiable rewards (RLVR) can elicit strong
reasoning in large language models (LLMs), while their performance after RLVR
varies dramatically across different base models. This raises a fundamental
question: what microscopic property of pre-trained models leads to this
variation? To investigate, we formalize reasoning as chains of Horn clauses
("if-then" rules) built from features extracted from the LLM's latent space via
cross-layer sparse autoencoders (SAEs). We estimate the transition
probabilities between its features, and further categorize each rule by its
semantic soundness level (e.g., strict, plausible, noisy) with an LLM. Our key
discovery is that high-potential models are inherently soundness-aware: their
internal probability distributions systematically shift across rules' soundness
levels, becoming highly distinct for "strict" versus "noisy" rules. In
contrast, weaker models are soundness-agnostic, collapsing to one distribution
regardless of soundness levels. To quantify this, we introduce the
Soundness-Aware Level (SAL), a microscopic metric using the Jensen-Shannon
Divergence to measure the separation between these distributions. We show that
SAL's predictions of post-RLVR reasoning performance follow a precise empirical
law (R^2=0.87) across diverse model families (Qwen, Mistral, Llama, DeepSeek)
and scales (0.5B-14B). This reveals that a model's reasoning potential is tied
to its intrinsic, pre-trained ability to distinguish sound knowledge from
unsound ones. These findings underscore the critical role of model pre-training
in shaping reasoning and offer a practical metric grounded in the model's
internal mechanisms for selecting/designing stronger base models.
Authors (4)
Xuansheng Wu
Xiaoman Pan
Wenlin Yao
Jianshu Chen
Submitted
October 17, 2025
Key Contributions
This paper identifies 'soundness-awareness' as a microscopic signature that predicts an LLM's reasoning potential, particularly after RLVR training. It formalizes reasoning as Horn clauses derived from latent space features (via SAEs) and shows that high-potential models exhibit distinct probability distributions across different soundness levels, unlike weaker models.
Business Value
Enables better selection and fine-tuning of LLMs for tasks requiring complex reasoning, leading to more reliable and capable AI systems.