Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Large language models perform near-Bayesian inference yet violate permutation
invariance on exchangeable data. We resolve this by showing transformers
minimize expected conditional description length (cross-entropy) over
orderings, $\mathbb{E}_\pi[\ell(Y \mid \Gamma_\pi(X))]$, which admits a
Kolmogorov-complexity interpretation up to additive constants, rather than the
permutation-invariant description length $\ell(Y \mid X)$. This makes them
Bayesian in expectation, not in realization. We derive (i) a Quantified
Martingale Violation bound showing order-induced deviations scale as $O(\log
n)$ with constants; (ii) the Expectation-level Decompression Law linking
information budgets to reliability for Bernoulli predicates; and (iii)
deployable planners (B2T/RoH/ISR) for answer/abstain decisions. Empirically,
permutation dispersion follows $a+b\ln n$ (Qwen2-7B $b \approx 0.377$,
Llama-3.1-8B $b \approx 0.147$); permutation mixtures improve ground-truth
likelihood/accuracy; and randomized dose-response shows hallucinations drop by
$\sim 0.13$ per additional nat. A pre-specified audit with a fixed ISR=1.0
achieves near-0\% hallucinations via calibrated refusal at 24\% abstention. The
framework turns hallucinations into predictable compression failures and
enables principled information budgeting.