Redirecting to original paper in 30 seconds...

Click below to go immediately or wait for automatic redirect

arxiv_ai 95% Match Research Paper LLM researchers,AI developers,Machine learning engineers,Educators using AI tools 2 weeks ago

CarBoN: Calibrated Best-of-N Sampling Improves Test-time Reasoning

large-language-models › reasoning
📄 Abstract

Abstract: Allocating more computation during inference time (test-time scaling) improves language model performance, especially for reasoning tasks. However, popular methods like Best-of-$N$ sampling often show diminishing returns as $N$ increases. To address this inefficiency, we introduce a general test-time calibration framework that adaptively modifies the model toward high-reward reasoning paths, with theoretical guarantees of improving the lower bound of expected reward under finite sampling, all without large language model (LLM) retraining. Within this framework, we propose CarBoN (Calibrated Best-of-$N$), a two-phase method that first explores the solution space and then learns a calibration of the logits via an input-specific temperature $T$ and additive shift vector $\delta$, guiding generation toward more reliable reasoning. Experiments on MATH-500 and AIME-2024 show that CarBoN improves efficiency, with up to $4\times$ fewer rollouts to reach the same accuracy, while often achieving higher accuracy under fixed budgets. We also analyze the complementary roles of $T$ and $\delta$ in balancing output diversity and correctness, and demonstrate that the framework also generalizes to step-level sampling strategies such as beam search. For more information, please refer to our project page at huggingface.co/spaces/TrustSafeAI/Test-Time-Calibration.
Authors (3)
Yung-Chen Tang
Pin-Yu Chen
Andrea Cavallaro
Submitted
October 17, 2025
arXiv Category
cs.LG
arXiv PDF

Key Contributions

Introduces CarBoN, a general test-time calibration framework that adaptively modifies LLMs to improve reasoning performance without retraining. It uses input-specific temperature and additive shifts to guide generation towards higher-reward paths, achieving better accuracy with fewer samples.

Business Value

Enables more cost-effective deployment of LLMs for complex reasoning tasks, such as automated problem-solving or advanced tutoring systems, by reducing inference computation.