arxiv_ai 95% Match Research Paper LLM researchers,AI developers,Machine learning engineers,Educators using AI tools 2 weeks ago

CarBoN: Calibrated Best-of-N Sampling Improves Test-time Reasoning

large-language-models › reasoning

📄 Abstract

Abstract: Allocating more computation during inference time (test-time scaling) improves language model performance, especially for reasoning tasks. However, popular methods like Best-of-$N$ sampling often show diminishing returns as $N$ increases. To address this inefficiency, we introduce a general test-time calibration framework that adaptively modifies the model toward high-reward reasoning paths, with theoretical guarantees of improving the lower bound of expected reward under finite sampling, all without large language model (LLM) retraining. Within this framework, we propose CarBoN (Calibrated Best-of-$N$), a two-phase method that first explores the solution space and then learns a calibration of the logits via an input-specific temperature $T$ and additive shift vector $\delta$, guiding generation toward more reliable reasoning. Experiments on MATH-500 and AIME-2024 show that CarBoN improves efficiency, with up to $4\times$ fewer rollouts to reach the same accuracy, while often achieving higher accuracy under fixed budgets. We also analyze the complementary roles of $T$ and $\delta$ in balancing output diversity and correctness, and demonstrate that the framework also generalizes to step-level sampling strategies such as beam search. For more information, please refer to our project page at huggingface.co/spaces/TrustSafeAI/Test-Time-Calibration.

Authors (3)

Yung-Chen Tang

Pin-Yu Chen

Andrea Cavallaro

Submitted

October 17, 2025

arXiv Category

cs.LG

arXiv PDF

Key Contributions

Introduces CarBoN, a general test-time calibration framework that adaptively modifies LLMs to improve reasoning performance without retraining. It uses input-specific temperature and additive shifts to guide generation towards higher-reward paths, achieving better accuracy with fewer samples.

Business Value

Enables more cost-effective deployment of LLMs for complex reasoning tasks, such as automated problem-solving or advanced tutoring systems, by reducing inference computation.

Paper Metadata

Innovation Type

Algorithmic Improvement

Deployment Feasibility

High, as it operates at inference time and does not require LLM retraining.

Limitations Addressed

Diminishing returns in Best-of-N sampling as N increases.,Inefficiency of allocating more computation during inference.,Need for improved LLM performance on complex reasoning tasks.

Performance Gains

Up to 4x fewer rollouts to reach the same accuracy; improves efficiency and accuracy on reasoning tasks.

Technical Tags

Test-time scalingLarge Language Models (LLMs)Reasoning tasksBest-of-N samplingTest-time calibrationAdaptive modificationLogit calibrationInput-specific temperatureAdditive shift vectorMATH-500AIME-2024

Research Topics

LLM Inference OptimizationReasoning in LLMsModel CalibrationEfficient AINatural Language Understanding

Methods & Architectures

CarBoN (Calibrated Best-of-N)Test-time calibration frameworkInput-specific temperature (T)Additive shift vector (delta)Logit modification Large Language Models (LLMs)

Applications & Tasks

Mathematical Reasoning Complex Problem Solving AI Education Improving LLM performance on reasoning tasksReducing computational cost during inferenceAddressing diminishing returns in sampling methods Enhancing accuracy of LLM reasoningOptimizing inference computationImproving reliability of generated solutions

Datasets & Benchmarks

Datasets

MATH-500, AIME-2024

Benchmarks

Accuracy improvement • Efficiency (fewer rollouts)

AccuracyNumber of rollouts

Related Fields

Artificial IntelligenceNatural Language ProcessingMachine LearningComputational LinguisticsEducation Technology

Keywords

LLM InferenceTest-time scalingReasoningBest-of-N samplingCalibrationCarBoNLogit adjustmentEfficiencyMATH-500AIME-2024Adaptive methods

Academic Context

#LLM Inference Optimization#Reasoning in LLMs#Model Calibration#Efficient AI#Natural Language Understanding

Commercial Potential

Potential Products

Optimized LLM inference enginesAI-powered math solversAutomated tutoring systems

Target Industries

EducationTechnologyResearchSoftware Development

Use Case Examples

Improving the accuracy of LLMs solving complex math problems.Reducing the computational cost of generating detailed explanations.Enhancing the reliability of LLM-generated code for complex algorithms.

Competitive Edge

Offers a novel inference-time optimization technique that improves efficiency and accuracy for LLM reasoning, without requiring model retraining.

Resource Requirements

Compute Needs

Operates at inference time, potentially increasing computation slightly per token but reducing overall computation by requiring fewer tokens/rollouts.

Data Requirements

Reasoning task datasets (e.g., MATH-500, AIME-2024).

Deployment Constraints

Requires careful calibration for specific tasks and models.

Scalability

Scales well as it's an inference-time technique applied per-token/per-generation.

View Full Paper Back to Papers