Redirecting to original paper in 30 seconds...

Click below to go immediately or wait for automatic redirect

arxiv_ai 96% Match Research Paper LLM Researchers,AI Alignment Researchers,Machine Learning Engineers,NLP Practitioners 1 week ago

TPO: Aligning Large Language Models with Multi-branch & Multi-step Preference Trees

large-language-models › alignment
📄 Abstract

Abstract: In the domain of complex reasoning tasks, such as mathematical reasoning, recent advancements have proposed the use of Direct Preference Optimization (DPO) to suppress output of dispreferred responses, thereby enhancing the long-chain reasoning capabilities of large language models (LLMs). To this end, these studies employed LLMs to generate preference trees via Tree-of-thoughts (ToT) and sample the paired preference responses required by the DPO algorithm. However, the DPO algorithm based on binary preference optimization is unable to learn multiple responses with varying degrees of preference/dispreference that provided by the preference trees, resulting in incomplete preference learning. In this work, we introduce Tree Preference Optimization (TPO), that does not sample paired preference responses from the preference tree; instead, it directly learns from the entire preference tree during the fine-tuning. Specifically, TPO formulates the language model alignment as a Preference List Ranking problem, where the policy can potentially learn more effectively from a ranked preference list of responses given the prompt. In addition, to further assist LLMs in identifying discriminative steps within long-chain reasoning and increase the relative reward margin in the preference list, TPO utilizes Adaptive Step Reward to adjust the reward values of each step in trajectory for performing fine-grained preference optimization. We carry out extensive experiments on mathematical reasoning tasks to evaluate TPO. The experimental results indicate that TPO consistently outperforms DPO across five public large language models on four datasets. Our code is publicly available at https://github.com/MrBlankness/TPO.git.
Authors (3)
Weibin Liao
Xu Chu
Yasha Wang
Submitted
October 10, 2024
arXiv Category
cs.CL
arXiv PDF

Key Contributions

Introduces Tree Preference Optimization (TPO), a novel method for aligning LLMs that directly learns from entire preference trees generated by Tree-of-thoughts (ToT), overcoming the limitations of Direct Preference Optimization (DPO) which samples only binary preferences. TPO enables learning from multiple responses with varying degrees of preference, enhancing long-chain reasoning.

Business Value

Leads to more capable and reliable LLMs for complex tasks like mathematical problem solving, code generation, and advanced dialogue systems, improving user experience and enabling new applications.