Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: How to design reinforcement learning (RL) tasks that effectively unleash the
reasoning capability of large language models (LLMs) remains an open question.
Existing RL tasks (e.g., math, programming, and constructing reasoning tasks)
suffer from three key limitations: (1) Scalability. They rely heavily on human
annotation or expensive LLM synthesis to generate sufficient training data. (2)
Verifiability. LLMs' outputs are hard to verify automatically and reliably. (3)
Controllable Difficulty. Most tasks lack fine-grained difficulty control,
making it hard to train LLMs to develop reasoning ability from easy to hard.
To address these limitations, we propose Saturn, a SAT-based RL framework
that uses Boolean Satisfiability (SAT) problems to train and evaluate LLMs
reasoning. Saturn enables scalable task construction, rule-based verification,
and precise difficulty control. Saturn designs a curriculum learning pipeline
that continuously improves LLMs' reasoning capability by constructing SAT tasks
of increasing difficulty and training LLMs from easy to hard. To ensure stable
training, we design a principled mechanism to control difficulty transitions.
We introduce Saturn-2.6k, a dataset of 2,660 SAT problems with varying
difficulty. It supports the evaluation of how LLM reasoning changes with
problem difficulty. We apply Saturn to DeepSeek-R1-Distill-Qwen and obtain
Saturn-1.5B and Saturn-7B. We achieve several notable results: (1) On SAT
problems, Saturn-1.5B and Saturn-7B achieve average pass@3 improvements of
+14.0 and +28.1, respectively. (2) On math and programming tasks, Saturn-1.5B
and Saturn-7B improve average scores by +4.9 and +1.8 on benchmarks (e.g.,
AIME, LiveCodeBench). (3) Compared to the state-of-the-art (SOTA) approach in
constructing RL tasks, Saturn achieves further improvements of +8.8%. We
release the source code, data, and models to support future research.
Authors (6)
Huanyu Liu
Jia Li
Hao Zhu
Kechi Zhang
Yihong Dong
Ge Li
Key Contributions
Proposes SATURN, a SAT-based RL framework to address limitations in training LLMs for reasoning. SATURN enables scalable task construction, rule-based verification, and precise difficulty control, which are crucial for developing robust reasoning abilities in LLMs.
Business Value
Improves the efficiency and reliability of training LLMs for complex reasoning tasks, potentially leading to more capable AI assistants and tools for education and research.