Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
π Abstract
Abstract: Applications of machine learning in chemistry are often limited by the
scarcity and expense of labeled data, restricting traditional supervised
methods. In this work, we introduce a framework for molecular reasoning using
general-purpose Large Language Models (LLMs) that operates without requiring
labeled training data. Our method anchors chain-of-thought reasoning to the
molecular structure by using unique atomic identifiers. First, the LLM performs
a one-shot task to identify relevant fragments and their associated chemical
labels or transformation classes. In an optional second step, this
position-aware information is used in a few-shot task with provided class
examples to predict the chemical transformation. We apply our framework to
single-step retrosynthesis, a task where LLMs have previously underperformed.
Across academic benchmarks and expert-validated drug discovery molecules, our
work enables LLMs to achieve high success rates in identifying chemically
plausible reaction sites ($\geq90\%$), named reaction classes ($\geq40\%$), and
final reactants ($\geq74\%$). Beyond solving complex chemical tasks, our work
also provides a method to generate theoretically grounded synthetic datasets by
mapping chemical knowledge onto the molecular structure and thereby addressing
data scarcity.
Authors (6)
Alan Kai Hassen
Andrius Bernatavicius
Antonius P. A. Janssen
Mike Preuss
Gerard J. P. van Westen
Djork-ArnΓ© Clevert
Submitted
October 18, 2025
Key Contributions
This paper introduces a novel framework that enables general-purpose LLMs to perform molecular reasoning for retrosynthesis without requiring labeled training data, by anchoring chain-of-thought to molecular structure via atomic identifiers. This overcomes data scarcity limitations in chemistry ML and allows LLMs to achieve high success rates in identifying chemically plausible reactions.
Business Value
Accelerates drug discovery and chemical synthesis by automating complex molecular reasoning tasks, potentially reducing R&D costs and time-to-market for new pharmaceuticals and materials.