arxiv_ai 97% Match Research Paper Computational Chemists,Drug Discovery Scientists,AI Researchers,Chemical Engineers 2 weeks ago

Atom-anchored LLMs speak Chemistry: A Retrosynthesis Demonstration

large-language-models › reasoning

📄 Abstract

Abstract: Applications of machine learning in chemistry are often limited by the scarcity and expense of labeled data, restricting traditional supervised methods. In this work, we introduce a framework for molecular reasoning using general-purpose Large Language Models (LLMs) that operates without requiring labeled training data. Our method anchors chain-of-thought reasoning to the molecular structure by using unique atomic identifiers. First, the LLM performs a one-shot task to identify relevant fragments and their associated chemical labels or transformation classes. In an optional second step, this position-aware information is used in a few-shot task with provided class examples to predict the chemical transformation. We apply our framework to single-step retrosynthesis, a task where LLMs have previously underperformed. Across academic benchmarks and expert-validated drug discovery molecules, our work enables LLMs to achieve high success rates in identifying chemically plausible reaction sites ($\geq90\%$), named reaction classes ($\geq40\%$), and final reactants ($\geq74\%$). Beyond solving complex chemical tasks, our work also provides a method to generate theoretically grounded synthetic datasets by mapping chemical knowledge onto the molecular structure and thereby addressing data scarcity.

Authors (6)

Alan Kai Hassen

Andrius Bernatavicius

Antonius P. A. Janssen

Mike Preuss

Gerard J. P. van Westen

Djork-Arné Clevert

Submitted

October 18, 2025

arXiv Category

cs.LG

arXiv PDF

Key Contributions

This paper introduces a novel framework that enables general-purpose LLMs to perform molecular reasoning for retrosynthesis without requiring labeled training data, by anchoring chain-of-thought to molecular structure via atomic identifiers. This overcomes data scarcity limitations in chemistry ML and allows LLMs to achieve high success rates in identifying chemically plausible reactions.

Business Value

Accelerates drug discovery and chemical synthesis by automating complex molecular reasoning tasks, potentially reducing R&D costs and time-to-market for new pharmaceuticals and materials.

Paper Metadata

Innovation Type

Methodological Innovation

Deployment Feasibility

High, leverages existing LLMs and can be integrated into cheminformatics workflows.

Limitations Addressed

Scarcity and expense of labeled data in chemistry, restricting traditional supervised methods; LLMs' previous underperformance in retrosynthesis.

Performance Gains

Achieves high success rates in identifying chemically plausible reactions.

Technical Tags

LLM for chemistryretrosynthesismolecular reasoningzero-shot learningfew-shot learningchain-of-thoughtatomic identifiersdrug discoveryunsupervised learning

Research Topics

AI in ChemistryMolecular DesignLLM ApplicationsScientific DiscoveryMachine Learning for Science

Methods & Architectures

Anchored chain-of-thought reasoningOne-shot learningFew-shot learningAtomic identifier mapping Large Language Models (LLMs)

Applications & Tasks

Drug Discovery Chemical Synthesis Materials Science Molecular reasoningRetrosynthesis predictionData scarcity in chemistry ML Predicting chemical transformationsIdentifying reaction pathwaysMolecular fragment identification

Datasets & Benchmarks

Datasets

Academic benchmarks, Expert-validated drug discovery molecules

Success rateChemical plausibility

Related Fields

CheminformaticsComputational ChemistryDrug DiscoveryMachine LearningArtificial Intelligence

Keywords

LLMChemistryRetrosynthesisMolecular ReasoningZero-shotFew-shotChain-of-thoughtDrug DiscoveryData ScarcityAtomic IdentifiersChemical SynthesisMachine Learning

Academic Context

#AI in Chemistry#Molecular Design#LLM Applications#Scientific Discovery#Machine Learning for Science

Commercial Potential

Potential Products

AI-powered retrosynthesis planning softwareAutomated chemical reaction prediction toolsDrug candidate generation platforms

Target Industries

PharmaceuticalsBiotechnologyChemical ManufacturingMaterials Science

Use Case Examples

Predicting synthetic routes for novel drug moleculesIdentifying potential precursors for complex chemical compoundsAutomating parts of the chemical process design

Competitive Edge

Offers a data-efficient approach for molecular reasoning using general LLMs, overcoming limitations of traditional supervised methods in data-scarce chemistry domains.

Market Opportunity

Significant market for drug discovery and chemical synthesis optimization tools.

Revenue Models

Licensing of softwareAPI access for R&D platforms.

Resource Requirements

Compute Needs

Moderate (for LLM inference)

Data Requirements

Minimal labeled data required; relies on molecular structure representations and chemical knowledge.

Deployment Constraints

Accuracy depends on the LLM's inherent chemical knowledge and the effectiveness of the anchoring mechanism.

Scalability

Scalable to various molecular reasoning tasks and different LLM backbones.

Regulatory Considerations

Ensuring safety and validity of predicted reactions in a real-world chemical synthesis context.

Production Readiness

Maturity Level

Research

Time to Market

2-3 years (for a specialized cheminformatics tool)

Patent Potential

Moderate (for novel anchoring techniques or specific applications)

View Full Paper Back to Papers