Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: The development of synthesis procedures remains a fundamental challenge in
materials discovery, with procedural knowledge scattered across decades of
scientific literature in unstructured formats that are challenging for
systematic analysis. In this paper, we propose a multi-modal toolbox that
employs large language models (LLMs) and vision language models (VLMs) to
automatically extract and organize synthesis procedures and performance data
from materials science publications, covering text and figures. We curated 81k
open-access papers, yielding LeMat-Synth (v 1.0): a dataset containing
synthesis procedures spanning 35 synthesis methods and 16 material classes,
structured according to an ontology specific to materials science. The
extraction quality is rigorously evaluated on a subset of 2.5k synthesis
procedures through a combination of expert annotations and a scalable
LLM-as-a-judge framework. Beyond the dataset, we release a modular, open-source
software library designed to support community-driven extension to new corpora
and synthesis domains. Altogether, this work provides an extensible
infrastructure to transform unstructured literature into machine-readable
information. This lays the groundwork for predictive modeling of synthesis
procedures as well as modeling synthesis--structure--property relationships.
Authors (19)
Magdalena Lederbauer
Siddharth Betala
Xiyao Li
Ayush Jain
Amine Sehaba
Georgia Channing
+13 more
Submitted
October 28, 2025
Key Contributions
Develops a multi-modal toolbox using LLMs and VLMs to automatically extract and organize synthesis procedures from materials science literature. It introduces the LeMat-Synth dataset (v1.0) and an open-source software library, significantly streamlining the process of discovering and analyzing synthesis methods.
Business Value
Accelerates materials discovery and innovation by providing researchers and developers with structured, easily accessible data on synthesis procedures, reducing R&D time and costs.