arxiv_cv 85% Match Research Paper AI researchers in chemistry,Computational chemists,Data scientists in pharma/materials 20 hours ago

RxnCaption: Reformulating Reaction Diagram Parsing as Visual Prompt Guided Captioning

computer-vision › object-detection

📄 Abstract

Abstract: Large-scale chemical reaction datasets are crucial for AI research in chemistry. However, existing chemical reaction data often exist as images within papers, making them not machine-readable and unusable for training machine learning models. In response to this challenge, we propose the RxnCaption framework for the task of chemical Reaction Diagram Parsing (RxnDP). Our framework reformulates the traditional coordinate prediction driven parsing process into an image captioning problem, which Large Vision-Language Models (LVLMs) handle naturally. We introduce a strategy termed "BBox and Index as Visual Prompt" (BIVP), which uses our state-of-the-art molecular detector, MolYOLO, to pre-draw molecular bounding boxes and indices directly onto the input image. This turns the downstream parsing into a natural-language description problem. Extensive experiments show that the BIVP strategy significantly improves structural extraction quality while simplifying model design. We further construct the RxnCaption-11k dataset, an order of magnitude larger than prior real-world literature benchmarks, with a balanced test subset across four layout archetypes. Experiments demonstrate that RxnCaption-VL achieves state-of-the-art performance on multiple metrics. We believe our method, dataset, and models will advance structured information extraction from chemical literature and catalyze broader AI applications in chemistry. We will release data, models, and code on GitHub.

Key Contributions

Reformulates chemical reaction diagram parsing as a visual prompt guided captioning problem, enabling Large Vision-Language Models (LVLMs) to process chemical data. Introduces the BBox and Index as Visual Prompt (BIVP) strategy using a molecular detector (MolYOLO) to improve structural extraction quality and simplify model design.

Business Value

Automates the extraction of crucial chemical reaction data from scientific literature, accelerating research and development in areas like drug discovery and materials science by making vast amounts of existing knowledge machine-readable.

Paper Metadata

Innovation Type

Novel problem formulation and methodology

Deployment Feasibility

Feasible, as it leverages existing LVLM architectures and a molecular detector. Requires integration into a pipeline for processing large volumes of documents.

Limitations Addressed

Machine unreadability of chemical reaction data in papers, which prevents its use for training ML models.

Technical Tags

visual promptimage captioningmolecular detectorbounding box predictionlarge vision-language modelschemical reaction parsingLVLMsMolYOLO

Research Topics

AI in ChemistryComputer VisionNatural Language ProcessingMultimodal LearningInformation Extraction

Methods & Architectures

Visual Prompt Guided CaptioningMolecular DetectionBounding Box PredictionLarge Vision-Language Models (LVLMs) Large Vision-Language Models (LVLMs)Molecular Detector (MolYOLO)

Applications & Tasks

Chemistry Drug Discovery Materials Science Information ExtractionData DigitizationImage Understanding Chemical Reaction Diagram Parsing (RxnDP)Molecular Structure ExtractionReaction Representation

Related Fields

Computational ChemistryNatural Language ProcessingComputer VisionMachine Learning

Keywords

chemical reactionsreaction parsingimage captioningvisual promptingLVLMMolYOLOinformation extractionchemistry datamachine readabilitydrug discoverymaterials scienceAI for science

Academic Context

#AI in Chemistry#Computer Vision#Natural Language Processing#Multimodal Learning#Information Extraction

Technology Stack

Frameworks & Libraries

LVLM

Commercial Potential

Potential Products

Automated chemical literature analysis platformDrug discovery intelligence toolMaterials informatics database

Target Industries

PharmaceuticalsChemicalsMaterials ScienceBiotechnology

Use Case Examples

Digitizing reaction schemes from patentsBuilding large-scale reaction databasesPredicting reaction outcomes

Competitive Edge

Offers a novel approach by reframing the problem for LVLMs, potentially outperforming traditional parsing methods that rely on coordinate prediction.

Market Opportunity

Large, driven by the need for efficient data extraction in chemical R&D.

Revenue Models

SaaS subscription for data analysis platformslicensing of technology.

Resource Requirements

Compute Needs

Likely significant, requiring GPU resources for training and inference of LVLMs and molecular detectors.

Data Requirements

Large datasets of chemical reaction diagrams paired with structured information.

Deployment Constraints

Integration with existing document processing pipelines, potential computational cost.

Scalability

Scalable to large corpora of chemical literature, provided sufficient computational resources.

Production Readiness

Maturity Level

Research/Development

Time to Market

1-3 years for a robust product.

Patent Potential

Moderate, for the novel BIVP strategy and the reformulation of RxnDP.

View Full Paper Back to Papers