arxiv_ml 85% Match Research Paper Computational chemists,Drug discovery scientists,Structural biologists,Machine learning researchers in life sciences 1 week ago

Pearl: A Foundation Model for Placing Every Atom in the Right Location

generative-ai › diffusion

📄 Abstract

Abstract: Accurately predicting the three-dimensional structures of protein-ligand complexes remains a fundamental challenge in computational drug discovery that limits the pace and success of therapeutic design. Deep learning methods have recently shown strong potential as structural prediction tools, achieving promising accuracy across diverse biomolecular systems. However, their performance and utility are constrained by scarce experimental data, inefficient architectures, physically invalid poses, and the limited ability to exploit auxiliary information available at inference. To address these issues, we introduce Pearl (Placing Every Atom in the Right Location), a foundation model for protein-ligand cofolding at scale. Pearl addresses these challenges with three key innovations: (1) training recipes that include large-scale synthetic data to overcome data scarcity; (2) architectures that incorporate an SO(3)-equivariant diffusion module to inherently respect 3D rotational symmetries, improving generalization and sample efficiency, and (3) controllable inference, including a generalized multi-chain templating system supporting both protein and non-polymeric components as well as dual unconditional/conditional modes. Pearl establishes a new state-of-the-art performance in protein-ligand cofolding. On the key metric of generating accurate (RMSD < 2 \r{A}) and physically valid poses, Pearl surpasses AlphaFold 3 and other open source baselines on the public Runs N' Poses and PoseBusters benchmarks, delivering 14.5% and 14.2% improvements, respectively, over the next best model. In the pocket-conditional cofolding regime, Pearl delivers $3.6\times$ improvement on a proprietary set of challenging, real-world drug targets at the more rigorous RMSD < 1 \r{A} threshold. Finally, we demonstrate that model performance correlates directly with synthetic dataset size used in training.

Authors (40)

Genesis Research Team

Alejandro Dobles

Nina Jovic

Kenneth Leidal

Pranav Murugan

David C. Williams

+34 more

Submitted

October 28, 2025

arXiv Category

cs.LG

arXiv PDF

Key Contributions

Introduces Pearl, a foundation model for protein-ligand cofolding that addresses key challenges in structure prediction. It utilizes large-scale synthetic data, SO(3)-equivariant diffusion modules for inherent 3D symmetry, and improved architectures to predict accurate and physically valid protein-ligand complex structures at scale.

Business Value

Significantly accelerates the drug discovery pipeline by enabling faster and more accurate prediction of how drug candidates bind to target proteins. Reduces R&D costs and time-to-market for new therapeutics.

Paper Metadata

Innovation Type

Foundation Model Architecture and Training Strategy

Deployment Feasibility

Moderate. Requires significant computational resources for training and inference. Integration into drug discovery workflows is feasible.

Limitations Addressed

Scarce experimental data for training,Inefficient deep learning architectures,Physically invalid poses generated by models,Limited ability to exploit auxiliary information,Challenge of predicting 3D structures of protein-ligand complexes

Performance Gains

Achieves promising accuracy across diverse biomolecular systems, overcoming limitations of previous deep learning methods (specific gains not quantified in abstract).

Technical Tags

Protein-Ligand DockingStructure PredictionFoundation ModelsDeep LearningComputational Drug DiscoverySO(3)-Equivariant DiffusionSynthetic Data Generation3D Structure PredictionBiomolecular SystemsTherapeutic Design

Research Topics

Computational BiologyDrug DiscoveryMachine Learning for ChemistryGenerative ModelsStructural Bioinformatics

Methods & Architectures

Foundation ModelSO(3)-Equivariant Diffusion ModuleLarge-scale synthetic data generationDeep learning architectures Foundation ModelDiffusion ModelSO(3)-Equivariant Network

Applications & Tasks

Drug Discovery Computational Chemistry Biotechnology Pharmaceuticals Protein-Ligand Complex Structure PredictionProtein FoldingMolecular DockingTherapeutic Design Accurately predicting 3D structures of protein-ligand complexesAccelerating therapeutic designOvercoming data scarcityGenerating physically valid poses

Related Fields

Computational ChemistryStructural BiologyMachine LearningDrug DevelopmentGenerative AI

Keywords

protein-ligand dockingstructure predictiondrug discoveryfoundation modeldiffusion modelSO(3)-equivariantcomputational chemistrybiotechnology3D structuremolecular modelingtherapeutic designsynthetic data

Academic Context

#Computational Biology#Drug Discovery#Machine Learning for Chemistry#Generative Models#Structural Bioinformatics

Commercial Potential

Potential Products

Drug candidate screening platformsMolecular modeling softwareAI-driven drug design services

Target Industries

PharmaceuticalsBiotechnologyChemicalsResearch Institutions

Use Case Examples

Identifying potential drug candidates by predicting their binding affinity to target proteinsDesigning novel molecules with desired binding propertiesUnderstanding protein-ligand interactions at an atomic level

Competitive Edge

Represents a significant advancement over previous methods by being a foundation model trained at scale, incorporating physical symmetries, and using synthetic data to overcome scarcity.

Market Opportunity

Massive market for drug discovery and development tools and services.

Revenue Models

Licensing of the model/platformproviding drug discovery servicespartnerships with pharmaceutical companies.

Resource Requirements

Compute Needs

High compute requirements for training large foundation models, likely involving significant GPU resources.

Data Requirements

Requires large datasets of protein-ligand complexes, supplemented by large-scale synthetic data.

Deployment Constraints

Computational cost for inference, integration into existing drug discovery pipelines, need for specialized expertise.

Scalability

Designed for scale ('at scale'), suggesting good scalability for prediction tasks.

Regulatory Considerations

Regulatory approval processes for new drugs developed using this technology.

Production Readiness

Maturity Level

Research/Advanced Development

Time to Market

Medium-term for integration into drug discovery pipelines; long-term for direct impact on approved drugs.

Patent Potential

High potential for patents on the foundation model architecture, training methods, and specific applications in drug discovery.

View Full Paper Back to Papers