arxiv_ml 50% Match Research Paper Computational biologists,Protein engineers,Bioinformaticians,Researchers in drug discovery 3 weeks ago

PRISM: Enhancing Protein Inverse Folding through Fine-Grained Retrieval on Structure-Sequence Multimodal Representations

generative-ai › autoregressive

📄 Abstract

Abstract: Designing protein sequences that fold into a target three-dimensional structure, known as the inverse folding problem, is central to protein engineering but remains challenging due to the vast sequence space and the importance of local structural constraints. Existing deep learning approaches achieve strong recovery rates, yet they lack explicit mechanisms to reuse fine-grained structure-sequence patterns that are conserved across natural proteins. We present PRISM, a multimodal retrieval-augmented generation framework for inverse folding that retrieves fine-grained representations of potential motifs from known proteins and integrates them with a hybrid self-cross attention decoder. PRISM is formulated as a latent-variable probabilistic model and implemented with an efficient approximation, combining theoretical grounding with practical scalability. Across five benchmarks (CATH-4.2, TS50, TS500, CAMEO 2022, and the PDB date split), PRISM establishes new state of the art in both perplexity and amino acid recovery, while also improving foldability metrics (RMSD, TM-score, pLDDT), demonstrating that fine-grained multimodal retrieval is a powerful and efficient paradigm for protein sequence design.

Authors (3)

Sazan Mahbub

Souvik Kundu

Eric P. Xing

Submitted

October 12, 2025

arXiv Category

q-bio.QM

arXiv PDF

Key Contributions

Introduces PRISM, a retrieval-augmented generation framework for the inverse folding problem. It leverages fine-grained structure-sequence patterns from known proteins via retrieval and integrates them into a hybrid attention decoder. PRISM is formulated as a latent-variable probabilistic model, achieving new state-of-the-art results across multiple benchmarks by effectively reusing conserved motifs.

Business Value

Enables the rational design of novel proteins with specific functions, accelerating the development of new enzymes, therapeutics, biomaterials, and other protein-based technologies.

Paper Metadata

Innovation Type

Algorithmic Development

Deployment Feasibility

Moderate. Requires significant computational resources and expertise in bioinformatics and deep learning. Integration into protein engineering workflows is feasible.

Limitations Addressed

Addresses the challenge of designing protein sequences that fold correctly, particularly the difficulty of capturing local structural constraints and reusing conserved patterns from existing proteins, which limits the effectiveness of purely generative models.

Performance Gains

Establishes new state of the art across five benchmarks.

Technical Tags

inverse foldingprotein designretrieval-augmented generationmultimodal representationsstructure-sequenceself-cross attentionlatent variable modelsprotein engineeringmotif retrievalstate-of-the-art

Research Topics

Computational BiologyProtein EngineeringMachine Learning for BiologyGenerative ModelsBioinformatics

Methods & Architectures

Retrieval-augmented generationMultimodal learningSelf-cross attention decoderLatent-variable probabilistic model Transformer-based decodersRetrieval modelsGenerative models

Applications & Tasks

Protein Engineering Drug Discovery Synthetic Biology Biotechnology Protein sequence designInverse folding problemStructure-sequence modeling Designing protein sequences for target structuresPredicting protein structure from sequenceReusing conserved structure-sequence patterns

Datasets & Benchmarks

Datasets

CATH-4.2, TS50, TS500, CAMEO 2022, PDB date split

Benchmarks

CATH-4.2 • TS50 • TS500 • CAMEO 2022 • PDB date split

Recovery ratesStructure prediction accuracyDesignability metrics

Related Fields

Computational BiologyBioinformaticsMachine LearningStructural BiologyProtein Engineering

Keywords

inverse foldingprotein designretrieval-augmented generationmultimodalstructure-sequenceattentionlatent variable modelprotein engineeringbioinformaticscomputational biologymotifstate-of-the-art

Academic Context

#Computational Biology#Protein Engineering#Machine Learning for Biology#Generative Models#Bioinformatics

Commercial Potential

Potential Products

Custom protein design servicesDatabases of designed protein sequencesTools for protein engineering

Target Industries

BiotechnologyPharmaceuticalsAgricultureMaterials ScienceChemicals

Use Case Examples

Designing novel enzymes for industrial processesCreating therapeutic proteins (e.g., antibodies)Developing new biomaterials with specific properties

Competitive Edge

Advances the state of the art in inverse folding by incorporating retrieval of fine-grained patterns, offering a more effective approach than purely generative methods.

Market Opportunity

Large and growing market for protein engineering and synthetic biology solutions.

Revenue Models

Licensing of designed proteins/sequencesContract research servicesDevelopment of protein-based products

Resource Requirements

Compute Needs

High (training large multimodal models)

Data Requirements

Large datasets of protein structures and sequences.

Deployment Constraints

Requires significant computational resources and specialized expertise.

Scalability

The framework is designed for practical scalability, using efficient approximations.

Regulatory Considerations

Biosecurity (for engineered organisms)Regulatory approval for therapeutic proteins

Production Readiness

Maturity Level

Research

Time to Market

3-5 years (for specific applications like therapeutics)

Patent Potential

High (novel protein design methods and applications)

View Full Paper Back to Papers