arxiv_ai 92% Match Research Paper Computational Biologists,Bioinformaticians,AI Researchers in Life Sciences,Drug Discovery Scientists 1 week ago

Lost in Tokenization: Context as the Key to Unlocking Biomolecular Understanding in Scientific LLMs

large-language-models › multimodal-llms

📄 Abstract

Abstract: Scientific Large Language Models (Sci-LLMs) have emerged as a promising frontier for accelerating biological discovery. However, these models face a fundamental challenge when processing raw biomolecular sequences: the tokenization dilemma. Whether treating sequences as a specialized language, risking the loss of functional motif information, or as a separate modality, introducing formidable alignment challenges, current strategies fundamentally limit their reasoning capacity. We challenge this sequence-centric paradigm by positing that a more effective strategy is to provide Sci-LLMs with high-level structured context derived from established bioinformatics tools, thereby bypassing the need to interpret low-level noisy sequence data directly. Through a systematic comparison of leading Sci-LLMs on biological reasoning tasks, we tested three input modes: sequence-only, context-only, and a combination of both. Our findings are striking: the context-only approach consistently and substantially outperforms all other modes. Even more revealing, the inclusion of the raw sequence alongside its high-level context consistently degrades performance, indicating that raw sequences act as informational noise, even for models with specialized tokenization schemes. These results suggest that the primary strength of existing Sci-LLMs lies not in their nascent ability to interpret biomolecular syntax from scratch, but in their profound capacity for reasoning over structured, human-readable knowledge. Therefore, we argue for reframing Sci-LLMs not as sequence decoders, but as powerful reasoning engines over expert knowledge. This work lays the foundation for a new class of hybrid scientific AI agents, repositioning the developmental focus from direct sequence interpretation towards high-level knowledge synthesis. The code is available at github.com/opendatalab-raise-dev/CoKE.

Authors (13)

Kai Zhuang

Jiawei Zhang

Yumou Liu

Hanqun Cao

Chunbin Gu

Mengdi Liu

+7 more

Submitted

October 27, 2025

arXiv Category

cs.AI

arXiv PDF

Key Contributions

Challenges the sequence-centric paradigm in Scientific LLMs by proposing that providing high-level structured context (derived from bioinformatics tools) is more effective than direct sequence processing. This bypasses tokenization issues and significantly enhances biological reasoning capabilities.

Business Value

Accelerates biological research and discovery by enabling AI models to better understand and reason about complex biomolecular data. This can lead to faster development of new drugs, therapies, and biotechnologies.

Paper Metadata

Innovation Type

Conceptual Innovation / Methodological Innovation

Deployment Feasibility

High. The proposed method involves modifying input strategies rather than core model architectures, making it adaptable to existing Sci-LLMs.

Limitations Addressed

Tokenization dilemma for biomolecular sequences,Loss of functional motif information,Alignment challenges when treating sequences as a separate modality,Limited reasoning capacity of current Sci-LLMs

Performance Gains

Context-only approach consistently outperforms others on biological reasoning tasks.

Technical Tags

scientific LLMsbiomolecular sequencestokenizationcontextual informationbioinformaticsbiological reasoningsequence-to-sequencenatural language processing

Research Topics

AI for BiologyLLM Applications in ScienceBioinformaticsSequence AnalysisInformation Representation

Methods & Architectures

Contextual Input StrategiesComparative Analysis of Input Modes (Sequence-only, Context-only, Combined) Scientific Large Language Models (Sci-LLMs)

Applications & Tasks

Biotechnology Genomics Drug Discovery Bioinformatics Tokenization dilemma for biomolecular sequencesLimited reasoning capacity of Sci-LLMsInformation loss in sequence processing Biological ReasoningBiomolecular Sequence UnderstandingAccelerating Biological Discovery

Related Fields

BioinformaticsComputational BiologyGenomicsMachine LearningNatural Language Processing

Keywords

scientific LLMsbiomolecular sequencestokenizationcontextbioinformaticsbiological reasoningAI for sciencegenomicsdrug discoverysequence analysisinformation representationNLP

Academic Context

#AI for Biology#LLM Applications in Science#Bioinformatics#Sequence Analysis#Information Representation

Commercial Potential

Potential Products

AI-powered bioinformatics analysis toolsNext-generation sequencing data interpretation platformsDrug target identification systems

Target Industries

BiotechnologyPharmaceuticalsHealthcareGenomics

Use Case Examples

Predicting protein function from amino acid sequences using contextual information.Identifying potential drug targets by analyzing genomic data with Sci-LLMs.Accelerating the interpretation of complex biological pathways.

Competitive Edge

Offers a significant improvement over existing Sci-LLM approaches by addressing fundamental limitations in sequence processing, leading to more accurate and insightful biological analysis.

Market Opportunity

Large and growing market for AI solutions in life sciences and drug discovery.

Revenue Models

SaaS for bioinformatics analysisLicensing of specialized Sci-LLM modelsConsulting services for AI integration in biotech

Resource Requirements

Compute Needs

Moderate to High, depending on the Sci-LLM used and the scale of analysis.

Data Requirements

Requires access to established bioinformatics tools for generating contextual information and relevant biological datasets (e.g., genomic sequences, protein data).

Deployment Constraints

Integration with existing bioinformatics pipelines,Computational resources for running Sci-LLMs,Need for curated contextual data sources

Scalability

Scalability depends on the underlying Sci-LLM and the efficiency of context generation.

Production Readiness

Maturity Level

Research/Development

Time to Market

1-2 years for integration into specialized bioinformatics software.

Patent Potential

Moderate, for novel methods of integrating contextual information into Sci-LLMs.

View Full Paper Back to Papers