Redirecting to original paper in 30 seconds...

Click below to go immediately or wait for automatic redirect

arxiv_ai 92% Match Research Paper Computational Biologists,Bioinformaticians,AI Researchers in Life Sciences,Drug Discovery Scientists 1 week ago

Lost in Tokenization: Context as the Key to Unlocking Biomolecular Understanding in Scientific LLMs

large-language-models › multimodal-llms
📄 Abstract

Abstract: Scientific Large Language Models (Sci-LLMs) have emerged as a promising frontier for accelerating biological discovery. However, these models face a fundamental challenge when processing raw biomolecular sequences: the tokenization dilemma. Whether treating sequences as a specialized language, risking the loss of functional motif information, or as a separate modality, introducing formidable alignment challenges, current strategies fundamentally limit their reasoning capacity. We challenge this sequence-centric paradigm by positing that a more effective strategy is to provide Sci-LLMs with high-level structured context derived from established bioinformatics tools, thereby bypassing the need to interpret low-level noisy sequence data directly. Through a systematic comparison of leading Sci-LLMs on biological reasoning tasks, we tested three input modes: sequence-only, context-only, and a combination of both. Our findings are striking: the context-only approach consistently and substantially outperforms all other modes. Even more revealing, the inclusion of the raw sequence alongside its high-level context consistently degrades performance, indicating that raw sequences act as informational noise, even for models with specialized tokenization schemes. These results suggest that the primary strength of existing Sci-LLMs lies not in their nascent ability to interpret biomolecular syntax from scratch, but in their profound capacity for reasoning over structured, human-readable knowledge. Therefore, we argue for reframing Sci-LLMs not as sequence decoders, but as powerful reasoning engines over expert knowledge. This work lays the foundation for a new class of hybrid scientific AI agents, repositioning the developmental focus from direct sequence interpretation towards high-level knowledge synthesis. The code is available at github.com/opendatalab-raise-dev/CoKE.
Authors (13)
Kai Zhuang
Jiawei Zhang
Yumou Liu
Hanqun Cao
Chunbin Gu
Mengdi Liu
+7 more
Submitted
October 27, 2025
arXiv Category
cs.AI
arXiv PDF

Key Contributions

Challenges the sequence-centric paradigm in Scientific LLMs by proposing that providing high-level structured context (derived from bioinformatics tools) is more effective than direct sequence processing. This bypasses tokenization issues and significantly enhances biological reasoning capabilities.

Business Value

Accelerates biological research and discovery by enabling AI models to better understand and reason about complex biomolecular data. This can lead to faster development of new drugs, therapies, and biotechnologies.