Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Scientific Large Language Models (Sci-LLMs) have emerged as a promising
frontier for accelerating biological discovery. However, these models face a
fundamental challenge when processing raw biomolecular sequences: the
tokenization dilemma. Whether treating sequences as a specialized language,
risking the loss of functional motif information, or as a separate modality,
introducing formidable alignment challenges, current strategies fundamentally
limit their reasoning capacity. We challenge this sequence-centric paradigm by
positing that a more effective strategy is to provide Sci-LLMs with high-level
structured context derived from established bioinformatics tools, thereby
bypassing the need to interpret low-level noisy sequence data directly. Through
a systematic comparison of leading Sci-LLMs on biological reasoning tasks, we
tested three input modes: sequence-only, context-only, and a combination of
both. Our findings are striking: the context-only approach consistently and
substantially outperforms all other modes. Even more revealing, the inclusion
of the raw sequence alongside its high-level context consistently degrades
performance, indicating that raw sequences act as informational noise, even for
models with specialized tokenization schemes. These results suggest that the
primary strength of existing Sci-LLMs lies not in their nascent ability to
interpret biomolecular syntax from scratch, but in their profound capacity for
reasoning over structured, human-readable knowledge. Therefore, we argue for
reframing Sci-LLMs not as sequence decoders, but as powerful reasoning engines
over expert knowledge. This work lays the foundation for a new class of hybrid
scientific AI agents, repositioning the developmental focus from direct
sequence interpretation towards high-level knowledge synthesis. The code is
available at github.com/opendatalab-raise-dev/CoKE.
Authors (13)
Kai Zhuang
Jiawei Zhang
Yumou Liu
Hanqun Cao
Chunbin Gu
Mengdi Liu
+7 more
Submitted
October 27, 2025
Key Contributions
Challenges the sequence-centric paradigm in Scientific LLMs by proposing that providing high-level structured context (derived from bioinformatics tools) is more effective than direct sequence processing. This bypasses tokenization issues and significantly enhances biological reasoning capabilities.
Business Value
Accelerates biological research and discovery by enabling AI models to better understand and reason about complex biomolecular data. This can lead to faster development of new drugs, therapies, and biotechnologies.