Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Pre-trained language models represented by the Transformer have been proven
to possess strong base capabilities, and the representative self-attention
mechanism in the Transformer has become a classic in sequence modeling
architectures. Different from the work of proposing sequence modeling
architecture to improve the efficiency of attention mechanism, this work
focuses on the impact of sequence modeling architectures on base capabilities.
Specifically, our concern is: How exactly do sequence modeling architectures
affect the base capabilities of pre-trained language models? In this work, we
first point out that the mixed domain pre-training setting commonly adopted in
existing architecture design works fails to adequately reveal the differences
in base capabilities among various architectures. To address this, we propose a
limited domain pre-training setting with out-of-distribution testing, which
successfully uncovers significant differences in base capabilities among
architectures at an early stage. Next, we analyze the base capabilities of
stateful sequence modeling architectures, and find that they exhibit
significant degradation in base capabilities compared to the Transformer. Then,
through a series of architecture component analysis, we summarize a key
architecture design principle: A sequence modeling architecture need possess
full-sequence arbitrary selection capability to avoid degradation in base
capabilities. Finally, we empirically validate this principle using an
extremely simple Top-1 element selection architecture and further generalize it
to a more practical Top-1 chunk selection architecture. Experimental results
demonstrate our proposed sequence modeling architecture design principle and
suggest that our work can serve as a valuable reference for future architecture
improvements and novel designs.
Authors (6)
Xin Lu
Yanyan Zhao
Si Wei
Shijin Wang
Bing Qin
Ting Liu
Key Contributions
This paper investigates how sequence modeling architectures influence the base capabilities of pre-trained language models. It proposes a limited domain pre-training setting with out-of-distribution testing to better reveal architectural differences, addressing the inadequacy of mixed domain pre-training in existing works. This approach is crucial for designing more robust and capable language models.
Business Value
Improved understanding of LLM architectures can lead to more efficient and effective model development, reducing training costs and improving performance for downstream applications.