arxiv_cl 95% Match Research Paper LLM Researchers,ML Engineers,AI Architects 1 week ago

Encoder-Decoder or Decoder-Only? Revisiting Encoder-Decoder Large Language Model

large-language-models › model-architecture

📄 Abstract

Abstract: Recent large language model (LLM) research has undergone an architectural shift from encoder-decoder modeling to nowadays the dominant decoder-only modeling. This rapid transition, however, comes without a rigorous comparative analysis especially \textit{from the scaling perspective}, raising concerns that the potential of encoder-decoder models may have been overlooked. To fill this gap, we revisit encoder-decoder LLM (RedLLM), enhancing it with recent recipes from decoder-only LLM (DecLLM). We conduct a comprehensive comparison between RedLLM, pretrained with prefix language modeling (LM), and DecLLM, pretrained with causal LM, at different model scales, ranging from $\sim$150M to $\sim$8B. Using RedPajama V1 (1.6T tokens) for pretraining and FLAN for instruction tuning, our experiments show that RedLLM produces compelling scaling properties and surprisingly strong performance. While DecLLM is overall more compute-optimal during pretraining, RedLLM demonstrates comparable scaling and context length extrapolation capabilities. After instruction tuning, RedLLM achieves comparable and even better results on various downstream tasks while enjoying substantially better inference efficiency. We hope our findings could inspire more efforts on re-examining RedLLM, unlocking its potential for developing powerful and efficient LLMs.

Authors (6)

Biao Zhang

Yong Cheng

Siamak Shakeri

Xinyi Wang

Min Ma

Orhan Firat

Submitted

October 30, 2025

arXiv Category

cs.CL

arXiv PDF

Key Contributions

This paper rigorously revisits and enhances encoder-decoder LLMs (RedLLM) with modern techniques, conducting a comprehensive, scale-aware comparison against dominant decoder-only LLMs (DecLLM). It demonstrates that RedLLM exhibits compelling scaling properties and strong performance, suggesting that the potential of encoder-decoder architectures may have been overlooked due to a lack of rigorous comparative analysis.

Business Value

Provides insights into optimal LLM architectures for different computational budgets and performance goals, potentially leading to more efficient development and deployment of LLM-based applications.

Paper Metadata

Innovation Type

Comparative Analysis / Architectural Enhancement

Deployment Feasibility

High, as it focuses on architectural choices and training methodologies applicable to existing LLM development.

Limitations Addressed

Lack of rigorous comparative analysis between encoder-decoder and decoder-only LLMs, especially from a scaling perspective,Potential overlooking of encoder-decoder model capabilities

Performance Gains

Comparable performance of RedLLM to DecLLM at scale,RedLLM demonstrates compelling scaling properties

Technical Tags

Encoder-decoder LLMDecoder-only LLMLarge Language Models (LLMs)Model scalingPretrainingInstruction tuningCausal LMPrefix LM

Research Topics

Large Language ModelsModel ArchitecturesDeep LearningNatural Language ProcessingMachine Learning Research

Methods & Architectures

Prefix Language Modeling (for encoder-decoder)Causal Language Modeling (for decoder-only)Instruction Tuning (FLAN)Comparative AnalysisScaling Experiments Encoder-Decoder LLMDecoder-Only LLM

Applications & Tasks

Natural Language Processing Research Foundation Models Architectural ComparisonUnderstanding Scaling PropertiesOptimizing LLM Pretraining Text generationLanguage understandingModel comparison

Datasets & Benchmarks

Datasets

RedPajama V1, FLAN

Related Fields

Deep LearningNatural Language ProcessingMachine LearningTransformer Architectures

Keywords

LLMencoder-decoderdecoder-onlyarchitecturescalingpretraininglanguage modelingtransformercomparisonRedPajamaFLAN

Academic Context

#Large Language Models#Model Architectures#Deep Learning#Natural Language Processing#Machine Learning Research

Technology Stack

Frameworks & Libraries

FLAN

Commercial Potential

Potential Products

More efficient LLM training frameworksSpecialized LLMs for specific tasks based on architectural choice

Target Industries

TechnologyAI ResearchSoftware Development

Use Case Examples

Choosing the right LLM architecture for a new NLP project based on compute and performance requirements.Developing more efficient pretraining strategies for large language models.

Competitive Edge

Challenges the dominance of decoder-only architectures by providing evidence for the continued viability and strengths of encoder-decoder models, particularly at scale.

Market Opportunity

The LLM market is rapidly expanding.

Revenue Models

N/A (research paper)

Resource Requirements

Compute Needs

Significant compute resources required for large-scale pretraining and fine-tuning experiments (up to 8B parameters).

Data Requirements

Large text corpora (1.6T tokens) for pretraining, and instruction-tuning datasets.

Deployment Constraints

Encoder-decoder models might have different inference characteristics compared to decoder-only models.

Scalability

Focuses explicitly on the scaling properties of both architectures.

Production Readiness

Maturity Level

Research

Time to Market

Ongoing research, implications for future model development.

View Full Paper Back to Papers