Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Bidirectional language models have better context understanding and perform
better than unidirectional models on natural language understanding tasks, yet
the theoretical reasons behind this advantage remain unclear. In this work, we
investigate this disparity through the lens of the Information Bottleneck (IB)
principle, which formalizes a trade-off between compressing input information
and preserving task-relevant content. We propose FlowNIB, a dynamic and
scalable method for estimating mutual information during training that
addresses key limitations of classical IB approaches, including computational
intractability and fixed trade-off schedules. Theoretically, we show that
bidirectional models retain more mutual information and exhibit higher
effective dimensionality than unidirectional models. To support this, we
present a generalized framework for measuring representational complexity and
prove that bidirectional representations are strictly more informative under
mild conditions. We further validate our findings through extensive experiments
across multiple models and tasks using FlowNIB, revealing how information is
encoded and compressed throughout training. Together, our work provides a
principled explanation for the effectiveness of bidirectional architectures and
introduces a practical tool for analyzing information flow in deep language
models.
Key Contributions
Investigates the theoretical advantage of bidirectional language models over unidirectional ones using the Information Bottleneck (IB) principle. Proposes FlowNIB, a dynamic and scalable method for estimating mutual information during training, showing theoretically and empirically that bidirectional models retain more mutual information and have higher effective dimensionality.
Business Value
Deeper theoretical understanding can guide the development of more efficient and effective language models, leading to better performance in downstream NLP applications.