Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: We develop hybrid memory architectures for general-purpose sequence
processing neural networks, that combine key-value memory using softmax
attention (KV-memory) with fast weight memory through dynamic synaptic
modulation (FW-memory) -- the core principles of quadratic and linear
transformers, respectively. These two memory systems have complementary but
individually limited properties: KV-memory offers precise retrieval but is
constrained by quadratic complexity in sequence length, while FW-memory
supports arbitrarily long sequences and enables more expressive computation but
sacrifices precise recall. We propose and compare three methods to blend these
two systems into a single memory system, differing in how and when input
information is delivered to each system, to leverage the strengths of both. We
conduct experiments on general language modeling and retrieval tasks by
training 340M- and 1.3B-parameter models from scratch, as well as on synthetic
algorithmic tasks designed to precisely illustrate the benefits of certain
hybrid methods over others. We also evaluate our hybrid memory systems on
reinforcement learning in partially observable environments. Overall, we
demonstrate how a well-designed hybrid can overcome the limitations of its
individual components, offering new insights into the design principle of
neural memory systems.
Authors (3)
Kazuki Irie
Morris Yau
Samuel J. Gershman
Key Contributions
Develops hybrid memory architectures for sequence processing neural networks by combining Key-Value (KV) memory (quadratic complexity, precise retrieval) with Fast Weight (FW) memory (linear complexity, expressive computation). Three methods are proposed to blend these complementary systems, aiming to leverage the strengths of both for improved performance on language modeling and algorithmic tasks.
Business Value
Enables the development of more efficient and capable sequence models, potentially leading to better performance in NLP applications and handling longer contexts.