arxiv_ai 95% Match Research Paper AI Researchers,LLM Developers,Machine Learning Engineers 1 week ago

Memory Mosaics at scale

large-language-models › model-architecture

📄 Abstract

Abstract: Memory Mosaics [Zhang et al., 2025], networks of associative memories, have demonstrated appealing compositional and in-context learning capabilities on medium-scale networks (GPT-2 scale) and synthetic small datasets. This work shows that these favorable properties remain when we scale memory mosaics to large language model sizes (llama-8B scale) and real-world datasets. To this end, we scale memory mosaics to 10B size, we train them on one trillion tokens, we introduce a couple architectural modifications ("Memory Mosaics v2"), we assess their capabilities across three evaluation dimensions: training-knowledge storage, new-knowledge storage, and in-context learning. Throughout the evaluation, memory mosaics v2 match transformers on the learning of training knowledge (first dimension) and significantly outperforms transformers on carrying out new tasks at inference time (second and third dimensions). These improvements cannot be easily replicated by simply increasing the training data for transformers. A memory mosaics v2 trained on one trillion tokens still perform better on these tasks than a transformer trained on eight trillion tokens.

Authors (2)

Jianyu Zhang

Léon Bottou

Submitted

July 4, 2025

arXiv Category

cs.AI

arXiv PDF

Key Contributions

This work demonstrates that Memory Mosaics, a network of associative memories, retain their appealing compositional and in-context learning capabilities when scaled to large language model sizes (LLaMA-8B) and real-world datasets. Memory Mosaics v2 significantly outperform transformers in new knowledge storage and in-context learning.

Business Value

Paves the way for more capable and efficient large language models that can better learn and adapt to new information, potentially leading to more dynamic and personalized AI applications.

Paper Metadata

Innovation Type

Architectural

Deployment Feasibility

Moderate. Requires significant computational resources for training and inference, similar to other large models, but offers potential efficiency gains.

Limitations Addressed

Scalability limitations of previous memory mosaic architectures,Transformer limitations in storing and utilizing new knowledge,Improving in-context learning efficiency

Performance Gains

Significantly outperforms transformers on new knowledge storage,Significantly outperforms transformers on in-context learning

Technical Tags

memory mosaicsassociative memoriescompositional learningin-context learninglarge language modelsscalingtransformer networksarchitectural modifications

Research Topics

Large Language ModelsMemory ArchitecturesEfficient LearningModel ScalingArtificial Intelligence

Methods & Architectures

scaling memory mosaicsarchitectural modifications ('Memory Mosaics v2')training on large datasets Memory MosaicsTransformersLLaMA-8B

Applications & Tasks

Natural Language Processing Artificial Intelligence Research Improving in-context learningEnhancing knowledge storageScaling LLMs Knowledge storage (training and new)In-context learningCompositional reasoning

Related Fields

Machine LearningDeep LearningArtificial IntelligenceCognitive Science

Keywords

Memory MosaicsLarge Language ModelsAssociative MemoryIn-context LearningCompositionalityScalingTransformersKnowledge RepresentationAI ArchitecturesLLaMA

Academic Context

#Large Language Models#Memory Architectures#Efficient Learning#Model Scaling#Artificial Intelligence

Commercial Potential

Potential Products

Next-generation LLMsAI systems with enhanced memory capabilitiesPersonalized AI assistants

Target Industries

TechnologyAI ResearchSoftware Development

Use Case Examples

AI models that can quickly learn and apply new information without extensive retrainingSystems capable of complex reasoning and problem-solving by composing learned knowledgeMore adaptable and versatile AI assistants

Competitive Edge

Offers a potentially more efficient and capable alternative to standard transformer architectures for specific tasks like new knowledge integration and in-context learning.

Market Opportunity

Massive and growing market for advanced LLMs.

Revenue Models

Licensing of the architecturedevelopment of specialized LLM services.

Resource Requirements

Compute Needs

Very High (10B parameter model, 1 trillion tokens)

Data Requirements

Large-scale text corpora (1 trillion tokens)

Deployment Constraints

High computational cost for training and inference.

Scalability

Demonstrates successful scaling to 10B parameters, suggesting good scalability.

Production Readiness

Maturity Level

Research

Time to Market

2-4 years

Patent Potential

Moderate (novel architectural modifications)

View Full Paper Back to Papers