arxiv_ml 85% Match Research Paper NLP Researchers,Machine Learning Theorists,Computational Linguists 2 weeks ago

On the Emergence of Linear Analogies in Word Embeddings

large-language-models › model-architecture

📄 Abstract

Abstract: Models such as Word2Vec and GloVe construct word embeddings based on the co-occurrence probability $P(i,j)$ of words $i$ and $j$ in text corpora. The resulting vectors $W_i$ not only group semantically similar words but also exhibit a striking linear analogy structure -- for example, $W_{\text{king}} - W_{\text{man}} + W_{\text{woman}} \approx W_{\text{queen}}$ -- whose theoretical origin remains unclear. Previous observations indicate that this analogy structure: (i) already emerges in the top eigenvectors of the matrix $M(i,j) = P(i,j)/P(i)P(j)$, (ii) strengthens and then saturates as more eigenvectors of $M (i, j)$, which controls the dimension of the embeddings, are included, (iii) is enhanced when using $\log M(i,j)$ rather than $M(i,j)$, and (iv) persists even when all word pairs involved in a specific analogy relation (e.g., king-queen, man-woman) are removed from the corpus. To explain these phenomena, we introduce a theoretical generative model in which words are defined by binary semantic attributes, and co-occurrence probabilities are derived from attribute-based interactions. This model analytically reproduces the emergence of linear analogy structure and naturally accounts for properties (i)-(iv). It can be viewed as giving fine-grained resolution into the role of each additional embedding dimension. It is robust to various forms of noise and agrees well with co-occurrence statistics measured on Wikipedia and the analogy benchmark introduced by Mikolov et al.

Authors (4)

Daniel J. Korchinski

Dhruva Karkada

Yasaman Bahri

Matthieu Wyart

Submitted

May 24, 2025

arXiv Category

cs.CL

arXiv PDF

Key Contributions

Introduces a theoretical generative model to explain the emergence of linear analogies (e.g., king-man+woman=queen) in word embeddings like Word2Vec and GloVe. It connects this phenomenon to the top eigenvectors of the co-occurrence matrix and provides insights into why these analogies persist even when specific word pairs are removed.

Business Value

Provides fundamental insights into how word embeddings capture semantic relationships, which can lead to the development of more effective and interpretable NLP models for various applications.

Paper Metadata

Innovation Type

Theoretical

Deployment Feasibility

N/A (theoretical explanation)

Limitations Addressed

The unclear theoretical origin of the striking linear analogy structure observed in word embeddings, and the persistence of these analogies under various conditions.

Technical Tags

Word EmbeddingsWord2VecGloVeCo-occurrence ProbabilityLinear AnalogyEigenvectorsGenerative ModelSemantic Relationships

Research Topics

Natural Language ProcessingRepresentation LearningLinguistic TheoryMachine Learning Theory

Methods & Architectures

Theoretical AnalysisGenerative ModelingEigenvalue Decomposition Word2VecGloVe

Applications & Tasks

Natural Language Processing Computational Linguistics Explaining the origin of linear analogies in word embeddingsUnderstanding the relationship between co-occurrence and semantic structure Generating word embeddingsAnalyzing semantic relationshipsExplaining emergent properties of NLP models

Related Fields

Natural Language ProcessingLinguisticsMachine LearningLinear AlgebraInformation Theory

Keywords

Word EmbeddingsWord2VecGloVeLinear AnalogyCo-occurrenceEigenvectorsNLPRepresentation LearningSemanticsGenerative ModelTheoretical OriginVector Space Models

Academic Context

#Natural Language Processing#Representation Learning#Linguistic Theory#Machine Learning Theory

Commercial Potential

Target Industries

TechnologySearch EnginesSocial Media

Use Case Examples

Improving semantic search capabilitiesEnhancing machine translation qualityDeveloping better text generation models

Competitive Edge

Offers a foundational theoretical explanation for a key observed phenomenon in widely used NLP models.

Market Opportunity

Underpins a vast market for NLP technologies.

Revenue Models

N/A

Resource Requirements

Compute Needs

Minimal (theoretical analysis)

Data Requirements

Analysis based on properties of co-occurrence matrices derived from text corpora.

Deployment Constraints

N/A

Scalability

N/A

Regulatory Considerations

N/A

Production Readiness

Maturity Level

Theoretical/Foundational

Time to Market

N/A

Patent Potential

Low

View Full Paper Back to Papers