Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Models such as Word2Vec and GloVe construct word embeddings based on the
co-occurrence probability $P(i,j)$ of words $i$ and $j$ in text corpora. The
resulting vectors $W_i$ not only group semantically similar words but also
exhibit a striking linear analogy structure -- for example, $W_{\text{king}} -
W_{\text{man}} + W_{\text{woman}} \approx W_{\text{queen}}$ -- whose
theoretical origin remains unclear. Previous observations indicate that this
analogy structure: (i) already emerges in the top eigenvectors of the matrix
$M(i,j) = P(i,j)/P(i)P(j)$, (ii) strengthens and then saturates as more
eigenvectors of $M (i, j)$, which controls the dimension of the embeddings, are
included, (iii) is enhanced when using $\log M(i,j)$ rather than $M(i,j)$, and
(iv) persists even when all word pairs involved in a specific analogy relation
(e.g., king-queen, man-woman) are removed from the corpus. To explain these
phenomena, we introduce a theoretical generative model in which words are
defined by binary semantic attributes, and co-occurrence probabilities are
derived from attribute-based interactions. This model analytically reproduces
the emergence of linear analogy structure and naturally accounts for properties
(i)-(iv). It can be viewed as giving fine-grained resolution into the role of
each additional embedding dimension. It is robust to various forms of noise and
agrees well with co-occurrence statistics measured on Wikipedia and the analogy
benchmark introduced by Mikolov et al.
Authors (4)
Daniel J. Korchinski
Dhruva Karkada
Yasaman Bahri
Matthieu Wyart
Key Contributions
Introduces a theoretical generative model to explain the emergence of linear analogies (e.g., king-man+woman=queen) in word embeddings like Word2Vec and GloVe. It connects this phenomenon to the top eigenvectors of the co-occurrence matrix and provides insights into why these analogies persist even when specific word pairs are removed.
Business Value
Provides fundamental insights into how word embeddings capture semantic relationships, which can lead to the development of more effective and interpretable NLP models for various applications.