Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Statistical analysis of corpora provides an approach to quantitatively
investigate natural languages. This approach has revealed that several power
laws consistently emerge across different corpora and languages, suggesting
universal mechanisms underlying languages. Particularly, the power-law decay of
correlation has been interpreted as evidence for underlying hierarchical
structures in syntax, semantics, and discourse. This perspective has also been
extended to child speeches and animal signals. However, the argument supporting
this interpretation has not been empirically tested in natural languages. To
address this problem, the present study examines the validity of the argument
for syntactic structures. Specifically, we test whether the statistical
properties of parse trees align with the assumptions in the argument. Using
English and Japanese corpora, we analyze the mutual information, deviations
from probabilistic context-free grammars (PCFGs), and other properties in
natural language parse trees, as well as in the PCFG that approximates these
parse trees. Our results indicate that the assumptions do not hold for
syntactic structures and that it is difficult to apply the proposed argument to
child speeches and animal signals, highlighting the need to reconsider the
relationship between the power law and hierarchical structures.
Key Contributions
This paper empirically tests the long-standing argument that power-law decay in correlation implies hierarchical structures in natural language. By analyzing syntactic structures in English and Japanese corpora, it investigates the alignment of statistical properties of parse trees with the assumptions of this argument, providing a quantitative assessment of a core linguistic hypothesis.
Business Value
Understanding fundamental linguistic structures can inform the development of more sophisticated NLP models, leading to better language understanding and generation capabilities in applications like translation and text analysis.