Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: We examine the extent to which sublinear-sample property testing and
estimation apply to settings where samples are independently but not
identically distributed. Specifically, we consider the following distributional
property testing framework: Suppose there is a set of distributions over a
discrete support of size $k$, $p_1, p_2,\ldots,p_T$, and we obtain $c$
independent draws from each distribution. Suppose the goal is to learn or test
a property of the average distribution, $p_{avg}$. This setup models a number
of important practical settings where the individual distributions correspond
to heterogeneous entities -- either individuals, chronologically distinct time
periods, spatially separated data sources, etc. From a learning standpoint,
even with $c=1$ samples from each distribution, $\Theta(k/\varepsilon^2)$
samples are necessary and sufficient to learn $p_{avg}$ to within error
$\varepsilon$ in $\ell_1$ distance. To test uniformity or identity --
distinguishing the case that $p_{avg}$ is equal to some reference distribution,
versus has $\ell_1$ distance at least $\varepsilon$ from the reference
distribution, we show that a linear number of samples in $k$ is necessary given
$c=1$ samples from each distribution. In contrast, for $c \ge 2$, we recover
the usual sublinear sample testing guarantees of the i.i.d.\ setting: we show
that $O(\sqrt{k}/\varepsilon^2 + 1/\varepsilon^4)$ total samples are
sufficient, matching the optimal sample complexity in the i.i.d.\ case in the
regime where $\varepsilon \ge k^{-1/4}$. Additionally, we show that in the
$c=2$ case, there is a constant $\rho > 0$ such that even in the linear regime
with $\rho k$ samples, no tester that considers the multiset of samples
(ignoring which samples were drawn from the same $p_i$) can perform uniformity
testing. We also extend our techniques to the problem of testing "closeness" of
two distributions.
Key Contributions
This paper extends property testing and estimation to settings with non-identically distributed samples, focusing on learning or testing properties of an average distribution. It establishes sample complexity bounds for learning the average distribution in L1 distance, even with a small number of samples per distribution, and considers testing uniformity or identity.
Business Value
Enables more robust statistical analysis and machine learning on datasets with inherent heterogeneity, such as data from different users, time periods, or sources, leading to more accurate insights and models.