Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: The success of deep networks is crucially attributed to their ability to
capture latent features within a representation space. In this work, we
investigate whether the underlying learned features of a model can be
efficiently retrieved through feedback from an agent, such as a large language
model (LLM), in the form of relative \tt{triplet comparisons}. These features
may represent various constructs, including dictionaries in LLMs or a
covariance matrix of Mahalanobis distances. We analyze the feedback complexity
associated with learning a feature matrix in sparse settings. Our results
establish tight bounds when the agent is permitted to construct activations and
demonstrate strong upper bounds in sparse scenarios when the agent's feedback
is limited to distributional information. We validate our theoretical findings
through experiments on two distinct applications: feature recovery from
Recursive Feature Machines and dictionary extraction from sparse autoencoders
trained on Large Language Models.