Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Event cameras offer microsecond-level latency and robustness to motion blur,
making them ideal for understanding dynamic environments. Yet, connecting these
asynchronous streams to human language remains an open challenge. We introduce
Talk2Event, the first large-scale benchmark for language-driven object
grounding in event-based perception. Built from real-world driving data, we
provide over 30,000 validated referring expressions, each enriched with four
grounding attributes -- appearance, status, relation to viewer, and relation to
other objects -- bridging spatial, temporal, and relational reasoning. To fully
exploit these cues, we propose EventRefer, an attribute-aware grounding
framework that dynamically fuses multi-attribute representations through a
Mixture of Event-Attribute Experts (MoEE). Our method adapts to different
modalities and scene dynamics, achieving consistent gains over state-of-the-art
baselines in event-only, frame-only, and event-frame fusion settings. We hope
our dataset and approach will establish a foundation for advancing multimodal,
temporally-aware, and language-driven perception in real-world robotics and
autonomy.
Key Contributions
Introduces Talk2Event, the first large-scale benchmark for language-driven object grounding using event cameras, and proposes EventRefer, an attribute-aware grounding framework (MoEE) that fuses multi-attribute representations. This work bridges the gap between high-speed, low-latency event camera data and human language, enabling richer scene understanding.
Business Value
Enhances the perception capabilities of autonomous systems and robots by enabling them to understand and respond to natural language commands related to their environment. This can lead to safer and more intuitive human-robot interaction and improved navigation in dynamic scenarios.