Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Modeling human-object interactions (HOI) from an egocentric perspective is a
largely unexplored yet important problem due to the increasing adoption of
wearable devices, such as smart glasses and watches. We investigate how much
information about interaction can be recovered from only head and wrists
tracking. Our answer is ECHO (Ego-Centric modeling of Human-Object
interactions), which, for the first time, proposes a unified framework to
recover three modalities: human pose, object motion, and contact from such
minimal observation. ECHO employs a Diffusion Transformer architecture and a
unique three-variate diffusion process, which jointly models human motion,
object trajectory, and contact sequence, allowing for flexible input
configurations. Our method operates in a head-centric canonical space,
enhancing robustness to global orientation. We propose a conveyor-based
inference, which progressively increases the diffusion timestamp with the frame
position, allowing us to process sequences of any length. Through extensive
evaluation, we demonstrate that ECHO outperforms existing methods that do not
offer the same flexibility, setting a state-of-the-art in egocentric HOI
reconstruction.