Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Video Understanding, Scene Interpretation and Commonsense Reasoning are
highly challenging tasks enabling the interpretation of visual information,
allowing agents to perceive, interact with and make rational decisions in its
environment. Large Language Models (LLMs) and Visual Language Models (VLMs)
have shown remarkable advancements in these areas in recent years, enabling
domain-specific applications as well as zero-shot open vocabulary tasks,
combining multiple domains. However, the required computational complexity
poses challenges for their application on edge devices and in the context of
Mobile Robotics, especially considering the trade-off between accuracy and
inference time. In this paper, we investigate the capabilities of
state-of-the-art VLMs for the task of Scene Interpretation and Action
Recognition, with special regard to small VLMs capable of being deployed to
edge devices in the context of Mobile Robotics. The proposed pipeline is
evaluated on a diverse dataset consisting of various real-world cityscape,
on-campus and indoor scenarios. The experimental evaluation discusses the
potential of these small models on edge devices, with particular emphasis on
challenges, weaknesses, inherent model biases and the application of the gained
information. Supplementary material is provided via the following repository:
https://datahub.rz.rptu.de/hstr-csrl-public/publications/scene-interpretation-on-edge-devices/
Key Contributions
Investigates the capabilities of state-of-the-art VLMs for scene interpretation and action recognition on edge devices for mobile robotics. It focuses on evaluating smaller VLMs that balance accuracy and inference time for real-world robotic applications, addressing the computational challenges of deploying large models.
Business Value
Enables more intelligent and adaptable mobile robots capable of understanding their environment and actions in real-time, even with limited onboard processing power, leading to safer and more versatile robotic applications.