Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Surgical scene understanding is critical for surgical training and robotic
decision-making in robot-assisted surgery. Recent advances in Multimodal Large
Language Models (MLLMs) have demonstrated great potential for advancing scene
perception in the medical domain, facilitating surgeons to understand surgical
scenes and procedures. However, these methods are primarily oriented towards
image-based analysis or global video understanding, overlooking the
fine-grained video reasoning that is crucial for analyzing specific processes
and capturing detailed task execution within a surgical procedure. To bridge
this gap, we propose SurgVidLM, the first video language model designed to
address both full and fine-grained surgical video comprehension. To train our
SurgVidLM, we construct the SVU-31K that is a large-scale dataset with over 31K
video-instruction pairs, enabling both holistic understanding and detailed
analysis of surgical procedures. Building on this resource, SurgVidLM
incorporates a two-stage StageFocus mechanism: the first stage extracts global
procedural context, while the second stage performs high-frequency local
analysis guided by temporal cues. We also develop the Multi-frequency Fusion
Attention to effectively integrate low- and high-frequency visual tokens,
ensuring the preservation of critical task-specific details. Experimental
results demonstrate that SurgVidLM significantly outperforms state-of-the-art
Vid-LLMs of comparable parameter scale in both full and fine-grained video
understanding tasks, showcasing its superior capability in capturing the
context of complex robot-assisted surgeries. Our code and dataset will be
publicly accessible soon.