arxiv_cv 90% Match Research Paper Researchers in computer vision,AI engineers working on video analysis,Developers of action recognition systems 2 weeks ago

StretchySnake: Flexible SSM Training Unlocks Action Recognition Across Spatio-Temporal Scales

computer-vision › video-understanding

📄 Abstract

Abstract: State space models (SSMs) have emerged as a competitive alternative to transformers in various tasks. Their linear complexity and hidden-state recurrence make them particularly attractive for modeling long sequences, whereas attention becomes quadratically expensive. However, current training methods for video understanding are tailored towards transformers and fail to fully leverage the unique attributes of SSMs. For example, video models are often trained at a fixed resolution and video length to balance the quadratic scaling of attention cost against performance. Consequently, these models suffer from degraded performance when evaluated on videos with spatial and temporal resolutions unseen during training; a property we call spatio-temporal inflexibility. In the context of action recognition, this severely limits a model's ability to retain performance across both short- and long-form videos. Therefore, we propose a flexible training method that leverages and improves the inherent adaptability of SSMs. Our method samples videos at varying temporal and spatial resolutions during training and dynamically interpolates model weights to accommodate any spatio-temporal scale. This instills our SSM, which we call StretchySnake, with spatio-temporal flexibility and enables it to seamlessly handle videos ranging from short, fine-grained clips to long, complex activities. We introduce and compare five different variants of flexible training, and identify the most effective strategy for video SSMs. On short-action (UCF-101, HMDB-51) and long-action (COIN, Breakfast) benchmarks, StretchySnake outperforms transformer and SSM baselines alike by up to 28%, with strong adaptability to fine-grained actions (SSV2, Diving-48). Therefore, our method provides a simple drop-in training recipe that makes video SSMs more robust, resolution-agnostic, and efficient across diverse action recognition scenarios.

Authors (4)

Nyle Siddiqui

Rohit Gupta

Sirnam Swetha

Mubarak Shah

Submitted

October 17, 2025

arXiv Category

cs.CV

arXiv PDF

Key Contributions

Proposes a flexible training method for State Space Models (SSMs) in video understanding to address spatio-temporal inflexibility, a common issue where models trained at fixed resolutions/lengths perform poorly on unseen variations. This method aims to leverage SSMs' linear complexity for long sequences and improve performance across diverse video scales.

Business Value

Enables more robust video analysis systems that can handle diverse video inputs without significant performance degradation, useful for applications like content moderation, surveillance, and autonomous systems.

Paper Metadata

Innovation Type

Methodological

Deployment Feasibility

Moderate to High, depending on the efficiency of the proposed SSM training method and its integration into existing video analysis pipelines.

Limitations Addressed

Spatio-temporal inflexibility in video models, degraded performance on videos with resolutions unseen during training, and the quadratic complexity of attention mechanisms in transformers for long sequences.

Performance Gains

Improved performance across different spatio-temporal resolutions compared to models trained at fixed resolutions.

Technical Tags

state space modelsaction recognitionvideo understandingspatio-temporal scalesflexible traininglinear complexitylong sequencestransformersresolution invariance

Research Topics

Video AnalysisAction RecognitionDeep Learning ArchitecturesSequence ModelingModel Robustness

Methods & Architectures

State Space Models (SSMs)Flexible training methodSpatio-temporal scale adaptation State Space Models (SSMs)Transformers

Applications & Tasks

Video Surveillance Human-Computer Interaction Robotics Content Analysis Degraded performance on unseen resolutionsSpatio-temporal inflexibilityQuadratic complexity of attention for long sequencesModels tailored for transformers not optimal for SSMs Action RecognitionVideo UnderstandingModeling long video sequences

Related Fields

Computer VisionDeep LearningSequence ModelingArtificial Intelligence

Keywords

state space modelsSSMaction recognitionvideo understandingspatio-temporalflexible traininglinear complexitylong sequencestransformersresolutiondeep learningcomputer visionsequence modeling

Academic Context

#Video Analysis#Action Recognition#Deep Learning Architectures#Sequence Modeling#Model Robustness

Commercial Potential

Potential Products

Action recognition modules for video analytics platformsRobust video understanding systems for surveillance

Target Industries

Media and EntertainmentSecurity and SurveillanceRoboticsAutomotive

Use Case Examples

Analyzing sports game footageDetecting suspicious activities in security footageUnderstanding human actions for human-robot collaboration

Competitive Edge

Offers an alternative to transformer-based models for video understanding, particularly for long sequences and varying resolutions, by leveraging the strengths of SSMs.

Market Opportunity

Growing market for video analytics and AI-powered content understanding.

Revenue Models

Licensing of the model architecture and training frameworkintegration into AI platforms.

Resource Requirements

Compute Needs

Moderate to High, depending on model size and training data.

Data Requirements

Large-scale video datasets for action recognition (e.g., Kinetics, UCF101).

Deployment Constraints

Computational resources for inference, especially for real-time applications.

Scalability

SSMs offer linear complexity, which is beneficial for scaling to longer sequences compared to quadratic attention.

Production Readiness

Maturity Level

Research/Development

Time to Market

Medium, for integrating into production systems.

Patent Potential

Moderate, for the novel training methodology.

View Full Paper Back to Papers