arxiv_cv 95% Match Research Paper Marine Scientists,Robotics Engineers,Computer Vision Researchers,Defense Analysts 2 days ago

NAUTILUS: A Large Multimodal Model for Underwater Scene Understanding

computer-vision › scene-understanding

📄 Abstract

Abstract: Underwater exploration offers critical insights into our planet and attracts increasing attention for its broader applications in resource exploration, national security, etc. We study the underwater scene understanding methods, which aim to achieve automated underwater exploration. The underwater scene understanding task demands multi-task perceptions from multiple granularities. However, the absence of large-scale underwater multi-task instruction-tuning datasets hinders the progress of this research. To bridge this gap, we construct NautData, a dataset containing 1.45 M image-text pairs supporting eight underwater scene understanding tasks. It enables the development and thorough evaluation of the underwater scene understanding models. Underwater image degradation is a widely recognized challenge that interferes with underwater tasks. To improve the robustness of underwater scene understanding, we introduce physical priors derived from underwater imaging models and propose a plug-and-play vision feature enhancement (VFE) module, which explicitly restores clear underwater information. We integrate this module into renowned baselines LLaVA-1.5 and Qwen2.5-VL and build our underwater LMM, NAUTILUS. Experiments conducted on the NautData and public underwater datasets demonstrate the effectiveness of the VFE module, consistently improving the performance of both baselines on the majority of supported tasks, thus ensuring the superiority of NAUTILUS in the underwater scene understanding area. Data and models are available at https://github.com/H-EmbodVis/NAUTILUS.

Authors (7)

Wei Xu

Cheng Wang

Dingkang Liang

Zongchuang Zhao

Xingyu Jiang

Peng Zhang

+1 more

Submitted

October 31, 2025

arXiv Category

cs.CV

arXiv PDF

Key Contributions

NAUTILUS addresses the challenge of underwater scene understanding by introducing NautData, a large-scale dataset (1.45M image-text pairs) for instruction tuning. It also incorporates physical priors from underwater imaging models to improve robustness against image degradation, enabling better multi-task perception for automated underwater exploration.

Business Value

Enables more effective and automated exploration and monitoring of underwater environments, supporting scientific research, resource management, and security operations.

Paper Metadata

Innovation Type

Dataset Creation & Methodological Improvement

Deployment Feasibility

Moderate, requires specialized hardware for underwater deployment and robust model performance in challenging conditions.

Limitations Addressed

Absence of large-scale underwater multi-task instruction-tuning datasets,Underwater image degradation affecting perception tasks,Need for robust models for diverse underwater applications

Technical Tags

underwater scene understandingmultimodal modellarge datasetinstruction tuningphysical priorsunderwater imaging modelsimage-text pairsmulti-task perception

Research Topics

Computer VisionScene UnderstandingMultimodal LearningRoboticsData Curation

Methods & Architectures

NAUTILUSNautData datasetInstruction TuningIncorporation of Physical Priors Large Multimodal Model

Applications & Tasks

Oceanography Marine Biology Resource Exploration National Security Robotics Lack of Large-Scale Underwater DatasetsUnderwater Image DegradationMulti-Task Perception DemandsAutomated Underwater Exploration Underwater Scene UnderstandingObject Detection in WaterImage Captioning (Underwater)Multi-task Perception

Datasets & Benchmarks

Datasets

NautData

Related Fields

Ocean EngineeringRoboticsRemote SensingMachine Learning

Keywords

underwaterscene understandingmultimodaldatasetinstruction tuningroboticsoceanographycomputer visionimage processingdeep learning

Academic Context

#Computer Vision#Scene Understanding#Multimodal Learning#Robotics#Data Curation

Commercial Potential

Potential Products

Autonomous underwater vehicles (AUVs) with advanced perceptionUnderwater monitoring systemsData analysis tools for marine research

Target Industries

Ocean ExplorationDefenseOil & GasAquacultureEnvironmental Monitoring

Use Case Examples

Automated mapping of the ocean floorMonitoring marine ecosystemsInspection of underwater infrastructureSearch and rescue operations at sea

Competitive Edge

Addresses a critical gap in underwater AI by providing a large-scale dataset and a robust multimodal model tailored for the unique challenges of underwater environments.

Market Opportunity

Significant, driven by increasing interest in ocean exploration and resource management.

Revenue Models

Licensing of the model and datasetdevelopment of specialized underwater robotics and sensing solutions.

Resource Requirements

Compute Needs

High, for training large multimodal models and processing large datasets.

Data Requirements

Large-scale, diverse underwater image-text pairs covering various tasks.

Deployment Constraints

Harsh underwater conditions, limited bandwidth for data transmission, real-time processing needs.

Scalability

Scalability depends on the efficiency of the multimodal model and the ability to deploy it on underwater platforms.

Regulatory Considerations

Data privacy concerns for surveillance applicationsEnvironmental regulations for underwater operations

Production Readiness

Maturity Level

Research

Time to Market

3-5 years for robust deployment in operational systems.

Patent Potential

Moderate, for the dataset and specific model adaptations.

View Full Paper Back to Papers