📄 Abstract
Abstract: Existing benchmarks for multimodal learning in Earth science offer limited,
siloed coverage of Earth's spheres and their cross-sphere interactions,
typically restricting evaluation to the human-activity sphere of atmosphere and
to at most 16 tasks. These limitations: \textit{narrow-source heterogeneity
(single/few data sources), constrained scientific granularity, and
limited-sphere extensibility}. Therefore, we introduce
\textbf{OmniEarth-Bench}, the first multimodal benchmark that systematically
spans all six spheres: atmosphere, lithosphere, oceanosphere, cryosphere,
biosphere, and human-activity sphere, and cross-spheres. Built with a scalable,
modular-topology data inference framework and native multi-observation sources
and expert-in-the-loop curation, OmniEarth-Bench produces 29,855 standardized,
expert-curated annotations. All annotations are organized into a four-level
hierarchy (Sphere, Scenario, Ability, Task), encompassing 109 expert-curated
evaluation tasks. Experiments on 9 state-of-the-art MLLMs reveal that even the
most advanced models struggle with our benchmarks, where none of them reach
35\% accuracy, revealing systematic gaps in Earth-system cognitive ability. The
dataset and evaluation code were released at OmniEarth-Bench
(https://anonymous.4open.science/r/OmniEarth-Bench-B1BD).
Key Contributions
OmniEarth-Bench is introduced as the first multimodal benchmark to systematically span all six spheres of Earth (atmosphere, lithosphere, oceanosphere, cryosphere, biosphere, human-activity) and their interactions. It utilizes a scalable data inference framework and expert curation to provide 109 evaluation tasks organized hierarchically.
Business Value
Provides a standardized and comprehensive platform for developing and evaluating AI models for Earth science applications, accelerating progress in areas like climate change monitoring, disaster prediction, and resource management.