Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Accurately predicting distributed cortical responses to naturalistic stimuli
requires models that integrate visual, auditory and semantic information over
time. We present a hierarchical multimodal recurrent ensemble that maps
pretrained video, audio, and language embeddings to fMRI time series recorded
while four subjects watched almost 80 hours of movies provided by the Algonauts
2025 challenge. Modality-specific bidirectional RNNs encode temporal dynamics;
their hidden states are fused and passed to a second recurrent layer, and
lightweight subject-specific heads output responses for 1000 cortical parcels.
Training relies on a composite MSE-correlation loss and a curriculum that
gradually shifts emphasis from early sensory to late association regions.
Averaging 100 model variants further boosts robustness. The resulting system
ranked third on the competition leaderboard, achieving an overall Pearson r =
0.2094 and the highest single-parcel peak score (mean r = 0.63) among all
participants, with particularly strong gains for the most challenging subject
(Subject 5). The approach establishes a simple, extensible baseline for future
multimodal brain-encoding benchmarks.