Redirecting to original paper in 30 seconds...
Click below to go immediately or wait for automatic redirect
📄 Abstract
Abstract: Recent advances in end-to-end (E2E) autonomous driving have been enabled by
training on diverse large-scale driving datasets, yet autonomous driving models
still struggle in out-of-distribution (OOD) scenarios. The COOOL benchmark
targets this gap by encouraging hazard understanding beyond closed taxonomies,
and the 2COOOL challenge extends it to generating human-interpretable incident
reports. We present a hierarchical reasoning framework for incident report
generation from dashcam videos that integrates frame-level captioning, incident
frame detection, and fine-grained reasoning within vision-language models
(VLMs). We further improve factual accuracy and readability through model
ensembling and a Blind A/B Scoring selection protocol. On the official 2COOOL
open leaderboard, our method ranks 2nd among 29 teams and achieves the best
CIDEr-D score, producing accurate and coherent incident narratives. These
results indicate that hierarchical reasoning with VLMs is a promising direction
for accident analysis and for broader understanding of safety-critical traffic
events. The implementation and code are available at
https://github.com/riron1206/kaggle-2COOOL-2nd-Place-Solution.
Key Contributions
Presents a hierarchical reasoning framework using Vision-Language Models (VLMs) for generating incident reports from dashcam videos. It integrates frame-level captioning, incident frame detection, and fine-grained reasoning, enhanced by ensembling and a selection protocol to improve factual accuracy and readability, particularly for OOD scenarios.
Business Value
Enhances the safety and transparency of autonomous driving systems by providing clear, human-readable explanations of incidents. This can aid in accident analysis, insurance claims, and regulatory compliance.