arxiv_cv 90% Match Research Paper Autonomous driving researchers,Robotics engineers,Computer vision engineers,AI researchers in tracking and fusion 2 weeks ago

FutrTrack: A Camera-LiDAR Fusion Transformer for 3D Multiple Object Tracking

computer-vision › object-detection

📄 Abstract

Abstract: We propose FutrTrack, a modular camera-LiDAR multi-object tracking framework that builds on existing 3D detectors by introducing a transformer-based smoother and a fusion-driven tracker. Inspired by query-based tracking frameworks, FutrTrack employs a multimodal two-stage transformer refinement and tracking pipeline. Our fusion tracker integrates bounding boxes with multimodal bird's-eye-view (BEV) fusion features from multiple cameras and LiDAR without the need for an explicit motion model. The tracker assigns and propagates identities across frames, leveraging both geometric and semantic cues for robust re-identification under occlusion and viewpoint changes. Prior to tracking, we refine sequences of bounding boxes with a temporal smoother over a moving window to refine trajectories, reduce jitter, and improve spatial consistency. Evaluated on nuScenes and KITTI, FutrTrack demonstrates that query-based transformer tracking methods benefit significantly from multimodal sensor features compared with previous single-sensor approaches. With an aMOTA of 74.7 on the nuScenes test set, FutrTrack achieves strong performance on 3D MOT benchmarks, reducing identity switches while maintaining competitive accuracy. Our approach provides an efficient framework for improving transformer-based trackers to compete with other neural-network-based methods even with limited data and without pretraining.

Authors (3)

Martha Teiko Teye

Ori Maoz

Matthias Rottmann

Submitted

October 22, 2025

arXiv Category

cs.CV

arXiv PDF

Key Contributions

FutrTrack is a modular camera-LiDAR fusion framework for 3D Multiple Object Tracking (MOT) that uses a transformer-based smoother and a fusion-driven tracker. It employs a multimodal two-stage transformer pipeline, integrating BEV fusion features without an explicit motion model, and leverages geometric/semantic cues for robust re-identification, significantly improving tracking performance.

Business Value

Crucial for the development of safe and reliable autonomous driving systems. Accurate 3D object tracking is essential for perception, prediction, and planning modules, enabling vehicles to navigate complex environments safely.

Paper Metadata

Innovation Type

Algorithmic & System Design

Deployment Feasibility

Medium. Requires significant computational resources for real-time processing of camera and LiDAR data, and transformer inference. Integration into automotive-grade systems is challenging.

Limitations Addressed

Challenges in robust 3D MOT,Effective fusion of camera and LiDAR data,Need for robust re-identification under occlusion and viewpoint changes

Performance Gains

Demonstrates significant benefits from query-based transformer tracking with multimodal sensor fusion.

Technical Tags

3D Multiple Object Tracking (MOT)Camera-LiDAR FusionTransformerBird's-Eye View (BEV)Fusion-driven TrackerTemporal SmootherQuery-based TrackingRe-identificationOcclusion HandlingnuScenesKITTI

Research Topics

3D Computer VisionObject TrackingSensor FusionDeep LearningAutonomous Driving

Methods & Architectures

Transformer-based smootherFusion-driven trackerMultimodal BEV fusionQuery-based tracking pipeline Transformer3D Detectors (as input)

Applications & Tasks

Autonomous Driving Robotics Surveillance Augmented Reality Robust 3D Multiple Object TrackingIntegrating camera and LiDAR data effectivelyHandling occlusions and viewpoint changes in tracking 3D Multiple Object TrackingObject detection and trackingSensor fusion for tracking

Datasets & Benchmarks

Datasets

nuScenes, KITTI

MOTA (Multiple Object Tracking Accuracy)IDF1HOTA (Hybrid Object Tracking Accuracy)

Related Fields

Computer VisionAutonomous DrivingRoboticsSensor FusionDeep Learning

Keywords

3D Object TrackingMOTCamera-LiDAR FusionTransformerBEVAutonomous DrivingSensor FusionRoboticsTrackingRe-identificationnuScenesKITTI

Academic Context

#3D Computer Vision#Object Tracking#Sensor Fusion#Deep Learning#Autonomous Driving

Commercial Potential

Potential Products

Perception systems for autonomous vehiclesAdvanced driver-assistance systems (ADAS)Robotic navigation and perception modules

Target Industries

AutomotiveRoboticsLogisticsSecurity and Surveillance

Use Case Examples

Tracking pedestrians, cyclists, and other vehicles for self-driving carsEnabling robots to navigate dynamic environmentsMonitoring traffic flow and identifying specific vehicles

Competitive Edge

Offers a unified, modular framework that leverages transformers and multimodal fusion for improved 3D MOT, potentially outperforming methods relying solely on single sensors or simpler fusion techniques.

Market Opportunity

Massive market for autonomous driving technology.

Revenue Models

Licensing the perception stack to automotive manufacturers.

Resource Requirements

Compute Needs

High for real-time processing and transformer inference.

Data Requirements

Large-scale datasets with synchronized camera and LiDAR data (e.g., nuScenes, KITTI).

Deployment Constraints

Requires significant onboard compute power for autonomous vehicles.

Scalability

Scalability depends on the efficiency of the transformer models and fusion mechanisms.

Regulatory Considerations

Safety standards for autonomous driving systems.

Production Readiness

Maturity Level

Research

Time to Market

Medium to Long (integration into automotive systems is complex)

View Full Paper Back to Papers