arxiv_cv 96% Match Research Paper Robotics engineers,Autonomous driving developers,AR/VR developers,3D mapping specialists 1 day ago

LiDAR-VGGT: Cross-Modal Coarse-to-Fine Fusion for Globally Consistent and Metric-Scale Dense Mapping

computer-vision › 3d-vision

📄 Abstract

Abstract: Reconstructing large-scale colored point clouds is an important task in robotics, supporting perception, navigation, and scene understanding. Despite advances in LiDAR inertial visual odometry (LIVO), its performance remains highly sensitive to extrinsic calibration. Meanwhile, 3D vision foundation models, such as VGGT, suffer from limited scalability in large environments and inherently lack metric scale. To overcome these limitations, we propose LiDAR-VGGT, a novel framework that tightly couples LiDAR inertial odometry with the state-of-the-art VGGT model through a two-stage coarse- to-fine fusion pipeline: First, a pre-fusion module with robust initialization refinement efficiently estimates VGGT poses and point clouds with coarse metric scale within each session. Then, a post-fusion module enhances cross-modal 3D similarity transformation, using bounding-box-based regularization to reduce scale distortions caused by inconsistent FOVs between LiDAR and camera sensors. Extensive experiments across multiple datasets demonstrate that LiDAR-VGGT achieves dense, globally consistent colored point clouds and outperforms both VGGT-based methods and LIVO baselines. The implementation of our proposed novel color point cloud evaluation toolkit will be released as open source.

Authors (6)

Lijie Wang

Lianjie Guo

Ziyi Xu

Qianhao Wang

Fei Gao

Xieyuanli Chen

Submitted

November 3, 2025

arXiv Category

cs.RO

arXiv PDF

Key Contributions

LiDAR-VGGT proposes a novel framework that tightly couples LiDAR inertial odometry with the VGGT model for globally consistent and metric-scale dense mapping. It addresses the limitations of existing methods by employing a two-stage coarse-to-fine fusion pipeline that refines poses and point clouds, and enhances cross-modal 3D similarity transformation to reduce scale distortions, enabling more accurate large-scale 3D reconstructions.

Business Value

Improves the accuracy and reliability of 3D mapping for autonomous systems and AR/VR applications, reducing the need for manual calibration and enabling more robust navigation and scene understanding.

Paper Metadata

Innovation Type

Novel Framework/Methodology

Deployment Feasibility

Moderate. Requires integration of LiDAR, IMU, and camera sensors, along with significant computational resources for real-time processing.

Limitations Addressed

Sensitivity of LIVO to extrinsic calibration,Scalability issues of 3D vision foundation models,Lack of metric scale in foundation models,Scale distortions from inconsistent sensor fields of view

Technical Tags

LiDARVisual OdometryDense MappingPoint CloudsVGGTCoarse-to-Fine FusionExtrinsic CalibrationMetric Scale3D Similarity TransformationBounding-box regularization

Research Topics

Robotics PerceptionSimultaneous Localization and Mapping (SLAM)3D ReconstructionSensor FusionComputer Vision

Methods & Architectures

LiDAR Inertial Odometry (LIO)VGGT (Vision-Geometric Ground Truth)Coarse-to-Fine Fusion PipelineRobust Initialization RefinementCross-modal 3D Similarity TransformationBounding-box-based Regularization LiDAR-VGGT frameworkTwo-stage fusion pipeline

Applications & Tasks

Robotics Autonomous Driving Augmented Reality 3D Scene Reconstruction Large-scale colored point cloud reconstructionSensitivity of LiDAR-inertial visual odometry to extrinsic calibrationLimited scalability of 3D vision foundation models in large environmentsLack of metric scale in foundation modelsScale distortions due to inconsistent sensor FOVs Dense mappingGlobally consistent 3D reconstructionMetric-scale mappingPerception for navigationScene understanding

Related Fields

RoboticsComputer Vision3D PerceptionSLAMSensor Fusion

Keywords

LiDARDense Mapping3D ReconstructionSLAMSensor FusionMetric ScaleVGGTCoarse-to-FineRoboticsAutonomous DrivingPoint Clouds

Academic Context

#Robotics Perception#Simultaneous Localization and Mapping (SLAM)#3D Reconstruction#Sensor Fusion#Computer Vision

Commercial Potential

Potential Products

High-accuracy 3D mapping softwareRobotics navigation systemsAR/VR environment reconstruction tools

Target Industries

RoboticsAutomotiveConstructionSurveyingGaming

Use Case Examples

Creating detailed 3D maps for autonomous vehicle navigationGenerating metric-scale 3D environments for AR applications3D reconstruction of large-scale indoor/outdoor spaces

Competitive Edge

Offers a unified framework for metric-scale dense mapping by integrating LiDAR and vision models, addressing key limitations in calibration sensitivity and scalability.

Market Opportunity

Large and growing market for autonomous systems and 3D mapping technologies.

Revenue Models

Licensing of the mapping softwareintegration into robotic platformsservice-based mapping solutions.

Resource Requirements

Compute Needs

High, for real-time processing of LiDAR and camera data, and for the fusion pipeline.

Data Requirements

Requires synchronized LiDAR, IMU, and camera data from diverse environments.

Deployment Constraints

Hardware integration complexity, computational power requirements for real-time operation, robustness in diverse environmental conditions.

Scalability

Designed for large-scale environments, addressing a key limitation of previous vision-based foundation models.

Production Readiness

Maturity Level

Research/Prototype

Time to Market

2-4 years for commercial deployment in robotics platforms.

Patent Potential

Moderate, for the specific fusion pipeline and regularization techniques.

View Full Paper Back to Papers