FreeReg: Image-to-Point Cloud Registration Leveraging Pretrained Diffusion Models and Monocular Depth Estimators

ICLR 2024
Haiping Wang1,* Yuan Liu2,* Bing Wang3 Yujing Sun2
Zhen Dong†1 Wenping Wang2 Bisheng Yang1,†
1 Wuhan University 2 The University of Hong Kong
3 The Hong Kong Polytechnic University 4 Texas A&M University
*The first two authors contribute equally.    Corresponding authors.   

[Paper]      [Code]     [BibTeX]

What can FreeReg do?

FreeReg, without any task-specific training or fine-tuning, is able to register (a) 2D RGB images with 3D point clouds of both indoor and outdoor scenes. The key idea of FreeReg is to extract cross-modality diffusion and geometric features by utilizing pretrained diffusion models and monocular depth estimators as shown in (b), which enables reliable pixel-to-point correspondence estimation (c) even in challenging cases with small overlaps, large viewpoint changes, and sparse point density (d).


Matching cross-modality features between images and point clouds is a fundamental problem for image-to-point cloud registration. However, due to the modality difference between images and points, it is difficult to learn robust and discriminative cross-modality features by existing metric learning methods for feature matching. Instead of applying metric learning on cross-modality data, we propose to unify the modality between images and point clouds by pretrained large-scale models first, and then establish robust correspondence within the same modality. We show that the intermediate features, called diffusion features, extracted by depth-to-image diffusion models are semantically consistent between images and point clouds, which enables the building of coarse but robust cross-modality correspondences. We further extract geometric features on depth maps produced by the monocular depth estimator. By matching such geometric features, we significantly improve the accuracy of the coarse correspondences produced by diffusion features. Extensive experiments demonstrate that without any task-specific training, direct utilization of both features produces accurate image-to-point cloud registration. On three public indoor and outdoor benchmarks, the proposed method averagely achieves a 20.6% improvement in Inlier Ratio, a three-fold higher Inlier Number, and a 48.6% improvement in Registration Recall than existing state-of-the-arts. The codes are available in the supplementary material and will be released upon acceptance.



Zero-shot registration results

Compare to baseline method

(a) Input RGB images and point clouds for registration. (b) Estimated correspondences from the baseline method I2P-Matr.
(c-e) Estimated correspondences by nearest neighborhood (NN) matcher utilizing Diffusion (FreeReg-D) / Geometric (FreeReg-G) / Fused features (FreeReg).

Visualization of feature maps and established correspondences

(a) Input RGB images and point clouds for registration. (b) The ground truth RGB images corresponding to the point clouds, solely intended for the readers' visualization of the overlapping regions. (c-e) Diffusion / Geometric / Fused Feature maps of the input RGB images and point clouds. (g) Estimated correspondences from FreeReg.


  title={FreeReg: Image-to-Point Cloud Registration Leveraging Pretrained Diffusion Models and Monocular Depth Estimators},
  author={Haiping Wang and Yuan Liu and Bing Wang and Yujing Sun and Zhen Dong and Wenping Wang and Bisheng Yang},
  journal={arXiv preprint arXiv:2310.03420},

Acknowledgements: We borrow this template from A-tale-of-two-features.