Abstract
OpenStreetMap (OSM), a rich and versatile source of volunteered geographic
information (VGI), facilitates human self-localization and scene understanding
by integrating nearby visual observations with vectorized map data. However,
the disparity in modalities and perspectives poses a major challenge for
effectively matching camera imagery with compact map representations, thereby limiting the full potential
of VGI data in real-world localization applications. Inspired by the fact that the human brain relies
on the fusion of geometric and semantic understanding for spatial localization tasks,
we propose the OSMLoc in this paper. OSMLoc is a brain-inspired visual localization approach
based on first-person-view images against the OSM maps. It integrates semantic and geometric
guidance to significantly improve accuracy, robustness, and generalization capability.
First, we equip the OSMLoc with the visual foundational model to extract powerful image features.
Second, a geometry-guided depth distribution adapter is proposed to bridge the monocular depth
estimation and camera-to-BEV transform. Thirdly, the semantic embeddings from the OSM data are
utilized as auxiliary guidance for image-to-OSM feature matching. To validate the proposed OSMLoc,
we collect a worldwide cross-area and cross-condition (CC) benchmark for extensive evaluation.
Experiments on the MGL dataset, CC validation benchmark, and KITTI dataset have demonstrated the
superiority of our method.
Acknowledgements:
We borrow this template from FreeReg.