TOL: Textual Localization with OpenStreetMap

Youqi Liao¹ Shuhao Kang² Jingyu Xu³ Olaf Wysocki⁴ Yan Xia⁵
Jianping Li^6,† Zhen Dong¹ Bisheng Yang¹ Xieyuanli Chen⁷
¹ Wuhan University ² Technical University of Munich ³ Institute of Artificial Intelligence (TeleAI), China Telecom
⁴ Nanyang Technological University ⁵ University of Cambridge ⁶ University of Science and Technology of China ⁷ National University of Defense Technology ^†Corresponding author.

[Paper] [Video] [Code] [BibTeX]

What can TOL do?

TOL is able to localize the text query in a city-scale OSM database with meter-level accurayc. (a) Text-to-OSM localization retrieves the most similar OSM tile from the database first, and then estimates the accurate 2-DoF position later. (b) shows the difference with existing methods. Compared with text-to-point-cloud localization methods, OSM data is much lighter in data construction, storage, and updates. Compared with existing text-driven place recognition methods, our approach focuses on global localization with meterlevel accuracy, instead of simply retrieving OSM tiles from textual queries. This enables finer-grained spatial understanding and more precise localization beyond tile-level retrieval.

Abstract

Natural language provides an intuitive way to express spatial intent in geospatial applications. While existing localization methods often rely on dense point cloud maps or high-resolution imagery, OpenStreetMap (OSM) offers a compact and freely available map representation that encodes rich semantic and structural information, making it well suited for large-scale localization. However, text-to-OSM (T2O) localization remains largely unexplored. In this paper, we formulate the T2O global localization task, which aims to estimate accurate 2 degree-of-freedom (DoF) positions in urban environments from textual scene descriptions without relying on geometric observations or GNSS-based initial location. To support the proposed task, we introduce TOL, a large-scale benchmark spanning multiple continents and diverse urban environments. TOL contains approximately 121K textual queries paired with OSM map tiles and covers about 316 km of road trajectories across Boston, Karlsruhe, and Singapore. We further propose TOLoc, a coarse-to-fine localization framework that explicitly models the semantics of surrounding objects and their directional information. In the coarse stage, direction-aware features are extracted from both textual descriptions and OSM tiles to construct global descriptors, which are used to retrieve candidate locations for the query. In the fine stage, the query text and top-1 retrieved tile are jointly processed, where a dedicated alignment module fuses textual descriptor and local map features to regress the 2-DoF pose. Experimental results demonstrate that TOLoc achieves strong localization performance, outperforming the best existing method by 6.53%, 9.93%, and 8.31% at 5m, 10m, and 25m thresholds, respectively, and shows strong generalization to unseen environments.

Introduction Video

Place recognition & Localization results

BibTex

 @misc{liao2026toltextuallocalizationopenstreetmap,
      title={TOL: Textual Localization with OpenStreetMap}, 
      author={Youqi Liao and Shuhao Kang and Jingyu Xu and Olaf Wysocki and Yan Xia and Jianping Li and Zhen Dong and Bisheng Yang and Xieyuanli Chen},
      year={2026},
      eprint={2604.01644},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2604.01644}, 
}

Acknowledgements: We borrow this template from FreeReg.