Abstract
Natural language provides an intuitive way to express spatial intent in geospatial
applications. While existing localization methods often rely on dense point cloud maps
or high-resolution imagery, OpenStreetMap (OSM) offers a compact and freely available
map representation that encodes rich semantic and structural information, making it
well suited for large-scale localization. However, text-to-OSM (T2O) localization
remains largely unexplored. In this paper, we formulate the T2O global localization
task, which aims to estimate accurate 2 degree-of-freedom (DoF) positions in urban
environments from textual scene descriptions without relying on geometric observations
or GNSS-based initial location. To support the proposed task, we introduce TOL, a
large-scale benchmark spanning multiple continents and diverse urban environments.
TOL contains approximately 121K textual queries paired with OSM map tiles and covers
about 316 km of road trajectories across Boston, Karlsruhe, and Singapore. We further
propose TOLoc, a coarse-to-fine localization framework that explicitly models the
semantics of surrounding objects and their directional information. In the coarse
stage, direction-aware features are extracted from both textual descriptions and
OSM tiles to construct global descriptors, which are used to retrieve candidate
locations for the query. In the fine stage, the query text and top-1 retrieved
tile are jointly processed, where a dedicated alignment module fuses textual
descriptor and local map features to regress the 2-DoF pose. Experimental
results demonstrate that TOLoc achieves strong localization performance,
outperforming the best existing method by 6.53%, 9.93%, and 8.31% at 5m,
10m, and 25m thresholds, respectively, and shows strong generalization to
unseen environments.
Acknowledgements:
We borrow this template from FreeReg.