CityAnchor: City-scale 3D Visual Grounding with Multi-modality LLMs

ICLR 2025
Jinpeng Li1,* Haiping Wang1,* Jiabin Chen1 Yuan Liu2,† Zhiyang Dou3 Yuexin Ma4
Sibei Yang4 Yuan Li5 Wenping Wang6 Zhen Dong1 Bisheng Yang1,†
1 Wuhan University 2 Hong Kong University of Science and Technology 3 University of Pennsylvania
4 ShanghaiTech University 5 Sun Yat-Sen University 6 Texas A&M University
*The first two authors contribute equally.    Corresponding authors.   

[Paper] [Code] [BibTeX]

Do you want to locate an object in a large city based only on description text?

Please stay tuned for our CityAnchor!


We present CityAnchor, a multi-modality LLM, that can accurately localize a target in a city-scale point cloud from some text descriptions of the target. CityAnchor achieves this by extracting features from the point cloud to grasp the intricate attributes and spatial relationships of urban objects. Then, CityAnchor comprehends the text descriptions and searches in the urban-scale point cloud for the objects corresponding to the input text descriptions.

Abstract

3D visual grounding is a critical task in computer vision with transformative applications in robotics, AR/VR, and autonomous driving. Taking this to the next level by scaling 3D visualization to city-scale point clouds opens up thrilling new possibilities. We present a 3D visual grounding method called CityAnchor for localizing an urban object in a city-scale point cloud. Recent developments in multiview reconstruction enable us to reconstruct city-scale point clouds but how to conduct visual grounding on such a large-scale urban point cloud remains an open problem. Previous 3D visual grounding system mainly concentrates on localizing an object in an image or a small-scale point cloud, which is not accurate and efficient enough to scale up to a city-scale point cloud. We address this problem with a multi-modality LLM which consists of two stages, a coarse localization and a fine-grained matching. Given the text descriptions, the coarse localization stage locates possible regions on a projected 2D map of the point cloud while the fine-grained matching stage accurately determines the most matched object in these possible regions. We conduct experiments on the CityRefer dataset and a new synthetic dataset annotated by us, both of which demonstrate our method can produce accurate 3D visual grounding on a city-scale 3D point cloud.

City-scale Grounding Results

Qualitative grounding results on the CityRefer dataset. The projected 2D map is obtained from the city-scale point cloud by top-view projection. The candidate objects from CLM are represented by red masks. In the query text, the target object is marked in red, the landmark name is marked in blue, and the neighborhood statement is marked in green. Grounding results are shown in red boxes.

Compare to Baseline Method CityRefer

Qualitative comparisons of the baseline method CityRefer and the proposed framework CityAnchor. The ground truth and predicted boxes are displayed in green and red, respectively. Although both our CityAnchor and the baseline method CityRefer are capable of understanding simple textual descriptions (e.g., white house, blue car, etc.), CityAnchor demonstrates a superior ability for accurately grounding guided by complex textual descriptions (e.g., it is next to another house with a red car parked in its front yard, it is in front of a multicolored White and beige house with a brown roof, etc.).



Acknowledgements: We borrow this template from A-tale-of-two-features.