We describe an approach to predict open-vocabulary 3D semantic voxel occupancy map from input 2D images with the objective of enabling 3D grounding, segmentation and retrieval of free-form language queries. This is a challenging problem because of the 2D-3D ambiguity and the open-vocabulary nature of the target tasks, where obtaining annotated training data in 3D is difficult. The con- tributions of this work are three-fold. First, we design a new model architecture for open-vocabulary 3D semantic occupancy prediction. The architecture consists of a 2D-3D encoder together with occupancy prediction and 3D-language heads. The output is a dense voxel map of 3D grounded language embeddings enabling a range of open-vocabulary tasks. Second, we develop a tri-modal self-supervised learning algorithm that leverages three modalities: (i) images, (ii) language and (iii) LiDAR point clouds, and enables training the proposed architecture using a strong pre-trained vision-language model without the need for any 3D manual language annotations. Finally, we demonstrate quantitatively the strengths of the proposed model on several open-vocabulary tasks: Zero-shot 3D semantic segmentation using existing datasets; 3D grounding and retrieval of free-form language queries, using a small dataset that we propose as an extension of nuScenes.
Proposed approach. In (a), we show the architecture of the proposed method. Having only surround-view images on the input, the model first extracts dense voxel feature grid that is then fed to two parallel heads: occupancy head g producing voxel-level occupancy predictions, and to 3D-language feature head h which outputs features aligned with text representations. In (b), we show how we train our approach, namely the occupancy loss \(\mathcal{L}_\text{occ}\) used to train class-agnostic occupancy predictions, and the feature loss \(\mathcal{L}_\text{ft}\) that enforces the 3D-language head h to output features aligned with text representations.
Qualitative results of zero-shot 3D occupancy prediction. Left: six input surround-view images. Right: our prediction; training grid resolution 100×100×8 is upsampled to 300×300×24 by interpolating the trained representation space. It is worth noting that the model successfully segments even the class bus, despite its limited occurrence in the training set.
Qualitative results of open-vocabulary language-driven retrieval. Left: six input surround-view images. Right: given a text query "Black hatchback", we retrieve the relevant parts of the 3D scene.
The details about our benchmark for open-vocabulary language-driven 3D retrieval are here .
Example from the dataset:
@inproceedings{vobecky2023pop3d,
author = {Vobecky, Antonin and Sim\'{e}oni, Oriane and Hurych, David and Gidaris, Spyridon and Bursuc, Andrei and P\'{e}rez, Patrick and Sivic, Josef},
booktitle = {Advances in Neural Information Processing Systems},
editor = {A. Oh and T. Neumann and A. Globerson and K. Saenko and M. Hardt and S. Levine},
pages = {50545--50557},
publisher = {Curran Associates, Inc.},
title = {POP-3D: Open-Vocabulary 3D Occupancy Prediction from Images},
url = {https://proceedings.neurips.cc/paper_files/paper/2023/file/9e30acdeff572463c1db9b7de59de64c-Paper-Conference.pdf},
volume = {36},
year = {2023}
}
This work supported by the European Regional Development Fund under the project IMPACT (no. CZ.02.1.010.00.015_0030000468), by the Ministry of Education, and by CTU Student Grant SGS21184OHK33T37. This work was supported by the Ministry of Education, Youth and Sports of the Czech Republic through the e-INFRA CZ (ID:90254). This research received the support of EXA4MIND, a European Union’s Horizon Europe Research and Innovation programme under grant agreement N° 101092944. Views and opinions expressed are however those of the author(s) only and do not necessarily reflect those of the European Union or the European Commission. Neither the European Union nor the granting authority can be held responsible for them. The authors have no competing interests to declare that are relevant to the content of this article. Antonin Vobecky acknowledges travel support from ELISE (GA no 951847).