Grounded 3D-LLM

Grounded 3D-LLM with Referent Tokens

Yilun Chen^1*, Shuai Yang^1,2*, Haifeng Huang^1,2*, Tai Wang¹, Ruiyuan Lyu¹,
Runsen Xu³, Dahua Lin^1,3, Jiangmiao Pang^1†

¹Shanghai AI Laboratory, ²Zhejiang University, ³The Chinese University of Hong Kong
^*Indicates Equal Contribution,^†Indicates Corresponding Author

Abstract

Prior studies on 3D scene understanding have primarily developed specialized models for specific tasks or required task-specific fine-tuning. In this study, we propose Grounded 3D-LLM, which explores the potential of 3D large multi-modal models (3D LMMs) to consolidate various 3D vision tasks within a unified generative framework. The model uses scene referent tokens as special noun phrases to reference 3D scenes, enabling it to handle sequences that interleave 3D and textual data. Per-task instruction-following templates are employed to ensure natural and diversity in translating 3D vision tasks into language formats. To facilitate the use of referent tokens in subsequent language modeling, we provide a large-scale, automatically curated grounded scene-text dataset with over 1 million phrase-to-region correspondences and introduce Contrastive Language-Scene Pre-training (CLASP) to perform phrase-level scene-text alignment using this data. Our comprehensive evaluation covers open-ended tasks like dense captioning and 3D question answering, alongside close-ended tasks such as object detection and language grounding. Experiments across multiple 3D benchmarks reveal the leading performance and the broad applicability of Grounded 3D-LLM.

Contributions

We propose Grounded 3D-LLM, which establishes correspondence between 3D scenes and natural language using referent tokens. This method facilitates scene referencing and effectively models various 3D vision-language problems within a unified language modeling framework.
We first provide an automated, curated grounded scene caption dataset with over 1 million phrase-level correspondences. Experiments show that CLASP, using this data in both supervised and zero-shot text settings, demonstrates broad generalization in phrase-level grounding.
Without requiring specialized models or task-specific fine-tuning, the single model Grounded 3D-LLM achieves top-tier performance in most downstream tasks among generative models, particularly in grounding problems.

Method

To reference phraseable objects in language modeling, a vision-language model is first pre-trained on large-scale grounded scene-text data to align text phrases with their corresponding 3D objects. Subsequently, a large language model (LLM) is fine-tuned using multi-modal instruction-following data, where referent tokens serve as interleaved soft text tokens representing the phraseable objects. Per-task instruction-following templates are employed to address the diverse range of 3D vision tasks within the unified language modeling framework.

Results Visualization

Object Detection Language Grounding Dense Captioning Question Answering Embodied Dialogue Embodied Planning

Grounded Scene Caption Data

We propose an automated grounded language dataset generation process utilizing ChatGPT and 2D vision-language models to create the Grounded Scene Caption dataset (G-SceneCap):

Step 1: Bootstrapping Object Captioning with GT Label Correction. Using 3D real-scan datasets, we annotate each object with the vision-language model CogVLM~\cite{cogvlm}, using the images of the largest visible areas. Inconsistent annotations are rectified using raw instance labels.
Step 2: Condensing Scene Objects into a Caption. For each enumerated anchor object, we form an initial object set by randomly selecting a group of nearby objects. Their captions and coordinates (x,y,z) are input into GPT-4 for captioning, which requires referencing object phrases with their object IDs in the format "[object_phrase object_ID]" in the caption.
Step 3: Adding Rule-Based Relations into Captions. To enrich scene captions, we integrate program-generated spatial relationships from Sr3D. By selecting an anchor object from the set in step 2, we apply the spatial relation rules (e.g., between, supporting, nearest, back) to include related objects. GPT-4 then combines these relationships into the prior caption from step 2.

In each grounded text, a phrase-to-region correspondence is formatted as "[object phrase object IDs]", where "object phrase" refers to a noun phrase and "object IDs" denotes the corresponding object IDs in the groud-truth annotation. The bottom line demonstrates explicit phrase-to-region correspondence example between the noun phrases (character positions) and the corresponding objects.

Converted to Grounded Instruction-Following Format

For existing 3D vision-language tasks, we can transform them into instruction-following formats. These convertible 3D vision-language tasks include single and multi-object grounding, object detection, dense captioning, 3D QA, etc. For each task, we utilize approximately 10-20 structured task-specific instruction-following templates. Note that the referent correspondence is converted from the phrase-to-region correspondence of grounded scene-text annotation.

Example Visualization of Grounded Scene Caption Dataset

@article{chen2024grounded3dllm, title={Grounded 3D-LLM with Referent Tokens}, author={Chen, Yilun and Yang, Shuai and Huang, Haifeng and Wang, Tai and Lyu, Ruiyuan and Xu, Runsen and Lin, Dahua and Pang, Jiangmiao}, journal={arXiv preprint arXiv:2405.10370}, year={2024}, }