Unveiling the Mist over 3D Vision-Language Understanding: Object-centric Evaluation with Chain-of-Analysis

CVPR 2025

Jiangyong Huang^1,2,✶ Baoxiong Jia^1,✶ Yan Wang¹ Ziyu Zhu^1,3 Xiongkun Linghu¹
Qing Li¹ Song-Chun Zhu^1,2,3 Siyuan Huang¹

¹Beijing Institute for General Artificial Intelligence (BIGAI)
²Peking University ³Tsinghua University

✶ indicates equal contribution

arXiv Video Demo Code Data

Overview

An illustration of Beacon3D, a novel benchmark for 3D grounding and question answering (QA) tasks. Beacon3D features an object-centric evaluation framework, with Grounding Chains (G-Chains) and Grounding-QA Chains (GQA-Chains) for each object. The evaluation adopts object-centric metrics to ensure robustness and utilizes chain-of-analysis for studies in task coherence. We also design various knowledge types such as class, appearance ("App."), geometry ("Geo."), spatial ("Spa."), and existence ("Exi.").

30 high-quality real 3D scenes in Beacon3D

Beacon3D is built on 30 high-quality real 3D scenes meticulously selected from ScanNet, 3RScan, and MultiScan. The object-centric evaluation includes more than 800 objects and shows a diverse distribution of knowledge types in grounding and QA tasks.

Summary

🤔Limitations of existing 3D-VL benchmarks

🗃️ Flawed test data
📊 Insufficient evaluation metrics
🔗 Isolation of grounding and QA tasks

💡Highlights of the Beacon3D benchmark

🗃️ High-quality test data
📊 Object-centric evaluation metrics
🔗 Grounding Chain and Grounding-QA Chain

From Existing 3D-VL Benchmarks to Beacon3D

🗃️ Flawed test data. We observe notable data flaws including ambiguous referential texts in the grounding task, ambiguous questions and incomplete answers in the QA task. Such flawed test data could undermine the reliability of evaluation results.

**Flawed test data in existing 3D-VL benchmarks:** the top row shows grounding data and the bottom row shows QA data.

🗃️ Beacon3D: high-quality test data. We establish detailed annotation guidelines, ensuring precise and natural language to address prior data flaws. The human study across different 3D-VL benchmarks highlights the quality of Beacon3D test data.

Human study on the quality of grounding data — **Human study on the quality of test data:** the left shows grounding data and the right shows QA data.

Human study on the quality of QA data — **Human study on the quality of test data:** the left shows grounding data and the right shows QA data.

📊 Insufficient evaluation metrics. We find that simple metrics, such as averaging accuracy over individual QA pairs, are vulnerable to pitfalls like visual ignorance and weak language robustness, falling short in capturing true model capability.

**Model pitfall: weak language robustness**

We present two pilot studies to show the vulnerability of existing evaluation metrics to two model pitfalls:

Visual ignorance: finetuning blind LLMs on SQA3D yields unexpectedly high accuracy, indicating the deficiency in evaluating the visual capability of 3D-VL models.
Language robustness: rephrasing language yields moderate and significant performance shifts in grounding and QA tasks, respectively, indicating that current 3D-VL models are susceptible to language variations.

**Performance of blind LLMs on SQA3D ( ^† denotes 3D-VL model)**

Effect of rephrasing on grounding task — **Effect of rephrasing language on grounding (left) and QA (right) tasks**

Effect of rephrasing on QA task — **Effect of rephrasing language on grounding (left) and QA (right) tasks**

📊 Beacon3D: object-centric evaluation metrics. In contrast to previous per-case average metrics, we design three diverse test cases per object and adopt object-centric metrics, which require the model to make correct predictions in all three cases.

**Data example: three grounding texts per object**

**Case-centric metrics vs. Object-centric metrics**

🔗 Beacon3D: Grounding Chain. We organize the grounding data into Grounding Chains following a coarse-to-fine scheme, which helps assess performance coherence across different granularities and identify the boundary of grounding capability.

🔗 Beacon3D: Grounding-QA Chain. Beacon3D links QA data to grounding data via shared referential texts of the target object. Each question queries a specific aspect (e.g., appearance) of the object, forming a Grounding-QA Chain that enables analysis of grounding-QA coherence. We identify two types of broken coherence:

Type 1: model fails to answer the queried content but can recognize that in grounding task, showing a lack of QA skills.
Type 2: model correctly answers the question but fails to ground the target object, indicating shortcut behavior in QA.

Data Visualizer

Select a scene and then click an image to visualize scene and object-centric data

ScanNet: scene0050_00 ScanNet: scene0616_00 3RScan: 634b2183 MultiScan: scene_00106_05

Scene — object

Grounding text

Question

Answer

Results and Findings

Metrics. Object-centric metrics elicit a significant performance drop compared to per-case metrics, suggesting that current 3D-VL models lack a comprehensive understanding of objects and are susceptible to language variations.

Model performance in grounding task — **Model performance in grounding (left) and QA (right) tasks:** "Case" denotes per-case metrics and "Obj." denotes object-centric metrics.

Model performance in QA task — **Model performance in grounding (left) and QA (right) tasks:** "Case" denotes per-case metrics and "Obj." denotes object-centric metrics.

Chain analysis: GQA-Chain. We visualize four types of GQA-Chains and observe a limited proportion of good grounding-QA coherence. R₁ and R₂ measure the two types of broken coherence, both hovering around 50%. This reveals a substantial gap between the skills of grounding and QA, and frequent shortcut behavior in QA.

Chain analysis across GQA-Chains — **GQA-Chain analysis:** distribution of four types of GQA-Chains (left) and two metrics for evaluating broken grounding-QA coherence (right).

Chain analysis: G-Chain. Evaluation across coarse-to-fine G-Chains shows that fine-grained grounding is more challenging than coarse. The difficulty is pronounced when the model fails on coarse texts. As fine-grained grounding is crucial to solid QA performance, our chain analysis highlights the need to improve 3D-VL models in fine-grained grounding for better grounding-QA coherence.

Chain analysis across G-Chains — **G-Chain analysis:** distribution of four types of G-Chains.

Model insights. Our evaluation indicates that incorporating LLMs into 3D-VL models weakens grounding capability and does not fundamentally enhance QA capability. This suggests the main bottleneck lies in 3D perception and VL alignment, rather than language modeling or reasoning — LLMs' strength. Therefore, advancing 3D-VL models may rely more on stronger foundation models for 3D scene understanding than on leveraging LLMs.

BibTex

If you find our work helpful, please consider citing us:

@inproceedings{huang2025unveiling,
  title={Unveiling the Mist over 3D Vision-Language Understanding: Object-centric Evaluation with Chain-of-Analysis},
  author={Huang, Jiangyong and Jia, Baoxiong and Wang, Yan and Zhu, Ziyu and Linghu, Xiongkun and Li, Qing and Zhu, Song-Chun and Huang, Siyuan},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year={2025}
}