Unveiling the Mist over 3D Vision-Language Understanding: Object-centric Evaluation with Chain-of-Analysis

CVPR 2025

1Beijing Institute for General Artificial Intelligence (BIGAI)
2Peking University 3Tsinghua University

✶ indicates equal contribution

Overview

Illustrative overview of Beacon3D

An illustration of Beacon3D, a novel benchmark for 3D grounding and question answering (QA) tasks. Beacon3D features an object-centric evaluation framework, with Grounding Chains (G-Chains) and Grounding-QA Chains (GQA-Chains) for each object. The evaluation adopts object-centric metrics to ensure robustness and utilizes chain-of-analysis for studies in task coherence. We also design various knowledge types such as class, appearance ("App."), geometry ("Geo."), spatial ("Spa."), and existence ("Exi.").

30 high-quality real 3D scenes in Beacon3D Beacon3D data statistics

Beacon3D is built on 30 high-quality real 3D scenes meticulously selected from ScanNet, 3RScan, and MultiScan. The object-centric evaluation includes more than 800 objects and shows a diverse distribution of knowledge types in grounding and QA tasks.

Summary

🤔Limitations of existing 3D-VL benchmarks

  • 🗃️  Flawed test data
  • 📊  Insufficient evaluation metrics
  • 🔗  Isolation of grounding and QA tasks

💡Highlights of the Beacon3D benchmark

  • 🗃️  High-quality test data
  • 📊  Object-centric evaluation metrics
  • 🔗  Grounding Chain and Grounding-QA Chain

From Existing 3D-VL Benchmarks to Beacon3D

🗃️ Flawed test data. We observe notable data flaws including ambiguous referential texts in the grounding task, ambiguous questions and incomplete answers in the QA task. Such flawed test data could undermine the reliability of evaluation results.

Flawed test data in existing 3D-VL benchmarks
Flawed test data in existing 3D-VL benchmarks: the top row shows grounding data and the bottom row shows QA data.

🗃️ Beacon3D: high-quality test data. We establish detailed annotation guidelines, ensuring precise and natural language to address prior data flaws. The human study across different 3D-VL benchmarks highlights the quality of Beacon3D test data.

Human study on the quality of grounding data Human study on the quality of QA data
Human study on the quality of test data: the left shows grounding data and the right shows QA data.

📊 Insufficient evaluation metrics. We find that simple metrics, such as averaging accuracy over individual QA pairs, are vulnerable to pitfalls like visual ignorance and weak language robustness, falling short in capturing true model capability.

Model pitfall: visual ignorance
Model pitfall: visual ignorance
Model pitfall: weak language robustness
Model pitfall: weak language robustness
We present two pilot studies to show the vulnerability of existing evaluation metrics to two model pitfalls:
  • Visual ignorance: finetuning blind LLMs on SQA3D yields unexpectedly high accuracy, indicating the deficiency in evaluating the visual capability of 3D-VL models.
  • Language robustness: rephrasing language yields moderate and significant performance shift in grounding and QA tasks respectively, indicating that current 3D-VL models are susceptible to language variations.
Performance of blind LLMs on SQA3D
Performance of blind LLMs on SQA3D ( denotes 3D-VL model)
Effect of rephrasing on grounding task Effect of rephrasing on QA task
Effect of rephrasing language on grounding (left) and QA (right) tasks

📊 Beacon3D: object-centric evaluation metrics. In contrast to previous per-case average metrics, we design three diverse test cases per object and adopt object-centric metrics, which require the model to make corrcet prediction in all three cases.

Data example: three grounding texts per object
Data example: three grounding texts per object
Object-centric metrics
Case-centric metrics vs. Object-centric metrics

🔗 Beacon3D: Grounding Chain. We organize the grounding data into Grounding Chains following a coarse-to-fine scheme, which helps assess performance coherence across different granularities and identify the boundary of grounding capability.

🔗 Beacon3D: Grounding-QA Chain. Beacon3D links QA data to grounding data via shared referential texts of the target object. Each question queries a specific aspect (e.g., appearance) of the object, forming a Grounding-QA Chain that enables analysis of grounding-QA coherence. We identify two types of broken coherence:
  • Type 1: model fails to answer the queried content but can recognize that in grounding task, showing a lack of QA skills.
  • Type 2: model correctly answers the question but fails to ground the target object, indicating shortcut behavior in QA.

Data Visualizer

Select a scene and then click an image to visualize scene and object-centric data

Grounding 1: The armchair directly facing the desk.[sep]Grounding 2: The armchair near the printer.[sep]Grounding 3: Large chair.[SEP]Question 1: What is in front of the large chair?[sep]Answer 1: Desk.[Sep]Question 2: What color is the armchair near the printer?[sep]Answer 2: Brown.[Sep]Question 3: What size is "the armchair near the printer" compared with the other chair?[sep]Answer 3: Large.
Grounding 1: The blue box next to the piano.[sep]Grounding 2: Blue box.[sep]Grounding 3: The lower box on the stool.[SEP]Question 1: What color is the lower box on the stool?[sep]Answer 1: Blue.[Sep]Question 2: Is there a similar box next to the blue box?[sep]Answer 2: Yes.[Sep]Question 3: What size is the blue box compared with the adjacent box?[sep]Answer 3: Large.
Grounding 1: The backpack closer to chair.[sep]Grounding 2: The black backpack far from piano.[sep]Grounding 3: The backpack closer to desk.[SEP]Question 1: What color is the backpack closer to chair?[sep]Answer 1: Black.[Sep]Question 2: Is "the black backpack far from piano" on the couch?[sep]Answer 2: No.[Sep]Question 3: Is there a trash can next to "the backpack closer to desk"?[sep]Answer 3: No.
Scene — object
Grounding text

Question

Answer



Results and Findings

Metrics. Object-centric metrics elicit a significant performance drop compared to per-case metrics, suggesting that current 3D-VL models lack a comprehensive understanding of objects and are susceptible to language variations.

Model performance in grounding task Model performance in QA task
Model performance in grounding (left) and QA (right) tasks: "Case" denotes per-case metrics and "Obj." denotes object-centric metrics.

Chain analysis: GQA-Chain. We visualize four types of GQA-Chains and observe a limited proportion of good grounding-QA coherence. R1 and R2 measure the two types of broken coherence, both hovering around 50%. This reveals a substantial gap between the skills of grounding and QA, and frequent shortcut behavior in QA.

Chain analysis across GQA-Chains
GQA-Chain analysis: distribution of four types of GQA-Chains (left) and two metrics for evaluating broken grounding-QA coherence (right).
Chain analysis: G-Chain. Evaluation across coarse-to-fine G-Chains shows that fine-grained grounding is more challenging than coarse. The difficulty is pronounced when the model fails on coarse texts. As fine-grained grounding is crucial to solid QA performance, our chain analysis highlights the need to improve 3D-VL models in fine-grained grounding for better grounding-QA coherence.
Chain analysis across G-Chains
G-Chain analysis: distribution of four types of G-Chains.

Model insights. Our evaluation indicates that incorporating LLMs to 3D-VL models weakens grounding capability and does not fundamentally enhance QA capability. This suggests the main bottleneck lies in 3D perception and VL alignment, rather than language modeling or reasoning — LLMs' strength. Therefore, advancing 3D-VL models may rely more on stronger foundation models for 3D scene understanding than on leveraging LLMs.

BibTex

If you find our work helpful, please consider citing us:

@inproceedings{huang2025unveiling,
  title={Unveiling the Mist over 3D Vision-Language Understanding: Object-centric Evaluation with Chain-of-Analysis},
  author={Huang, Jiangyong and Jia, Baoxiong and Wang, Yan and Zhu, Ziyu and Linghu, Xiongkun and Li, Qing and Zhu, Song-Chun and Huang, Siyuan},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year={2025}
}