Contextual Information and Large Language Model for Open-Vocabulary Indoor 3D Object Detection

doi:10.12146/j.issn.2095-3135.20241201003

Home > Archive>Volume 14, Issue 3, 2025 >51-63. DOI:10.12146/j.issn.2095-3135.20241201003

Contextual Information and Large Language Model for Open-Vocabulary Indoor 3D Object Detection
DOI:
                        10.12146/j.issn.2095-3135.20241201003
                    
CSTR:
                        32239.14.j.issn.2095-3135.20241201003
                    
Author:
                        
                        
                    
Affiliation:
Clc Number:TP183
Fund Project:

Article

Figures

Metrics

Reference

Cited by

Materials

Comments

Abstract:

Existing indoor three-dimensional (3D) object detection is able to detect a limited number of object categories, thus limiting the application on intelligent robotics. Open-vocabulary object detection is able to detect all objects of interest in a given scene without defining object categories, thus solving the shortcomings of indoor 3D object detection. At the same time, the large language model with prior knowledge can significantly improve the performance of visual tasks. However, existing researches on open-vocabulary indoor 3D object detection only focuses on object information and ignores contextual information. The input data for indoor 3D object detection is mainly point cloud, which suffers from sparsity and noise problems. Relying only on the object point cloud can negatively affect the 3D detection results. Contextual information contains scene information, which can complement the object information to promote the recognition on object category. For this reason, this paper proposes an open-vocabulary 3D object detection algorithm based on contextual information assistance. The algorithm integrates contextual information and object information through a large language model, and then performs chain-of-thought reasoning. The proposed algorithm is validated on SUN RGB-D and ScanNetV2 datasets, and the experimental results show the effectiveness of the proposed algorithm.

Reference

Cited by

Get Citation

ZHANG Sheng, CHENG Jun. Contextual Information and Large Language Model for Open-Vocabulary Indoor 3D Object Detection[J]. Journal of Integration Technology,2025,14(3):51-63

Copy

Article Metrics

Abstract:
PDF:
HTML:
Cited by:

History

Received:December 01,2024
Revised:January 11,2025
Adopted:
Online: May 09,2025
Published:

Home

About Journal

Editorial Team

Author Center

Peer Review

Reader Center

Ethics

Contact us

中文

Get Citation

Share

Article Metrics

History

Article QR Code