Contextual Information and Large Language Model for Open-Vocabulary Indoor 3D Object Detection
Author:
Clc Number:

TP183

  • Article
  • | |
  • Metrics
  • |
  • Reference [42]
  • |
  • Related [4]
  • | | |
  • Comments
    Abstract:

    Existing indoor three-dimensional (3D) object detection is able to detect a limited number of object categories, thus limiting the application on intelligent robotics. Open-vocabulary object detection is able to detect all objects of interest in a given scene without defining object categories, thus solving the shortcomings of indoor 3D object detection. At the same time, the large language model with prior knowledge can significantly improve the performance of visual tasks. However, existing researches on open-vocabulary indoor 3D object detection only focuses on object information and ignores contextual information. The input data for indoor 3D object detection is mainly point cloud, which suffers from sparsity and noise problems. Relying only on the object point cloud can negatively affect the 3D detection results. Contextual information contains scene information, which can complement the object information to promote the recognition on object category. For this reason, this paper proposes an open-vocabulary 3D object detection algorithm based on contextual information assistance. The algorithm integrates contextual information and object information through a large language model, and then performs chain-of-thought reasoning. The proposed algorithm is validated on SUN RGB-D and ScanNetV2 datasets, and the experimental results show the effectiveness of the proposed algorithm.

    Reference
    [1] 孙毅, 吴斯曼, 方伟, 等. 基于ResNet的安全监控目标检测 [J]. 集成技术, 2024, 13(6): 44-52.
    Sun Y, Wu SM, Fang W, et al. Object detection of security monitoring based on ResNet [J]. Journal of Integration Technology, 2024, 13(6): 44-52.
    [2] Rukhovich D, Vorontsova A, Konushin A. TR3D: towards real-time indoor 3D object detection [C] // Proceedings of the IEEE International Conference on Image Processing, 2023: 281-285.
    [3] Zhu CY, Chen L. A survey on open-vocabulary detection and segmentation: past, present, and future [J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024, 46(12): 8954-8975.
    [4] 郑泽凡, 谷飞飞, 王思成, 等. 基于三维视觉的机器人安全预警系统 [J]. 集成技术, 2022, 11(4): 80-91.
    Zheng ZF, Gu FF, Wang SC, et al. A robot safety warning system based on 3D vision [J]. Journal of Integration Technology, 2022, 11(4): 80-91.
    [5] 王耀祖, 李擎, 戴张杰, 等. 大语言模型研究现状与趋势 [J]. 工程科学学报, 2024, 46(8): 1411-1425.
    Wang YZ, Li Q, Dai ZJ, et al. Current status and trends in large language modeling research [J]. Chinese Journal of Engineering, 2024, 46(8): 1411-1425.
    [6] Wei J, Wang XZ, Schuurmans D, et al. Chain-of-thought prompting elicits reasoning in large language models [C] // Proceedings of the Advances in Neural Information Processing Systems, 2022: 24824-24837.
    [7] Qi CR, Litany O, He KM, et al. Deep hough voting for 3D object detection in point clouds [C] // Proceedings of the 2019 IEEE/CVF International Conference on Computer Vision, 2019: 9277-9286.
    [8] Qi CR, Yi L, Su H, et al. PointNet++: deep hierarchical feature learning on point sets in a metric space [C] // Proceedings of the Advances in Neural Information Processing Systems, 2017: 1-10.
    [9] Rukhovich D, Vorontsova A, Konushin A, et al. FCAF3D: fully convolutional anchor-free 3D object detection [C] // Proceedings of the European Conference on Computer Vision, 2022: 477-493.
    [10] Vaswani A, Shazeer N, Parmar N, et al. Attention is all you need [C] // Proceedings of the Advances in Neural Information Processing Systems, 2017: 1-11.
    [11] Liu Z, Zhang Z, Cao Y, et al. Group-Free 3D object detection via Transformers [C] // Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision, 2021: 2949-2958.
    [12] Misra I, Girdhar R, Joulin A, et al. An end-to-end Transformer model for 3D object detection [C] // Proceedings of the 2021 IEEE/CVF International Conference on Computer Vision, 2021: 2906-2917.
    [13] Radford A, Kim JW, Hallacy C, et al. Learning transferable visual models from natural language supervision [C] // Proceedings of the International Conference on Machine Learning, 2021: 8748-8763.
    [14] Zareian A, Rosa KD, Hu DH, et al. Open-vocabulary object detection using captions [C] // Proceedings of the 2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021: 14393-14402.
    [15] Zhong YW, Yang JW, Zhang PC, et al. RegionCLIP: region-based language-image pretraining [C] // Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022: 16793-16803.
    [16] Li LH, Zhang PC, Zhang HT, et al. Grounded language-image pre-training [C] // Proceedings of the 2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022: 10965-10975.
    [17] Liu SL, Zeng ZY, Ren TH, et al. Grounding DINO: marrying DINO with grounded pre-training for open-set object detection [Z/OL]. arXiv Preprint, arXiv: 2303.05499, 2023.
    [18] Zhang H, Li F, Liu SL, et al. DINO: DETR with improved denoising anchor boxes for end-to-end object detection [Z/OL]. arXiv Preprint, arXiv: 2203.03605, 2022.
    [19] Chen J, Zhu DY, Shen XQ, et al. MiniGPT-v2: large language model as a unified interface for vision-language multi-task learning [Z/OL]. arXiv Preprint, arXiv: 2310.09478, 2023.
    [20] Zhang H, Xu J, Tang T, et al. OpenSight: a simple open-vocabulary framework for LiDAR-based object detection [Z/OL]. arXiv Preprint, arXiv: 2312.08876, 2023.
    [21] Lu YH, Xu CF, Wei X, et al. Open-vocabulary point-cloud object detection without 3D annotation [C] // Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023: 1190-1199.
    [22] Zhang DM, Li C, Zhang R, et al. FM-OV3D: foundation model-based cross-modal knowledge blending for open-vocabulary 3D detection [C] // Proceedings of the AAAI Conference on Artificial Intelligence, 2024: 16723-16731.
    [23] Cao Y, Zeng YH, Xu H, et al. CoDA: collaborative novel box discovery and cross-modal alignment for open-vocabulary 3D object detection [C] // Proceedings of the Advances in Neural Information Processing Systems, 2024: 1-12.
    [24] OpenAI. GPT-4 technical report [Z/OL]. arXiv Preprint, arXiv: 2303.08774, 2023.
    [25] Hu EJ, Shen YL, Wallis P, et al. LoRA: low-rank adaptation of large language models [Z/OL]. arXiv Preprint, arXiv: 2106.09685, 2021.
    [26] Dettmers T, Pagnoni A, Holtzman A, et al. QLoRA: efficient finetuning of quantized LLMs [C] // Proceedings of the Advances in Neural Information Processing Systems, 2024: 1-26.
    [27] Liu X, Ji KX, Fu YC, el al. P-tuning v2: prompt tuning can be comparable to fine-tuning universally across scales and tasks [Z/OL]. arXiv Preprint, arXiv: 2110.07602, 2021.
    [28] Liu H, Li C, Wu Q, et al. Visual instruction tuning [C] // Proceedings of the Advances in Neural Information Processing Systems, 2024: 1-25.
    [29] Yao Y, Yu TY, Zhang A, et al. MiniCPM-V: a GPT-4V level MLLM on your phone [Z/OL]. arXiv Preprint, arXiv: 2408.01800, 2024.
    [30] Hu SD, Tu YG, Han X, et al. MiniCPM: unveiling the potential of small language models with scalable training strategies [Z/OL]. arXiv Preprint, arXiv: 2404.06395, 2024.
    [31] Zhai XH, Mustafa B, Kolesnikov A, et al. Sigmoid loss for language image pre-training [Z/OL]. arXiv Preprint, arXiv: 2303.15343, 2023.
    [32] Carion N, Massa F, Synnaeve G, et al. End-to-end object detection with Transformers [C] // Proceedings of the European Conference on Computer Vision, 2020: 213-229.
    [33] Wang A, Chen H, Liu LH, et al. YOLOv10: real-time end-to-end object detection [Z/OL]. arXiv Preprint, arXiv: 2405.14458, 2024.
    [34] Chu Z, Chen J, Chen Q, et al. A survey of chain of thought reasoning: advances, frontiers and future [Z/OL]. arXiv Preprint, arXiv: 2309.15402, 2023.
    [35] Feng G, Zhang B, Gu Y, et al. Towards revealing the mystery behind chain of thought: a theoretical perspective [C] // Proceedings of the Advances in Neural Information Processing Systems, 2024: 1-38.
    [36] Merrill W, Sabharwal A. The expresssive power of transformers with chain of thought [Z/OL]. arXiv Preprint, arXiv: 2310.07923, 2023.
    [37] Song SR, Lichtenberg SP, Xiao JX. SUN RGB-D: a RGB-D scene understanding benchmark suite [C] // Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2015: 567-576.
    [38] Dai A, Chang AX, Savva M, et al. ScanNet: richly-annotated 3D reconstructions of indoor scene [C] // Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2017: 5828-5839.
    [39] Lu YH, Xu CF, Wei XB, et al. Open-vocabulary 3D detection via image-level class and debiased cross-modal contrastive learning [Z/OL]. arXiv Preprint, arXiv: 2207.01987, 2022.
    Cited by
    Comments
    Comments
    分享到微博
    Submit
Get Citation

ZHANG Sheng, CHENG Jun. Contextual Information and Large Language Model for Open-Vocabulary Indoor 3D Object Detection[J]. Journal of Integration Technology,2025,14(3):51-63

Copy
Share
Article Metrics
  • Abstract:49
  • PDF: 72
  • HTML: 0
  • Cited by: 0
History
  • Received:December 01,2024
  • Revised:January 11,2025
  • Online: May 09,2025
Article QR Code