跨模态交互驱动的具身智能：视觉-语言-动作融合模型、数据与平台综述

刘国华; 崔纪泽; 陆勇辰; 胡军军; 谢强; 郭媛君; 潘觅; 杨之乐

doi:10.12146/j.issn.2095-3135.20250923001

跨模态交互驱动的具身智能：视觉-语言-动作融合模型、数据与平台综述

Cross-Modal Interaction-Driven Embodied Intelligence: A Review of Vision–Language–Action Models, Data, and Platforms

摘要

摘要: 具身智能的核心是让智能体在真实或仿真环境中以“视觉–语言–动作”闭环进行感知、理解、决策和执行，从而获得可迁移的技能表征。围绕这一目标，视觉-语言-动作（VLA）范式正把视觉表征、语言推理与动作控制统一到同一决策链路中，并逐步走向开源化、可复现化。首先，本文面向VLA驱动的具身智能，从模型层面对比了端到端式VLA、规划—控制式分层架构以及语言/代码驱动的高层任务规划器，并分析它们在泛化能力、可解释性、实时部署能力以及跨形态迁移能力方面的差异；进而，在数据与评测层面，我们总结了跨机器人示教数据与自动化采集管线，梳理了典型评测基准（如任务成功率、长程目标达成率与安全约束指标等），并给出了可复现的最小实验协议；最后，在平台层，我们从交互式仿真、GPU 并行物理加速训练到真实机器人验证与统一评测标准出发，构建出“仿真—实机—评测”的闭环，并讨论其对Sim-to-Real落地的意义。在此基础上，本文进一步总结并形成“跨模态交互—对齐”分析框架，用于刻画感知、语言理解与规划、动作生成以及环境反馈之间的误差传递路径，并据此归纳未来在数据覆盖度、跨模态鲁棒性、能效与安全治理等方向的关键挑战与工程化路线图。

Abstract: The core of embodied intelligence is to enable an agent to perceive, understand, decide, and act in real or simulated environments through a “vision–language–action” loop, thereby acquiring transferable skill representations. Centered on this goal, the vision-language-action (VLA) paradigm is unifying visual representation, language reasoning, and action control within a single decision pipeline, and is gradually moving toward openness and reproducibility. First, this work examines VLA-driven embodied intelligence at the model level by comparing end-to-end VLA architectures, planning–control hierarchical architectures, and language/code-driven high-level task planners, and analyzes their differences in terms of generalization ability, interpretability, real-time deployability, and cross-morphology transferability. Then, at the data and evaluation level, we summarize cross-robot demonstration data and automated data collection pipelines, outline typical evaluation benchmarks (such as task success rate, long-horizon goal completion rate, and safety constraint metrics), and provide a minimal reproducible experimental protocol. Finally, at the platform level, we construct a “simulation–real robot–evaluation” loop that spans interactive simulation, GPU-parallelized physics-accelerated training, real-world robot validation, and unified evaluation standards, and we discuss its significance for Sim-to-Real deployment. On this basis, we further propose an analysis framework of “cross-modal interaction and alignment,” which is used to characterize the error propagation pathways among perception, language understanding and planning, action generation, and environmental feedback. We then summarize the key future challenges and engineering roadmaps in areas such as data coverage, cross-modal robustness, energy efficiency, and safety governance.

HTML全文

参考文献(0)

施引文献

资源附件(0)