跨模态交互驱动的具身智能：视觉-语言-动作融合模型、数据与平台综述

刘国华; 崔纪泽; 陆勇辰; 胡军军; 谢强; 郭媛君; 潘觅; 杨之乐

doi:10.12146/j.issn.2095-3135.20250923001

跨模态交互驱动的具身智能：视觉-语言-动作融合模型、数据与平台综述

Cross-Modal Interaction-Driven Embodied Intelligence: A Review of Vision-Language-Action Models, Data, and Platforms

摘要

摘要: 具身智能的核心在于通过“视觉-语言-动作”闭环，使智能体能在真实或仿真环境中感知、理解、决策和执行，从而获得可迁移的技能表征。围绕这一目标，视觉-语言-动作范式正将视觉表征、语言推理与动作控制统一到同一决策链路中，并逐步走向开源化和可复现化。首先，本文从模型层面对比了规划—控制式分层架构和语言/代码驱动的高层任务规划器，并分析了它们在泛化能力、可解释性、实时部署能力和跨形态迁移能力方面的差异；其次，在数据与评测层面，总结了跨机器人示教数据和自动化采集流程，梳理了典型评测基准(如任务成功率、长程目标达成率和安全约束指标等)，并给出了可复现的最小实验协议；最后，在平台层面，本文从交互式仿真、GPU 并行物理加速训练、真实机器人验证和统一评测标准出发，构建“仿真—实机—评测”的闭环，并讨论其对Sim-to-Real落地的意义。在此基础上，本文进一步总结并形成了“跨模态交互—对齐”分析框架，用于刻画感知—语言理解与规划—动作生成—环境反馈之间的误差传递路径，进而从数据覆盖度、跨模态鲁棒性、能效与安全治理等方向归纳未来的关键挑战与工程化路线图。

Abstract: The core of embodied intelligence is to enable an agent to perceive, understand, decide, and act in real or simulated environments through a “vision-language-action” loop, thereby acquiring transferable skill representations. Centered on this goal, the vision-language-action (VLA) paradigm is unifying visual representation, language reasoning, and action control within a single decision pipeline, and is gradually moving toward openness and reproducibility. First, this study examines embodied intelligence at the model level by evaluating planning-control hierarchical architectures against language/code-driven high-level task planners, and analyzes their differences in terms of generalization ability, interpretability, real-time deployability, and cross-morphology transferability. Then, at the data and evaluation level, we summarize cross-robot demonstration data and automated data collection pipelines, outline typical evaluation benchmarks (such as task success rate, long-horizon goal completion rate, and safety constraint metrics), and provide a minimal reproducible experimental protocol. Finally, at the platform level, we construct a “simulation-real robot-evaluation” loop that spans interactive simulation, GPU-parallelized physics-accelerated training, real-world robot validation, and unified evaluation standards, and we discuss its significance for Sim-to-Real deployment. On this basis, we further propose an analysis framework of “cross-modal interaction and alignment”, which is used to characterize the error propagation pathways among perception, language understanding and planning, action generation, and environmental feedback. We then summarize the key future challenges and engineering roadmaps in areas such as data coverage, cross-modal robustness, energy efficiency, and safety governance.

HTML全文

参考文献(152)

施引文献

资源附件(0)