Cross-Modal Interaction-Driven Embodied Intelligence: A Review of Vision–Language–Action Models, Data, and Platforms
-
Graphical Abstract
-
Abstract
The core of embodied intelligence is to enable an agent to perceive, understand, decide, and act in real or simulated environments through a “vision–language–action” loop, thereby acquiring transferable skill representations. Centered on this goal, the vision-language-action (VLA) paradigm is unifying visual representation, language reasoning, and action control within a single decision pipeline, and is gradually moving toward openness and reproducibility. First, this work examines VLA-driven embodied intelligence at the model level by comparing end-to-end VLA architectures, planning–control hierarchical architectures, and language/code-driven high-level task planners, and analyzes their differences in terms of generalization ability, interpretability, real-time deployability, and cross-morphology transferability. Then, at the data and evaluation level, we summarize cross-robot demonstration data and automated data collection pipelines, outline typical evaluation benchmarks (such as task success rate, long-horizon goal completion rate, and safety constraint metrics), and provide a minimal reproducible experimental protocol. Finally, at the platform level, we construct a “simulation–real robot–evaluation” loop that spans interactive simulation, GPU-parallelized physics-accelerated training, real-world robot validation, and unified evaluation standards, and we discuss its significance for Sim-to-Real deployment. On this basis, we further propose an analysis framework of “cross-modal interaction and alignment,” which is used to characterize the error propagation pathways among perception, language understanding and planning, action generation, and environmental feedback. We then summarize the key future challenges and engineering roadmaps in areas such as data coverage, cross-modal robustness, energy efficiency, and safety governance.
-
-