Cross-Modal Interaction-Driven Embodied Intelligence: A Review of Vision&ndash;Language&ndash;Action Models, Data, and Platforms

GuoHua LIU; jize cui; yongchen lu; junjun hu; qiang xie; yuanjun guo; Mi PAN; ZhiLe YANG

doi:10.12146/j.issn.2095-3135.20250923001

GuoHua LIU, jize cui, yongchen lu, junjun hu, qiang xie, yuanjun guo, Mi PAN, ZhiLe YANG. Cross-Modal Interaction-Driven Embodied Intelligence: A Review of Vision–Language–Action Models, Data, and Platforms[J]. Journal of Integration Technology. DOI: 10.12146/j.issn.2095-3135.20250923001

Citation:

Cross-Modal Interaction-Driven Embodied Intelligence: A Review of Vision–Language–Action Models, Data, and Platforms

Graphical Abstract

Graphical Abstract

Abstract

Abstract

The core of embodied intelligence is to enable an agent to perceive, understand, decide, and act in real or simulated environments through a “vision–language–action” loop, thereby acquiring transferable skill representations. Centered on this goal, the vision-language-action (VLA) paradigm is unifying visual representation, language reasoning, and action control within a single decision pipeline, and is gradually moving toward openness and reproducibility. First, this work examines VLA-driven embodied intelligence at the model level by comparing end-to-end VLA architectures, planning–control hierarchical architectures, and language/code-driven high-level task planners, and analyzes their differences in terms of generalization ability, interpretability, real-time deployability, and cross-morphology transferability. Then, at the data and evaluation level, we summarize cross-robot demonstration data and automated data collection pipelines, outline typical evaluation benchmarks (such as task success rate, long-horizon goal completion rate, and safety constraint metrics), and provide a minimal reproducible experimental protocol. Finally, at the platform level, we construct a “simulation–real robot–evaluation” loop that spans interactive simulation, GPU-parallelized physics-accelerated training, real-world robot validation, and unified evaluation standards, and we discuss its significance for Sim-to-Real deployment. On this basis, we further propose an analysis framework of “cross-modal interaction and alignment,” which is used to characterize the error propagation pathways among perception, language understanding and planning, action generation, and environmental feedback. We then summarize the key future challenges and engineering roadmaps in areas such as data coverage, cross-modal robustness, energy efficiency, and safety governance.

FullText(HTML)

References (0)

Cited By

Turn off MathJax

Article Contents

Cross-Modal Interaction-Driven Embodied Intelligence: A Review of Vision–Language–Action Models, Data, and Platforms

Graphical Abstract

Abstract

Catalog

Export File

Citation

Format

Content