基于 Spark 与 MPI 集成的数据分析与处理平台

周梦兵; 李秋彦; 吴欧; 王洋

doi:10.12146/j.issn.2095-3135.20241203002

摘要: 目前，以机器学习为代表的人工智能应用负载呈现出计算密集型与数据密集型并存的双密特征。这类应用不仅需要支持海量数据的存储、传输和容错，还需优化复杂逻辑计算的性能。传统的单一大数据框架或高性能计算框架已无法应对这类应用带来的挑战。本文提出一种基于 Spark 和 MPI 的高性能混合大数据处理平台。该平台基于典型的大规模集群构建，针对以机器学习为代表的双密型应用的存储和计算特点，重点建设了双范式混合计算、异构存储和融合高性能通信等3个模块。针对双密型应用既有数据密集型的大数据处理，也有计算密集型的高性能计算的特点，本文设计了基于 Spark 和 MPI 范式的计算模块，通过对任务进行分割和分类，将计算密集型任务卸载到 MPI 计算模块，从而提升双范式混合计算功能。针对双密型应用在计算过程中不同类型数据的特征，本文设计了异构的存储结构和数据与元数据分离的策略，通过对数据的分型优化存储，构建了高性能存储系统。针对双密型计算的通信特点，本文提出一种高性能通信技术的集成方式，为计算模块和存储模块提供高性能通信支持。测试结果表明，该平台可为双密型应用提供高效的双范式混合计算，与单一的Spark大数据平台相比，各种计算任务的性能提升了4.2%～17.3%。

Abstract: Currently, AI application workloads, represented by machine learning, exhibit a dual-density characteristic, combining both compute-intensive and data-intensive traits. These applications not only require support for the storage, transmission, and fault tolerance of massive data but also need to optimize the performance of complex logical computations. Traditional single big data frameworks or high-performance computing frameworks can no longer meet the challenges posed by these applications. The hybrid big data platform based on Spark and MPI proposed in this paper is a high-performance big data processing platform. This platform, built on a typical large-scale cluster, focuses on addressing the storage and computing characteristics of dual-density applications, such as those in machine learning, and includes 3 key modules: dual-paradigm hybrid computation, heterogeneous storage, and integrated high-performance communication. To address the dual-density nature of these applications, which involve both data-intensive big data processing and compute-intensive high-performance computing, a computational module combining the Spark and MPI paradigms is designed. By splitting and classifying tasks, compute-intensive tasks are offloaded to the MPI computation module, enhancing the dual-paradigm hybrid computation capability. To address the characteristics of different types of data during the computing process, a heterogeneous storage structure and a data-metadata separation strategy are designed. This optimizes data storage through classification, building a high-performance storage system. In response to the communication needs of dual-density computing, this paper proposes an integration approach that combines high-performance communication techniques, providing strong communication support for the computing and storage modules. Test results show that this platform provides efficient dual-paradigm hybrid computation for dual-density applications, achieving performance improvements of 4.2% to 17.3% compared to a standalone Spark big data platform for various computation tasks.

基于 Spark 与 MPI 集成的数据分析与处理平台

Integrated Data Analysis and Processing Platform Based on Spark and MPI