基于Spark 与 MPI 集成的数据分析与处理平台
CSTR:
作者:
作者单位:

1.中国科学院深圳先进技术研究院;2.中国科学院大学;3.南京大学软件学院;4.深圳理工大学

作者简介:

通讯作者:

中图分类号:

G

基金项目:

广东省重点领域研发计划“软件、芯片与计算”重大科技专项(No. 2021B 0101400005);深圳市承接“人机物融合的云计算架构与平台 ”的产业化应用研究(CJGJZD20230724093659004);中国科学院深圳先进技术研究院--深圳广道数字有限公司大数据与AI联合实验室(E3Z092);边云智能协同计算方法与面向C-V2X的应用, 深圳市科技计划项目(深港澳C类,No. SGDX20220 530111001003)


Integrated Data Analysis and Processing Platform Based on Spark and MPI
Author:
Affiliation:

1.Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences;2.University of Chinese Academy of Sciences;3.Software Institute, Nanjing University;4.Shenzhen University of Advanced Technology

Fund Project:

  • 摘要
  • |
  • 图/表
  • |
  • 访问统计
  • |
  • 参考文献
  • |
  • 相似文献
  • |
  • 引证文献
  • |
  • 资源附件
  • |
  • 文章评论
    摘要:

    目前以机器学习为代表的人工智能应用负载表现出计算密集型与数据密集型并存的双密特征。这类应用不仅需要支持海量数据的存储、传输与容错,还需优化复杂逻辑计算的性能。传统的单一大数据框架或高性能计算框架已无法应对这类应用带来的挑战。文章提出的基于 Spark 和 MPI 的混合大数据平台是一种高性能大数据处理平台。该平台以典型大规模集群为基础,针对以机器学习为代表的双密型应用的存储和计算特点,重点建设了双范式混合计算、异构存储和融合高性能通信三个模块。针对双密型应用既有数据密集型的大数据处理也有计算密集型的高性能计算的特点,设计了 Spark 范式混合 MPI 范式的计算模块,通过对任务进行分割和分类,将计算密集型任务卸载到 MPI 计算模块,提升双范式混合计算功能。针对双密型应用在计算过程中不同数据的数据特征,设计了异构的存储结构和数据与元数据分离的策略,通过对数据的分型优化存储,构建高可用、高性能的存储系统。针对双密型计算的通信特点,提出高性能通信技术的融合方式,为计算模块和存储模块提供高性能通信支持。测试结果表明,该平台可以为双密型应用提供高效的双范式混合计算,针对不同的计算任务,相较于单一的Spark大数据平台性能提升4.2%至17.3%。

    Abstract:

    Currently, AI application workloads, represented by machine learning, exhibit a dual-density characteristic, combining both compute-intensive and data-intensive traits. These applications not only require support for the storage, transmission, and fault tolerance of massive data but also need to optimize the performance of complex logical computations. Traditional single big data frameworks or high-performance computing frameworks can no longer meet the challenges posed by these applications. The hybrid big data platform based on Spark and MPI proposed in this paper is a high-performance big data processing platform. This platform, built on a typical large-scale cluster, focuses on addressing the storage and computing characteristics of dual-density applications, such as those in machine learning, and includes three key modules: dual-paradigm hybrid computation, heterogeneous storage, and integrated high-performance communication. To address the dual-density nature of these applications, which involve both data-intensive big data processing and compute-intensive high-performance computing, a computational module combining the Spark and MPI paradigms is designed. By splitting and classifying tasks, compute-intensive tasks are offloaded to the MPI computation module, enhancing the dual-paradigm hybrid computation capability. To address the different data characteristics during the computation process, a heterogeneous storage structure and a data-metadata separation strategy are designed. This optimizes data storage through classification, building a highly available, high-performance storage system. In response to the communication needs of dual-density computing, a high-performance communication technique is proposed, providing strong communication support for the computing and storage modules. Test results show that this platform provides efficient dual-paradigm hybrid computation for dual-density applications, achieving performance improvements of 4.2% to 17.3% compared to a standalone Spark big data platform for various computation tasks.

    参考文献
    相似文献
    引证文献
引用本文

周梦兵,李秋彦,吴欧,等.基于Spark 与 MPI 集成的数据分析与处理平台 [J].集成技术,

Citing format
zhoumengbing, liqiuyan, wuou, et al. Integrated Data Analysis and Processing Platform Based on Spark and MPI[J]. Journal of Integration Technology.

复制
分享
文章指标
  • 点击次数:
  • 下载次数:
  • HTML阅读次数:
  • 引用次数:
历史
  • 收稿日期:2024-12-03
  • 最后修改日期:2025-04-10
  • 录用日期:2025-04-25
  • 在线发布日期: 2025-05-07
  • 出版日期:
文章二维码