基于张量虚拟机的快速卷积自动性能优化
作者:
作者单位:

作者简介:

通讯作者:

中图分类号:

TP 399

基金项目:

广东省重点领域研发计划资助项目( 2021B0101310002);国家自然科学基金项目( 62272449 );深圳市基础研究项目 (RCYX20200714114734194, KQTD20200820113106007, ZDSYS20220422103800001);中国科学院青年创新促进会项目(Y2021101)


Fast Convolution Automatic Performance Optimization Based on Tensor Virtual Machine
Author:
Affiliation:

Fund Project:

This work is supported by Key Research and Development Project of Guangdong Province (2021B0101310002), National Natural Science Foundation of China (62272449), Shenzhen Basic Research Fundation (RCYX20200714114734194, KQTD20200820113106007, ZDSYS20220422103800001), and Youth Innovation Promotion Association, CAS (Y2021101)

  • 摘要
  • |
  • 图/表
  • |
  • 访问统计
  • |
  • 参考文献
  • |
  • 相似文献
  • |
  • 引证文献
  • |
  • 资源附件
  • |
  • 文章评论
    摘要:

    卷积神经网络作为深度学习的典型代表,是计算机视觉等任务中最常用的神经网络,然而,卷积运算通常占整个卷积神经网络运行时的 90% 以上,成为卷积神经网络的性能瓶颈。此外,由于当下硬件的复杂性及工作负载的多样性,之前工作中的一些特定优化往往缺乏性能可移植性。对此,作者提出 BlazerML,一个基于张量虚拟机(TVM)模板代码自动生成的开源卷积计算库,可为任何输入形状自动生成高性能的卷积实现。BlazerML 是基于 Winograd 算法实现的,因为该算法是快速卷积算法中性能最高的算法。实验结果表明:BlazerML 显著优于当下最先进的开源库。在 x86 CPU 上运行常见的深度学习网络前向推理分别比 OnnxRuntime、MNN 和 TVM 社区版本快 1.18~2.47 倍、1.18~2.27 倍和 1.01~1.66 倍。在 ARM CPU 上运行常见深度学习网络的单层推理分别比 ACL 和 FastConv 快 1.26~6.11 倍、1.04~4.28 倍。

    Abstract:

    Convolutional Neural Networks (CNNs) as a quintessential representation of deep learning, are the most commonly used neural networks in tasks such as computer vision. However, convolution operations typically account for over 90% of the runtime in CNNs, becoming a bottleneck for performance. Additionally, due to the complexity of current hardware and the diversity of workloads, specific optimizations in previous work often lack performance portability. To address this problem, the author introduces BlazerML, an open-source convolution computation library based on auto-generated code templates from TVM, capable of automatically generating high-performance convolution implementations for any input shape. BlazerML is implemented based on the Winograd algorithm, known for its high performance in fast convolution algorithms. Experimental results demonstrate that BlazerML significantly outperforms current state-of-the-art open-source libraries. On x86 CPUs, running common deep learning network forward inferences, it is faster by 1.18—2.47 times, 1.18—2.27 times, and 1.01—1.66 times compared to OnnxRuntime, MNN, and the TVM community version, respectively. On ARM CPUs, for single-layer inference of common deep learning networks, it surpasses ACL and FastConv by 1.26—6.11 times and 1.04—4.28 times, respectively.

    参考文献
    相似文献
    引证文献
引用本文

引文格式
陈 疆,朱泓霖,孟金涛,等.基于张量虚拟机的快速卷积自动性能优化 [J].集成技术,2024,13(5):3-18

Citing format
CHEN Jiang, ZHU Honglin, MENG Jintao, et al. Fast Convolution Automatic Performance Optimization Based on Tensor Virtual Machine[J]. Journal of Integration Technology,2024,13(5):3-18

复制
分享
文章指标
  • 点击次数:
  • 下载次数:
  • HTML阅读次数:
  • 引用次数:
历史
  • 收稿日期:2024-02-02
  • 最后修改日期:2024-02-02
  • 录用日期:
  • 在线发布日期: 2024-09-24
  • 出版日期:
文章二维码