基于TVM的Winograd自动性能优化方法
作者:
作者单位:

1.南方科技大学;2.深圳市腾讯计算机系统有限公司;3.中国科学院深圳先进技术研究院

作者简介:

通讯作者:

基金项目:

伦理声明:



Winograd Automatic Performance Optimization Based on TVM
Author:
Ethical statement:

Affiliation:

1.Southern University of Science and Technology;2.Shenzhen city Tencent computer system Co.Ltd.;3.Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences

Funding:

  • 摘要
  • |
  • 图/表
  • |
  • 访问统计
  • |
  • 参考文献
  • |
  • 相似文献
  • |
  • 引证文献
  • |
  • 资源附件
    摘要:

    卷积神经网络(CNNs)作为深度学习的典型代表,是计算机视觉等任务最常用的神经网络,然而卷积运算通常占整个CNNs运行时的90%以上,成为CNNs的性能瓶颈。另外,由于当下硬件的复杂性及工作负载的多样性,之前工作的一些特定优化往往缺乏性能可移植性。对此我们提出了BlazerML,一个基于TVM模板代码自动生成的开源卷积计算库,可以自动生成任意输入形状的高性能卷积实现,BlazerML是基于Winograd算法实现的,因为该算法是快速卷积算法中性能最高的算法。实验结果表明BlazerML显著优于当下最先进的开源库。在x86 CPU上运行常见的深度学习网络前向推理分别比OnnxRuntime、MNN和TVM社区版本快1.18~2.47、1.18~2.27和1.01~1.66倍。在ARM CPU上运行常见深度学习网络的单层推理分别比速ACL和FastConv快1.26~6.11、1.04~4.28倍。

    Abstract:

    Convolutional Neural Networks (CNNs), as a quintessential representation of deep learning, are the most commonly used neural networks in tasks such as computer vision. However, convolution operations typically account for over 90% of the runtime in CNNs, becoming a bottleneck for performance. Additionally, due to the complexity of current hardware and the diversity of workloads, specific optimizations in previous work often lack performance portability. To address this, we introduce BlazerML, an open-source convolution computation library based on auto-generated code templates from TVM, capable of automatically generating high-performance convolution implementations for any input shape. BlazerML is implemented based on the Winograd algorithm, known for its high performance in fast convolution algorithms. Experimental results demonstrate that BlazerML significantly outperforms current state-of-the-art open-source libraries. On x86 CPUs, running common deep learning network forward inferences, it is faster by 1.18~2.47, 1.18~2.27, and 1.01~1.66 times compared to OnnxRuntime, MNN, and the TVM community version, respectively. On ARM CPUs, for single-layer inference of common deep learning networks, it surpasses ACL and FastConv by 1.26~6.11 and 1.04~4.28 times, respectively.

    参考文献
    相似文献
    引证文献
引用本文

陈疆,朱泓霖,孟金涛.基于TVM的Winograd自动性能优化方法 [J].集成技术,

Citing format
ChenJiang, ZhuHonglin, MengJintao. Winograd Automatic Performance Optimization Based on TVM[J]. Journal of Integration Technology.

复制
分享
文章指标
  • 点击次数:
  • 下载次数:
  • HTML阅读次数:
历史
  • 收稿日期:2024-02-02
  • 最后修改日期:2024-02-02
  • 录用日期:
  • 在线发布日期: 2024-03-28
  • 出版日期: