基于张量虚拟机的快速卷积自动性能优化

陈 疆; 朱泓霖; 孟金涛; 魏彦杰

doi:10.12146/j.issn.2095-3135.20240202001

基于张量虚拟机的快速卷积自动性能优化

Fast Convolution Automatic Performance Optimization Based on Tensor Virtual Machine

摘要

摘要: 卷积神经网络作为深度学习的典型代表, 是计算机视觉等任务中最常用的神经网络, 然而, 卷积运算通常占整个卷积神经网络运行时的 90% 以上, 成为卷积神经网络的性能瓶颈。此外, 由于当下硬件的复杂性及工作负载的多样性, 之前工作中的一些特定优化往往缺乏性能可移植性。对此, 作者提出 BlazerML, 一个基于张量虚拟机(TVM)模板代码自动生成的开源卷积计算库, 可为任何输入形状自动生成高性能的卷积实现。BlazerML 是基于 Winograd 算法实现的, 因为该算法是快速卷积算法中性能最高的算法。实验结果表明：BlazerML 显著优于当下最先进的开源库。在 x86 CPU 上运行常见的深度学习网络前向推理分别比 OnnxRuntime、MNN 和 TVM 社区版本快 1.18～2.47 倍、1.18～2.27 倍和 1.01～1.66 倍。在 ARM CPU 上运行常见深度学习网络的单层推理分别比 ACL 和 FastConv 快 1.26～6.11 倍、1.04～4.28 倍。

Abstract: Convolutional Neural Networks (CNNs) as a quintessential representation of deep learning, are the most commonly used neural networks in tasks such as computer vision. However, convolution operations typically account for over 90% of the runtime in CNNs, becoming a bottleneck for performance. Additionally, due to the complexity of current hardware and the diversity of workloads, specific optimizations in previous work often lack performance portability. To address this problem, the author introduces BlazerML, an open-source convolution computation library based on auto-generated code templates from TVM, capable of automatically generating high-performance convolution implementations for any input shape. BlazerML is implemented based on the Winograd algorithm, known for its high performance in fast convolution algorithms. Experimental results demonstrate that BlazerML significantly outperforms current state-of-the-art open-source libraries. On x86 CPUs, running common deep learning network forward inferences, it is faster by 1.18—2.47 times, 1.18—2.27 times, and 1.01—1.66 times compared to OnnxRuntime, MNN, and the TVM community version, respectively. On ARM CPUs, for single-layer inference of common deep learning networks, it surpasses ACL and FastConv by 1.26—6.11 times and 1.04—4.28 times, respectively.

HTML全文

参考文献(23)

施引文献

资源附件(0)