Fast Convolution Automatic Performance Optimization Based on Tensor Virtual Machine

doi:10.12146/j.issn.2095-3135.20240202001

Home > Archive>Volume 13, Issue 5, 2024 >3-18. DOI:10.12146/j.issn.2095-3135.20240202001

Fast Convolution Automatic Performance Optimization Based on Tensor Virtual Machine
DOI:
                        10.12146/j.issn.2095-3135.20240202001
                    
CSTR:
                        32239.14.2024.05.002
                    
Author:
                        
                        
                    
Affiliation:
Clc Number:TP 399
Fund Project:This work is supported by Key Research and Development Project of Guangdong Province (2021B0101310002), National Natural Science Foundation of China (62272449), Shenzhen Basic Research Fundation (RCYX20200714114734194, KQTD20200820113106007, ZDSYS20220422103800001), and Youth Innovation Promotion Association, CAS (Y2021101)

Article

Figures

Metrics

Reference

Cited by

Materials

Comments

Abstract:

Convolutional Neural Networks (CNNs) as a quintessential representation of deep learning, are the most commonly used neural networks in tasks such as computer vision. However, convolution operations typically account for over 90% of the runtime in CNNs, becoming a bottleneck for performance. Additionally, due to the complexity of current hardware and the diversity of workloads, specific optimizations in previous work often lack performance portability. To address this problem, the author introduces BlazerML, an open-source convolution computation library based on auto-generated code templates from TVM, capable of automatically generating high-performance convolution implementations for any input shape. BlazerML is implemented based on the Winograd algorithm, known for its high performance in fast convolution algorithms. Experimental results demonstrate that BlazerML significantly outperforms current state-of-the-art open-source libraries. On x86 CPUs, running common deep learning network forward inferences, it is faster by 1.18—2.47 times, 1.18—2.27 times, and 1.01—1.66 times compared to OnnxRuntime, MNN, and the TVM community version, respectively. On ARM CPUs, for single-layer inference of common deep learning networks, it surpasses ACL and FastConv by 1.26—6.11 times and 1.04—4.28 times, respectively.

Reference

Cited by

Get Citation

CHEN Jiang, ZHU Honglin, MENG Jintao, et al. Fast Convolution Automatic Performance Optimization Based on Tensor Virtual Machine[J]. Journal of Integration Technology,2024,13(5):3-18

Copy

Article Metrics

Abstract:
PDF:
HTML:
Cited by:

History

Received:February 02,2024
Revised:February 02,2024
Adopted:
Online: September 24,2024
Published:

Home

About Journal

Editorial Team

Author Center

Peer Review

Reader Center

Ethics

Contact us

中文

Get Citation

Share

Article Metrics

History

Article QR Code