Abstract:
The lightweight convolutional neural network designed for mobile devices features fast inference speed but is constrained by its inherent locality. Local information can only be captured within a windowed region, leading to performance degradation. Introducing the self-attention mechanism can capture global information, but it reduces detection speed. To address these issues, this paper introduces a hardware-friendly MobileNetv4 network architecture based on YOLOv8, incorporating a universally inverted bottleneck search block that integrates the inverted bottleneck, ConvNext, Feed Forward network, and a novel variant of extra depthwise convolution. Additionally, a dynamic upsampling operator is introduced to improve the upsampling operation, reducing GPU memory usage and latency in the model. Furthermore, this paper enhances the detection head of YOLOv8 by introducing a dynamic detection head, which combines spatial awareness, scale awareness, and task awareness into a unified framework. It effectively applies the attention mechanism in the object detection head, improving detection performance and efficiency. The experimental results demonstrate that compared to the next-best model, YOLOv8n, YOLOv8n_M achieved an improvement of 1.3% in mean Average Precision (mAP 0.5∶0.95). In terms of model complexity, YOLOv8n_M successfully compresses the model size by 36% (with a reduction of 1 million parameters) and reduces computational costs by 26% (The Giga Floating-Point Operations (GFLOPs) were reduced by 2.4.). The proposed YOLOv8_M effectively reduces the model's parameter count and inference time while improving object detection accuracy in various environments to a certain extent.