Abstract:Hand pose estimation based on RGB images holds wide application prospects in dynamic gesture recognition and human-computer interaction. However, existing methods face challenges such as high hand self-similarity and densely distributed keypoints, making it difficult to achieve high-precision predictions with low computational costs, thereby limiting their performance in complex scenarios. To address these challenges, this paper proposes a 2D hand pose estimation model named FAR-HandNet, based on the YOLOv8 network. The model ingeniously integrates a focused linear attention module, a keypoint alignment strategy, and a regression residual fitting module, effectively enhancing feature capture capabilities for small target regions (e.g., hands) while mitigating the adverse effects of self-similarity on the localization accuracy of hand keypoints. Additionally, the regression residual fitting module leverages a flow-based generative model to fit the residual distribution of keypoints, significantly improving regression precision. Experiments were conducted on the Carnegie Mellon University panorama dataset (CMU) and the FreiHAND dataset. Results demonstrate that FAR-HandNet exhibits remarkable advantages in parameter size and computational efficiency. Compared to existing methods, it achieves superior performance in the percentage of correct keypoints under varying thresholds. Furthermore, the model achieves an inference time of only 32 ms. Ablation studies further validate the effectiveness of each module, conclusively verifying the efficacy and superiority of FAR-HandNet in hand pose estimation tasks.