Abstract:Hand pose estimation based on RGB images shows crucial application prospects in the fields of dynamic gesture recognition and human-computer interaction. However, existing methods face numerous challenges. For example, the high degree of self-similarity of the hand and the extremely dense distribution of key points make it extremely difficult to achieve high-precision prediction under the condition of low computational cost, which in turn leads to limitations in performance in complex scenarios.In view of this, this paper proposes a two-dimensional (2D) hand pose estimation model based on the YOLOv8 network, namely FAR-HandNet. This model ingeniously integrates the Focused Linear Attention module, the key point alignment strategy, and the regression residual fitting module, effectively enhancing the feature capture ability for small target areas (such as the hand), while reducing the adverse impact of self-similarity on the positioning accuracy of hand key points. It is worth mentioning that the regression residual fitting module uses a flow-based generative model to fit the distribution of key point residuals, which greatly improves the accuracy of the regression model.The experiments in this paper are carried out on the CMU and FreiHAND datasets. The experimental results clearly show that FAR-HandNet has obvious advantages in terms of the number of parameters and computational efficiency, and performs excellently in PCK (Percentage of Correct Keypoints) under different thresholds, showing a significant improvement compared with existing methods. In addition, the inference time of this model is only 32ms. The ablation experiments further confirm the effectiveness of each module, fully verifying the effectiveness and superiority of FAR-HandNet in the hand pose estimation task.