基于聚焦注意力机制的对齐回归手部姿态估计网络

窦铭扬; 耿艳娟; 杨佳彬

doi:10.12146/j.issn.2095-3135.20241030001

基于聚焦注意力机制的对齐回归手部姿态估计网络

Alignment Regression Hand Pose Estimation Network Based on Focused Attention Mechanism

摘要

摘要: 基于 RGB 图像的手部姿态估计在动态手势识别和人机交互领域有着广泛的应用前景。然而，现有方法面临诸多挑战，如手部自相似性程度高和关键点分布极为密集等问题，导致难以以较低的计算成本实现高精度的预测，进而导致在复杂场景中的表现存在局限性。鉴于此，本文提出一种基于 YOLOv8 网络的二维手部姿态估计模型 —— FAR-HandNet。该模型巧妙地融合了聚焦线性注意力模块、关键点对齐策略和回归残差拟合模块，有效地增强了对小目标区域(如手部)的特征捕捉能力，同时减少了自相似性对手部关键点定位精度的不良影响。此外，回归残差拟合模块基于流生成模型对关键点残差分布进行拟合，极大地提升了回归模型的精度。本文实验在卡内基梅隆大学全景数据集 (CMU)和 FreiHAND 数据集上展开。实验结果表明，FAR-HandNet 在参数量和计算效率方面优势明显，与现有方法相比，在不同阈值下的正确关键点概率表现优异。同时，该模型的推理时间仅需 32 ms。消融实验进一步证实了各模块的有效性，充分验证了 FAR-HandNet 在手部姿态估计任务中的有效性和优越性。

Abstract: Hand pose estimation based on RGB images holds wide application prospects in dynamic gesture recognition and human-computer interaction. However, existing methods face challenges such as high hand self-similarity and densely distributed keypoints, making it difficult to achieve high-precision predictions with low computational costs, thereby limiting their performance in complex scenarios. To address these challenges, this paper proposes a 2D hand pose estimation model named FAR-HandNet, based on the YOLOv8 network. The model ingeniously integrates a focused linear attention module, a keypoint alignment strategy, and a regression residual fitting module, effectively enhancing feature capture capabilities for small target regions (e.g., hands) while mitigating the adverse effects of self-similarity on the localization accuracy of hand keypoints. Additionally, the regression residual fitting module leverages a flow-based generative model to fit the residual distribution of keypoints, significantly improving regression precision. Experiments were conducted on the Carnegie Mellon University panorama dataset (CMU) and the FreiHAND dataset. Results demonstrate that FAR-HandNet exhibits remarkable advantages in parameter size and computational efficiency. Compared to existing methods, it achieves superior performance in the percentage of correct keypoints under varying thresholds. Furthermore, the model achieves an inference time of only 32 ms. Ablation studies further validate the effectiveness of each module, conclusively verifying the efficacy and superiority of FAR-HandNet in hand pose estimation tasks.

HTML全文

参考文献(25)

施引文献

资源附件(0)