基于多模态大模型的细粒度动作识别研究

熊婷; 乔宇; 王亚立

doi:10.12146/j.issn.2095-3135.20260514001

基于多模态大模型的细粒度动作识别研究

Fine-grained Action Recognition with Multimodal Large language Models

摘要

摘要: 动作识别是视频理解的重要任务之一，其中细粒度的动作识别任务相对传统的动作识别任务，更加具有挑战性。结合多模态大模型的方法，虽在视频理解任务上取得较好效果，但对视频时空维度的细粒度信息关注不足，导致目前的多模态大模型方法在动作识别任务上仍然存在一定程度上的偏差。针对这一问题，本文提出了一种基于通用视觉大模型InternVideo2的视觉编码器改进的细粒度信息对齐和特征融合方法，首先利用感兴趣区域特征对齐的方式，将感兴趣的局部细节区域的特征进行细粒度特征对齐和提取，并增加单独的视觉编码分支，根据提取出的细粒度特征对分支编码器进行微调；编码过后的细粒度特征将作为补充信息，参与到后续的动作分类的分数计算中；通过以上步骤，有效提高了多模态大模型对于包含细粒度信息的视频动作的识别准确率。在一系列细粒度动作识别的实验中，本方法分别在Something-Something V1/V2数据集上取得了 68.15%/73.12%的Top-1准确率，相对基准模型取得了有效提升。

Abstract: Action recognition is a critical task in video understanding, where fine-grained action recognition poses greater challenges compared to conventional action recognition. While existing multimodal large models have achieved notable success in video
understanding tasks, they still exhibit limitations in capturing fine-grained spatial and temporal information, leading to suboptimal performance. To address this issue, we propose an enhanced fine-grained information alignment and feature fusion method based on the video foundation model InternVideo2’s visual encoder. Our approach consists of three key steps, including Region-of-Interest Feature Alignment, extracting and aligning fine-grained features from localized regions of interest; Auxiliary Encoding Branch, introducing a dedicated visual encoding branch, fine-tuned on the extracted fine-grained features; Feature Fusion for Classification, incorporating the encoded fine-grained features as supplementary cues to improve action classification scoring. Extensive experiments on fine-grained action recognition benchmarks demonstrate the effectiveness of our method, achieving 68.15%(Something-Something V1) and 73.12%(Something-Something V2) Top-1 accuracy, significantly outperforming baseline model.

HTML全文

参考文献(0)

施引文献

资源附件(0)