Abstract:
Action recognition is a critical task in video understanding, where fine-grained action recognition poses greater challenges compared to conventional action recognition. While existing multimodal large models have achieved notable success in video
understanding tasks, they still exhibit limitations in capturing fine-grained spatial and temporal information, leading to suboptimal performance. To address this issue, we propose an enhanced fine-grained information alignment and feature fusion method based on the video foundation model InternVideo2’s visual encoder. Our approach consists of three key steps, including Region-of-Interest Feature Alignment, extracting and aligning fine-grained features from localized regions of interest; Auxiliary Encoding Branch, introducing a dedicated visual encoding branch, fine-tuned on the extracted fine-grained features; Feature Fusion for Classification, incorporating the encoded fine-grained features as supplementary cues to improve action classification scoring. Extensive experiments on fine-grained action recognition benchmarks demonstrate the effectiveness of our method, achieving 68.15%(Something-Something V1) and 73.12%(Something-Something V2) Top-1 accuracy, significantly outperforming baseline model.