Abstract:Effectively transferring knowledge from pre-trained models to downstream video understanding tasks is an important topic in computer vision research. Knowledge transfer becomes more challenging in open world due to poor data conditions. Many recent multimodal pre-training models are inspired by natural language processing and perform transfer learning by designing prompt learning. In this paper, we propose an LLM-powered domain context-assisted open-world action recognition method that leverages the open-world understanding capabilities of large language models. Our approach aligns visual representation with multi-level descriptions of human actions for robust classification, by enriching action labels with contextual knowledge in large language model. In the experiments of open-world action recognition with fully supervised setting, we obtain a Top-1 accuracy of 71.86% on the ARID dataset, and an mAP of 80.93% on the Tiny-VARIT dataset. More important, our method can achieve Top-1 accuracy of 48.63% in source-free video domain adaptation and 54.36% in multi-source video domain adaptation.