Abstract:Effectively transferring knowledge from pre-trained models to downstream video understanding tasks is an important topic in computer vision research. Knowledge transfer becomes more challenging in open domain due to poor data conditions. Many recent multi-modal pre-training models are inspired by natural language processing and perform transfer learning by designing prompt learning. The paper leverages the comprehension ability of large language models over open domains and proposes a domain-context-assisted method for open-domain behavior recognition. This approach aligns visual representation with multi-level descriptions of human actions for robust classification, by enriching action labels with context knowledge in large language model. In the experiments of open-domain action recognition with fully supervised setting, it obtain a Top1 accuracy of 71.86% on the ARID dataset, and an mean average precision of 80.93% on the Tiny-VARIT dataset. More important, it can achieve Top1 accuracy of 48.63% in source-free video domain adaptation and 54.36% in multi-source video domain adaptation. The experimental results demonstrate the efficacy of domain context-assisted in a variety of open domain environments.