Abstract:High-level action information, such as spatial feature of frames, temporal feature among frames, or human level skeleton features are usually used in existing video action recognition methods. However, these high-level features cannot effectively describe the action composition of human behavior, and thus reduce the ability of deep learning models to recognize confusing behaviors. In this work, video action recognition method based on human body parts is investigated. By learning the action representation of the fine-grained parts of the human body, the video representation of human action was learned from bottom to up level. Specifically, the method mainly includes three modules: (1) body part feature enhancement module, which enhances the image-based human body part feature, (2) body part feature fusion module, which fuses the features of various parts of the human body to form human feature, and (3) body feature enhancement module, which is responsible for enhancing the human body features of all people in the video. The popular datasets of UCF101 and HMDB51 were used for experiments. And the results showed that, the video action recognition method based on human body parts is complementary with current methods, and can effectively improve the accuracy of human action recognition.