Abstract:Metaphor has the purpose of inspiring understanding and persuading others. Currently, metaphor presents the trend of multimodal integration of text, images, and videos. Therefore, identifying the metaphorical semantics contained in multimodal contents has research value for Internet content security. Due to the lack of multimodal metaphor datasets, it is difficult to establish research models. Therefore, current scholars pay more attention to text-based metaphor detection. To overcome this shortcoming, the paper first generates a new multimodal metaphor dataset from the perspectives of image-text, metaphor appearance, emotion expression, and author intention. Then, Kappa scores were used to assess the consistency among the annotators of the dataset. Finally, a multimodal metaphor detection model is constructed to verify the quality and value of the multimodal data set by combining image attribute features, image entity features, and text features with the help of a pre-training model and attention mechanism. The experimental results show that the metaphor dataset with emotion and intention can improve the effectiveness of metaphor model detection, and confirm that the interrelationship of multimodal information is helpful for understanding metaphor.