一种基于孪生网络预训练语言模型的文本匹配方法研究

卢美情; 申妍燕

doi:10.12146/j.issn.2095-3135.20220817001

一种基于孪生网络预训练语言模型的文本匹配方法研究

A Text Matching Method Based on a Pretraining Language Model: Sentence Embeddings Using Siamese BERT-Networks

摘要

摘要: 孪生网络预训练语言模型(Sentence Embeddings using Siamese BERT-Networks, SBERT)在文本匹配的表示层面上存在两个缺点：(1)两个文本查询经 BERT Encoder 得到向量表示后, 直接进行简单计算；(2)该计算不能考虑到文本查询之间更细粒度表示的问题, 易产生语义上的偏离, 难以衡量单个词在上下文中的重要性。该文结合交互方法, 提出一种结合多头注意力对齐机制的 SBERT 改进模型。该模型首先获取经 SBERT 预训练的两个文本查询的隐藏层向量；然后, 计算两文本之间的相似度矩阵, 并利用注意力机制分别对两个文本中的 token 再次编码, 从而获得交互特征；最后进行池化, 并整合全连接层进行预测。该方法引入了多头注意力对齐机制, 完善了交互型文本匹配算法, 加强了相似文本之间的关联度, 提高了文本匹配效果。在 ATEC 2018 NLP 数据集及 CCKS 2018 微众银行客户问句匹配数据集上, 对该方法进行验证, 实验结果表明, 与当前流行的 5 种文本相似度匹配模型 ESIM、ConSERT、BERT-whitening、SimCSE 以及 baseline 模型 SBERT 相比, 本文模型在 F1 评价指标上分别达到了 84.7% 和90.4%, 比 Baseline 分别提高了 18.6% 和 8.7%, 在准确率以及召回率方面也表现出了较好的效果, 且具备一定的鲁棒性。

Abstract: The sentence embeddings using Siamese BERT-Networks pre-trained language model has two shortcomings in its presentation layer for text matching, that is, (1) two queried texts are directly computed after they are represented in vectors by the BERT Encoder, (2) such computation does not consider the needs to refine the granular representation of the two queried texts. As such presented semantics could be deviated and it is also difficult to assess the importance of single words in text matching. This paper proposes an improved text similarity matching model SBMAA based on SBERT pre-trained language model. Firstly, the hidden layer vectors of the two queries passing through the SBERT model are obtained, and then the similarity matrix between the two is calculated. The attention mechanism is used to encode the tokens in the two sentences again to obtain interactive features and pool them. Finally, the fully connected layer is connected for prediction. This method introduces the multi-head attention alignment mechanism, which is a common way of interactive text matching algorithm, and strengthens the correlation degree between similar texts, so that the model can achieve more accurate matching effect. The experimental results on ATEC 2018 NLP data set and CCKS 2018 Webank Customer Question Matching dataset show that compared with the five popular text similarity matching models ESIM, ConSERT, BERT-whitening, SimCSE and Baseline model SBERT, The proposed SBMAA model achieves 84.7% and 90.4% in F1 evaluation index, 18.6% and 8.7% higher than Baseline, respectively. It also shows good effect in accuracy and recall rate, and has certain robustness.

HTML全文

参考文献(0)

施引文献

资源附件(0)