基于自注意力机制的中心距差异多模态情感分析

doi:10.6040/j.issn.1671-9352.0.2024.230

摘要/Abstract

摘要： 为了解决现有模型在模态间相关性挖掘、特征融合方式和标签更新机制上存在的问题,提出一种基于自注意力机制的中心距差异多模态情感分析方法(center moment discrepancy multimodal sentiment analysis based on self-attention mechanism, SA-CMD)。首先,使用编码器对提取的特征序列进行编码,并通过自注意力机制动态调整各模态特征的权重,捕捉模态间复杂的依赖关系。然后,引入中心距差异方法动态优化特征表示和标签分布,增强模型的鲁棒性。在特征融合过程中,通过计算模态特征与其正负中心的距离差异,生成更准确的特征标签,进一步提高融合特征的质量。最终,使用线性层将融合特征投影到低维空间进行预测。实验结果表明,SA-CMD在公开数据集CMU-MOSI和CMU-MOSEI上的各项评价指标均优于现有基准模型,特别是在相关系数、二分类精度和七分类精度指标上表现优越。进一步验证自注意力机制和中心距差异方法在提升模型性能中的关键作用,充分说明SA-CMD模型在多模态情感分析任务中的有效性和鲁棒性。

关键词: 多模态情感分析, 自注意力机制, 中心距差异, 多模态特征融合

Abstract: A center moment discrepancy multimodal sentiment analysis based on self-attention mechanism(SA-CMD)is proposed, aiming to address issues related to modality correlation mining, feature fusion strategies, and label updating mechanisms in existing models. First, an encoder is used to encode the extracted feature sequences, and the weights of each modalitys features are dynamically adjusted through a self-attention mechanism to capture the complex dependencies between modalities. Next, the center moment discrepancy method is introduced to dynamically optimize feature representations and label distributions, enhancing the models robustness. During the feature fusion process, the model calculates the distance discrepancy between modality features and their respective positive and negative centers to generate more accurate feature labels, further improving the quality of the fused features. Finally, a linear layer is used to project the fused features onto a lower-dimensional space for prediction. Experimental results show that SA-CMD outperforms existing baseline models in the public CMU-MOSI and CMU-MOSEI datasets across various evaluation metrics, especially in terms of the Pearson correlation coefficient, binary classification accuracy, and seven-class classification accuracy. Ablation experiments further verify the key role of the self-attention mechanism and the center moment discrepancy method in enhancing model performance, fully demonstrating the effectiveness and robustness of the SA-CMD model in multimodal sentiment analysis tasks.

Key words: multimodal sentiment analysis, self-attention mechanism, center moment discrepancy, multimodal feature fusion

中图分类号:

TP391

陈忠源,路翀. 基于自注意力机制的中心距差异多模态情感分析[J]. 《山东大学学报(理学版)》, 2026, 61(3): 86-95.

CHEN Zhongyuan, LU Chong. Center moment discrepancy multimodal sentiment analysis based on self-attention mechanism[J]. JOURNAL OF SHANDONG UNIVERSITY(NATURAL SCIENCE), 2026, 61(3): 86-95.

参考文献

[1] ZADEH A A B, LIANG P P, PORIA S, et al. Multimodal language analysis in the wild: CMU-MOSEI dataset and interpretable dynamic fusion graph[C] //Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics(Volume 1: Long Papers). Melbourne: ACL, 2018:2236-2246.
[2] TSAI Y H H, LIANG P P, ZADEH A, et al. Learning factorized multimodal representations[EB/OL].(2018-06-16)[2024-07-04].https://arxiv.org/abs/1806.06176.
[3] TSAI Y H H, BAI S, LIANG P P, et al. Multimodal transformer for unaligned multimodal language sequences[C] //Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Florence: ACL, 2019:6558-6569.
[4] LIU Weide, ZHAN Huijing, CHEN Hao, et al. Multimodal sentiment analysis with missing modality: a knowledge-transfer approach[EB/OL].(2023-11-28)[2024-07-04]. https://arxiv.org/abs/2401.10747.
[5] 罗渊贻,吴锐,刘家锋,等. 面向情感语义不一致的多模态情感分析方法[J/OL]. 计算机研究与发展.(2024-03-09)[2024-07-04]. http://kns.cnki.net/kcms/detail/11.1777.tp.20240305.1731.006.html. LUO Yuanyi, WU Rui, LIU Jiafeng, et al. Multimodal sentiment analysis method for sentimental semantic inconsistency[J/OL]. Journal of Computer Research and Development.(2024-03-09)[2024-07-04]. http://kns.cnki.net/kcms/detail/11.1777.tp.20240305.1731.006.html.
[6] BALTRUŠAITIS T, AHUJA C, MORENCY L P. Multimodal machine learning: a survey and taxonomy[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2018, 41(2):423-443.
[7] SUN Z, SARMA P K, SETHARES W, et al. Multi-modal sentiment analysis using deep canonical correlation analysis[EB/OL].(2019-07-15)[2024-07-04]. https://arxiv.org/abs/1907.08696.
[8] MAI Sijie, HU Haifeng, XING Songlong. Modality to modality translation: an adversarial representation learning and graph fusion network for multimodal fusion[C] //Proceedings of the AAAI Conference on Artificial Intelligence. Honolulu: AAAI Press, 2020:164-172.
[9] HAZARIKA D, ZIMMERMANN R, PORIA S. Misa: modality-invariant and-specific representations for multimodal sentiment analysis[C] //Proceedings of the 28th ACM International Conference on Multimedia. Seattle: ACM, 2020:1122-1131.
[10] ZADEH A, CHEN M, PORIA S, et al. Tensor fusion network for multimodal sentiment analysis[EB/OL].(2017-07-23)[2024-07-04]. https://arxiv.org/abs/1707.07250.
[11] LIU Z, SHEN Y, LAKSHMINARASIMHAN V B, et al. Efficient low-rank multimodal fusion with modality-specific factors[EB/OL].(2018-05-31)[2024-07-04]. https://arxiv.org/abs/1806.00064.
[12] SUN Z, SARMA P, SETHARES W, et al. Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis[C] //Proceedings of the AAAI Conference on Artificial Intelligence. Honolulu: AAAI Press, 2020:8992-8999.
[13] HAN W, CHEN H, PORIA S. Improving multimodal fusion with hierarchical mutual information maximization for multimodal sentiment analysis[EB/OL].(2019-09-01)[2024-07-04]. https://arxiv.org/abs/2109.00412.
[14] 田昌宁,贺昱政,王笛,等. 基于Transformer的多子空间多模态情感分析[J/OL]. 西北大学学报(自然科学版).(2024-04-03)[2024-07-04]. https://xdxbzk.nwu.edu.cn/thesisDetails#10.16152/j.cnki.xdxbzr.2024-02-002&lang=zh. TIAN Changning, HE Yuzheng, WANG Di, et al. Multi-subspace multimodal sentiment analysis method based on Transformer[J/OL]. Journal of Northwest University(Natural Science Edition).(2024-04-03)[2024-07-04]. https://xdxbzk.nwu.edu.cn/thesisDetails#10.16152/j.cnki.xdxbzr.2024-02-002&lang=zh.
[15] BAHDANAU D, CHO K, BENGIO Y. Neural machine translation by jointly learning to align and translate[EB/OL].(2014-09-03)[2024-07-04]. https://arxiv.org/abs/1409.0473.
[16] VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need[J]. Advances in Neural Information Processing Systems, 2017, 30:5998-6008.
[17] DEVLIN J, CHANG M W, LEE K, et al. BERT: pre-training of deep bidirectional transformers for language understanding[EB/OL].(2018-10-11)[2024-07-04]. https://arxiv.org/abs/1810.04805.
[18] WANG Di, GUO Xutong, TIAN Yumin, et al. TETFN: a text enhanced transformer fusion network for multimodal sentiment analysis[J]. Pattern Recognition, 2023, 136:109259.
[19] HOCHREITER S, SCHMIDHUBER J. Long short-term memory[J]. Neural Computation, 1997, 9(8):1735-1780.
[20] ZELLINGER W, GRUBINGER T, LUGHOFER E, et al. Central moment discrepancy(CMD)for domain-invariant representation learning[EB/OL].(2017-02-28)[2024-07-04]. https://arxiv.org/abs/1702.08811.
[21] YU Weimeng, XU Hua, YUAN Ziqi, et al. Learning modality-specific representations with self-supervised multi-task learning for multimodal sentiment analysis[C] //Proceedings of the AAAI Conference on Artificial Intelligence. Virtual: AAAI Press, 2021, 35(12):10790-10797.
[22] ZADEH A, ZELLERS R, PINCUS E, et al. Mosi: multimodal corpus of sentiment intensity and subjectivity analysis in online opinion videos[EB/OL].(2016-06-20)[2024-07-04]. https://arxiv.org/abs/1606.06259.
[23] LIN Han, ZHANG Pinglu, LIN Jiading, et al. PS-mixer: a polar-vector and strength-vector mixer model for multimodal sentiment analysis[J]. Information Processing & Management, 2023, 60(2):103229.

多维度评价

Viewed

Full text

Abstract

Cited

Shared

Discussed