JOURNAL OF SHANDONG UNIVERSITY(NATURAL SCIENCE) ›› 2023, Vol. 58 ›› Issue (12): 31-40, 51.doi: 10.6040/j.issn.1671-9352.1.2022.421

Previous Articles     Next Articles

Multimodal sentiment analysis based on text-guided hierarchical adaptive fusion

Chan LU1,2(),Junjun GUO1,2,*(),Kaiwen TAN1,2,Yan XIANG1,2,Zhengtao YU1,2   

  1. 1. Faculty of Information Engineering and Automation, Kunming University of Science and Technology, Kunming 650500, Yunnan, China
    2. Yunnan Key Laboratory of Artificial Intelligence, Kunming University of Science and Technology, Kunming 650500, Yunnan, China
  • Received:2022-09-29 Online:2023-12-20 Published:2023-12-19
  • Contact: Junjun GUO E-mail:904943362@qq.com;guojjgb@163.com

Abstract:

The paper proposes a multi-modal hierarchical fusion method based on text modal guidance, which uses text modal information as the guidance to achieve hierarchical adaptive screening and fusion of multi-modal information. Firstly, the importance information representation between two modalities is realized based on the cross-modal attention mechanism, then the hierarchical adaptive fusion based on the multimodal important information is realized through the multimodal adaptive gating mechanism, and finally the multimodal features are synthesized. and modal importance information to implement multimodal sentiment analysis. The experimental results on the public datasets MOSI and MOSEI show that the accuracy and F1 value of the baseline model have increased by 0.76% and 0.7%, respectively.

Key words: multimodal sentiment analysis, multimodal fusion, attention mechanism, gating network

CLC Number: 

  • TP391

Fig.1

Example diagram of multimodal sentiment analysis"

Fig.2

Example of a speech mode"

Fig.3

Structure diagram of a multimodal hierarchical adaptive fusion model based on text modal guidance"

Fig.4

Local cross-modal interaction module structure"

Fig.5

Global multimodal feature interaction module structure"

Table 1

Dataset divisions"

数据集 训练集数 验证集数 测试集数 合计
MOSI 1 284 229 686 2 199
MOSEI 16 326 1 871 4 659 22 856

Fig.6

Distribution of sentiment inthe CMU-MOSI dataset"

Fig.7

Distribution of sentiment inthe CMU-MOSEI dataset"

Table 2

Experimental parameter settings"

参数 MOSI MOSEI
batch_size 32 32
learning rate 1×10-3 1×10-4
learning rate (BERT) 5×10-5 5×10-5
dropout 0.2 0.3
优化器 Adam Adam
激活函数 ReLU ReLU
dm(低维空间维度) 128 128

Table 3

Experimental results of different models on the CMU-MOSI dataset"

模型 MAE Corr Acc_2 F1-score
TFN 0.901 0.698 80.81 80.74
LMF 0.917 0.695 82.52 82.42
Mult 0.861 0.711 84.10 83.90
MAG-BERT 0.773 0.770 84.88 84.83
MISA 0.797 0.755 83.55 84.08
ICCN 0.862 0.714 83.07 83.02
Self-MM 0.720 0.799 85.67 85.68
本文 0.710 0.801 86.43 86.38

Table 4

Experimental results of different models on the CMU-MOSEI dataset"

模型 MAE Corr Acc_2 F1-score
TFN 0.539 0.700 82.50 82.11
LMF 0.623 0.677 82.01 82.13
Mult 0.580 0.703 82.54 82.33
MAG-BERT 0.605 0.755 84.78 84.71
MISA 0.539 0.753 84.85 84.83
ICCN 0.565 0.713 84.18 84.15
Self-MM 0.522 0.770 85.28 85.04
本文 0.531 0.762 85.36 85.29

Fig.8

Results of modal ablation experiments in the CMU-MOSI dataset"

Fig.9

Results of modal importance ablation experiment in the CMU-MOSI dataset"

Table 5

Results of model ablation experiments in the CMU-MOSI dataset"

模型 MAE Corr Acc_2 F1-score
本文 0.710 0.801 86.43 86.38
(-)跨模态注意力 0.711 0.792 85.67 85.61
(-)门控单元 0.707 0.791 85.06 85.09
(-)文本门 0.708 0.800 85.52 84.48
(-)语音门 0.730 0.787 85.21 85.19
(-)视觉门 0.730 0.787 85.21 85.19
相关特征融合 0.729 0.793 86.13 86.03
特有特征融合 0.730 0.799 85.06 85.03

Table 6

CMU-MOSI dataset case study results"

例号 多模态信息 视频画面 真实情感,真实值 预测情感,预测值
1 文本:This movie frustrated me(这部电影让我感到沮丧)语音:语调高视觉:皱眉 消极, -2.8 消极, -2.754
2 文本:And it is a really funny(这真的很有趣)语音:语气强烈视觉:微笑 积极, 1.8 积极, 1.785
2 文本: I think that the movie did rely on you to kind of figure it out (我认为这部电影确实要依靠你自己来理解它)语音:语速平缓,语气平静视觉:面无表情 中性, 0 中性, 0.084
1 JIMING L I U , PEIXIANG Z , YING L I U , et al.Summary of multi-modal sentiment analysis technology[J].Journal of Frontiers of Computer Science & Technology,2021,15(7):1165.
2 SUN Z, SARMA P, SETHARES W, et al. Learning relationships between text, audio, and video via deep canonical correlation for multimodal language analysis[C]//Proceedings of the AAAI Conference on Artificial Intelligence. New York: AAAI Press, 2020: 8992-8999.
3 ZADEH A, CHEN M, PORIA S, et al. Tensor fusion network for multimodal sentiment analysis[C]//Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. Copenhagen: Association for Computational Linguistics, 2017: 1103-1114.
4 ZADEH A, LIANG P P, MAZUMDER N, et al. Memory fusion network for multi-view sequential learning[C]//Proceedings of the AAAI Conference on Artificial Intelligence. New Orleans: AAAI Press, 2018: 5634-5641.
5 TSAI Y H H, BAI S, LIANG P P, et al. Multimodal transformer for unaligned multimodal language sequences[C]//Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. Florence: Association for Computational Linguistics, 2019: 6558-6569.
6 YU W, XU H, YUAN Z, et al. Learning modality-specific representations with self-supervised multi-task learning for multimodal sentiment analysis[C]//Proceedings of the AAAI Conference on Artificial Intelligence. Palo Alto: AAAI Press, 2021: 10790-10797.
7 WILLIAMS J, KLEINEGESSE S, COMANESCU R, et al. Recognizing emotions in video using multimodal DNN feature fusion[C]//Proceedings of Grand Challenge and Workshop on Human Multimodal Language. Melbourne: Association for Computational Linguistics, 2018: 11-19.
8 LIU Z, SHEN Y, LAKSHMINARASIMHAN V B, et al. Efficient low-rank multimodal fusion with modality-specific factors[C]//Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics. Melbourne: Association for Computational Linguistics, 2018: 2247-2256.
9 MAI S, HU H, XING S. Modality to modality translation: an adversarial representation learning and graph fusion network for multimodal fusion[C]//Proceedings of the AAAI Conference on Artificial Intelligence. Palo Alto: AAAI Press, 2021: 164-172.
10 ZHOU S, JIA J, YIN Y, et al. Understanding the teaching styles by an attention based multi-task cross-media dimensional modeling[C]//Proceedings of the 27th ACM International Conference on Multimedia. New York: Association for Computing Machinery, 2019: 1322-1330.
11 CHEN M, LI X. Swafn: sentimental words aware fusion network for multimodal sentiment analysis[C]//Proceedings of the 28th International Conference on Computational Linguistics. Barcelona: International Committee on Computational Linguistics, 2020: 1067-1077.
12 HAN W, CHEN H, PORIA S. Improving multimodal fusion with hierarchical mutual information maximization for multimodal sentiment analysis[C]//Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing. Punta Cana: Association for Computational Linguistics, 2021: 9180-9192.
13 PHAM H, LIANG P P, MANZINI T, et al. Found in translation: learning robust joint representations by cyclic translations between modalities[C]//Proceedings of the AAAI Conference on Artificial Intelligence. Honolulu: AAAI Press, 2019: 6892-6899.
14 VASWANI A , SHAZEER N , PARMAR N , et al.Attention is all you need[J].Advances in Neural Information Processing Systems,2017,30,5998-6008.
15 HAZARIKA D, ZIMMERMANN R, PORIA S. Misa: modality-invariant and specific representations for multimodal sentiment analysis[C]//Proceedings of the 28th ACM International Conference on Multimedia. New York: Association for Computing Machinery, 2020: 1122-1131.
16 WANG Y, SHEN Y, LIU Z, et al. Words can shift: dynamically adjusting word representations using nonverbal behaviors[C]//Proceedings of the AAAI Conference on Artificial Intelligence. Honolulu: AAAI Press, 2019: 7216-7223.
17 RAHMAN W, HASAN M K, LEE S, et al. Integrating multimodal information in large pretrained transformers[C]//Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Stroudsburg: Association for Computational Linguistics, 2020: 2359-2369.
18 ANDREW G, ARORA R, BILMES J, et al. Deep canonical correlation analysis[C]//International Conference on Machine Learning. Atlanta: JMLR. org, 2013: 1247-1255.
19 DEVLIN J, CHANG M W, LEE K, et al. Bert: pre-training of deep bidirectional transformers for language understanding[C]//Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics. Minneapolis: Association for Computational Linguistics, 2019: 4171-4186.
20 HOCHREITER S , SCHMIDHUBER J .Long short-term memory[J].Neural Computation,1997,9(8):1735-1780.
doi: 10.1162/neco.1997.9.8.1735
21 ZADEH A , ZELLERS R , PINCUS E , et al.MOSI: multimodal corpus of sentiment intensity and subjectivity analysis in online opinion videos[J].IEEE Intelligent Systems,2016,82-88.
22 ZADEH A, PU P. Multimodal language analysis in the wild: cmu-mosei dataset and interpretable dynamic fusion graph[C]//Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Long Papers). Melbourne: Association for Computational Linguistics, 2018: 2236-2246.
[1] Yujia NA,Jun XIE,Haiyang YANG,Xinying XU. Context fusion-based knowledge graph completion [J]. JOURNAL OF SHANDONG UNIVERSITY(NATURAL SCIENCE), 2023, 58(9): 71-80.
[2] WANG Jing-hong, LIANG Li-na, LI Hao-kang, WANG Xi-zhao. Community discovery algorithm based on label attention mechanism [J]. JOURNAL OF SHANDONG UNIVERSITY(NATURAL SCIENCE), 2022, 57(12): 1-12.
[3] BAO Liang, CHEN Zhi-hao, CHEN Wen-zhang, YE Kai, LIAO Xiang-wen. Dual co-matching network with multiway attention for opinion reading comprehension [J]. JOURNAL OF SHANDONG UNIVERSITY(NATURAL SCIENCE), 2021, 56(3): 44-53.
[4] TANG Guang-yuan, GUO Jun-jun, YU Zheng-tao, ZHANG Ya-fei,GAO Sheng-xiang. Method of recommendation based on knowledge driven by BERT and law [J]. JOURNAL OF SHANDONG UNIVERSITY(NATURAL SCIENCE), 2021, 56(11): 24-30.
[5] Chang-ying HAO,Yan-yan LAN,Hai-nan ZHANG,Jia-feng GUO,Jun XU,Liang PANG,Xue-qi CHENG. Dialogue generation model based on extended keywords information [J]. JOURNAL OF SHANDONG UNIVERSITY(NATURAL SCIENCE), 2019, 54(7): 68-76.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] YANG Jun. Characterization and structural control of metalbased nanomaterials[J]. JOURNAL OF SHANDONG UNIVERSITY(NATURAL SCIENCE), 2013, 48(1): 1 -22 .
[2] HE Hai-lun, CHEN Xiu-lan* . Circular dichroism detection of the effects of denaturants and buffers on the conformation of cold-adapted protease MCP-01 and  mesophilic protease BP01[J]. JOURNAL OF SHANDONG UNIVERSITY(NATURAL SCIENCE), 2013, 48(1): 23 -29 .
[3] ZHAO Jun1, ZHAO Jing2, FAN Ting-jun1*, YUAN Wen-peng1,3, ZHANG Zheng1, CONG Ri-shan1. Purification and anti-tumor activity examination of water-soluble asterosaponin from Asterias rollestoni Bell[J]. J4, 2013, 48(1): 30 -35 .
[4] SUN Xiao-ting1, JIN Lan2*. Application of DOSY in oligosaccharide mixture analysis[J]. J4, 2013, 48(1): 43 -45 .
[5] LUO Si-te, LU Li-qian, CUI Ruo-fei, ZHOU Wei-wei, LI Zeng-yong*. Monte-Carlo simulation of photons transmission at alcohol wavelength in  skin tissue and design of fiber optic probe[J]. J4, 2013, 48(1): 46 -50 .
[6] YANG Lun, XU Zheng-gang, WANG Hui*, CHEN Qi-mei, CHEN Wei, HU Yan-xia, SHI Yuan, ZHU Hong-lei, ZENG Yong-qing*. Silence of PID1 gene expression using RNA interference in C2C12 cell line[J]. J4, 2013, 48(1): 36 -42 .
[7] MAO Ai-qin1,2, YANG Ming-jun2, 3, YU Hai-yun2, ZHANG Pin1, PAN Ren-ming1*. Study on thermal decomposition mechanism of  pentafluoroethane fire extinguishing agent[J]. J4, 2013, 48(1): 51 -55 .
[8] YANG Ying, JIANG Long*, SUO Xin-li. Choquet integral representation of premium functional and related properties on capacity space[J]. J4, 2013, 48(1): 78 -82 .
[9] LI Yong-ming1, DING Li-wang2. The r-th moment consistency of estimators for a semi-parametric regression model for positively associated errors[J]. J4, 2013, 48(1): 83 -88 .
[10] DONG Wei-wei. A new method of DEA efficiency ranking for decision making units with independent subsystems[J]. J4, 2013, 48(1): 89 -92 .