您的位置:山东大学 -> 科技期刊社 -> 《山东大学学报(理学版)》

《山东大学学报(理学版)》 ›› 2021, Vol. 56 ›› Issue (7): 82-90.doi: 10.6040/j.issn.1671-9352.1.2020.047

• • 上一篇    

基于BERT-SUMOPN模型的抽取-生成式文本自动摘要

谭金源1,刁宇峰1,杨亮1,祁瑞华2,林鸿飞1   

  1. 1. 大连理工大学信息检索实验室, 辽宁 大连 116024;2. 大连外国语大学语言智能研究中心, 辽宁 大连 116024
  • 发布日期:2021-07-19
  • 作者简介:谭金源(1997— ),男,硕士研究生,研究方向为文本摘要. E-mail:1997tjy@mail.dlut.edu.cn*通信作者简介:林鸿飞(1962— ),男,博士,教授,研究方向为自然语言处理. E-mail:hflin@dlut.edu.cn
  • 基金资助:
    国家重点研发计划资助项目(2019YFC1200302);国家自然科学基金重点资助项目(61632011)

Extractive-abstractive text automatic summary based on BERT-SUMOPN model

TAN Jin-yuan1, DIAO Yu-feng1, YANG Liang1, QI Rui-hua2, LIN Hong-fei1   

  1. 1. Information Retrieval Laboratory, Dalian University of Technology, Dalian 116024, Liaoning, China;
    2. Language Intelligence Research Center, Dalian University of Foreign Languages, Dalian 116024, Liaoning, China
  • Published:2021-07-19

摘要: 抽取式摘要可读性、准确性较差,生成式摘要存在连贯性、逻辑性的不足,此外2种摘要方法的传统模型对文本的向量表示往往不够充分、准确。针对以上问题,该文提出了一种基于BERT-SUMOPN模型的抽取-生成式摘要方法。模型通过BERT预训练语言模型获取文本向量,然后利用抽取式结构化摘要模型抽取文本中的关键句子,最后将得到的关键句子输入到生成式指针生成网络中,通过EAC损失函数对模型进行端到端训练,结合coverage机制减少生成重复,获取摘要结果。实验结果表明,BERT-SUMOPN模型在BIGPATENT专利数据集上取得了很好的效果,ROUGE-1和ROUGE-2指标分别提升了3.3%和2.5%。

关键词: BERT预训练语言模型, 结构化模型, 指针生成网络, EAC损失函数

Abstract: Extractive summaries have poor readability and accuracy, while abstractive summaries also have deficiencies in coherence and logic. In addition, the traditional models of the two summary methods are often insufficient and inaccurate for the vector representation of text. In response to the above problems, this paper proposes an extractive-abstractive summary method based on BERT-SUMOPN model. The model obtains the text vector through the BERT pre-trained language model, then extracts the key sentences in the text using the extractive summary model, and finally inputs the obtained key sentences into the pointer-generation network, and carries out the model through the EAC loss function for end-to-end training, combined with the coverage mechanism to reduce duplication and obtain summary results. The experimental results show that the BERT-SUMOPN model has achieved good results on the BIGPATENT patent dataset, and the ROUGE-1 and ROUGE-2 indicators have been improved by 3.3% and 2.5% respectively.

Key words: BERT pre-trained language model, Structured summary model, Pointer-generator network, EAC loss function

中图分类号: 

  • TP391.1
[1] 王凯祥.面向查询的自动文本摘要技术研究综述[J]. 计算机科学, 2018, 45(11A):12-16. WANG Kaixiang. Survey of query-oriented automatic summarization technology[J]. Computer Science, 2018, 45(11A):12-16.
[2] HSU W T, LIN C K, LEE MY, et al. A unified model for extractive and abstractive summarization using inconsistency loss[EB/OL].(2018-05-16)[2020-05-18]. https://arxiv.org/abs/1805.06266.
[3] LUHN H P. The automatic creation of literatureabstracts[J]. IBM Journal of Research and Development,1958,2(2):159-165.
[4] 张迎,王中卿,王红玲. 基于篇章主次关系的单文档抽取式摘要方法研究[J]. 中文信息学报, 2019, 33(8):67-76. ZHANG Ying, WANG Zhongqing, WANG Hongling. Single document extractive summarization with satellite and nuclear relations[J]. Journal of Chinese Information Processing, 2019, 33(8):67-76.
[5] 齐翌辰,王森淼,赵亚慧. 基于深度学习的中文抽取式摘要方法应用[J]. 科教导刊(中旬刊), 2019(14):67-70. QI Yichen, WANG Senmiao, ZHAO Yahui. Application of Chinese extractive abstraction method based on deep learning[J]. The Guide of Science & Education, 2019(14):67-70.
[6] 艾丽斯,唐卫红,傅云斌,等. 抽取式自动文本生成算法[J]. 华东师范大学学报(自然科学版), 2018, 2018(4):70-79. AI Lisi, TANG Weihong, FU Yunbin, et al. An algorithm for natural language generation via text extracting[J]. Journal of East China Normal University(Natural Sciences), 2018, 2018(4):70-79.
[7] NALLAPATI R, ZHAI F, ZHOU B. SummaRuNNer: a recurrent neural network based sequence model for extractive summarization of documents[C] //Thirty-first AAAI Conference on Artificial Intelligence. [S.l] : AAAI, 2017.
[8] FANG Changjian, MU Dejun, DENG Zhenghong, et al. Word-sentence co-ranking for automatic extractive text summarization[J]. Expert Systems with Applications, 2017, 72:189-195.
[9] ZHANG Y, ER M J, ZHAO R, et al. Multiview convolutional neural networks for multidocument extractive summarization[J]. IEEE Transactions on Cybernetics, 2016, 47(10):3230-3242.
[10] 吴仁守,张宜飞,王红玲,等. 基于层次结构的生成式自动文摘[J]. 中文信息学报, 2019, 33(10):90-98. WU Renshou, ZHANG Yifei, WANG Hongling, et al. Abstractive summarization based on hierarchical structure[J]. Journal of Chinese Information Processing, 2019, 33(10):90-98.
[11] 田珂珂,周瑞莹,董浩业,等. 基于编码器共享和门控网络的生成式文本摘要方法[J]. 北京大学学报(自然科学版), 2020,56(1):61-67. TIAN Keke, ZHOU Ruiying, DONG Haoye, et al. An abstractive summarization method based on encoder-sharing and gated network[J]. Acta Scientiarum Naturalium Universitatis Pekinensis, 2020, 56(1):61-67.
[12] 张敏,曾碧卿,韩旭丽,等.DAPC:结合双注意力和指针覆盖的文本摘要模型[J]. 计算机工程与应用,2020,5(8):149-157. ZHANG Min, ZENG Biqing, HAN Xuli, et al. DAPC:dual attention and pointer-coverage network based summarization model[J]. Computer Engineering and Applications, 2020, 5(8):149-157.
[13] CAO Ziqiang, WEI Furu, LI Wenjie, et al. Faithful to the original: fact aware neural abstractive summarization[C] //Thirty-second AAAI Conference on Artificial Intelligence. [S.l] : AAAI, 2018.
[14] LIU Linqing, LU Yao, YANG Min, et al. Generative adversarial network for abstractive text summarization[C] //Thirty-second AAAI Conference on Artificial Intelligence. [S.l] : AAAI, 2018.
[15] 吕瑞,王涛,曾碧卿,等. TSPT:基于预训练的三阶段复合式文本摘要模型[J]. 计算机应用研究,2020,37(10):2917-2921. LYU Rui, WANG Tao, ZENG Biqing, et al. TSPT: three-stage compound text summarization model based on pre-training[J]. Application Research of Computers, 2020, 37(10):2917-2921.
[16] 邱俊. 基于强化学习的混合式文本摘要模型[J].信息技术与信息化, 2019(1):67-70. QIU Jun. Hybrid text summarization model based on reinforcement learning [J]. Information Technology and Informatization, 2019(1):67-70.
[17] DEVLIN J, CHANG M W, LEE K, et al. Bert: pre-training of deep bidirectional transformers for language understanding[EB/OL].(2018-10-11)[2020-05-18]. https://arxiv.org/abs/1810.04805.
[18] LIU Y, TITOV I, LAPATA M. Single document summarization as tree induction[C] //Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies( Volume 1: Long and Short Papers). Minneapolis: ACL, 2019: 1745-1755.
[19] SEE A, LIU P J, MANNING C D. Get to the point: summarization with pointer-generator networks[C] // Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics(Volume 1: Long Papers). [S.l.] : ACL, 2017.
[20] SHARMA E, LI C, WANG L. Bigpatent: a large-scale dataset for abstractive and coherent summarization[EB/OL].(2019-06-10)[2020-05-18]. https://arxiv.org/abs/1906.03741v1.
[21] SUTSKEVER I, VINYALS O, LE Q V. Sequence to sequence learning with neural networks[EB/OL].(2014-09-10)[2020-05-18]. https://arxiv.org/abs/1409.3215v3.
[22] BOUTKAN F, RANZIJN J, RAU D, et al. Point-less: more abstractive summarization with pointer-generator networks[EB/OL].(2019-04-18)[2020-05-18]. https://arxiv.org/abs/1905.01975.
[23] GEHRMANN S, DENG Y, RUSH AM. Bottom-up abstractive summarization[EB/OL].(2018-08-31)[2020-05-18]. https://arxiv.org/abs/1808.10792v2.
[24] EDUNOV S, BAEVSKI A, AULI M. Pre-trained language model representations for language generation[EB/OL].(2019-03-22)[2020-05-18]. https://arxiv.org/abs/1903.09722v2.
[25] SUBRAMANIAN S, LI R, PILAULT J, et al. On extractive and abstractive neural document summarization with transformer language models[EB/OL].(2019-09-07)[2020-05-18]. https://arxiv.org/abs/1909.03186.
[26] CHEN Y C, BANSALM. Fast abstractive summarization with reinforce-selected sentence rewriting[EB/OL].(2018-05-28)[2020-05-18]. https://arxiv.org/abs/1805.11080v1.
[1] 许侃,刘瑞鑫,林鸿飞,刘海峰,冯娇娇,李家平,林原,徐博. 基于异质网络嵌入的学术论文推荐方法[J]. 《山东大学学报(理学版)》, 2020, 55(11): 35-45.
[2] 王佳麒,杨沐昀,赵铁军,赵臻宇. 检务文书检索数据集的构建[J]. 《山东大学学报(理学版)》, 2020, 55(7): 81-87.
[3] 庞博,刘远超. 融合pointwise及深度学习方法的篇章排序[J]. 山东大学学报(理学版), 2018, 53(3): 30-35.
[4] 杨艳,徐冰,杨沐昀,赵晶晶. 一种基于联合深度学习模型的情感分类方法[J]. 山东大学学报(理学版), 2017, 52(9): 19-25.
[5] 黄栋,徐博,许侃,林鸿飞,杨志豪. 基于词向量和EMD距离的短文本聚类[J]. 山东大学学报(理学版), 2017, 52(7): 66-72.
[6] 杜漫,徐学可,杜慧,伍大勇,刘悦,程学旗. 面向情绪分类的情绪词向量学习[J]. 山东大学学报(理学版), 2017, 52(7): 52-58.
[7] 曹蓉,黄金柱,易绵竹. 信息检索—DARPA人类语言技术研究的最终指向[J]. 山东大学学报(理学版), 2016, 51(9): 11-17.
[8] 奉国和,王丹迪,李媚婵. 基于SVD的档案学主题挖掘[J]. 山东大学学报(理学版), 2016, 51(1): 95-100.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!