山东大学学报(理学版) ›› 2018, Vol. 53 ›› Issue (9): 40-48.doi: 10.6040/j.issn.1671-9352.1.2017.060
龚双双,陈钰枫*,徐金安,张玉洁
GONG Shuang-shuang, CHEN Yu-feng*, XU Jin-an, ZHANG Yu-jie
摘要: 多词表达(multiword expressions, MWEs)是自然语言中一类固定或半固定搭配的语言单元,特别在网络文本中,多词表达频繁出现,给分词和后续文本理解带来了巨大挑战,因此,面向网络文本提出了一种双层抽取策略来实现多词表达的识别。第一层次,利用基于左右熵联合增强互信息的算法来实现多词表达的初步抽取;第二层次,在第一层次获得的多词表达候选列表的基础上,利用SVM分类器,构建上下文和词向量特征,进行多词表达与非多词表达的分类,实现多词表达候选列表的进一步过滤。经过实验测试,在5 000条微博语料上,第一层次获得的多词表达的F值为84.92%,第二层次多词表达识别的F值为89.58%,相比于基线系统,性能有很大的提升。实验结果表明,双层抽取策略能够实现网络多词表达的有效抽取,并能有效改善分词结果。
中图分类号:
[1] BALDWIN T, BANNARD C, TANAKA T, et al. An empirical model of multiword Expression decomposability[C] // Proceedings of the ACL 2003 Workshop on Multiword Expressions: Analysis, Acquisition and Treatment. Sapporo: ACL, 2003: 89-96. [2] JACKENDOFF R. The architecture of the language faculty[M]. Cambridge: MIT Press, 1997. [3] CASELI H M, RAMISCH C, NUNES M G V, et al. Alignment-based extraction of multiword expression[J]. Language Resources and Evaluation, 2009, 44(1/2):59-77. [4] PIAO S S, SUN Guangfan, RAYSON P, et al. Automatic extraction of Chinese multiword expressions with a statistical tool[C] // Proceedings of the Workshop on Multi-word Expressions in a Multilingual Context. Trento: J Weeds, 2006: 17-24. [5] BU Fan, ZHU Xiaoyan, LI Ming. A new multiword expression metric and its applications[J]. Journal of Computer Science & Technology, 2011, 26(1):3-13. [6] DUAN Jianyong, LU Ruanzhan, WU Weilin, et al. A bio-inspired approach for multiword expression extraction[C] // Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics. Sydney: BPA Digital, 2006: 4876-4883. [7] 缪苗. VNC结构多词表达的抽取与分类[D].北京:北京邮电大学, 2011: 55-60. MIAO Miao. The extraction and classification of multiword expression in VNC structure[D]. Beijing: Beijing University of Posts and Telecommunications, 2011: 55-60. [8] REN Z, LÜ Y, CAO J, et al. Improving statistical machine translation using domain bilingual multiword expressions[C] // Proceedings of the 2009 Workshop on Multiword Expressions. Suntec: ACL-IJCNLP, 2009: 47. [9] XIAO Jian, XU Jian, XU Xiaolan. Automatic extraction and alignment of multiword expressions from English-Chinese comparable corpus[J]. Computer Engineering & Applications, 2010, 46(31):130-131. [10] ZHU H, ZHANG S. Extraction method of micro-blog new login word based on improved position-word probability[C] // International Conference on Applications & Techniques in Cyber Security & Intelligence. Basel:Springer International Publishing AG, 2017. [11] ZHANG W, YOSHIDA T, TANG X, et al. Improving effectiveness of mutual information for substan-tival multiword expression extraction[J]. Expert Systems with Applications, 2009, 36(8):10919-10930. [12] 周练.Word2vec的工作原理及应用探究[J].科技情报开发与经济,2015,25(2):145-148. ZHOU Lian. Word2vecs working principle and application to explore[J]. Science and Technology Information Development and Economy, 2015, 25(2): 145-148. [13] JAYADEVA, KHEMCHANDANI R, CHANDRA S. Twin support vector machines for pattern classification[J]. IEEE Transactions on Pattern Analysis & Machine Intelligence, 2007, 29(5):905-910. [14] 焦春鹏. 基于二分类SVM的多分类方法比较研究[D].西安电子科技大学,2011. JIAO Chunpeng. A comparative study of multi taxonomy based on two classification SVM[D]. Xian: Xian Electronic and Science University, 2011. |
[1] | 孙建东,顾秀森,李彦,徐蔚然. 基于COAE2016数据集的中文实体关系抽取算法研究[J]. 山东大学学报(理学版), 2017, 52(9): 7-12. |
[2] | 冼 健,莫玄朗,奚建清 . 基于问题模式匹配的智能答疑系统原型[J]. J4, 2006, 41(3): 100-103 . |
|