您的位置:山东大学 -> 科技期刊社 -> 《山东大学学报(理学版)》

山东大学学报(理学版) ›› 2018, Vol. 53 ›› Issue (9): 40-48.doi: 10.6040/j.issn.1671-9352.1.2017.060

• • 上一篇    下一篇

基于网络文本的汉语多词表达抽取方法

龚双双,陈钰枫*,徐金安,张玉洁   

  1. 北京交通大学计算机与信息技术学院, 北京 100044
  • 收稿日期:2017-12-12 出版日期:2018-09-20 发布日期:2018-09-10
  • 作者简介:龚双双(1990— ),女,硕士研究生,研究方向为自然语言处理、信息抽取. E-mail:15120393@bjtu.edu.cn*通信作者简介:陈钰枫(1981— ),女,博士,副教授,研究方向为自然语言处理、人工智能. E-mail:chenyf@bjtu.edu.cn
  • 基金资助:
    国家自然科学基金资助项目(61473294,61370130);北京市自然科学基金资助项目(4172047);中央高校基本科研业务费专项资金资助项目(2015JBM033)

Extraction of Chinese multiword expressions based on Web text

GONG Shuang-shuang, CHEN Yu-feng*, XU Jin-an, ZHANG Yu-jie   

  1. College of Computer and Information Technology, Beijing Jiaotong University, Beijing 100044, China
  • Received:2017-12-12 Online:2018-09-20 Published:2018-09-10

摘要: 多词表达(multiword expressions, MWEs)是自然语言中一类固定或半固定搭配的语言单元,特别在网络文本中,多词表达频繁出现,给分词和后续文本理解带来了巨大挑战,因此,面向网络文本提出了一种双层抽取策略来实现多词表达的识别。第一层次,利用基于左右熵联合增强互信息的算法来实现多词表达的初步抽取;第二层次,在第一层次获得的多词表达候选列表的基础上,利用SVM分类器,构建上下文和词向量特征,进行多词表达与非多词表达的分类,实现多词表达候选列表的进一步过滤。经过实验测试,在5 000条微博语料上,第一层次获得的多词表达的F值为84.92%,第二层次多词表达识别的F值为89.58%,相比于基线系统,性能有很大的提升。实验结果表明,双层抽取策略能够实现网络多词表达的有效抽取,并能有效改善分词结果。

关键词: 多词表达, 左右熵, 分词, 增强互信息, SVM

Abstract: A Multiword Expression is a kind of fixed and semi-fixed collocation in natural language, especially in network text, MWEs appear frequently, which brings a great challenge to the subsequent segmentation and text comprehension. Therefore, we propose a double-layer extraction strategy to achieve the recognition of MWEs in this paper. In the first layer, we use the LRE+EMI algorithm to achieve the initial extraction of MWEs; In the second layer, we use SVM classifier and construct the characteristics of context and word vector to classify the MWEs and non-MWEs, in order to further filter the MWEs candidate list on the basis of the MWEs candidate list got from the first layer. After the experiment, the F value of MWEs reached 84.92% in the first layer and the F value of MWEs reached 89.58% in the second layer, which have greatly improved performance compared with the baseline system. The experimental result shows that the double-layer extraction strategy can availably extract MWEs, and can effectively improve the segmentation results.

Key words: SVM, MWEs, left and right entropy, enhanced mutual information, word segmentation

中图分类号: 

  • TP391
[1] BALDWIN T, BANNARD C, TANAKA T, et al. An empirical model of multiword Expression decomposability[C] // Proceedings of the ACL 2003 Workshop on Multiword Expressions: Analysis, Acquisition and Treatment. Sapporo: ACL, 2003: 89-96.
[2] JACKENDOFF R. The architecture of the language faculty[M]. Cambridge: MIT Press, 1997.
[3] CASELI H M, RAMISCH C, NUNES M G V, et al. Alignment-based extraction of multiword expression[J]. Language Resources and Evaluation, 2009, 44(1/2):59-77.
[4] PIAO S S, SUN Guangfan, RAYSON P, et al. Automatic extraction of Chinese multiword expressions with a statistical tool[C] // Proceedings of the Workshop on Multi-word Expressions in a Multilingual Context. Trento: J Weeds, 2006: 17-24.
[5] BU Fan, ZHU Xiaoyan, LI Ming. A new multiword expression metric and its applications[J]. Journal of Computer Science & Technology, 2011, 26(1):3-13.
[6] DUAN Jianyong, LU Ruanzhan, WU Weilin, et al. A bio-inspired approach for multiword expression extraction[C] // Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics. Sydney: BPA Digital, 2006: 4876-4883.
[7] 缪苗. VNC结构多词表达的抽取与分类[D].北京:北京邮电大学, 2011: 55-60. MIAO Miao. The extraction and classification of multiword expression in VNC structure[D]. Beijing: Beijing University of Posts and Telecommunications, 2011: 55-60.
[8] REN Z, LÜ Y, CAO J, et al. Improving statistical machine translation using domain bilingual multiword expressions[C] // Proceedings of the 2009 Workshop on Multiword Expressions. Suntec: ACL-IJCNLP, 2009: 47.
[9] XIAO Jian, XU Jian, XU Xiaolan. Automatic extraction and alignment of multiword expressions from English-Chinese comparable corpus[J]. Computer Engineering & Applications, 2010, 46(31):130-131.
[10] ZHU H, ZHANG S. Extraction method of micro-blog new login word based on improved position-word probability[C] // International Conference on Applications & Techniques in Cyber Security & Intelligence. Basel:Springer International Publishing AG, 2017.
[11] ZHANG W, YOSHIDA T, TANG X, et al. Improving effectiveness of mutual information for substan-tival multiword expression extraction[J]. Expert Systems with Applications, 2009, 36(8):10919-10930.
[12] 周练.Word2vec的工作原理及应用探究[J].科技情报开发与经济,2015,25(2):145-148. ZHOU Lian. Word2vecs working principle and application to explore[J]. Science and Technology Information Development and Economy, 2015, 25(2): 145-148.
[13] JAYADEVA, KHEMCHANDANI R, CHANDRA S. Twin support vector machines for pattern classification[J]. IEEE Transactions on Pattern Analysis & Machine Intelligence, 2007, 29(5):905-910.
[14] 焦春鹏. 基于二分类SVM的多分类方法比较研究[D].西安电子科技大学,2011. JIAO Chunpeng. A comparative study of multi taxonomy based on two classification SVM[D]. Xian: Xian Electronic and Science University, 2011.
[1] 孙建东,顾秀森,李彦,徐蔚然. 基于COAE2016数据集的中文实体关系抽取算法研究[J]. 山东大学学报(理学版), 2017, 52(9): 7-12.
[2] 冼 健,莫玄朗,奚建清 . 基于问题模式匹配的智能答疑系统原型[J]. J4, 2006, 41(3): 100-103 .
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!