JOURNAL OF SHANDONG UNIVERSITY(NATURAL SCIENCE) ›› 2018, Vol. 53 ›› Issue (9): 40-48.doi: 10.6040/j.issn.1671-9352.1.2017.060

Previous Articles     Next Articles

Extraction of Chinese multiword expressions based on Web text

GONG Shuang-shuang, CHEN Yu-feng*, XU Jin-an, ZHANG Yu-jie   

  1. College of Computer and Information Technology, Beijing Jiaotong University, Beijing 100044, China
  • Received:2017-12-12 Online:2018-09-20 Published:2018-09-10

Abstract: A Multiword Expression is a kind of fixed and semi-fixed collocation in natural language, especially in network text, MWEs appear frequently, which brings a great challenge to the subsequent segmentation and text comprehension. Therefore, we propose a double-layer extraction strategy to achieve the recognition of MWEs in this paper. In the first layer, we use the LRE+EMI algorithm to achieve the initial extraction of MWEs; In the second layer, we use SVM classifier and construct the characteristics of context and word vector to classify the MWEs and non-MWEs, in order to further filter the MWEs candidate list on the basis of the MWEs candidate list got from the first layer. After the experiment, the F value of MWEs reached 84.92% in the first layer and the F value of MWEs reached 89.58% in the second layer, which have greatly improved performance compared with the baseline system. The experimental result shows that the double-layer extraction strategy can availably extract MWEs, and can effectively improve the segmentation results.

Key words: SVM, MWEs, left and right entropy, enhanced mutual information, word segmentation

CLC Number: 

  • TP391
[1] BALDWIN T, BANNARD C, TANAKA T, et al. An empirical model of multiword Expression decomposability[C] // Proceedings of the ACL 2003 Workshop on Multiword Expressions: Analysis, Acquisition and Treatment. Sapporo: ACL, 2003: 89-96.
[2] JACKENDOFF R. The architecture of the language faculty[M]. Cambridge: MIT Press, 1997.
[3] CASELI H M, RAMISCH C, NUNES M G V, et al. Alignment-based extraction of multiword expression[J]. Language Resources and Evaluation, 2009, 44(1/2):59-77.
[4] PIAO S S, SUN Guangfan, RAYSON P, et al. Automatic extraction of Chinese multiword expressions with a statistical tool[C] // Proceedings of the Workshop on Multi-word Expressions in a Multilingual Context. Trento: J Weeds, 2006: 17-24.
[5] BU Fan, ZHU Xiaoyan, LI Ming. A new multiword expression metric and its applications[J]. Journal of Computer Science & Technology, 2011, 26(1):3-13.
[6] DUAN Jianyong, LU Ruanzhan, WU Weilin, et al. A bio-inspired approach for multiword expression extraction[C] // Proceedings of the 21st International Conference on Computational Linguistics and 44th Annual Meeting of the Association for Computational Linguistics. Sydney: BPA Digital, 2006: 4876-4883.
[7] 缪苗. VNC结构多词表达的抽取与分类[D].北京:北京邮电大学, 2011: 55-60. MIAO Miao. The extraction and classification of multiword expression in VNC structure[D]. Beijing: Beijing University of Posts and Telecommunications, 2011: 55-60.
[8] REN Z, LÜ Y, CAO J, et al. Improving statistical machine translation using domain bilingual multiword expressions[C] // Proceedings of the 2009 Workshop on Multiword Expressions. Suntec: ACL-IJCNLP, 2009: 47.
[9] XIAO Jian, XU Jian, XU Xiaolan. Automatic extraction and alignment of multiword expressions from English-Chinese comparable corpus[J]. Computer Engineering & Applications, 2010, 46(31):130-131.
[10] ZHU H, ZHANG S. Extraction method of micro-blog new login word based on improved position-word probability[C] // International Conference on Applications & Techniques in Cyber Security & Intelligence. Basel:Springer International Publishing AG, 2017.
[11] ZHANG W, YOSHIDA T, TANG X, et al. Improving effectiveness of mutual information for substan-tival multiword expression extraction[J]. Expert Systems with Applications, 2009, 36(8):10919-10930.
[12] 周练.Word2vec的工作原理及应用探究[J].科技情报开发与经济,2015,25(2):145-148. ZHOU Lian. Word2vecs working principle and application to explore[J]. Science and Technology Information Development and Economy, 2015, 25(2): 145-148.
[13] JAYADEVA, KHEMCHANDANI R, CHANDRA S. Twin support vector machines for pattern classification[J]. IEEE Transactions on Pattern Analysis & Machine Intelligence, 2007, 29(5):905-910.
[14] 焦春鹏. 基于二分类SVM的多分类方法比较研究[D].西安电子科技大学,2011. JIAO Chunpeng. A comparative study of multi taxonomy based on two classification SVM[D]. Xian: Xian Electronic and Science University, 2011.
[1] SUN Jian-dong, GU Xiu-sen, LI Yan, XU Wei-ran. Chinese entity relation extraction algorithms based on COAE2016 datasets [J]. JOURNAL OF SHANDONG UNIVERSITY(NATURAL SCIENCE), 2017, 52(9): 7-12.
[2] PENG Qiu-fang, LIU Yang. Research of gender prediciton based on SVM with E-commerce data [J]. JOURNAL OF SHANDONG UNIVERSITY(NATURAL SCIENCE), 2016, 51(7): 74-80.
[3] LIU Biao1,2, CHEN Chun-ping3, FENG Hua-min1,3, LI Yang3. A SVM parameters selection algorithm based on Fisher criterion [J]. J4, 2012, 47(7): 50-54.
[4] XIAN Jian,MO Xuan-lang and XI Jiang-qing . A question answering system based on question pattern match [J]. J4, 2006, 41(3): 100-103 .
Full text



No Suggested Reading articles found!