JOURNAL OF SHANDONG UNIVERSITY(NATURAL SCIENCE) ›› 2018, Vol. 53 ›› Issue (9): 40-48.doi: 10.6040/j.issn.1671-9352.1.2017.060

Extraction of Chinese multiword expressions based on Web text

GONG Shuang-shuang, CHEN Yu-feng*, XU Jin-an, ZHANG Yu-jie   

  1. College of Computer and Information Technology, Beijing Jiaotong University, Beijing 100044, China
  • Received:2017-12-12 Online:2018-09-20 Published:2018-09-10

Abstract: A Multiword Expression is a kind of fixed and semi-fixed collocation in natural language, especially in network text, MWEs appear frequently, which brings a great challenge to the subsequent segmentation and text comprehension. Therefore, we propose a double-layer extraction strategy to achieve the recognition of MWEs in this paper. In the first layer, we use the LRE+EMI algorithm to achieve the initial extraction of MWEs; In the second layer, we use SVM classifier and construct the characteristics of context and word vector to classify the MWEs and non-MWEs, in order to further filter the MWEs candidate list on the basis of the MWEs candidate list got from the first layer. After the experiment, the F value of MWEs reached 84.92% in the first layer and the F value of MWEs reached 89.58% in the second layer, which have greatly improved performance compared with the baseline system. The experimental result shows that the double-layer extraction strategy can availably extract MWEs, and can effectively improve the segmentation results.

Key words: SVM, MWEs, left and right entropy, enhanced mutual information, word segmentation

