Chinese-Japanese multi-word phrase extraction and alignment based on multi-strategy filtering

Chinese-Japanese multi-word phrase extraction and alignment based on multi-strategy filtering

TANG Liang1,2, LI Qian1, XU Hong-bo2, YI Mian-zhu1   

  1. 1. Department of Language Engineering, Luoyang University of Foreign Language, Luoyang 471003, Henan, China;
    2. Institute of Computing Technology, Chinese Academy of Science, Beijing 100049, China
  • Received:2015-03-03 Revised:2015-07-22 Online:2015-09-20 Published:2015-09-26

Abstract: On the task of cross-language text analysis, a multi-word phrase is less ambiguous and more accurate than a single word, which helps to understand the text more accurately. Existing methods mainly focus on cross-language alignment of single words. This paper presents an extraction and alignment method for Chinese-Japanese multi-word phrases based on multi-strategy filtering, which combines the multi-word phrases extraction and cross-language alignment. Firstly, we get multi-word phrases with complete semantics using repeated string, left-right adjacent entropy, internal relationship, multi-word nesting, stop-word method etc. Secondly, we use the parallel corpus to compute the similarity of Chinese-Japanese multi-word phrases, to achieve cross-language alignment. In the process, according to the rules and characteristics of the Japanese language, we dynamically adjust the threshold according to corpus' size and related domains, in order to improve the applicability of multi-word phrases. The experimental results show that this method is effective to extract Chinese-Japanese multi-word phrases as the alignment unit, which makes the semantic expression more complete and more practical value.

Key words: parallel corpus, multi-word phrase, word alignment

[1] MO Yuan-yuan, GUO Jian-yi, YU Zheng-tao, MAO Cun-li, NIU Yi-tong. A bilingual word alignment method of Vietnamese-Chinese based on deep neural network [J]. JOURNAL OF SHANDONG UNIVERSITY(NATURAL SCIENCE), 2016, 51(1): 77-83.
