Building a Russian-Chinese comparable corpus based on Wikipedia and its comparability calculation

YUAN Wei1, 2, YI Mian-zhu2*   

  1. 1. Post-Doctoral Research Station of Shanghai International Studies University, Shanghai 200083, China;
    2. Language Engineering Department PLA University of Foreign Languages, Luoyang 471003, Henan, China
  • Received:2017-03-16 Online:2017-09-20 Published:2017-09-15

Abstract: Currently Russian and Chinese corpus research is urgently needed new breakthroughs in data sources, research angles and applications. Comparable corpus is one of the research hotspots in corpus linguistics and natural language processing. So far there has been no study of Russian-Chinese comparable corpora in China. This paper reviews the existing achievements in this area, designs an method to construct Russian-Chinese comparable corpus based on Wikipedia, develops a system for automatic acquiring comparable texts, and builds a Russian-Chinese comparable corpus, which contents more than a million words. In the end, the comparability of this comparable corpora was evaluated by using cross-language similarity calculation methods. The results demonstrate that using this method can effectively obtain Russian-Chinese comparable texts with high comparability, and the corpus can be used for translation, discourse analysis and computational linguistics studies.

Key words: Russian, comparable corpora, Wikipedia

