JOURNAL OF SHANDONG UNIVERSITY(NATURAL SCIENCE) ›› 2017, Vol. 52 ›› Issue (9): 1-6.doi: 10.6040/j.issn.1671-9352.0.2017.095

    Next Articles

Building a Russian-Chinese comparable corpus based on Wikipedia and its comparability calculation

YUAN Wei1, 2, YI Mian-zhu2*   

  1. 1. Post-Doctoral Research Station of Shanghai International Studies University, Shanghai 200083, China;
    2. Language Engineering Department PLA University of Foreign Languages, Luoyang 471003, Henan, China
  • Received:2017-03-16 Online:2017-09-20 Published:2017-09-15

Abstract: Currently Russian and Chinese corpus research is urgently needed new breakthroughs in data sources, research angles and applications. Comparable corpus is one of the research hotspots in corpus linguistics and natural language processing. So far there has been no study of Russian-Chinese comparable corpora in China. This paper reviews the existing achievements in this area, designs an method to construct Russian-Chinese comparable corpus based on Wikipedia, develops a system for automatic acquiring comparable texts, and builds a Russian-Chinese comparable corpus, which contents more than a million words. In the end, the comparability of this comparable corpora was evaluated by using cross-language similarity calculation methods. The results demonstrate that using this method can effectively obtain Russian-Chinese comparable texts with high comparability, and the corpus can be used for translation, discourse analysis and computational linguistics studies.

Key words: Russian, comparable corpora, Wikipedia

CLC Number: 

  • TP391
[1] BAKER M. Corpora in translation studies: an overview and some suggestions for future research[J]. Target, 1995, 7(2): 223-243.
[2] SKADIH,A I, AKER A, MASTROPAVLOS N, et al. Collecting and using comparable corpora for statistical machine translation[C] //Proceedings of the 8th International Conference on Language Resources and Evaluation(LREC). Istanbul:[s.n.] 2012: 438-445.
[3] 肖健, 徐建, 徐晓兰, 等. 英中可比语料库中多词表达自动提取与对齐[J]. 计算机工程与应用, 2010, 46(31):130-134,187. XIAO Jian, XU Jian, XU Xiaolan, et al. Automatic extraction and alignment of multiword expressions from English-Chinese comparable corpus[J]. Computer Engineering and Applications, 2010, 46(31):130-134,187.
[4] 朱群燕. 基于可比语料库的跨语言信息检索研究[D]. 武汉: 华中师范大学, 2015. ZHU Qunyan. Research on cross language information retrieval based on comparable corpora[D]. Wuhan: Central China Normal University, 2015.
[5] 胡弘思, 姚天昉. 基于维基百科的双语可比语料的句子对齐[J]. 中文信息学报, 2016, 30(01):198-203. HU Hongsi, YAO Tianfang. Sentence alignment for bilingual comparable corpus from Wikipedia[J]. Journal of Chinese Information Processing, 2016, 30(01):198-203.
[6] YU Kun, TSUJI J. Bilingual dictionary extraction from Wikipedia[J]. Proceeding of MT Summit XII, 2009, 12:121-124.
[7] OTERO P G, L‘OPEZ I G. Wikipedia as multilingual source of comparable corpora[C] // Proceedings of the 3rd Workshop on Building and Using Comparable Corpora(LREC-2010). Malta: European Language Resources Association, 2010: 21-25.
[8] ION R, TUFFS D, BOROS T, et al. Online compilation of comparable corpora and their evaluation[C] // Proceedings of the 7th International Conference Formal Approaches to South Slavic and Balkan Languages. Dubrovnik: FASSBL7, 2010: 29-33.
[9] SHAROFF S, BABYCH B, HARTLEY A. Using comparable corpora to solve problems difficult for human translators[C] // Proceedings of the COLING/ACL on Main Conference Poster Sessions. Los Angeles: ACL, 2006: 739-746.
[10] SMITH JR, QUIRK C, TOUTANOVA K. Extracting parallel sentences from comparable corpora using document level alignment[C] // Human Language Technologies: the 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics. Los Angeles: ACL, 2010: 403-411.
[11] ZAGIBALOV T, BELYATSKAYA K, CARROLL J. Comparable English-Russian book review corpora for sentiment analysis[C] // Proceedings of the 1st Workshop on Computational Approaches to Subjectivity and Sentiment Analysis. Alacant: Universitat d'Alacant, 2010: 67-72.
[12] FRAISSE A, PAROUBEK P. Twitter as a comparable corpus to build multilingual affective lexicons[C] // Proceedings of the 7th Workshop on Building and Using Comparable Corpora. Reykjavik: JNLE, 2014: 26-31.
[1] . Design and implementation of topic detection in Russian news based on ontology [J]. JOURNAL OF SHANDONG UNIVERSITY(NATURAL SCIENCE), 2018, 53(9): 49-54.
[2] WANG Tong, MA Yan-zhou, YI Mian-zhu. Speech recognition of Russian short instructions based on DTW [J]. JOURNAL OF SHANDONG UNIVERSITY(NATURAL SCIENCE), 2017, 52(11): 29-36.
[3] ZHANG Ming, TANG Hui-feng, LI Zhu-feng. Coreference wordlist construction of Russian weapon [J]. JOURNAL OF SHANDONG UNIVERSITY(NATURAL SCIENCE), 2014, 49(12): 36-42.
Full text



No Suggested Reading articles found!