您的位置:山东大学 -> 科技期刊社 -> 《山东大学学报(理学版)》

山东大学学报(理学版) ›› 2017, Vol. 52 ›› Issue (9): 1-6.doi: 10.6040/j.issn.1671-9352.0.2017.095

• •    下一篇

基于维基百科的俄汉可比语料库构建及可比度计算

原伟1,2,易绵竹2*   

  1. 1.上海外国语大学博士后流动站, 上海 200083;2.中国人民解放军外国语学院语言工程系, 河南 洛阳 471003
  • 收稿日期:2017-03-16 出版日期:2017-09-20 发布日期:2017-09-15
  • 通讯作者: 易绵竹(1964— ),男,博士,教授,研究方向为计算语言学、俄语语言文学. E-mail:13373781261@126.com E-mail:yw5811827@126.com
  • 作者简介:原伟(1981— ),男,博士,副教授,研究方向为计算语言学、语料库语言学. E-mail:yw5811827@126.com
  • 基金资助:
    国家社会科学基金资助项目(14CYY051);中国博士后科学基金面上资助项目(2017M610268)

Building a Russian-Chinese comparable corpus based on Wikipedia and its comparability calculation

YUAN Wei1, 2, YI Mian-zhu2*   

  1. 1. Post-Doctoral Research Station of Shanghai International Studies University, Shanghai 200083, China;
    2. Language Engineering Department PLA University of Foreign Languages, Luoyang 471003, Henan, China
  • Received:2017-03-16 Online:2017-09-20 Published:2017-09-15

摘要: 可比语料库由于其自身优势和广泛用途逐渐成为语料库研究的热点方向之一,而目前国内俄汉可比语料库相关研究未见学者涉及。通过梳理国内外相关研究成果,设计了一种基于维基百科构建俄汉可比语料库的思路和方法,研制了语料自动获取系统,以篇章对齐为基础建立了俄汉可比语料库,语料字(词)总数达到了百万级,最后利用跨语言相似度计算的方法对俄汉语料的可比度进行计算。计算结果表明该方法能够有效获取可比度较高的俄汉语料,所构建的语料库可被用于俄汉翻译、话语分析及计算语言学研究中。

关键词: 可比语料库, 俄语, 维基百科

Abstract: Currently Russian and Chinese corpus research is urgently needed new breakthroughs in data sources, research angles and applications. Comparable corpus is one of the research hotspots in corpus linguistics and natural language processing. So far there has been no study of Russian-Chinese comparable corpora in China. This paper reviews the existing achievements in this area, designs an method to construct Russian-Chinese comparable corpus based on Wikipedia, develops a system for automatic acquiring comparable texts, and builds a Russian-Chinese comparable corpus, which contents more than a million words. In the end, the comparability of this comparable corpora was evaluated by using cross-language similarity calculation methods. The results demonstrate that using this method can effectively obtain Russian-Chinese comparable texts with high comparability, and the corpus can be used for translation, discourse analysis and computational linguistics studies.

Key words: Russian, comparable corpora, Wikipedia

中图分类号: 

  • TP391
[1] BAKER M. Corpora in translation studies: an overview and some suggestions for future research[J]. Target, 1995, 7(2): 223-243.
[2] SKADIH,A I, AKER A, MASTROPAVLOS N, et al. Collecting and using comparable corpora for statistical machine translation[C] //Proceedings of the 8th International Conference on Language Resources and Evaluation(LREC). Istanbul:[s.n.] 2012: 438-445.
[3] 肖健, 徐建, 徐晓兰, 等. 英中可比语料库中多词表达自动提取与对齐[J]. 计算机工程与应用, 2010, 46(31):130-134,187. XIAO Jian, XU Jian, XU Xiaolan, et al. Automatic extraction and alignment of multiword expressions from English-Chinese comparable corpus[J]. Computer Engineering and Applications, 2010, 46(31):130-134,187.
[4] 朱群燕. 基于可比语料库的跨语言信息检索研究[D]. 武汉: 华中师范大学, 2015. ZHU Qunyan. Research on cross language information retrieval based on comparable corpora[D]. Wuhan: Central China Normal University, 2015.
[5] 胡弘思, 姚天昉. 基于维基百科的双语可比语料的句子对齐[J]. 中文信息学报, 2016, 30(01):198-203. HU Hongsi, YAO Tianfang. Sentence alignment for bilingual comparable corpus from Wikipedia[J]. Journal of Chinese Information Processing, 2016, 30(01):198-203.
[6] YU Kun, TSUJI J. Bilingual dictionary extraction from Wikipedia[J]. Proceeding of MT Summit XII, 2009, 12:121-124.
[7] OTERO P G, L‘OPEZ I G. Wikipedia as multilingual source of comparable corpora[C] // Proceedings of the 3rd Workshop on Building and Using Comparable Corpora(LREC-2010). Malta: European Language Resources Association, 2010: 21-25.
[8] ION R, TUFFS D, BOROS T, et al. Online compilation of comparable corpora and their evaluation[C] // Proceedings of the 7th International Conference Formal Approaches to South Slavic and Balkan Languages. Dubrovnik: FASSBL7, 2010: 29-33.
[9] SHAROFF S, BABYCH B, HARTLEY A. Using comparable corpora to solve problems difficult for human translators[C] // Proceedings of the COLING/ACL on Main Conference Poster Sessions. Los Angeles: ACL, 2006: 739-746.
[10] SMITH JR, QUIRK C, TOUTANOVA K. Extracting parallel sentences from comparable corpora using document level alignment[C] // Human Language Technologies: the 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics. Los Angeles: ACL, 2010: 403-411.
[11] ZAGIBALOV T, BELYATSKAYA K, CARROLL J. Comparable English-Russian book review corpora for sentiment analysis[C] // Proceedings of the 1st Workshop on Computational Approaches to Subjectivity and Sentiment Analysis. Alacant: Universitat d'Alacant, 2010: 67-72.
[12] FRAISSE A, PAROUBEK P. Twitter as a comparable corpus to build multilingual affective lexicons[C] // Proceedings of the 7th Workshop on Building and Using Comparable Corpora. Reykjavik: JNLE, 2014: 26-31.
[1] 原伟,唐亮,易绵竹. 基于本体的俄文新闻话题检测设计与实现[J]. 山东大学学报(理学版), 2018, 53(9): 49-54.
[2] 王彤,马延周,易绵竹. 基于DTW的俄语短指令语音识别[J]. 山东大学学报(理学版), 2017, 52(11): 29-36.
[3] 张溟, 唐慧丰, 李珠峰. 俄语武器装备名称共指词表构建[J]. 山东大学学报(理学版), 2014, 49(12): 36-42.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!