山东大学学报(理学版) ›› 2017, Vol. 52 ›› Issue (9): 1-6.doi: 10.6040/j.issn.1671-9352.0.2017.095
• • 下一篇
原伟1,2,易绵竹2*
YUAN Wei1, 2, YI Mian-zhu2*
摘要: 可比语料库由于其自身优势和广泛用途逐渐成为语料库研究的热点方向之一,而目前国内俄汉可比语料库相关研究未见学者涉及。通过梳理国内外相关研究成果,设计了一种基于维基百科构建俄汉可比语料库的思路和方法,研制了语料自动获取系统,以篇章对齐为基础建立了俄汉可比语料库,语料字(词)总数达到了百万级,最后利用跨语言相似度计算的方法对俄汉语料的可比度进行计算。计算结果表明该方法能够有效获取可比度较高的俄汉语料,所构建的语料库可被用于俄汉翻译、话语分析及计算语言学研究中。
中图分类号:
[1] BAKER M. Corpora in translation studies: an overview and some suggestions for future research[J]. Target, 1995, 7(2): 223-243. [2] SKADIH,A I, AKER A, MASTROPAVLOS N, et al. Collecting and using comparable corpora for statistical machine translation[C] //Proceedings of the 8th International Conference on Language Resources and Evaluation(LREC). Istanbul:[s.n.] 2012: 438-445. [3] 肖健, 徐建, 徐晓兰, 等. 英中可比语料库中多词表达自动提取与对齐[J]. 计算机工程与应用, 2010, 46(31):130-134,187. XIAO Jian, XU Jian, XU Xiaolan, et al. Automatic extraction and alignment of multiword expressions from English-Chinese comparable corpus[J]. Computer Engineering and Applications, 2010, 46(31):130-134,187. [4] 朱群燕. 基于可比语料库的跨语言信息检索研究[D]. 武汉: 华中师范大学, 2015. ZHU Qunyan. Research on cross language information retrieval based on comparable corpora[D]. Wuhan: Central China Normal University, 2015. [5] 胡弘思, 姚天昉. 基于维基百科的双语可比语料的句子对齐[J]. 中文信息学报, 2016, 30(01):198-203. HU Hongsi, YAO Tianfang. Sentence alignment for bilingual comparable corpus from Wikipedia[J]. Journal of Chinese Information Processing, 2016, 30(01):198-203. [6] YU Kun, TSUJI J. Bilingual dictionary extraction from Wikipedia[J]. Proceeding of MT Summit XII, 2009, 12:121-124. [7] OTERO P G, L‘OPEZ I G. Wikipedia as multilingual source of comparable corpora[C] // Proceedings of the 3rd Workshop on Building and Using Comparable Corpora(LREC-2010). Malta: European Language Resources Association, 2010: 21-25. [8] ION R, TUFFS D, BOROS T, et al. Online compilation of comparable corpora and their evaluation[C] // Proceedings of the 7th International Conference Formal Approaches to South Slavic and Balkan Languages. Dubrovnik: FASSBL7, 2010: 29-33. [9] SHAROFF S, BABYCH B, HARTLEY A. Using comparable corpora to solve problems difficult for human translators[C] // Proceedings of the COLING/ACL on Main Conference Poster Sessions. Los Angeles: ACL, 2006: 739-746. [10] SMITH JR, QUIRK C, TOUTANOVA K. Extracting parallel sentences from comparable corpora using document level alignment[C] // Human Language Technologies: the 2010 Annual Conference of the North American Chapter of the Association for Computational Linguistics. Los Angeles: ACL, 2010: 403-411. [11] ZAGIBALOV T, BELYATSKAYA K, CARROLL J. Comparable English-Russian book review corpora for sentiment analysis[C] // Proceedings of the 1st Workshop on Computational Approaches to Subjectivity and Sentiment Analysis. Alacant: Universitat d'Alacant, 2010: 67-72. [12] FRAISSE A, PAROUBEK P. Twitter as a comparable corpus to build multilingual affective lexicons[C] // Proceedings of the 7th Workshop on Building and Using Comparable Corpora. Reykjavik: JNLE, 2014: 26-31. |
[1] | 原伟,唐亮,易绵竹. 基于本体的俄文新闻话题检测设计与实现[J]. 山东大学学报(理学版), 2018, 53(9): 49-54. |
[2] | 王彤,马延周,易绵竹. 基于DTW的俄语短指令语音识别[J]. 山东大学学报(理学版), 2017, 52(11): 29-36. |
[3] | 张溟, 唐慧丰, 李珠峰. 俄语武器装备名称共指词表构建[J]. 山东大学学报(理学版), 2014, 49(12): 36-42. |
|