山东大学学报(理学版) ›› 2016, Vol. 51 ›› Issue (7): 43-50.doi: 10.6040/j.issn.1671-9352.1.2015.116
姚亮,洪宇*,刘昊,刘乐,姚建民
YAO Liang, HONG Yu*, LIU Hao, LIU Le, YAO Jian-min
摘要: 统计机器翻译系统由规模较大、领域混杂的平行语料训练获得,当训练数据和测试数据领域分布不一致时,其翻译质量往往较低。针对这一问题,提出了一种基于语义分布相似度的翻译模型领域自适应方法。该方法首先获得目标领域源语言端和目标语言端的词向量,并构建二者之间的映射关系。借助这一映射关系,获取源语言单词在目标语言端的语义k近邻词,然后基于该语义k近邻词在通用领域语义空间的分布,计算双语短语在目标领域下的翻译相似度,并作为新特征加入解码器,以此提升通用翻译模型的领域自适应能力。实验结果表明,相比于基准系统,利用本文所提方法优化后的翻译系统在英汉翻译任务新闻领域测试集和科技领域测试集上,分别获得0.67和0.56个BLEU值的性能提升。
中图分类号:
[1] FOSTER G, GOUTTE C, KUHN R. Discriminative instance weighting for domain adaptation in statistical machine translation[C] //Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing. New York: ACM, 2010: 451-459. [2] SU J, WU H, WANG H, et al. Translation model adaptation for statistical machine translation with monolingual topic information[C] //Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics. New York: ACM, 2012: 459-468. [3] SENNRICH R. Perplexity minimization for translation model domain adaptation in statistical machine translation[C] //Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics. USA: Wiley, 2012: 539-549. [4] 曹杰, 吕雅娟, 苏劲松,等. 利用上下文信息的统计机器翻译领域自适应[J]. 中文信息学报, 2010, 24(6): 50-56. CAO Jie, L(¨overU)Yajuan, SU Jinsong, et al. Using contextual information to the statistical machine translation domain adaptive[J]. Chinese Journal of Information, 2010, 24(6): 50-56. [5] HEWAVITHARANA S, MEHAY D, ANANTHAKRI-SHNAN S, et al. Incremental topic-based translation model adaptation for conversational spoken language translation[C] //Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics. Somerset: ACL, 2013: 697-701. [6] HASLER E, BLUNSOM P, KOEHN P, et al. Dynamic topic adaptation for phrase-based MT[C] //Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics. Somerset: ACL, 2014. [7] 黄瑾, 吕雅娟, 刘群. 基于信息检索方法的统计翻译系统训练数据选择与优化[J]. 中文信息学报, 2008, 22(2): 40-46. HUANG Jin, LÜ Yajuan, LIU Qun. Selection and optimization of training data for statistical translation system based on information retrieval[J]. Chinese Journal of Information, 2008, 22(2):40-46. [8] MOORE R C, LEWIS W. Intelligent selection of language model training data[C] //Proceedings of the ACL 2010 Conference Short Papers, Association for Computational Linguistics. New York: ACM, 2010:220-224. [9] AXELROD A, HE X, GAO J. Domain adaptation via pseudo in-domain data selection[C] //Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing. New York: ACM, 2011: 355-362. [10] DUH K, NEUBIG G, SUDOH K, et al. Adaptation data selection using neural language models: experiments in machine translation[C] //Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics. Somerset: ACL, 2013: 678-683. [11] LE L, YU H, HAO L, et al. Effective selection of translation model training data[C] //Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics(Short Papers). Somerset: ACL, 2014: 569-573. [12] MIKOLOV T, SUTSKEVER I, CHEN K, et al. Distributed representations of words and phrases and their compositionality[J]. Advances in Neural Information Processing Systems, 2013, 26: 3111-3119. [13] MIKOLOV T, LE Q V, SUTSKEVER I. Exploiting similarities among languages for machine translation[J/OL]. Computer Science, 2013.[2015-06-12].http://arxin.prg/pdf/1309.4168v/pdf [14] 王超超, 熊德意, 张民. 基于双语合成语义的翻译相似度模型[J]. 北京大学学报(自然科学版), 2015, 51(2): 335-341. WANG Chaochao, XIONG Deyi, ZHANG Min. Translation similarity model based on the semantic of bilingual synthesis[J]. Journal of Peking University(Natural Science Edition), 2015, 51(2):335-341. [15] KOEHN P, OCH F J, MARCU D. Statistical phrase-based translation[C] //Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology. New York: ACM, 2003: 48-54. [16] XIAO T, ZHU J, ZHANG H, et al. NiuTrans: an open source toolkit for phrase-based and syntax-based machine translation[C] //Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics. New York: ACM, 2012: 19-24. [17] OCH F J, NEY H. A systematic comparison of various statistical alignment models[J]. Computational Linguistics, 2003, 29(1): 19-51. [18] STOLCKE A. SRILM-an extensible language modeling toolkit[J]. International Conference on Spoken Language Processing, 2004: 901-904. [19] OCH F J. Minimum error rate training in statistical machine translation[C] //Proceedings of the 41st Annual Meeting on Association for Computational Linguistics. New York: ACM, 2003: 160-167. [20] PAPINENI K, ROUKOS S, WARD T, et al. BLEU: a method for automatic evaluation of machine translation[C] //Proceedings of the 40th Annual Meeting on Association for Computational Linguistics. New York: ACM, 2002: 311-318. |
[1] | 杜漫,徐学可,杜慧,伍大勇,刘悦,程学旗. 面向情绪分类的情绪词向量学习[J]. 山东大学学报(理学版), 2017, 52(7): 52-58. |
[2] | 黄栋,徐博,许侃,林鸿飞,杨志豪. 基于词向量和EMD距离的短文本聚类[J]. 山东大学学报(理学版), 2017, 52(7): 66-72. |
[3] | 杨阳, 刘龙飞, 魏现辉, 林鸿飞. 基于词向量的情感新词发现方法[J]. 山东大学学报(理学版), 2014, 49(11): 51-58. |
|