您的位置:山东大学 -> 科技期刊社 -> 《山东大学学报(理学版)》

山东大学学报(理学版) ›› 2016, Vol. 51 ›› Issue (7): 43-50.doi: 10.6040/j.issn.1671-9352.1.2015.116

• • 上一篇    下一篇

基于语义分布相似度的翻译模型领域自适应研究

姚亮,洪宇*,刘昊,刘乐,姚建民   

  1. 苏州大学江苏省计算机信息处理重点实验室, 江苏 苏州 215006
  • 收稿日期:2015-11-14 出版日期:2016-07-20 发布日期:2016-07-27
  • 通讯作者: 洪宇(1978— ),男,副教授,研究方向为话题检测、信息检索和信息抽取.E-mail:tianxianer@gmail.com E-mail:yaoliang310@163.com
  • 作者简介:姚亮(1993— ),男,硕士研究生,研究方向为统计机器翻译.E-mail:yaoliang310@163.com
  • 基金资助:
    国家自然科学基金资助项目(61373097,61272259,61272260)

Translation model adaptation based on semantic distribution similarity

YAO Liang, HONG Yu*, LIU Hao, LIU Le, YAO Jian-min   

  1. Provincial Key Laboratory of Computer Information Processing Technology, Soochow University, Suzhou 215006, Jiangsu, China
  • Received:2015-11-14 Online:2016-07-20 Published:2016-07-27

摘要: 统计机器翻译系统由规模较大、领域混杂的平行语料训练获得,当训练数据和测试数据领域分布不一致时,其翻译质量往往较低。针对这一问题,提出了一种基于语义分布相似度的翻译模型领域自适应方法。该方法首先获得目标领域源语言端和目标语言端的词向量,并构建二者之间的映射关系。借助这一映射关系,获取源语言单词在目标语言端的语义k近邻词,然后基于该语义k近邻词在通用领域语义空间的分布,计算双语短语在目标领域下的翻译相似度,并作为新特征加入解码器,以此提升通用翻译模型的领域自适应能力。实验结果表明,相比于基准系统,利用本文所提方法优化后的翻译系统在英汉翻译任务新闻领域测试集和科技领域测试集上,分别获得0.67和0.56个BLEU值的性能提升。

关键词: 翻译模型, 向量映射, 领域自适应, 词向量, 语义分布

Abstract: Statistical machine translation(SMT)system is trained with large-scale and domain-mixed parallel corpus, when the data for training and testing are not belonged to the same domain, the translation quality usually drops dramatically. To solve this problem, we proposed a novel approach to adapt the translation model based on semantic distribution similarity of translation pair. The approach firstly obtained word representations both in source and target language, and then built mapping to link the different vector space. With the mapping function the semantic k-nearest neighbors of source language in the target vector space can be easily obtained. Based on the semantic distribution of k neighbors in the general domain space, we computed phrases translation similarity in the domain of interest. The similarities are then integrated into the decoder engine, in order to enhance the adaption ability of common translation model. Experiments on English to Chinese translation tasks show that the optimized translation systems build on our method outperform the baseline system by 0.67 and 0.56 BLUE points on news and science-technology test sets respectively.

Key words: translation model, word representation, semantic distribution, domain adaptation

中图分类号: 

  • TP393
[1] FOSTER G, GOUTTE C, KUHN R. Discriminative instance weighting for domain adaptation in statistical machine translation[C] //Proceedings of the 2010 Conference on Empirical Methods in Natural Language Processing. New York: ACM, 2010: 451-459.
[2] SU J, WU H, WANG H, et al. Translation model adaptation for statistical machine translation with monolingual topic information[C] //Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics. New York: ACM, 2012: 459-468.
[3] SENNRICH R. Perplexity minimization for translation model domain adaptation in statistical machine translation[C] //Proceedings of the 13th Conference of the European Chapter of the Association for Computational Linguistics. USA: Wiley, 2012: 539-549.
[4] 曹杰, 吕雅娟, 苏劲松,等. 利用上下文信息的统计机器翻译领域自适应[J]. 中文信息学报, 2010, 24(6): 50-56. CAO Jie, L(¨overU)Yajuan, SU Jinsong, et al. Using contextual information to the statistical machine translation domain adaptive[J]. Chinese Journal of Information, 2010, 24(6): 50-56.
[5] HEWAVITHARANA S, MEHAY D, ANANTHAKRI-SHNAN S, et al. Incremental topic-based translation model adaptation for conversational spoken language translation[C] //Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics. Somerset: ACL, 2013: 697-701.
[6] HASLER E, BLUNSOM P, KOEHN P, et al. Dynamic topic adaptation for phrase-based MT[C] //Proceedings of the 14th Conference of the European Chapter of the Association for Computational Linguistics. Somerset: ACL, 2014.
[7] 黄瑾, 吕雅娟, 刘群. 基于信息检索方法的统计翻译系统训练数据选择与优化[J]. 中文信息学报, 2008, 22(2): 40-46. HUANG Jin, LÜ Yajuan, LIU Qun. Selection and optimization of training data for statistical translation system based on information retrieval[J]. Chinese Journal of Information, 2008, 22(2):40-46.
[8] MOORE R C, LEWIS W. Intelligent selection of language model training data[C] //Proceedings of the ACL 2010 Conference Short Papers, Association for Computational Linguistics. New York: ACM, 2010:220-224.
[9] AXELROD A, HE X, GAO J. Domain adaptation via pseudo in-domain data selection[C] //Proceedings of the 2011 Conference on Empirical Methods in Natural Language Processing. New York: ACM, 2011: 355-362.
[10] DUH K, NEUBIG G, SUDOH K, et al. Adaptation data selection using neural language models: experiments in machine translation[C] //Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics. Somerset: ACL, 2013: 678-683.
[11] LE L, YU H, HAO L, et al. Effective selection of translation model training data[C] //Proceedings of the 52nd Annual Meeting of the Association for Computational Linguistics(Short Papers). Somerset: ACL, 2014: 569-573.
[12] MIKOLOV T, SUTSKEVER I, CHEN K, et al. Distributed representations of words and phrases and their compositionality[J]. Advances in Neural Information Processing Systems, 2013, 26: 3111-3119.
[13] MIKOLOV T, LE Q V, SUTSKEVER I. Exploiting similarities among languages for machine translation[J/OL]. Computer Science, 2013.[2015-06-12].http://arxin.prg/pdf/1309.4168v/pdf
[14] 王超超, 熊德意, 张民. 基于双语合成语义的翻译相似度模型[J]. 北京大学学报(自然科学版), 2015, 51(2): 335-341. WANG Chaochao, XIONG Deyi, ZHANG Min. Translation similarity model based on the semantic of bilingual synthesis[J]. Journal of Peking University(Natural Science Edition), 2015, 51(2):335-341.
[15] KOEHN P, OCH F J, MARCU D. Statistical phrase-based translation[C] //Proceedings of the 2003 Conference of the North American Chapter of the Association for Computational Linguistics on Human Language Technology. New York: ACM, 2003: 48-54.
[16] XIAO T, ZHU J, ZHANG H, et al. NiuTrans: an open source toolkit for phrase-based and syntax-based machine translation[C] //Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics. New York: ACM, 2012: 19-24.
[17] OCH F J, NEY H. A systematic comparison of various statistical alignment models[J]. Computational Linguistics, 2003, 29(1): 19-51.
[18] STOLCKE A. SRILM-an extensible language modeling toolkit[J]. International Conference on Spoken Language Processing, 2004: 901-904.
[19] OCH F J. Minimum error rate training in statistical machine translation[C] //Proceedings of the 41st Annual Meeting on Association for Computational Linguistics. New York: ACM, 2003: 160-167.
[20] PAPINENI K, ROUKOS S, WARD T, et al. BLEU: a method for automatic evaluation of machine translation[C] //Proceedings of the 40th Annual Meeting on Association for Computational Linguistics. New York: ACM, 2002: 311-318.
[1] 杜漫,徐学可,杜慧,伍大勇,刘悦,程学旗. 面向情绪分类的情绪词向量学习[J]. 山东大学学报(理学版), 2017, 52(7): 52-58.
[2] 黄栋,徐博,许侃,林鸿飞,杨志豪. 基于词向量和EMD距离的短文本聚类[J]. 山东大学学报(理学版), 2017, 52(7): 66-72.
[3] 杨阳, 刘龙飞, 魏现辉, 林鸿飞. 基于词向量的情感新词发现方法[J]. 山东大学学报(理学版), 2014, 49(11): 51-58.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!