基于深层神经网络(DNN)的汉-越双语词语对齐方法

doi:10.6040/j.issn.1671-9352.3.2014.289

山东大学学报（理学版） ›› 2016, Vol. 51 ›› Issue (1): 77-83.doi: 10.6040/j.issn.1671-9352.3.2014.289

基于深层神经网络(DNN)的汉-越双语词语对齐方法

莫媛媛¹, 郭剑毅^1,2*,余正涛^1,2,毛存礼^1,2,牛翊童¹

1.昆明理工大学信息工程与自动化学院, 云南昆明 650051;2.昆明理工大学智能信息处理重点实验室, 云南昆明 650051

收稿日期:2015-03-03 出版日期:2016-01-16 发布日期:2016-11-29
通讯作者: 郭剑毅(1964— ),女,硕士,教授,研究方向为自然语言处理、信息抽取.E-mail:gjade86@hotmail.com E-mail:yuanyuan2013ly@163.com
作者简介:莫媛媛(1989— ),女,硕士研究生,研究方向为自然语言处理、信息抽取. E-mail:yuanyuan2013ly@163.com
基金资助:
国家自然科学基金资助项目(61262041);云南省教育厅基金重大专项资助项目(2013FA030)

A bilingual word alignment method of Vietnamese-Chinese based on deep neural network

MO Yuan-yuan¹, GUO Jian-yi^1,2*, YU Zheng-tao^1,2, MAO Cun-li^1,2, NIU Yi-tong¹

1. School of Information Engineering and Automation, Kunming University of Science and Technology, Kunming 650051, Yunnan, China;
2. Intelligent Information Processing Key Laboratory, Kunming University of Science and Technology, Kunming 650051, Yunnan, China

Received:2015-03-03 Online:2016-01-16 Published:2016-11-29

摘要/Abstract

摘要： 针对汉-越双语因语言特点差异较大而导致难以实现词语自动对齐的问题,提出了一种基于深层神经网络(deep neural network, DNN)的汉-越双语词语对齐方法。该方法先将汉-越双语词语转化成词向量,作为DNN模型的输入,再通过调整和扩展HMM模型,并融入上下文信息,构建DNN-HMM词语对齐模型。实验以HMM模型和IBM4模型为基础模型,通过大规模的汉-越双语词语对齐任务表明,该方法的准确率、召回率较两个基础模型都有明显的提高,而词语对齐错误率大大降低。

关键词: DNN, 词语对齐, 汉语, 越南语

Abstract: It is difficult to achieve auto-alignment between Vietnamese and Chinese, because their syntax and structure are quite different. In this case, we present a novel method for the Vietnamese-Chinese word alignment based on DNN(deep neural network). Firstly, we should convert Vietnamese-Chinese bilingual word into word embedding, and as the input within DNN. Secondly, DNN-HMM word alignment model is constructed by expanding HMM model, which also integrating the context information. The basic model of the experiments are HMM and IBM4. The results of large-scale Vietnamese-Chinese bilingual word alignment task show that this method not only significantly improved its accuracy and recall rate than the two basic models, but also greatly reduced word alignment error rate.

Key words: word alignment, DNN, Vietnamese, Chinese

中图分类号:

TP391

莫媛媛, 郭剑毅,余正涛,毛存礼,牛翊童. 基于深层神经网络(DNN)的汉-越双语词语对齐方法[J]. 山东大学学报（理学版）, 2016, 51(1): 77-83.

MO Yuan-yuan, GUO Jian-yi, YU Zheng-tao, MAO Cun-li, NIU Yi-tong. A bilingual word alignment method of Vietnamese-Chinese based on deep neural network[J]. JOURNAL OF SHANDONG UNIVERSITY(NATURAL SCIENCE), 2016, 51(1): 77-83.

参考文献

[1] LE H P, HO T V. A maximum entropy approach to sentence boundary detection of Vietnamese texts[C] //IEEE International Conference on Research, Innovation and Vision for the Future-RIVF 2008. New York: IEEE, 2008:1-6.
[2] HUYÊN N T M, ROUSSANALY A, VINH H T. A hybrid approach to word segmentation of Vietnamese texts[J]. Language and Automata Theory and Applications, 2008:240-249.
[3] 越南语词法分析系统 [EB/OL]. [2014-11-12].http://www.loria.fr/~lehong/tools/vn-Tokenizer.php.
[4] BROWN P F, PIETRA V J D, PIETRA S A D, et al. The mathematics of statistical machine translation:parameter estimation[J]. Computational Linguistics, 1993, 19(2):263-311.
[5] Franz Josef Och, Hermann Ney. A systematic comparison of various statistical alignment models[J].Computational Linguistics, 2003, 29(1):19-51.
[6] BLUNSOM P, COHN T. Discriminative word alignment with conditional random fields[C] //Proceedings of the 21st International Conference on Computational Linguistics and the 44th Annual Meeting of the Association for Computational Linguistics. Philadelphia:Association for Computational Linguistics, 2006:65-72.
[7] LIU Y, LIU Q, LIN S. Discriminative word alignment by linear modeling[J]. Computational Linguistics, 2010, 36(3):303-339.
[8] HINTON G E, OSINDERO S, TEH Y W. A fast learning algorithm for deep belief nets[J]. Neural Computation, 2006, 18(7):1527-1554.
[9] COLLOBERT R, WESTON J, BOTTOU L, et al. Natural language processing(almost)from scratch[J]. The Journal of Machine Learning Research, 2011, 12:2493-2537.
[10] NIEHUES J, WAIBEL A. Continuous space language models using restricted boltzmann machines[C] //Proceedings of the 9th International Workshop on Spoken Language Translation(IWSLT).[S.l.] :[s.n.] , 2012:1-48.
[11] GOLDBERG Y, LEVY O. word2vec Explained: deriving Mikolov et al.'s negative-sampling word-embedding method[EB/OL].[2014-10-24].http://arxiv.org/pdf/1402.3722v1.pdf.
[12] MIKOLOV T, CHEN K, CORRADO G, et al. Efficient estimation of word representations in vector space[EB/OL].[2014-10-24]. http://arxiv.org/pdf/1301.3781.pdf.
[13] KLEMENTIEV A, TITOV I, BHATTARAI B. Nducing crosslingual distributed representations of words [C] //Proceedings of the International Conference on Computational Linguistics. Bombay: 2012 Organizing Committee, 2012.
[14] ZHENG Xiaoqing, CHEN Haiyang, XU Tianyu. Deep learning for Chinese word segmentation and POS tagging[C] //Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: Association for Computational Linguistics, 2013:647-657.
[15] DAHL G E, YU D, DENG L, et al. Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition[J]., IEEE Transactions on Audio, Speech, and Language Processing, 2012, 20(1):30-42.
[16] YANG Nan, LIU Shujie, LI Mu, et al. Word alignment modeling with context dependent deep neural network[C] //Proceedings ofthe 51st Annual Meeting of the Association for Computational Linguistics. Somerset: ACL, 2013: 166-175.
[17] BENGIO Y. Learning deep architectures for AI[M]. Now Publishers Inc Hanover, 2009.

多维度评价

Viewed

Full text

Abstract

Cited

Shared

Discussed

基于深层神经网络(DNN)的汉-越双语词语对齐方法

A bilingual word alignment method of Vietnamese-Chinese based on deep neural network

PDF (PC)

摘要/Abstract

引用本文

使用本文

参考文献

相关文章 3

多维度评价

本文评价

推荐阅读 0

[1]	林丽. 基于核心依存图的新闻事件抽取[J]. 山东大学学报（理学版）, 2016, 51(9): 121-126.
[2]	谭红叶, 赵健, 陈千. 基于不一致的汉语句法树库潜在错误查找[J]. 山东大学学报（理学版）, 2015, 50(01): 26-30.
[3]	潘清清,周枫,余正涛,郭剑毅,线岩团. 基于条件随机场的越南语命名实体识别方法[J]. 山东大学学报（理学版）, 2014, 49(1): 76-79.