山东大学学报(理学版) ›› 2016, Vol. 51 ›› Issue (1): 77-83.doi: 10.6040/j.issn.1671-9352.3.2014.289
莫媛媛1, 郭剑毅1,2*,余正涛1,2,毛存礼1,2,牛翊童1
MO Yuan-yuan1, GUO Jian-yi1,2*, YU Zheng-tao1,2, MAO Cun-li1,2, NIU Yi-tong1
摘要: 针对汉-越双语因语言特点差异较大而导致难以实现词语自动对齐的问题,提出了一种基于深层神经网络(deep neural network, DNN)的汉-越双语词语对齐方法。该方法先将汉-越双语词语转化成词向量,作为DNN模型的输入,再通过调整和扩展HMM模型,并融入上下文信息,构建DNN-HMM词语对齐模型。实验以HMM模型和IBM4模型为基础模型,通过大规模的汉-越双语词语对齐任务表明,该方法的准确率、召回率较两个基础模型都有明显的提高,而词语对齐错误率大大降低。
中图分类号:
[1] LE H P, HO T V. A maximum entropy approach to sentence boundary detection of Vietnamese texts[C] //IEEE International Conference on Research, Innovation and Vision for the Future-RIVF 2008. New York: IEEE, 2008:1-6. [2] HUYÊN N T M, ROUSSANALY A, VINH H T. A hybrid approach to word segmentation of Vietnamese texts[J]. Language and Automata Theory and Applications, 2008:240-249. [3] 越南语词法分析系统 [EB/OL]. [2014-11-12].http://www.loria.fr/~lehong/tools/vn-Tokenizer.php. [4] BROWN P F, PIETRA V J D, PIETRA S A D, et al. The mathematics of statistical machine translation:parameter estimation[J]. Computational Linguistics, 1993, 19(2):263-311. [5] Franz Josef Och, Hermann Ney. A systematic comparison of various statistical alignment models[J].Computational Linguistics, 2003, 29(1):19-51. [6] BLUNSOM P, COHN T. Discriminative word alignment with conditional random fields[C] //Proceedings of the 21st International Conference on Computational Linguistics and the 44th Annual Meeting of the Association for Computational Linguistics. Philadelphia:Association for Computational Linguistics, 2006:65-72. [7] LIU Y, LIU Q, LIN S. Discriminative word alignment by linear modeling[J]. Computational Linguistics, 2010, 36(3):303-339. [8] HINTON G E, OSINDERO S, TEH Y W. A fast learning algorithm for deep belief nets[J]. Neural Computation, 2006, 18(7):1527-1554. [9] COLLOBERT R, WESTON J, BOTTOU L, et al. Natural language processing(almost)from scratch[J]. The Journal of Machine Learning Research, 2011, 12:2493-2537. [10] NIEHUES J, WAIBEL A. Continuous space language models using restricted boltzmann machines[C] //Proceedings of the 9th International Workshop on Spoken Language Translation(IWSLT).[S.l.] :[s.n.] , 2012:1-48. [11] GOLDBERG Y, LEVY O. word2vec Explained: deriving Mikolov et al.'s negative-sampling word-embedding method[EB/OL].[2014-10-24].http://arxiv.org/pdf/1402.3722v1.pdf. [12] MIKOLOV T, CHEN K, CORRADO G, et al. Efficient estimation of word representations in vector space[EB/OL].[2014-10-24]. http://arxiv.org/pdf/1301.3781.pdf. [13] KLEMENTIEV A, TITOV I, BHATTARAI B. Nducing crosslingual distributed representations of words [C] //Proceedings of the International Conference on Computational Linguistics. Bombay: 2012 Organizing Committee, 2012. [14] ZHENG Xiaoqing, CHEN Haiyang, XU Tianyu. Deep learning for Chinese word segmentation and POS tagging[C] //Proceedings of the 2013 Conference on Empirical Methods in Natural Language Processing. Stroudsburg: Association for Computational Linguistics, 2013:647-657. [15] DAHL G E, YU D, DENG L, et al. Context-dependent pre-trained deep neural networks for large-vocabulary speech recognition[J]., IEEE Transactions on Audio, Speech, and Language Processing, 2012, 20(1):30-42. [16] YANG Nan, LIU Shujie, LI Mu, et al. Word alignment modeling with context dependent deep neural network[C] //Proceedings ofthe 51st Annual Meeting of the Association for Computational Linguistics. Somerset: ACL, 2013: 166-175. [17] BENGIO Y. Learning deep architectures for AI[M]. Now Publishers Inc Hanover, 2009. |
[1] | 林丽. 基于核心依存图的新闻事件抽取[J]. 山东大学学报(理学版), 2016, 51(9): 121-126. |
[2] | 谭红叶, 赵健, 陈千. 基于不一致的汉语句法树库潜在错误查找[J]. 山东大学学报(理学版), 2015, 50(01): 26-30. |
[3] | 潘清清,周枫,余正涛,郭剑毅,线岩团. 基于条件随机场的越南语命名实体识别方法[J]. 山东大学学报(理学版), 2014, 49(1): 76-79. |
|