山东大学学报(理学版) ›› 2015, Vol. 50 ›› Issue (09): 21-28.doi: 10.6040/j.issn.1671-9352.3.2014.016
唐亮1,2, 李倩1, 许洪波2, 易绵竹1
TANG Liang1,2, LI Qian1, XU Hong-bo2, YI Mian-zhu1
摘要: 在跨语言文本分析任务中,多词短语比单个词汇歧义小,语义表达更加准确,有助于提高文本理解的准确性。现有方法主要关注单个词的跨语言对齐。将多词短语抽取和跨语言对齐相融合,提出了一种基于多策略过滤的汉日多词短语抽取和对齐的方法。首先从一个语种出发,通过重复串、左右邻接熵、内部关联度、多词嵌套、停用词等方法提取并过滤得到具备完整语义的多词短语,然后利用平行语料库计算汉日多词短语的相似度,实现跨语言对齐。在整个过程中可结合日语语言规则与特点,根据语料规模、相关领域对过滤阈值进行动态调整,提高了多词短语的领域适用性。实验结果表明,该方法可有效抽取汉日多词短语并进行准确对齐,以多词短语为对齐单元,语义表达更完整,实用价值更大。
中图分类号:
[1] NAGAO M. A framework of a mechanical translation between Japanese and English by analogy principle[C]//Proceedings of the International NATO Symposium on Artificial and Human Intelligence.Amsterdam: Elsevier, 1984:173-180. [2] LIU Yang,LIU Qun,LIN Shouxun. Tree-to-string alignment templates forstatistical machine translation[C]//Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics. Stroudsburg:Association for Computational Linguistics, 2006:609-616. [3] 刘群. 基于句法的统计机器翻译模型与方法[J]. 中文信息学报, 2011, 6:63-71. LIU Qun. Syntax-based statistical machine translation models and approaches[J].Journal of Chinese Information Process, 2011, 6:63-71. [4] CHIANG D. Hierarchical phrase-based translation[J]. Computational Linguistics, 2007, 33(2):201-228. [5] QUIRK C, MENEZES A, HERRY C. Dependency treelet translation: syntactically information phrasal SMT[C]//Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics. Stroudsburg: Association for Computational Linguistics, 2005:271-279. [6] Taro Watanabe,Kenji Imamura,Eiichiro Sumita. Statistical machine translationbased on hierarchical phrase alignment[C]//Proceedings of the 9th International Conference on Theoretical and Methodological Issues in Machine Translation(TMI).[S.l.]:[s.n.], 2002:188-198. [7] 陈怀兴, 尹存燕, 陈家骏. 一种命名实体翻译等价对的抽取方法[J]. 中文信息学报, 2008, 4:55-60. CHEN Huaixing, Yin Cunyan, CHEN Jiajun. An approach to extract named entity translingual equivalence[J]. Journal of Chinese Information Processing, 2008, 4:55-60. [8] 常宝宝. 基于汉英双语语料库的翻译等价单位自动获取研究[J].术语标准化与信息技术, 2002, 2:24-29. CHANG Baobao. Extraction of translation equivalent pairs from Chinese-English parallel corpus[J]. Terminology Standardization and Information Technology, 2002, 2:24-29. [9] 吕雅娟. 基于双语语料库的翻译等价对自动抽取[J].高技术通讯, 2003, 13(5):19-24. L Yajuan. Automatic extraction of translational equivalence based on bilingual corpora[J]. Chinese High Technology Letters, 2003, 13(5):19-24. [10] 刘颖, 铁铮, 余畅.汉英短语翻译对的自动抽取[J].计算机应用与软件, 2012, 7:69-72. LIU Ying, TIE Zheng, YU Chang. Automatic extraction of Chinese-English phrase translation pairs[J]. Computer Applications and Software, 2012, 7:69-72. [11] 吴宏林, 刘绍明, 于戈. 基于加权二部图的汉日词对齐[J]. 中文信息学报, 2007, 21(5):101-106. WU Honglin, LIU Shaoming, YU Ge. Word alignment between Chinese and Japanese based on weighted bipartite grap[J]. Journal of Chinese Information Processing, 2007, 21(5):101-106. [12] 茹旷. 日汉双语命名实体对获取方法及其应用研究[D].北京:北京交通大学,2014. RU Kuang. The methods and researches into construct Chinese-Japanese named entity translation equivalents[D].Beijing Jiaotong University, 2014. [13] OCH F J. Minimum error rate training in statistical machine translation[C]//Proceedings of the 41st Annual Meeting on Association for Computational Linguistics. Somerset: Association Computational Linguistics, 2003:160-167. [14] 王思力, 张华平, 王斌. 双数组Trie树算法优化及其应用研究[J]. 中文信息学报, 2006, 20(5):24-30. WANG Sili, ZHANG Heaping, WANG Bin. Research of optimization on double-array trie and its application[J]. Journal of Chinese Information Processing, 2006, 20(5):24-30. [15] 郑丽英. 数据结构Trie及其应用[J]. 现代计算机:专业版, 2004, 11(8):76-81. ZHENG Liying. The data structrue trie and its appliation research[J]. Modern Computer, 2004, 11(8):76-81. |
[1] | 陈兴俊,魏晶晶,廖祥文,简思远,陈国龙. 基于词对齐模型的中文评价对象与评价词抽取[J]. 山东大学学报(理学版), 2016, 51(1): 58-64. |
|