您的位置:山东大学 -> 科技期刊社 -> 《山东大学学报(理学版)》

山东大学学报(理学版) ›› 2015, Vol. 50 ›› Issue (09): 21-28.doi: 10.6040/j.issn.1671-9352.3.2014.016

• 论文 • 上一篇    下一篇

基于多策略过滤的汉日多词短语抽取和对齐

唐亮1,2, 李倩1, 许洪波2, 易绵竹1   

  1. 1. 洛阳外国语学院语言工程系, 河南 洛阳 471003;
    2. 中科院计算技术研究所, 北京 100049
  • 收稿日期:2015-03-03 修回日期:2015-07-22 出版日期:2015-09-20 发布日期:2015-09-26
  • 作者简介:唐亮(1976-),男,博士,讲师,研究方向为语言信息处理、数据挖掘.E-mail:tangliang@software.itc.ac.cn
  • 基金资助:
    国家重点基础研究发展计划(973计划)项目(2014CB340400,2012CB316303);国家自然科学基金重点项目(61232010);国家自然科学基金面上项目(61173064);国家科技支撑计划项目(2012BAH39B04)

Chinese-Japanese multi-word phrase extraction and alignment based on multi-strategy filtering

TANG Liang1,2, LI Qian1, XU Hong-bo2, YI Mian-zhu1   

  1. 1. Department of Language Engineering, Luoyang University of Foreign Language, Luoyang 471003, Henan, China;
    2. Institute of Computing Technology, Chinese Academy of Science, Beijing 100049, China
  • Received:2015-03-03 Revised:2015-07-22 Online:2015-09-20 Published:2015-09-26

摘要: 在跨语言文本分析任务中,多词短语比单个词汇歧义小,语义表达更加准确,有助于提高文本理解的准确性。现有方法主要关注单个词的跨语言对齐。将多词短语抽取和跨语言对齐相融合,提出了一种基于多策略过滤的汉日多词短语抽取和对齐的方法。首先从一个语种出发,通过重复串、左右邻接熵、内部关联度、多词嵌套、停用词等方法提取并过滤得到具备完整语义的多词短语,然后利用平行语料库计算汉日多词短语的相似度,实现跨语言对齐。在整个过程中可结合日语语言规则与特点,根据语料规模、相关领域对过滤阈值进行动态调整,提高了多词短语的领域适用性。实验结果表明,该方法可有效抽取汉日多词短语并进行准确对齐,以多词短语为对齐单元,语义表达更完整,实用价值更大。

关键词: 平行语料库, 多词短语, 词对齐

Abstract: On the task of cross-language text analysis, a multi-word phrase is less ambiguous and more accurate than a single word, which helps to understand the text more accurately. Existing methods mainly focus on cross-language alignment of single words. This paper presents an extraction and alignment method for Chinese-Japanese multi-word phrases based on multi-strategy filtering, which combines the multi-word phrases extraction and cross-language alignment. Firstly, we get multi-word phrases with complete semantics using repeated string, left-right adjacent entropy, internal relationship, multi-word nesting, stop-word method etc. Secondly, we use the parallel corpus to compute the similarity of Chinese-Japanese multi-word phrases, to achieve cross-language alignment. In the process, according to the rules and characteristics of the Japanese language, we dynamically adjust the threshold according to corpus' size and related domains, in order to improve the applicability of multi-word phrases. The experimental results show that this method is effective to extract Chinese-Japanese multi-word phrases as the alignment unit, which makes the semantic expression more complete and more practical value.

Key words: parallel corpus, multi-word phrase, word alignment

中图分类号: 

  • TP393
[1] NAGAO M. A framework of a mechanical translation between Japanese and English by analogy principle[C]//Proceedings of the International NATO Symposium on Artificial and Human Intelligence.Amsterdam: Elsevier, 1984:173-180.
[2] LIU Yang,LIU Qun,LIN Shouxun. Tree-to-string alignment templates forstatistical machine translation[C]//Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics. Stroudsburg:Association for Computational Linguistics, 2006:609-616.
[3] 刘群. 基于句法的统计机器翻译模型与方法[J]. 中文信息学报, 2011, 6:63-71. LIU Qun. Syntax-based statistical machine translation models and approaches[J].Journal of Chinese Information Process, 2011, 6:63-71.
[4] CHIANG D. Hierarchical phrase-based translation[J]. Computational Linguistics, 2007, 33(2):201-228.
[5] QUIRK C, MENEZES A, HERRY C. Dependency treelet translation: syntactically information phrasal SMT[C]//Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics. Stroudsburg: Association for Computational Linguistics, 2005:271-279.
[6] Taro Watanabe,Kenji Imamura,Eiichiro Sumita. Statistical machine translationbased on hierarchical phrase alignment[C]//Proceedings of the 9th International Conference on Theoretical and Methodological Issues in Machine Translation(TMI).[S.l.]:[s.n.], 2002:188-198.
[7] 陈怀兴, 尹存燕, 陈家骏. 一种命名实体翻译等价对的抽取方法[J]. 中文信息学报, 2008, 4:55-60. CHEN Huaixing, Yin Cunyan, CHEN Jiajun. An approach to extract named entity translingual equivalence[J]. Journal of Chinese Information Processing, 2008, 4:55-60.
[8] 常宝宝. 基于汉英双语语料库的翻译等价单位自动获取研究[J].术语标准化与信息技术, 2002, 2:24-29. CHANG Baobao. Extraction of translation equivalent pairs from Chinese-English parallel corpus[J]. Terminology Standardization and Information Technology, 2002, 2:24-29.
[9] 吕雅娟. 基于双语语料库的翻译等价对自动抽取[J].高技术通讯, 2003, 13(5):19-24. L Yajuan. Automatic extraction of translational equivalence based on bilingual corpora[J]. Chinese High Technology Letters, 2003, 13(5):19-24.
[10] 刘颖, 铁铮, 余畅.汉英短语翻译对的自动抽取[J].计算机应用与软件, 2012, 7:69-72. LIU Ying, TIE Zheng, YU Chang. Automatic extraction of Chinese-English phrase translation pairs[J]. Computer Applications and Software, 2012, 7:69-72.
[11] 吴宏林, 刘绍明, 于戈. 基于加权二部图的汉日词对齐[J]. 中文信息学报, 2007, 21(5):101-106. WU Honglin, LIU Shaoming, YU Ge. Word alignment between Chinese and Japanese based on weighted bipartite grap[J]. Journal of Chinese Information Processing, 2007, 21(5):101-106.
[12] 茹旷. 日汉双语命名实体对获取方法及其应用研究[D].北京:北京交通大学,2014. RU Kuang. The methods and researches into construct Chinese-Japanese named entity translation equivalents[D].Beijing Jiaotong University, 2014.
[13] OCH F J. Minimum error rate training in statistical machine translation[C]//Proceedings of the 41st Annual Meeting on Association for Computational Linguistics. Somerset: Association Computational Linguistics, 2003:160-167.
[14] 王思力, 张华平, 王斌. 双数组Trie树算法优化及其应用研究[J]. 中文信息学报, 2006, 20(5):24-30. WANG Sili, ZHANG Heaping, WANG Bin. Research of optimization on double-array trie and its application[J]. Journal of Chinese Information Processing, 2006, 20(5):24-30.
[15] 郑丽英. 数据结构Trie及其应用[J]. 现代计算机:专业版, 2004, 11(8):76-81. ZHENG Liying. The data structrue trie and its appliation research[J]. Modern Computer, 2004, 11(8):76-81.
[1] 陈兴俊,魏晶晶,廖祥文,简思远,陈国龙. 基于词对齐模型的中文评价对象与评价词抽取[J]. 山东大学学报(理学版), 2016, 51(1): 58-64.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!