基于多策略过滤的汉日多词短语抽取和对齐

doi:10.6040/j.issn.1671-9352.3.2014.016

Abstract

Abstract: On the task of cross-language text analysis, a multi-word phrase is less ambiguous and more accurate than a single word, which helps to understand the text more accurately. Existing methods mainly focus on cross-language alignment of single words. This paper presents an extraction and alignment method for Chinese-Japanese multi-word phrases based on multi-strategy filtering, which combines the multi-word phrases extraction and cross-language alignment. Firstly, we get multi-word phrases with complete semantics using repeated string, left-right adjacent entropy, internal relationship, multi-word nesting, stop-word method etc. Secondly, we use the parallel corpus to compute the similarity of Chinese-Japanese multi-word phrases, to achieve cross-language alignment. In the process, according to the rules and characteristics of the Japanese language, we dynamically adjust the threshold according to corpus' size and related domains, in order to improve the applicability of multi-word phrases. The experimental results show that this method is effective to extract Chinese-Japanese multi-word phrases as the alignment unit, which makes the semantic expression more complete and more practical value.

Key words: parallel corpus, multi-word phrase, word alignment

CLC Number:

TP393

TANG Liang, LI Qian, XU Hong-bo, YI Mian-zhu. Chinese-Japanese multi-word phrase extraction and alignment based on multi-strategy filtering[J].JOURNAL OF SHANDONG UNIVERSITY(NATURAL SCIENCE), 2015, 50(09): 21-28.

References

[1] NAGAO M. A framework of a mechanical translation between Japanese and English by analogy principle[C]//Proceedings of the International NATO Symposium on Artificial and Human Intelligence.Amsterdam: Elsevier, 1984:173-180.
[2] LIU Yang,LIU Qun,LIN Shouxun. Tree-to-string alignment templates forstatistical machine translation[C]//Proceedings of the 21st International Conference on Computational Linguistics and the 44th annual meeting of the Association for Computational Linguistics. Stroudsburg:Association for Computational Linguistics, 2006:609-616.
[3] 刘群. 基于句法的统计机器翻译模型与方法[J]. 中文信息学报, 2011, 6:63-71. LIU Qun. Syntax-based statistical machine translation models and approaches[J].Journal of Chinese Information Process, 2011, 6:63-71.
[4] CHIANG D. Hierarchical phrase-based translation[J]. Computational Linguistics, 2007, 33(2):201-228.
[5] QUIRK C, MENEZES A, HERRY C. Dependency treelet translation: syntactically information phrasal SMT[C]//Proceedings of the 43rd Annual Meeting on Association for Computational Linguistics. Stroudsburg: Association for Computational Linguistics, 2005:271-279.
[6] Taro Watanabe,Kenji Imamura,Eiichiro Sumita. Statistical machine translationbased on hierarchical phrase alignment[C]//Proceedings of the 9th International Conference on Theoretical and Methodological Issues in Machine Translation(TMI).[S.l.]:[s.n.], 2002:188-198.
[7] 陈怀兴, 尹存燕, 陈家骏. 一种命名实体翻译等价对的抽取方法[J]. 中文信息学报, 2008, 4:55-60. CHEN Huaixing, Yin Cunyan, CHEN Jiajun. An approach to extract named entity translingual equivalence[J]. Journal of Chinese Information Processing, 2008, 4:55-60.
[8] 常宝宝. 基于汉英双语语料库的翻译等价单位自动获取研究[J].术语标准化与信息技术, 2002, 2:24-29. CHANG Baobao. Extraction of translation equivalent pairs from Chinese-English parallel corpus[J]. Terminology Standardization and Information Technology, 2002, 2:24-29.
[9] 吕雅娟. 基于双语语料库的翻译等价对自动抽取[J].高技术通讯, 2003, 13(5):19-24. L Yajuan. Automatic extraction of translational equivalence based on bilingual corpora[J]. Chinese High Technology Letters, 2003, 13(5):19-24.
[10] 刘颖, 铁铮, 余畅.汉英短语翻译对的自动抽取[J].计算机应用与软件, 2012, 7:69-72. LIU Ying, TIE Zheng, YU Chang. Automatic extraction of Chinese-English phrase translation pairs[J]. Computer Applications and Software, 2012, 7:69-72.
[11] 吴宏林, 刘绍明, 于戈. 基于加权二部图的汉日词对齐[J]. 中文信息学报, 2007, 21(5):101-106. WU Honglin, LIU Shaoming, YU Ge. Word alignment between Chinese and Japanese based on weighted bipartite grap[J]. Journal of Chinese Information Processing, 2007, 21(5):101-106.
[12] 茹旷. 日汉双语命名实体对获取方法及其应用研究[D].北京:北京交通大学,2014. RU Kuang. The methods and researches into construct Chinese-Japanese named entity translation equivalents[D].Beijing Jiaotong University, 2014.
[13] OCH F J. Minimum error rate training in statistical machine translation[C]//Proceedings of the 41st Annual Meeting on Association for Computational Linguistics. Somerset: Association Computational Linguistics, 2003:160-167.
[14] 王思力, 张华平, 王斌. 双数组Trie树算法优化及其应用研究[J]. 中文信息学报, 2006, 20(5):24-30. WANG Sili, ZHANG Heaping, WANG Bin. Research of optimization on double-array trie and its application[J]. Journal of Chinese Information Processing, 2006, 20(5):24-30.
[15] 郑丽英. 数据结构Trie及其应用[J]. 现代计算机:专业版, 2004, 11(8):76-81. ZHENG Liying. The data structrue trie and its appliation research[J]. Modern Computer, 2004, 11(8):76-81.

Metrics

Viewed

Full text

Abstract

Cited

Shared

Discussed

Comments

Recommended 0

No Suggested Reading articles found!

Chinese-Japanese multi-word phrase extraction and alignment based on multi-strategy filtering

PDF (PC)

Abstract

Cite this article

share this article

References

Related Articles 1

Metrics

Comments

Recommended 0