您的位置:山东大学 -> 科技期刊社 -> 《山东大学学报(理学版)》

山东大学学报(理学版) ›› 2015, Vol. 50 ›› Issue (01): 26-30.doi: 10.6040/j.issn.1671-9352.3.2014.014

• 论文 • 上一篇    下一篇

基于不一致的汉语句法树库潜在错误查找

谭红叶, 赵健, 陈千   

  1. 山西大学计算机与信息技术学院, 山西 太原 030006
  • 收稿日期:2014-09-19 修回日期:2014-11-25 出版日期:2015-01-20 发布日期:2015-01-24
  • 作者简介:谭红叶(1971-),女,博士,副教授,研究方向为自然语言处理.E-mail:hytan_2006@126.com
  • 基金资助:
    国家自然科学青年基金资助项目(61100138,61403238);山西省自然科学基金资助项目(2011011016-2,2012021012-1);山西省回国留学人员科研项目(2013-022);山西省高校科技开发项目(20121117);山西省2012年度留学回国人员科技活动择优项目

Finding potential errors in Chinese treebank based on inconsistencies

TAN Hong-ye, ZHAO Jian, CHEN Qian   

  1. College of Computer and Information Technology in Shanxi University, Taiyuan 030006, Shanxi, China
  • Received:2014-09-19 Revised:2014-11-25 Online:2015-01-20 Published:2015-01-24

摘要: 语料库是自然语言处理NLP(natural language processing)的基础,其标注质量影响着基于有指导机器学习方法的NLP系统的性能。针对汉语句法树库,提出了一种基于不一致查找树库潜在标注错误的方法,该方法主要从两方面进行不一致检测:一是从类似短语内部构成并结合可疑度来检测不一致;二是从标注大纲入手,检测词性、短语等各类标记符号与大纲定义不符合的情况。实验结果表明,在查找到的不一致现象中,存在一定数量的语料库标注错误。

关键词: 汉语树库, 不一致, 潜在错误, 自然语言处理

Abstract: Corpora are fundamental to natural language processing(NLP) and corpus annotation quality influences the performance of the systems based on supervised machine learning approaches. Aiming at Chinese treebank, an approach was proposed to find potential errors based on inconsistencies. Inconsistencies were detected with two strategies: one uses similar internal structure and suspicious degree, the other uses the annotation guideline to check those annotations, which don't meet the definitions of the guideline. Experimental results show that there are some annotation errors in the inconsistencies.

Key words: inconsistencies, potential error, natural language processing, Chinese treebank

中图分类号: 

  • TP391
[1] 苗玺, 郑家恒. 中文语料库分词不一致的分类处理研究[J]. 山西大学学报:自然科学版, 2006,29(1):23-25. MIAO Xi, ZHENG Jiaheng. Classified study on inconsistency of segment for Chinese corpus[J]. Journal of Shanxi University: Natural Science Edition, 2006, 29(1):23-25.
[2] 刘江, 郑家恒, 张虎.中文文本语料库分词一致性检验技术的初探[J]. 计算机应用研究, 2005, 22(8):52-54. LIU Jiang, ZHENG Jiaheng, ZHANG Hu. Studies on the consistency o f word-segmented Chinese corpus[J]. Application Research of Computers, 2005, 22(8):52-54.
[3] 张虎, 郑家恒, 刘江. 语料库词性标注一致性检查方法研究[J]. 中文信息学报, 2004, 18(5):11-16. ZHANG Hu, ZHENG Jiaheng, LIU Jiang. The inspecting method study on consistence of part of speech tagging of corpus[J]. Journal of Chinese Information Processing, 2004, 18(5):11-16.
[4] 魏莉, 谭红叶, 郑家恒, 等. 汉语句法树库一致性检验方法研究[J]. 广西师范大学学报: 自然科学版, 2010(001):139-142. WEI Li, TAN Hongye, ZHENG Jiaheng,et al. Study of keeping consistency of Chinese corpus of complete parsing[J]. Journal of Guangxi Normal University: Natural Science Edition, 2010(001):139-142.
[5] KULICK S, BIES A. Treebank analysis and search using an extracted tree grammar[C]// Proceedings of the 8th International Workshop on Treebanks and Linguistic Theories, Milan: [s.n.], 2009:1-12.
[6] KULICK S, BIES A, MOTT J. Using derivation trees for treebank error detection[C]// Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies.Stroudsburg, PA, USA: Association for Computational Linguistics, 2011: 693-698.
[7] GARDENT C, NARAYAN S. Error mining on dependency trees[C]// Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics. Stroudsburg, PA, USA: Association for Computational Linguistics, 2012: 592-600.
[8] DICKINSON M. Similarity and dissimilarity in treebank grammars[C]// Proceedings of the 18th International Congress of Linguists. Seoul:[s.n.], 2008: 1597-1611.
[9] VOLOKH A, NEUMANN G. Automatic detection and correction of errors in dependency tree-banks[C]// Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies. Stroudsburg, PA, USA: Association for Computational Linguistics, 2011: 346-350.
[1] 陈珂锐,潘君. 基于扩展特征向量空间模型的
多源数据融合
[J]. J4, 2013, 48(11): 87-92.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!