山东大学学报(理学版) ›› 2015, Vol. 50 ›› Issue (01): 26-30.doi: 10.6040/j.issn.1671-9352.3.2014.014
谭红叶, 赵健, 陈千
TAN Hong-ye, ZHAO Jian, CHEN Qian
摘要: 语料库是自然语言处理NLP(natural language processing)的基础,其标注质量影响着基于有指导机器学习方法的NLP系统的性能。针对汉语句法树库,提出了一种基于不一致查找树库潜在标注错误的方法,该方法主要从两方面进行不一致检测:一是从类似短语内部构成并结合可疑度来检测不一致;二是从标注大纲入手,检测词性、短语等各类标记符号与大纲定义不符合的情况。实验结果表明,在查找到的不一致现象中,存在一定数量的语料库标注错误。
中图分类号:
[1] 苗玺, 郑家恒. 中文语料库分词不一致的分类处理研究[J]. 山西大学学报:自然科学版, 2006,29(1):23-25. MIAO Xi, ZHENG Jiaheng. Classified study on inconsistency of segment for Chinese corpus[J]. Journal of Shanxi University: Natural Science Edition, 2006, 29(1):23-25. [2] 刘江, 郑家恒, 张虎.中文文本语料库分词一致性检验技术的初探[J]. 计算机应用研究, 2005, 22(8):52-54. LIU Jiang, ZHENG Jiaheng, ZHANG Hu. Studies on the consistency o f word-segmented Chinese corpus[J]. Application Research of Computers, 2005, 22(8):52-54. [3] 张虎, 郑家恒, 刘江. 语料库词性标注一致性检查方法研究[J]. 中文信息学报, 2004, 18(5):11-16. ZHANG Hu, ZHENG Jiaheng, LIU Jiang. The inspecting method study on consistence of part of speech tagging of corpus[J]. Journal of Chinese Information Processing, 2004, 18(5):11-16. [4] 魏莉, 谭红叶, 郑家恒, 等. 汉语句法树库一致性检验方法研究[J]. 广西师范大学学报: 自然科学版, 2010(001):139-142. WEI Li, TAN Hongye, ZHENG Jiaheng,et al. Study of keeping consistency of Chinese corpus of complete parsing[J]. Journal of Guangxi Normal University: Natural Science Edition, 2010(001):139-142. [5] KULICK S, BIES A. Treebank analysis and search using an extracted tree grammar[C]// Proceedings of the 8th International Workshop on Treebanks and Linguistic Theories, Milan: [s.n.], 2009:1-12. [6] KULICK S, BIES A, MOTT J. Using derivation trees for treebank error detection[C]// Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies.Stroudsburg, PA, USA: Association for Computational Linguistics, 2011: 693-698. [7] GARDENT C, NARAYAN S. Error mining on dependency trees[C]// Proceedings of the 50th Annual Meeting of the Association for Computational Linguistics. Stroudsburg, PA, USA: Association for Computational Linguistics, 2012: 592-600. [8] DICKINSON M. Similarity and dissimilarity in treebank grammars[C]// Proceedings of the 18th International Congress of Linguists. Seoul:[s.n.], 2008: 1597-1611. [9] VOLOKH A, NEUMANN G. Automatic detection and correction of errors in dependency tree-banks[C]// Proceedings of the 49th Annual Meeting of the Association for Computational Linguistics: Human Language Technologies. Stroudsburg, PA, USA: Association for Computational Linguistics, 2011: 346-350. |
[1] | 陈珂锐,潘君. 基于扩展特征向量空间模型的 多源数据融合[J]. J4, 2013, 48(11): 87-92. |
|