您的位置:山东大学 -> 科技期刊社 -> 《山东大学学报(理学版)》

山东大学学报(理学版) ›› 2015, Vol. 50 ›› Issue (03): 11-19.doi: 10.6040/j.issn.1671-9352.3.2014.122

• 论文 • 上一篇    下一篇

基于语义模式和引用分布的科技文献信息抽取

杨中国1,2, 李洪奇1,2, 朱丽萍1,2, 刘蔷1,2   

  1. 1. 中国石油大学(北京)石油数据挖掘北京市重点实验室, 北京 102249;
    2. 中国石油大学(北京)地球物理与信息工程学院, 北京 102249
  • 收稿日期:2014-09-19 修回日期:2014-12-31 出版日期:2015-03-20 发布日期:2015-03-13
  • 通讯作者: 李洪奇(1960- ),男,教授,研究方向为智能信息处理、资源软件工程.E-mail:hq.li@cup.edu.cn E-mail:hq.li@cup.edu.cn
  • 作者简介:杨中国(1987- ),男,博士研究生,研究方向为自然语言理解、数据挖掘.E-mail:yangzhongguo1234@163.com
  • 基金资助:
    中国石油大学(北京)基金资助项目(KYJJ2012-05-25);国家重大科技专项(2011ZX05023-005-006,2011ZX0520-007-007)

Scientific literature information extraction based on semantic pattern and reference distribution

YANG Zhong-guo1,2, LI Hong-qi1,2, ZHU Li-ping1,2, LIU Qiang1,2   

  1. 1. Beijing Key Lab of Petroleum Data Mining, China University of Petroleum(Beijing), Beijing 102249, China;
    2. College of Geophysics and Information Engineering, China University of Petroleum(Beijing), Beijing 102249, China
  • Received:2014-09-19 Revised:2014-12-31 Online:2015-03-20 Published:2015-03-13

摘要: 科技文献中回顾前人研究成果、分析存在的问题、提出解决方法等语言片段是论文创新性信息的构成部分。分析论文写作过程中问题分析信息的逻辑思维以及在文章中呈现的篇章关系,综合利用引用分布特征、篇章关系特征、否定情感特征构建具有普适性的信息抽取语义模式。从论文原始文本中通过匹配定义好的语义模式抽取出问题分析信息。同时,利用引导词特征、语义相似度计算从论文文本中抽取出论文的主要工作信息。以数据挖掘领域科技文献为例,对比人工抽取结果对提出的方法进行评价,结果表明该方法能较准确抽取相应信息,为科技论文聚类、论文推荐提供基础数据来源。

关键词: 引用分布, 篇章关系, 否定情感, 引导词, 语义模式

Abstract: In the scientific and technology literature, the review of previous research results, analysis of existing problems, propose solutions and other language fragment are part of the innovation of this information. The logical thinking pattern of problem analysis information in the paper and the discourse relation were analyzed. A utilization of reference distribution, discourse relation characteristics, negative emotional characteristics was made to construct universal semantic pattern of information extraction. The problem analysis information was extracted from the original text by matching the defined semantic pattern. At the same time, the guide words feature and semantic similarity were used to extract the mainly work information from papers. Focusing on the science and technology literature of the data mining field, the proposed method was evaluated by contrasting with the artificial extraction results. The results show that this method can accurately extract the corresponding information, provide the basic data source for clustering of scientific papers and the paper recommends.

Key words: reference distribution, textual relations, negative emotion, guide words, semantic pattern

中图分类号: 

  • TP391
[1] 温有奎,温浩.关键词与创新点词句群分布分析[J]. 情报学报,2007, 26(1):50-55. WEN Youkui, WEN Hao. Sentence group distribution of keywords and innovation idea words[J]. Journal of the China Society for Scientific and Technical Information, 2007, 26(1):50-55.
[2] 温有奎,温浩,徐端颐,等.基于创新点的知识元挖掘[J].情报学报,2005,24(6):663-668. WEN Youkui, WEN Hao, XU Duanyi, et al. Knowledge element mining in knowledge management[J].Journal of the China Society for Scientific and Technical Information, 2005, 24(6):663-668.
[3] 盛杰.期刊编辑对科技论文创新性的把握[J].编辑学报,2011,23(3):215-217. SHENG Jie. Academic innovation controlling of scientific papers by editors[J].Acta Editologica, 2011, 23(3):215-217.
[4] GRISHMAN R. Information extraction: techniques and challenges[R]. New York: New York University Press, 1997.
[5] FRIJTERS R, VAN VUGT M, SMEETS R, et al. Literature mining for the discovery of hidden connections between drugs, genes and diseases[J]. Los Computational Biology, 2010, 6(9):e1000943.
[6] KIM J D, NGUYEN N, WANG Yue, et al.The genia event and protein coreference tasks of the BioNLP shared task 2011[J]. BMC Bioinformatics, 2012, 13(11):S1.1-S1.12.
[7] GARTEN Y, COULET A, ALTMAN R B. Recent progress in automatically extracting information from the pharmacogenomic literature[J].Pharmacogenomics, 2010, 11(10):1467-1489.
[8] ANANIADOU S, PYYSALO S, TSUJII J, et al. Event extraction for systems biology by text mining the literature[J].Trends in Biotechnology, 2010, 28: 381-390.
[9] Chikashi Nobata, Paul D Dobson, Syed A Iqbal, et al. Mining metabolites: extracting the yeast metabolome from the literature[J].Metabolomics, 2011, 7(1):94-101.
[10] 钱伟中,王娟,傅狲,等.融合浅层句法分析的蛋白质互作用信息抽取方法[J].计算机应用研究,2011,28(3):972-975. QIAN Weizhong, WANG Juan, FU Chong, et al. Prote in-protein interaction extraction method using shallow parsing[J].Application Research of Computer,2011,28(3):972-975.
[11] 黄泽武. 基于语义的科技文献共享平台的信息抽取系统[D].武汉: 华中科技大学,2007. HUANG Zewu. Information extraction system in semantic based scientific literature sharing platform[D].Wuhan: Huazhong University of Science and Technology, 2007.
[12] 欧阳辉,禄乐滨.基于证据理论的论文元数据抽取算法研究[J].电子设计工程,2010,18(4):66-69. OUYANG Hui, LU Lebin. Research of paper metadata extraction algorithm based on theory of evidence [J].Electronic Design Engineering, 2010, 18(14):66-69.
[13] 于亮.科技文献的文本特征抽取研究与应用[D].北京: 北京邮电大学,2009. YU Liang. Research and applications on text feathers extraction from science and technical literatures [D]. Beijing: Beijing University of Posts and Telecommunications, 2009.
[14] 倪娜,刘凯,李耀东.科技文献关键词自动标注算法研究[J].计算机科学,2012,39(9):175-179. NI Na, LIU Kai, LI Yaodong. Study of automatic keywords labeling for scientific literature [J].2012,39(9):175-179.
[15] 叶春蕾,冷伏海.基于引文-主题概率模型的科技文献主题识别方法研究[J].情报理论与实践, 2013,9(36):100-103. YE Chunlei, LENG Fuhai. Research on literature topic identification method based on probability model of citation-theme from science and technical literatures[J].Information Studies: Theory and Application, 2013, 9(36):100-103.
[16] 冷伏海,白如江,祝清松.面向科技文献的混合语义信息抽取方法研究[J].图书情报工作,2013,57(11):112-119. LENG Fuhai, BAI Rujiang, ZHU Qingsong. Research on hybrid semantic information extraction methods for science and technology literature[J].Library and Information Service, 2013, 57(11):112-119.
[17] 朱大明.参考文献引证在研究型论文中的分布特征[J].编辑学报,2008,20(6):481-483. ZHU Daming. Distribution of cited references in each part of research papers [J].Acta Editologica, 2008, 20(6):481-483.
[18] 高时阔,黎文丽,郭开选,等.科技论文文体结构所体现的美学特征[J]. 编辑学报,2006,18(3):173-175. GAO Shikuo, LI Wenli, GUO Kaixuan, et al. Aesthetic characteristics of scientific papers [J]. Acta Editologica, 2006, 18(3):481-483.
[19] 陈浩元. 科技书刊标准化18讲[M]. 北京: 北京师范大学出版社, 1998. CHEN Haoyuan. Science and technology periodicals standardization 18 leture[M].Beijing: Beijing Normal University Press, 1998.
[20] 朱大明. 学术论文引言中的参考文献简析[J].编辑学报,2005,17(3):190. ZHU Daming. Analyses of references in introduction part of academic papers[J]. Acta Editologica, 2005, 17(3):190.
[21] 杨江,侯敏,王宁.基于浅层篇章结构的评论文倾向性分析[J].中文信息学报,2011,25(2):83-88. YANG Jiang, HOU Min, WANG Ning. Sentiment polarity analysis of reviews based on shallow text structure[J].Journal of Chinese Information Processing, 2011, 25(2):83-88.
[22] 郭冲,王振宇.面向细粒度意见挖掘的情感本体树及自动构建[J].中文信息学报,2013,27(5):75-82. GUO Chong, WANG Zhenyu. Auto-construct of sentiment ontology tree for fine-grained opinion mining[J]. Journal of Chinese Information Processing, 2013, 27(5):75-82.
[23] 李晓霞. 科技论文引言的撰写[J].商洛师范专科学校学报,2004,18(2):62-64. LI Xiaoxia. The writing of the introduction of scientific papers[J].Journal of Shangluo Teachers College,2004,18(2):62-64.
[24] 邓建元. 科技论文引言的内容与形式[J].编辑学报,2003, 15(5) : 347-348. DENG Jianyuan. Contents and forms of introduction part of academic papers[J].Acta Editologica,2003,15(5):347-348.
[25] PITIER E, RAGHUPATHY M, MEHTA H, et al.Easily identifiable discourse relations[C]// Proceedings of COLING. [S.l.]:DBLP, 2008:87-90.
[26] 郑黎晓,许智武,陈海明.基于文法分支覆盖的短句子生成算法[J].软件学报,2011,22(11):2564-2576. ZHENG Lixiao, XU Zhiwu, CHEN Haiming. Algorithm for generating short sentences from grammars based on branch coverage criterion[J]. Journal of Software, 2011, 22(11),2564-2576.
[27] GABRILOVICH E, MARKOVITCH S. Computing semantic relatedness using Wikipedia-based explicit semantic analysis[C]// Proceedings of the 20th International Joint Conference on Artificial Intelligence. Freiburg: IJCAI, 2007:1606-1611.
[28] 张俊溪,吴晓军.一种新的基于进化计算的聚类算法[J].计算机工程与应用,2011,47(24):111-114. ZHANG Junxi, WU Xiaojun. New clustering algorithm based on evolutionary computation[J].Computer Engineering and Application, 2011, 47(24):111-114.
[1] 严为绒, 洪宇, 朱珊珊, 车婷婷, 姚建民, 朱巧明. 基于语义场景的隐式篇章关系检测方法[J]. 山东大学学报(理学版), 2014, 49(11): 59-67.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!