您的位置:山东大学 -> 科技期刊社 -> 《山东大学学报(理学版)》

山东大学学报(理学版) ›› 2015, Vol. 50 ›› Issue (07): 23-30.doi: 10.6040/j.issn.1671-9352.3.2014.307

• 论文 • 上一篇    下一篇

一种面向科技文献引言的信息抽取方法

朱丽萍1,2, 李洪奇1,2, 杨中国1,2, 刘蔷1,2   

  1. 1. 中国石油大学(北京)石油数据挖掘北京市重点实验室, 北京 102249;
    2. 中国石油大学(北京)地球物理与信息工程学院, 北京 102249
  • 收稿日期:2014-09-19 出版日期:2015-07-20 发布日期:2015-07-31
  • 通讯作者: 李洪奇(1960-),男,教授,研究方向为智能信息处理、资源软件工程.E-mail:hq.li@cup.edu.cn E-mail:hq.li@cup.edu.cn
  • 作者简介:朱丽萍(1973-),女,博士研究生,副教授,研究方向为自然语言理解、数据挖掘.E-mail:zhuliping@cup.edu.cn
  • 基金资助:
    中国石油大学(北京)基金资助项目(KYJJ2012-05-25);国家重大科技专项(2011ZX05023-005-06,2011ZX05020-007-007)

An information extraction method for scientific literature introduction

ZHU Li-ping1,2, LI Hong-qi1,2, YANG Zhong-guo1,2, LIU Qiang1,2   

  1. 1. Beijing Key Lab of Petroleum Data Mining, China University of Petroleum(Beijing), Beijing 102249, China;
    2. College of Geophysics and Information Engineering, China University of Petroleum(Beijing), Beijing 102249, China
  • Received:2014-09-19 Online:2015-07-20 Published:2015-07-31

摘要: 分析了引言部分写作模型,将文本按照句子级别划分为背景知识、问题分析、工作描述三个类别。统计每个部分句子的引导词、句型表达、线索词、所处位置的特征,并构建相应规则库。在分词和词性标注基础上,利用规则匹配每个句子得出所属的类别,从而抽取出三个部分的信息。以石油勘探开发类科技文献和数据挖掘类科技文献为例,进行人工判别和本文方法抽取试验,结果表明本文方法能准确获取相应信息。

关键词: 科技文献, 背景知识, 线索词, 信息抽取

Abstract: The introduction of the scientific literature could be classified as three categories: background knowledge, problem analysis and work description based on analyses of write model. Each part of the three categories could be depicted by guide words, sentence structure, clue words and sentence position. These features of sentence were used to construct a rule which could distinguish the type of sentences. A rule bank was generated by features extracted from a mount of scientific article sentences. The information of the tree categories could be extracted by simply matching the three types of rules. A text information extraction experiment was studied in the fields of petroleum exploration and data mining,in which the automatically extracted result was compared to human work. The result shows that all three types of information could be extracted effectively.

Key words: scientific literature, clue words, information extraction, background knowledge

中图分类号: 

  • TP391
[1] GRISHMAN R. Information extraction: techniques and challenges[M]. Berlin, Germany: Springer-Verlag, 1997.
[2] AGARWAL S, YU Hong. Automatically classifying sentences in full-text biomedical articles into introduction, methods, results and discussion[J]. Bioinformatics, 2009, 25(23):3174-3180.
[3] Su Nam Kim, David Martinez, Lawrence Cavedon, et al. Automatic classification of sentences to support evidence based medicine[J]. BMC Bioinformatics, 12(Suppl 2):S5.1-S5.10.
[4] Abeed Sarker, Diego Molla. A rule-based approach for automatic identification of publication types of medical papers[C]//Proceedings of the 15th Australasian Document Computing Symposium.[S.l]:[s.n.], 2011.
[5] Patrick Davis-Desmond, Diego Molla. Detection of evidence in clinical research papers[C]//Proceedings of the Australasian Workshop on Health Informatics and Knowledge Management (HIKM). Darlinghurst: Australian Computer Society, 2012: 13-20.
[6] IBEKWE-SANJUAN F, CHEN CHAOMEI, PINHO R. Identifying strategic information from scientific articles through sentence classification[C]//Proceedings of the 6th International Conference on Language Resources and Evaluation Conference (LREC-08). Marrakesh, Morocco, 2008: 5.
[7] TORRES J A S, CRUZES D S, SALVADOR L N. Automatic results Identification in software engineering papers. Is it possible? [C]//Proceedings of the12th International Conference on Computational Science and Its Applications (ICCSA). Piscataway: IEEE, 2012: 108-122.
[8] TORRES J A S. Automatic summarization of software engineering papers to support the systematic review process[D]. Salvador: Salvador University, 2011.
[9] 黄泽武. 基于语义的科技文献共享平台的信息抽取系统[D].武汉: 华中科技大学,2007. HUANG Zewu. Information extraction system in semantic based scientific literature sharing platform[D].Wuhan: Huazhong University of Science and Technology, 2007.
[10] 于亮.科技文献的文本特征抽取研究与应用[D]. 北京:北京邮电大学, 2009. YU Liang. Research and applications on text feathers extraction from science and technical literatures[D]. Beijing: Beijing University of Posts and Telecommunications, 2009.
[11] 何新贵,彭甫阳.中文文本的关键词自动抽取和模糊分类[J].中文信息学报,1998,13(1):10-16 HE Xingui, PENG Puyang. Fuzzy classification and automatic extraction of keywords from Chinese text[J]. Journal of Chinese Information, 1998, 13(1):10-16.
[12] 何婷婷, 许婷, 瞿国忠,等.基于主题词对的文档重排方法[J].计算机工程与应用, 2007,43(11):161-163. HE Tingting, XU Ting, QU Guozhong, et al. Re-ranking based on topic word pairs[J].Computer Engineering and Applications, 2007, 43(11):161-163.
[13] 侯跃芳, 崔雷, 朱利娜. 应用主题词/副主题词关联规则对专题知识的挖掘分析及评价[J].情报理论与实践, 2008(2):234-236. HOU Yuefang, CUI Lei, ZHU Lina. Analysing and evaluating the thematic knowledge mining using association rules of subject headings or subheadings[J]. Information Studies: Theory and Application, 2008(2):234-236.
[14] 温有奎,温浩.关键词与创新点词句群分布分析[J]. 情报学报,2007, 26(1): 50-55. WEN Youkui, WEN Hao. Sentence group distribution of keywords and innovation idea words[J]. Journal of the China Society for Scientific and Technical Information, 2007, 26(1):50-55.
[15] 温有奎,温浩,徐端颐,等.基于创新点的知识元挖掘[J].情报学报, 2005, 24(6):663-668. WEN Youkui, WEN Hao, XU Duanyi, et al. Knowledge element mining in knowledge management[J]. Journal of the China Society for Scientific and Technical Information, 2005, 24(6):663-668.
[16] 孙荣,周文,刘宗田.用规则抽取句子中事件信息[J].小型微型计算机系统, 2011(11):2309-2314. SUN Rong, ZHOU Wen, LIU Zongtian. Extracting event information using rules from sentences[J]. Journal of Chinese Computer Systems, 2011(11):2309-2314.
[17] 唐惠丽,郑小妹.正则表达式的研究及在Web中的应用[J].计算机技术与发展, 2013, 23(2):82-85. TANG Huili, ZHENG Xiaomei. Research of regular expressions and application in Web[J]. Computer Technology and Development, 2013, 23(2):82-85.
[18] 冷伏海,白如江,祝清松.面向科技文献的混合语义信息抽取方法研究[J].图书情报工作,2013,57(11):112-119. LENG Fuhai, BAI Rujiang, ZHU Qingsong. Research on hybrid semantic information extraction methods for science and technology literature[J]. Library and Information Service,2013, 57(11):112-119.
[19] 李晓霞. 科技论文引言的撰写[J]. 商洛师范专科学校学报, 2004, 18(2):62-64. LI Xiaoxia. The writing of the introduction of scientific papers[J]. Journal of Shangluo Teachers College, 2004, 18(2):62-64.
[20] 邓建元. 科技论文引言的内容与形式[J]. 编辑学报,2003, 15(5):347-348. DENG Jianyuan. Contents and forms of introduction part of academic papers[J]. Acta Editologica, 2003, 15(5):347-348.
[21] 王小唯,吕雪梅,杨波. 学术论文引言的结构模型化研究[J].编辑学报, 2003(04). WANG Xiaowei, L Xuemei, YANG Bo. Structure modeling research on introduction of scientific papers[J]. Acta Editologica, 2003(04).
[22] 朱大明. 学术论文引言中的参考文献简析[J].编辑学报,2005,17(3):190. ZHU Daming. Analyses of references in introduction part of academic papers[J]. Acta Editologica, 2005, 17(3):190.
[23] 刘豹, 张桂平, 蔡东风. 基于统计和规则相结合的科技术语自动抽取研究[J]. 计算机工程与应用, 2008,44(23): 147-150. LIU Bao, ZHANG Guiping, CAI Dongfeng. Techical term automatic extraction research based on statistics and rules[J].Computer Engineering and Applications, 2008, 44(23):147-150.
[24] 张平,潘保芝,张莹,等.自组织神经网络在火成岩岩性识别中的应用[J].石油物探, 2009, 48(1):54-56. ZHANG Ping, PAN Baozhi, ZHANG Ying, et al. Application of self organizing neural network in lithology identification of igneous rock[J]. Geophysical Prospecting for Petroleum,2009, 48(1):54-56.
[1] 苏丰龙,谢庆华,黄清泉,邱继远,岳振军. 基于直推式学习的半监督属性抽取[J]. 山东大学学报(理学版), 2016, 51(3): 111-115.
[2] 李智恒,杨志豪,林鸿飞. 基于语义的疾病相关蛋白质知识抽取[J]. 山东大学学报(理学版), 2016, 51(3): 104-110.
[3] 王辉, 陈光. 基于Bootstrapping的英文产品评论属性词抽取方法[J]. 山东大学学报(理学版), 2014, 49(12): 23-29.
[4] 关冕,马军. 针对Web论坛的一种结构化数据自动抽取方法[J]. J4, 2010, 45(5): 42-47.
[5] 王 静,姚 勇,刘志镜 . 基于广义隐马尔可夫模型的网页信息抽取方法[J]. J4, 2007, 42(11): 49-52 .
[6] 王 雷,陈治平,李志成 . 基于文本分块的多模板隐马尔可夫模型的文本信息抽取[J]. J4, 2006, 41(3): 19-24 .
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!