《山东大学学报(理学版)》 ›› 2019, Vol. 54 ›› Issue (3): 67-74.doi: 10.6040/j.issn.1671-9352.2.2018.212
Xue-mei WANG1(),Xing-shu CHEN1,2,Hai-zhou WANG2,Wen-xian WANG3,*()
摘要:
针对抽取新闻关键信息需要人工构造或训练生成模板的问题,提出了基于标签和分块特征的新闻关键信息自动抽取方法。该方法首先通过计算新闻网页相关特征来定位新闻正文标签块,然后通过编辑距离定位新闻标题标签块,最后根据正文块和标题块定位新闻发布时间和来源标签块,并通过抽取各块的文本获得目标新闻关键信息。在该方法的基础上提出了针对新闻站点的目标新闻自动抽取框架,并用该框架对10个新闻站点的30个新闻栏目进行了新闻抽取。对抽取到的1597条新闻随机选择了1000条进行了实验。实验结果表明,该方法对新闻标题、发布时间、来源、正文均表现出良好的抽取效果,且优于实验对比对象。
中图分类号:
1 |
梅雪, 程学旗, 郭岩, 等. 一种全自动生成网页信息抽取Wrapper的方法[J]. 中文信息学报, 2008, 22 (1): 22- 29.
doi: 10.3969/j.issn.1003-0077.2008.01.004 |
MEI Xue , CHENG Xueqi , GUO Yan , et al. Fully automatic Wrapper generation for web information extraction[J]. Journal of Chinese Information Processing, 2008, 22 (1): 22- 29.
doi: 10.3969/j.issn.1003-0077.2008.01.004 |
|
2 |
顾韵华, 高原, 高宝, 等. 基于模板和领域本体的Deep Web信息抽取研究[J]. 计算机工程与设计, 2014, 35 (1): 327- 332.
doi: 10.3969/j.issn.1000-7024.2014.01.061 |
GU Yunhua , GAO Yuan , GAO Bao , et al. Research on Deep Web information extraction based on template and domain ontology[J]. Computer Engineering and Design, 2014, 35 (1): 327- 332.
doi: 10.3969/j.issn.1000-7024.2014.01.061 |
|
3 |
郭少华, 郭岩, 李海燕, 等. 可扩展的网页关键信息抽取研究[J]. 中文信息学报, 2015, 29 (1): 97- 103.
doi: 10.3969/j.issn.1003-0077.2015.01.013 |
GUO Shaohua , GUO Yan , LI Haiyan , et al. Research on extensible web key information extraction[J]. Journal of Chinese Information Processing, 2015, 29 (1): 97- 103.
doi: 10.3969/j.issn.1003-0077.2015.01.013 |
|
4 | WENINGER T, HSU W H, HAN J. CETR: content extraction via tag ratios[C]// Proc of the 19th International Conference on World Wide Web. New York: ACM, 2010: 971-980. |
5 | SUN Fei, SONG Dandan, LIAO Lejian. Dom based content extraction via text density[C]// Proc of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval. New York: ACM, 2011: 245-254. |
6 |
SONG Dandan , SUN Fei , LIAO Lejian . A hybrid approach for content extraction with text density and visual importance of DOM nodes[J]. Knowledge and Information Systems, 2015, 42 (1): 75- 96.
doi: 10.1007/s10115-013-0687-x |
7 | MEHTA B, NARVEKAR M. DOM tree based approach for web content extraction[C]// 2015 International Conference on Communication, Information & Computing Technology. Mumbai: IEEE, 2015: 1-6. |
8 | 吴共庆, 胡骏, 李莉, 等. 基于标签路径特征融合的在线Web新闻内容抽取[J]. 软件学报, 2016, 27 (3): 714- 735. |
WU Gongqing , HU Jun , LI Li , et al. Online Web news extraction via tag path feature fusion[J]. Journal of Software, 2016, 27 (3): 714- 735. | |
9 |
FANG Yixiang , XIE Xiaoqin , ZHANG Xiaofeng , et al. STEM: a suffix tree-based method for web data records extraction[J]. Knowledge and Information Systems, 2018, 55 (2): 305- 331.
doi: 10.1007/s10115-017-1062-0 |
10 | CAI Deng, YU Shipeng, WEN Jirong, et al. VIPS: a vision-based page segmentation algorithm[J/OL]. (2003-11-01). https://www.microsoft.com/en-us/research/publication/vips-a-vision-based-page-segmentation-algorithm/. |
11 | 吴秦, 胡丽娟, 梁久祯. 基于分块重要度和二维条件随机场的Web信息抽取[J]. 南京大学学报(自然科学版), 2014, (1): 79- 85. |
WU Qin , HU Lijuan , LIANG Jiuzhen . Web information extraction based on block importance model and 2D conditional random fields[J]. Journal of Nanjing University (Natural Sciences), 2014, (1): 79- 85. | |
12 | ZELENY J , BURGET R , ZENDULKA J . Box clustering segmentation: a new method for vision-based web page preprocessing[J]. Information Processing & Management, 2017, 53 (3): 735- 750. |
13 | PU Jiachen, LIU Jin, WANG Jin. A vision-based approach for deep web form extraction[M]// Advanced Multimedia and Ubiquitous Engineering. Singapore: Springer, 2017: 696-702. |
14 | JONATHAN H. Jsoup[DB/OL]. (2010-01-17)[2018-05-08]. https://github.com/jhy/jsoup. |
15 | YANG Lucasou. Newspaper[DB/OL]. (2013-11-25)[2018-05-08]. https://github.com/codelucas/newspaper. |
16 | HU Jun. Webcollector[DB/OL]. (2014-07-12)[2018-05-08]. https://github.com/CrawlScript/WebCollector. |
[1] | 李智恒,杨志豪,林鸿飞. 基于语义的疾病相关蛋白质知识抽取[J]. 山东大学学报(理学版), 2016, 51(3): 104-110. |
[2] | 苏丰龙,谢庆华,黄清泉,邱继远,岳振军. 基于直推式学习的半监督属性抽取[J]. 山东大学学报(理学版), 2016, 51(3): 111-115. |
[3] | 朱丽萍, 李洪奇, 杨中国, 刘蔷. 一种面向科技文献引言的信息抽取方法[J]. 山东大学学报(理学版), 2015, 50(07): 23-30. |
[4] | 王辉, 陈光. 基于Bootstrapping的英文产品评论属性词抽取方法[J]. 山东大学学报(理学版), 2014, 49(12): 23-29. |
[5] | 关冕,马军. 针对Web论坛的一种结构化数据自动抽取方法[J]. J4, 2010, 45(5): 42-47. |
[6] | 王 静,姚 勇,刘志镜 . 基于广义隐马尔可夫模型的网页信息抽取方法[J]. J4, 2007, 42(11): 49-52 . |
[7] | 王 雷,陈治平,李志成 . 基于文本分块的多模板隐马尔可夫模型的文本信息抽取[J]. J4, 2006, 41(3): 19-24 . |
|