JOURNAL OF SHANDONG UNIVERSITY(NATURAL SCIENCE) ›› 2019, Vol. 54 ›› Issue (3): 67-74.doi: 10.6040/j.issn.1671-9352.2.2018.212

•   • Previous Articles     Next Articles

Automatic extraction of key information for news web pages based on tag and block features

Xue-mei WANG1(),Xing-shu CHEN1,2,Hai-zhou WANG2,Wen-xian WANG3,*()   

  1. 1. College of Software Engineering, Sichuan University, Chengdu 610065, Sichuan, China
    2. College of Cybersecurity, Sichuan University, Chengdu 610065, Sichuan, China
    3. Cybersecurity Research Institute, Sichuan University, Chengdu 610065, Sichuan, China
  • Received:2018-09-27 Online:2019-03-01 Published:2019-03-19
  • Contact: Wen-xian WANG E-mail:ewellwang@163.com;catean@scu.edu.cn
  • Supported by:
    国家自然科学基金资助项目(61802270);国家自然科学基金资助项目(61802271);国家“双创”示范基地之变革性技术国际研发转化平台资助项目(C700011);四川省重点研发项目资金资助(2018G20100);四川省科技支撑计划资金资助(2016GZ0038);中央高校基本科研业务费专项资金资助(2017SCU11065)

Abstract:

In view of that issue of news key information extraction require manual construction template or training generation template, provide news key information automatic extraction method based on label and block characteristics. The method first locates the news body block by calculating the relevant characteristics of the text block, then position the news title block by editing distance, and finally locates the press release time and source label block according to the text block and title block, and finally obtains the target news key information by extracting the text of each block. On the basis of this method, propose an automatic news extraction framework for news sites, and uses this framework to extract news from 30 news columns of 10 news sites. A total of 1597 news data are collected, and 1000 of them are randomly selected for the experiment. The experimental results show that this method has a good extraction effect on news title, publish time, source and text, and is superior to the comparison objects.

Key words: tag and block features, key information, information extraction, website

CLC Number: 

  • TP391

Fig.1

Extensible key information extraction framework of news website"

Table 1

Comparison of experimental results"

%
新闻关键信息 Newspaper[15] Webcollector[16] 本文方法
P R F P R F P R F
标题 100 93.5 96.6 99.1 90.8 94.8 100 97.3 98.6
发布时间 59.7 55.9 57.7 88.7 80.6 84.5 100 97.3 98.6
正文 98.2 88.3 93.0 96.3 90.4 93.3 96.7 95.9 96.3
来源 99.3 76.9 87.7
1 梅雪, 程学旗, 郭岩, 等. 一种全自动生成网页信息抽取Wrapper的方法[J]. 中文信息学报, 2008, 22 (1): 22- 29.
doi: 10.3969/j.issn.1003-0077.2008.01.004
MEI Xue , CHENG Xueqi , GUO Yan , et al. Fully automatic Wrapper generation for web information extraction[J]. Journal of Chinese Information Processing, 2008, 22 (1): 22- 29.
doi: 10.3969/j.issn.1003-0077.2008.01.004
2 顾韵华, 高原, 高宝, 等. 基于模板和领域本体的Deep Web信息抽取研究[J]. 计算机工程与设计, 2014, 35 (1): 327- 332.
doi: 10.3969/j.issn.1000-7024.2014.01.061
GU Yunhua , GAO Yuan , GAO Bao , et al. Research on Deep Web information extraction based on template and domain ontology[J]. Computer Engineering and Design, 2014, 35 (1): 327- 332.
doi: 10.3969/j.issn.1000-7024.2014.01.061
3 郭少华, 郭岩, 李海燕, 等. 可扩展的网页关键信息抽取研究[J]. 中文信息学报, 2015, 29 (1): 97- 103.
doi: 10.3969/j.issn.1003-0077.2015.01.013
GUO Shaohua , GUO Yan , LI Haiyan , et al. Research on extensible web key information extraction[J]. Journal of Chinese Information Processing, 2015, 29 (1): 97- 103.
doi: 10.3969/j.issn.1003-0077.2015.01.013
4 WENINGER T, HSU W H, HAN J. CETR: content extraction via tag ratios[C]// Proc of the 19th International Conference on World Wide Web. New York: ACM, 2010: 971-980.
5 SUN Fei, SONG Dandan, LIAO Lejian. Dom based content extraction via text density[C]// Proc of the 34th International ACM SIGIR Conference on Research and Development in Information Retrieval. New York: ACM, 2011: 245-254.
6 SONG Dandan , SUN Fei , LIAO Lejian . A hybrid approach for content extraction with text density and visual importance of DOM nodes[J]. Knowledge and Information Systems, 2015, 42 (1): 75- 96.
doi: 10.1007/s10115-013-0687-x
7 MEHTA B, NARVEKAR M. DOM tree based approach for web content extraction[C]// 2015 International Conference on Communication, Information & Computing Technology. Mumbai: IEEE, 2015: 1-6.
8 吴共庆, 胡骏, 李莉, 等. 基于标签路径特征融合的在线Web新闻内容抽取[J]. 软件学报, 2016, 27 (3): 714- 735.
WU Gongqing , HU Jun , LI Li , et al. Online Web news extraction via tag path feature fusion[J]. Journal of Software, 2016, 27 (3): 714- 735.
9 FANG Yixiang , XIE Xiaoqin , ZHANG Xiaofeng , et al. STEM: a suffix tree-based method for web data records extraction[J]. Knowledge and Information Systems, 2018, 55 (2): 305- 331.
doi: 10.1007/s10115-017-1062-0
10 CAI Deng, YU Shipeng, WEN Jirong, et al. VIPS: a vision-based page segmentation algorithm[J/OL]. (2003-11-01). https://www.microsoft.com/en-us/research/publication/vips-a-vision-based-page-segmentation-algorithm/.
11 吴秦, 胡丽娟, 梁久祯. 基于分块重要度和二维条件随机场的Web信息抽取[J]. 南京大学学报(自然科学版), 2014, (1): 79- 85.
WU Qin , HU Lijuan , LIANG Jiuzhen . Web information extraction based on block importance model and 2D conditional random fields[J]. Journal of Nanjing University (Natural Sciences), 2014, (1): 79- 85.
12 ZELENY J , BURGET R , ZENDULKA J . Box clustering segmentation: a new method for vision-based web page preprocessing[J]. Information Processing & Management, 2017, 53 (3): 735- 750.
13 PU Jiachen, LIU Jin, WANG Jin. A vision-based approach for deep web form extraction[M]// Advanced Multimedia and Ubiquitous Engineering. Singapore: Springer, 2017: 696-702.
14 JONATHAN H. Jsoup[DB/OL]. (2010-01-17)[2018-05-08]. https://github.com/jhy/jsoup.
15 YANG Lucasou. Newspaper[DB/OL]. (2013-11-25)[2018-05-08]. https://github.com/codelucas/newspaper.
16 HU Jun. Webcollector[DB/OL]. (2014-07-12)[2018-05-08]. https://github.com/CrawlScript/WebCollector.
[1] LI Zhi-heng, YANG Zhi-hao, LIN Hong-fei. Semantic output output-based disease-protein knowledge extraction [J]. JOURNAL OF SHANDONG UNIVERSITY(NATURAL SCIENCE), 2016, 51(3): 104-110.
[2] SU Feng-long, XIE Qing-hua, HUANG Qing-quan, QIU Ji-yuan, YUE Zhen-jun. Semi-supervised method for attribute extraction based on transductive learning [J]. JOURNAL OF SHANDONG UNIVERSITY(NATURAL SCIENCE), 2016, 51(3): 111-115.
[3] ZHU Li-ping, LI Hong-qi, YANG Zhong-guo, LIU Qiang. An information extraction method for scientific literature introduction [J]. JOURNAL OF SHANDONG UNIVERSITY(NATURAL SCIENCE), 2015, 50(07): 23-30.
[4] WANG Hui, CHEN Guang. Feature extraction method based on Bootstrapping in English product comment [J]. JOURNAL OF SHANDONG UNIVERSITY(NATURAL SCIENCE), 2014, 49(12): 23-29.
[5] GUAN Mian, MA Jun. Automatic structured data extraction from Web forums [J]. J4, 2010, 45(5): 42-47.
[6] WANG Jing,YAO Yong,LIU Zhi-jing . Web information extraction based on a generalized hidden Markov model [J]. J4, 2007, 42(11): 49-52 .
[7] WANG Lei,CHEN Zhi-ping,LI Zhi-cheng . Using text blocks based on multiple templates hidden markov model for text information extraction [J]. J4, 2006, 41(3): 19-24 .
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] ZHAO Tong-xin1, LIU Lin-de1*, ZHANG Li1, PAN Cheng-chen2, JIA Xing-jun1. Pollinators and pollen polymorphism of  Wisteria sinensis (Sims) Sweet[J]. JOURNAL OF SHANDONG UNIVERSITY(NATURAL SCIENCE), 2014, 49(03): 1 -5 .
[2] GUO Lan-lan1,2, GENG Jie1, SHI Shuo1,3, YUAN Fei1, LEI Li1, DU Guang-sheng1*. Computing research of the water hammer pressure in the process of #br# the variable speed closure of valve based on UDF method[J]. JOURNAL OF SHANDONG UNIVERSITY(NATURAL SCIENCE), 2014, 49(03): 27 -30 .
[3] LI Min1,2, LI Qi-qiang1. Observer-based sliding mode control of uncertain singular time-delay systems#br#[J]. JOURNAL OF SHANDONG UNIVERSITY(NATURAL SCIENCE), 2014, 49(03): 37 -42 .
[4] HAN Ya-fei, YI Wen-hui, WANG Wen-bo, WANG Yan-ping, WANG Hua-tian*. Soil bacteria diversity in continuous cropping poplar plantation#br# by high throughput sequencing[J]. JOURNAL OF SHANDONG UNIVERSITY(NATURAL SCIENCE), 2014, 49(05): 1 -6 .
[5] MA Yuan-yuan, MENG Hui-li, XU Jiu-cheng, ZHU Ma. Normal distribution of lattice close-degree based on granular computing[J]. JOURNAL OF SHANDONG UNIVERSITY(NATURAL SCIENCE), 2014, 49(08): 107 -110 .
[6] XU Jun-feng. On the growth of the meromorphic solutions of complex algebraic differential equations[J]. J4, 2010, 45(6): 91 -93 .
[7] WU Zhi-jun,SHEN Dan-dan. Architecture and key technologies of network-enabled next generation global flight tracking based on information integration and sharing[J]. JOURNAL OF SHANDONG UNIVERSITY(NATURAL SCIENCE), 2016, 51(11): 1 -6 .
[8] QU Xiao-ying ,ZHAO Jing . Solution of the Klein-Gordon equation for the time-dependent potential[J]. J4, 2007, 42(7): 22 -26 .
[9] CHEN Yong, . An approximate algorithm for the cost totalcoloring of trees[J]. J4, 2006, 41(1): 111 -114 .
[10] LIU Da-Kun, Wang Shu-Dong. Incidence chromatic number of several generalized Petersen graphs[J]. J4, 2008, 43(12): 48 -51 .