您的位置:山东大学 -> 科技期刊社 -> 《山东大学学报(理学版)》

山东大学学报(理学版) ›› 2015, Vol. 50 ›› Issue (07): 1-8.doi: 10.6040/j.issn.1671-9352.3.2014.270

• 论文 •    下一篇

一种基于网页块特征的多级网页聚类方法

范意兴1,2, 郭岩1, 李希鹏1,2, 赵岭1, 刘悦1, 俞晓明1, 程学旗1   

  1. 1. 中国科学院计算技术研究所, 北京 100190;
    2. 中国科学院大学, 北京 100190
  • 收稿日期:2014-09-05 出版日期:2015-07-20 发布日期:2015-07-31
  • 作者简介:范意兴(1990-),男,硕士研究生,研究方向为网页信息抽取.E-mail:fanyixing@software.ict.ac.cn
  • 基金资助:
    国家重点基础研究发展计划(“973”计划)项目(2012CB316303);国家高技术研究发展计划(“863”计划)项目(2012AA011003);国家科技支撑计划项目(2012BAH39B02);国家自然科学基金资助项目(61232010,61202058)

A multi-level page clustering method based on page segmentation

FAN Yi-xing1,2, GUO Yan1, LI Xi-peng1,2, ZHAO Ling1, LIU Yue1, YU Xiao-ming1, CHENG Xue-qi1   

  1. 1. Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190, China;
    2. University of Chinese Academy of Sciences, Beijing 100190, China
  • Received:2014-09-05 Online:2015-07-20 Published:2015-07-31

摘要: 利用网页的结构特征,提出一种多级网页聚类方法。该方法首先对网页进行分块,然后使用网页的块特征对网页进行聚类。在聚类过程中,通过调整阈值,能够提供三级聚类:同站点网页聚类、同站点同结构网页聚类、同站点同结构同模板网页聚类。与已有的网页聚类方法相比较,该方法能够提供多级聚类结果,满足不同的聚类需求,而且在聚类的准确率和效率方面有本质上的提高。

关键词: 网页分块, 网页聚类, DOM

Abstract: A multi-level page clustering method based on page segmentation was proposed. In this method, pages were divided into several blocks, and then clustered by using the block feature. By adjusting the threshold of similarity between pages, three-level clustering was obtained: the first level is pages from the same website, the second level is pages from the same website with the same structures, and the last level is pages produced with the same template from the same website. Compared with traditional methods, this method not only could provide multi-level clustering, but also can cluster pages effectively.

Key words: DOM, page segmentation, page clustering

中图分类号: 

  • TP391
[1] 余钧. 网页中数据记录的自动抽取和归并[D]. 北京:中国科学院大学,2014. Yujun. Research on automatic extraction and integration of data record in Web page[D]. Beijing: University of Chinese Academy of Sciences, 2014.
[2] 李睿, 曾俊瑀, 周四望. 基于局部标签树匹配的改进网页聚类算法[J].计算机应用, 2010, 30(3):818-820.. LI Rui, ZENG Junyu, ZHOU Siwang. Improved Web page clustering algorithm based on partial tag tree matching[J]. Journal of Computer Applications, 2010, 30(3):818-820.
[3] Valter Crescenzi, Paolo Merialdo, Paolo Missier. Clustering Web pages based on their structure[J]. Data & Knowledge Engineering, 2005(54):279-299.
[4] XIAO Yunpeng, TAO Yang, LI Qian. Web page adaptation for mobile device[C]// Proceedings of IEEE 4th International Conference on Wireless Communications, Networking and Mobile Computing (WiCOM '08).Washington: IEEE Computer Society, 2008:1-5.
[5] Tomoyuki Nanno, Suguru Saito, Manabu Okumura. Structuring Web pages based on repetation of elements[J]. Transactions of Information Processing Society of Japan, 2004, 45(9):2157-2167.
[6] CAI Deng, YU Shipeng,WEN Jirong, et al. Vips: a vision based page segmentation algorithm[R]. Microsoft Research, 2003.
[7] Jan Zeleny, Radek Burget. Cluster-based page segmentation: a fast and precise method for web page pre-processing[C]// Proceedings of the 3rd International Conference on Web Intelligence, Mining and Semantics. New York: ACM, 2013:7.1-7.11.
[8] Chaw Su Win, Mie Mie Su Thwin. Web page segmentation and informative content extraction for effective information retrieval[J]. International Journal of Computer & Communication Engineering Research(IJCCER), 2014, 2(2):35-45.
[9] 常育红, 姜哲, 朱小燕. 基于标记树表示方法的页面结构分析[J].计算机工程与应用, 2004,40(16):129-132. CHANG Yuhong, JIANG Zhe, ZHU Xiaoyan. Web page structure analysis based on tag tree method[J]. Computer Engineering and Applications, 2004, 40(16):129-132.
[1] 李霞,蒋盛益. 基于DOM树及行文本统计去噪的网页文本抽取技术[J]. J4, 2012, 47(3): 38-42.
[2] 郭智莲1,2, 杨海龙1*. 相容拟半连续Domain和相容交半连续Domain[J]. J4, 2012, 47(2): 104-108.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!