JOURNAL OF SHANDONG UNIVERSITY(NATURAL SCIENCE) ›› 2015, Vol. 50 ›› Issue (07): 1-8.doi: 10.6040/j.issn.1671-9352.3.2014.270

    Next Articles

A multi-level page clustering method based on page segmentation

FAN Yi-xing1,2, GUO Yan1, LI Xi-peng1,2, ZHAO Ling1, LIU Yue1, YU Xiao-ming1, CHENG Xue-qi1   

  1. 1. Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190, China;
    2. University of Chinese Academy of Sciences, Beijing 100190, China
  • Received:2014-09-05 Online:2015-07-20 Published:2015-07-31

Abstract: A multi-level page clustering method based on page segmentation was proposed. In this method, pages were divided into several blocks, and then clustered by using the block feature. By adjusting the threshold of similarity between pages, three-level clustering was obtained: the first level is pages from the same website, the second level is pages from the same website with the same structures, and the last level is pages produced with the same template from the same website. Compared with traditional methods, this method not only could provide multi-level clustering, but also can cluster pages effectively.

Key words: DOM, page segmentation, page clustering

CLC Number: 

  • TP391
[1] 余钧. 网页中数据记录的自动抽取和归并[D]. 北京:中国科学院大学,2014. Yujun. Research on automatic extraction and integration of data record in Web page[D]. Beijing: University of Chinese Academy of Sciences, 2014.
[2] 李睿, 曾俊瑀, 周四望. 基于局部标签树匹配的改进网页聚类算法[J].计算机应用, 2010, 30(3):818-820.. LI Rui, ZENG Junyu, ZHOU Siwang. Improved Web page clustering algorithm based on partial tag tree matching[J]. Journal of Computer Applications, 2010, 30(3):818-820.
[3] Valter Crescenzi, Paolo Merialdo, Paolo Missier. Clustering Web pages based on their structure[J]. Data & Knowledge Engineering, 2005(54):279-299.
[4] XIAO Yunpeng, TAO Yang, LI Qian. Web page adaptation for mobile device[C]// Proceedings of IEEE 4th International Conference on Wireless Communications, Networking and Mobile Computing (WiCOM '08).Washington: IEEE Computer Society, 2008:1-5.
[5] Tomoyuki Nanno, Suguru Saito, Manabu Okumura. Structuring Web pages based on repetation of elements[J]. Transactions of Information Processing Society of Japan, 2004, 45(9):2157-2167.
[6] CAI Deng, YU Shipeng,WEN Jirong, et al. Vips: a vision based page segmentation algorithm[R]. Microsoft Research, 2003.
[7] Jan Zeleny, Radek Burget. Cluster-based page segmentation: a fast and precise method for web page pre-processing[C]// Proceedings of the 3rd International Conference on Web Intelligence, Mining and Semantics. New York: ACM, 2013:7.1-7.11.
[8] Chaw Su Win, Mie Mie Su Thwin. Web page segmentation and informative content extraction for effective information retrieval[J]. International Journal of Computer & Communication Engineering Research(IJCCER), 2014, 2(2):35-45.
[9] 常育红, 姜哲, 朱小燕. 基于标记树表示方法的页面结构分析[J].计算机工程与应用, 2004,40(16):129-132. CHANG Yuhong, JIANG Zhe, ZHU Xiaoyan. Web page structure analysis based on tag tree method[J]. Computer Engineering and Applications, 2004, 40(16):129-132.
[1] ZOU Shao-hui, ZHANG Tian. Interaction relationship between international carbon future price and domestic carbon price [J]. JOURNAL OF SHANDONG UNIVERSITY(NATURAL SCIENCE), 2018, 53(5): 70-79.
[2] LI Run-chuan, ZAN Hong-ying, SHEN Sheng-ya, BI Yin-long, ZHANG Zhong-jun. Spam messages identification based on multi-feature fusion [J]. JOURNAL OF SHANDONG UNIVERSITY(NATURAL SCIENCE), 2017, 52(7): 73-79.
[3] DING Yi-tao, YANG Hai-bin, YANG Xiao-yuan, ZHOU Tan-ping. A reversible image data hiding scheme in Homomorphic encrypted domain [J]. JOURNAL OF SHANDONG UNIVERSITY(NATURAL SCIENCE), 2017, 52(7): 104-110.
[4] CHEN Xia, CHEN Chun-rong. Gap functions and error bounds for generalized vector variational inequalities [J]. JOURNAL OF SHANDONG UNIVERSITY(NATURAL SCIENCE), 2017, 52(4): 1-5.
[5] ZHANG Ya-yun, WU Qun-ying. Precise asymptotics in the law of iterated logarithm for the moment convergence of ρ-mixing sequences [J]. JOURNAL OF SHANDONG UNIVERSITY(NATURAL SCIENCE), 2017, 52(4): 13-20.
[6] YANG Xu, LI Shuo. Comparison theorem for backward doubly stochastic differential equations driven by white noises and Poisson random measures [J]. JOURNAL OF SHANDONG UNIVERSITY(NATURAL SCIENCE), 2017, 52(4): 26-29.
[7] WANG Xiao-yan, SHEN Jia-lan, SHEN Yuan-xia. Graded multi-granulation rough set based on weighting granulations and dominance relation [J]. JOURNAL OF SHANDONG UNIVERSITY(NATURAL SCIENCE), 2017, 52(3): 97-104.
[8] LIU Long-fei, YANG Xiao-yuan. On the linear complexity of a new generalized cyclotomic sequence with length p3 over GF(l) [J]. JOURNAL OF SHANDONG UNIVERSITY(NATURAL SCIENCE), 2017, 52(3): 24-31.
[9] ZHU Xiao-ying, PANG Shi-you. On the maximal eccentric distance sum of tree with given domination number [J]. JOURNAL OF SHANDONG UNIVERSITY(NATURAL SCIENCE), 2017, 52(2): 30-36.
[10] JU Pei-jun, WANG Wei. Delay margin of linear multi-input multi-output system [J]. JOURNAL OF SHANDONG UNIVERSITY(NATURAL SCIENCE), 2017, 52(11): 60-64.
[11] LIU Ge, LIU Qing-qing, ZHANG Jian-zhong. Random number extraction mechanism based on quantum measurement [J]. JOURNAL OF SHANDONG UNIVERSITY(NATURAL SCIENCE), 2017, 52(11): 44-48.
[12] LIU Hua, XIE Mei, JIANG Rui, WEI Yu-mei. The two grids population migration model based on birth-death process [J]. JOURNAL OF SHANDONG UNIVERSITY(NATURAL SCIENCE), 2016, 51(8): 84-89.
[13] . Construction of expert relationship network based on random walk strategy [J]. JOURNAL OF SHANDONG UNIVERSITY(NATURAL SCIENCE), 2016, 51(7): 30-34.
[14] YAO Liang, HONG Yu, LIU Hao, LIU Le, YAO Jian-min. Translation model adaptation based on semantic distribution similarity [J]. JOURNAL OF SHANDONG UNIVERSITY(NATURAL SCIENCE), 2016, 51(7): 43-50.
[15] LIU Yang, FENG Zhi-wei, CHEN Ping-yan. Almost sure central limit theorem for arrays of random variables [J]. JOURNAL OF SHANDONG UNIVERSITY(NATURAL SCIENCE), 2016, 51(6): 24-29.
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!