山东大学学报(理学版) ›› 2015, Vol. 50 ›› Issue (07): 1-8.doi: 10.6040/j.issn.1671-9352.3.2014.270
• 论文 • 下一篇
范意兴1,2, 郭岩1, 李希鹏1,2, 赵岭1, 刘悦1, 俞晓明1, 程学旗1
FAN Yi-xing1,2, GUO Yan1, LI Xi-peng1,2, ZHAO Ling1, LIU Yue1, YU Xiao-ming1, CHENG Xue-qi1
摘要: 利用网页的结构特征,提出一种多级网页聚类方法。该方法首先对网页进行分块,然后使用网页的块特征对网页进行聚类。在聚类过程中,通过调整阈值,能够提供三级聚类:同站点网页聚类、同站点同结构网页聚类、同站点同结构同模板网页聚类。与已有的网页聚类方法相比较,该方法能够提供多级聚类结果,满足不同的聚类需求,而且在聚类的准确率和效率方面有本质上的提高。
中图分类号:
[1] 余钧. 网页中数据记录的自动抽取和归并[D]. 北京:中国科学院大学,2014. Yujun. Research on automatic extraction and integration of data record in Web page[D]. Beijing: University of Chinese Academy of Sciences, 2014. [2] 李睿, 曾俊瑀, 周四望. 基于局部标签树匹配的改进网页聚类算法[J].计算机应用, 2010, 30(3):818-820.. LI Rui, ZENG Junyu, ZHOU Siwang. Improved Web page clustering algorithm based on partial tag tree matching[J]. Journal of Computer Applications, 2010, 30(3):818-820. [3] Valter Crescenzi, Paolo Merialdo, Paolo Missier. Clustering Web pages based on their structure[J]. Data & Knowledge Engineering, 2005(54):279-299. [4] XIAO Yunpeng, TAO Yang, LI Qian. Web page adaptation for mobile device[C]// Proceedings of IEEE 4th International Conference on Wireless Communications, Networking and Mobile Computing (WiCOM '08).Washington: IEEE Computer Society, 2008:1-5. [5] Tomoyuki Nanno, Suguru Saito, Manabu Okumura. Structuring Web pages based on repetation of elements[J]. Transactions of Information Processing Society of Japan, 2004, 45(9):2157-2167. [6] CAI Deng, YU Shipeng,WEN Jirong, et al. Vips: a vision based page segmentation algorithm[R]. Microsoft Research, 2003. [7] Jan Zeleny, Radek Burget. Cluster-based page segmentation: a fast and precise method for web page pre-processing[C]// Proceedings of the 3rd International Conference on Web Intelligence, Mining and Semantics. New York: ACM, 2013:7.1-7.11. [8] Chaw Su Win, Mie Mie Su Thwin. Web page segmentation and informative content extraction for effective information retrieval[J]. International Journal of Computer & Communication Engineering Research(IJCCER), 2014, 2(2):35-45. [9] 常育红, 姜哲, 朱小燕. 基于标记树表示方法的页面结构分析[J].计算机工程与应用, 2004,40(16):129-132. CHANG Yuhong, JIANG Zhe, ZHU Xiaoyan. Web page structure analysis based on tag tree method[J]. Computer Engineering and Applications, 2004, 40(16):129-132. |
[1] | 李霞,蒋盛益. 基于DOM树及行文本统计去噪的网页文本抽取技术[J]. J4, 2012, 47(3): 38-42. |
[2] | 郭智莲1,2, 杨海龙1*. 相容拟半连续Domain和相容交半连续Domain[J]. J4, 2012, 47(2): 104-108. |
|