您的位置:山东大学 -> 科技期刊社 -> 《山东大学学报(理学版)》

J4

• 论文 • 上一篇    下一篇

基于网页分块的Shark-Search算法

陈 军1,陈竹敏2   

  1. 1. 山东大学网络中心, 山东 济南 250100; 2. 山东大学计算机科学与技术学院, 山东 济南 250061
  • 收稿日期:1900-01-01 修回日期:1900-01-01 出版日期:2006-10-24 发布日期:2006-10-24
  • 通讯作者: 陈 军

Improved Shark-Search algorithm based on page segmentation

CHEN Jun1,CHEN Zhu-min2   

  1. 1. Network Center, Shandong University, Jinan 250100, Shandong;2. School of Computer Science and Technology, Shandong University, Jinan 250061, Shandong
  • Received:1900-01-01 Revised:1900-01-01 Online:2006-10-24 Published:2006-10-24
  • Contact: CHEN Jun

摘要: Shark-Search算法是一个经典的主题爬取算法. 针对该算法在爬取噪音链接较多的Web页面时性能并不理想的问题, 提出了基于网页分块的Shark-Search算法, 该算法从页面、块、链接的多种粒度来更加有效的进行链接的选择与过滤. 实验证明, 改进的Shark-Search算法比传统的Shark-Search算法在查准率和信息量总和上有了质的提高.

关键词: Shark-Search算法, 主题爬取, 相关性计算 , 页面分块

Abstract: A Shark-Search algorithm is one of the classical algorithms for focused crawling. However, its performance is not ideal for crawling Web pages which contain too many noisy links. An improved Shark-Search algorithm based on page segmentation was proposed, which can accurately evaluate the relevance from three granularities: page, block and single link. Several experiments were carried out to verify that the improved Shark-Search algorithm can obtain significantly higher efficiency than traditional ones.

Key words: relevance computation , page segmentation, focused crawling, Shark-Search algorithm

中图分类号: 

  • TP391
[1] 苏 祺,项 锟,孙 斌 . 基于链接聚类的Shark-Search算法[J]. J4, 2006, 41(3): 1-04 .
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!