J4
• 论文 • 上一篇 下一篇
郑健珍1,林坤辉1,周昌乐2,康 恺1
收稿日期:
修回日期:
出版日期:
发布日期:
通讯作者:
ZHENG Jian-zhen,LIN Kun-hui,ZHOU Chang-le,KANG Kai
Received:
Revised:
Online:
Published:
Contact:
摘要: 定题爬虫能迅速获取网络上特定主题的大量信息,对专业搜索引擎及数据挖掘应用都具有重大价值.针对目前通用的基于关键词主题过滤策略的不足,在概念聚集思想启发下,提出了基于本体语义的主题过滤策略.同时根据网页具有不同位置不同信息重要性的特点,提出了改进的加权特征项权值计算公式,实现基于语义的网页实时过滤.为进一步提高爬虫的工作效率提出链接相关度预测算法.对比实验表明此策略具有可行性.
关键词: 定题爬虫, 主题过滤, 链接分析 , 本体语义
Abstract: Focused crawler can fetch large quantities of domain resources from the Web in a short time. It is very helpful in both foused search engines and data mining companies. In order to overcome the deficiency of topic filtering strategy based on keywords widly used nowadays, the paper proposed a topic filtering stratege based on concept elicited by concept congregation idea. The paper also proposed an authority modified weight calculation formula based on different importance of Web page information. By doing this, real time Web page filtering based on concept can be achieved. In the hope of improving focused crawler's work efficiency more, the paper also proposed a link forecast algorithm. At last, the comparative experiment shows that the strategies proposed in this paper are pratical.
Key words: hyperlinkanalyse , ontologysemanticanalyse, topicfiltering, focusedcrawler
郑健珍,林坤辉,周昌乐,康 恺 . 基于本体语义的定题爬虫[J]. J4, 2006, 41(3): 90-94 .
ZHENG Jian-zhen,LIN Kun-hui,ZHOU Chang-le,KANG Kai . Ontology based on focused crawler[J]. J4, 2006, 41(3): 90-94 .
0 / / 推荐
导出引用管理器 EndNote|Reference Manager|ProCite|BibTeX|RefWorks
链接本文: http://lxbwk.njournal.sdu.edu.cn/CN/
http://lxbwk.njournal.sdu.edu.cn/CN/Y2006/V41/I3/90
Cited