基于本体语义的定题爬虫

基于本体语义的定题爬虫

郑健珍¹,林坤辉¹,周昌乐²,康恺¹

厦门大学软件学院，福建厦门 361005

收稿日期:2006-03-29 修回日期:1900-01-01 出版日期:2006-10-24 发布日期:2006-10-24
通讯作者: 郑健珍

Ontology based on focused crawler

ZHENG Jian-zhen,LIN Kun-hui,ZHOU Chang-le,KANG Kai

Software School, Xiamen Univ., Xiamen 361005, Fujian, China;

Received:2006-03-29 Revised:1900-01-01 Online:2006-10-24 Published:2006-10-24
Contact: ZHEN Jian-zhen

摘要/Abstract

摘要： 定题爬虫能迅速获取网络上特定主题的大量信息，对专业搜索引擎及数据挖掘应用都具有重大价值.针对目前通用的基于关键词主题过滤策略的不足，在概念聚集思想启发下，提出了基于本体语义的主题过滤策略.同时根据网页具有不同位置不同信息重要性的特点，提出了改进的加权特征项权值计算公式，实现基于语义的网页实时过滤.为进一步提高爬虫的工作效率提出链接相关度预测算法.对比实验表明此策略具有可行性.

关键词: 定题爬虫, 主题过滤, 链接分析 , 本体语义

Abstract: Focused crawler can fetch large quantities of domain resources from the Web in a short time. It is very helpful in both foused search engines and data mining companies. In order to overcome the deficiency of topic filtering strategy based on keywords widly used nowadays, the paper proposed a topic filtering stratege based on concept elicited by concept congregation idea. The paper also proposed an authority modified weight calculation formula based on different importance of Web page information. By doing this, real time Web page filtering based on concept can be achieved. In the hope of improving focused crawler's work efficiency more, the paper also proposed a link forecast algorithm. At last, the comparative experiment shows that the strategies proposed in this paper are pratical.

Key words: hyperlinkanalyse , ontologysemanticanalyse, topicfiltering, focusedcrawler

郑健珍,林坤辉,周昌乐,康恺 . 基于本体语义的定题爬虫[J]. J4, 2006, 41(3): 90-94 .

ZHENG Jian-zhen,LIN Kun-hui,ZHOU Chang-le,KANG Kai . Ontology based on focused crawler[J]. J4, 2006, 41(3): 90-94 .

参考文献

多维度评价

Viewed

Full text

Abstract

Cited

Shared

Discussed