中心网页中主题网页链接的自动抽取

J4 ›› 2012, Vol. 47 ›› Issue (5): 25-31.

中心网页中主题网页链接的自动抽取

夏天^1,2

1. 数据工程与知识工程教育部重点实验室(中国人民大学), 北京 100872;
2. 中国人民大学信息资源管理学院, 北京 100872

收稿日期:2011-11-30 出版日期:2012-05-20 发布日期:2012-06-01
作者简介:夏天(1978- ),男,博士,副教授,主要研究方向为Web数据挖掘.Email:xiatian1119@gmail.com
基金资助:
国家社会科学基金资助项目(09CTQ027)

Automatic extracting topic page links from Hub page

XIA Tian^1,2

1. Key Laboratory of Data Engineering and Knowledge Engineering, MOE, Beijing 100872, China;
2. School of Information Resource Management, Renmin University of China, Beijing 100872, China

Received:2011-11-30 Online:2012-05-20 Published:2012-06-01

摘要/Abstract

摘要：

基于扩展标记树,提出了一种从中心网页中自动抽取主题网页链接的方法。首先构建链接有序表,利用链接前缀树发现主题网页链接拒绝规则,实现对网页链接类型的预判定;其次,通过分组分割和相似分组重新合并,把页面中的链接归入到不同分组之中,进而识别分组的类型和核心区域所在的分组,最终把链接归入三类链接集合之中。实验结果表明该方法无需训练即可实现中心网页中主题网页链接的高精度抽取。

关键词: 链接抽取;扩展标记树;链接前缀树

Abstract:

A topic link extraction method from Hub page based on extended label tree was proposed. Firs, a topic link sorted list was build and deny rules were learned by prefix tree, then, the link type was pre-determined. Second, by group splitting and re-merging, each candidate link was classified into different groups. The group type and the group which represented the hub page’s core region were identified, and finally all links were put into three different collections. Experimental results show that this method can achieve high-precision for topic link extraction without training.

Key words: link extraction; extended label tree; link prefix tree

夏天1,2. 中心网页中主题网页链接的自动抽取[J]. J4, 2012, 47(5): 25-31.

XIA Tian1,2. Automatic extracting topic page links from Hub page[J]. J4, 2012, 47(5): 25-31.

参考文献

多维度评价

Viewed

Full text

Abstract

Cited

Shared

Discussed