您的位置:山东大学 -> 科技期刊社 -> 《山东大学学报(理学版)》

J4 ›› 2011, Vol. 46 ›› Issue (5): 34-38.

• SEWM 2011 会议 • 上一篇    下一篇

Web数据的深度定向采集

夏天1,2   

  1. 1.数据工程与知识工程教育部重点实验室, 北京 100872; 2.中国人民大学信息资源管理学院, 北京100872
  • 收稿日期:2010-12-06 发布日期:2011-05-25
  • 作者简介:夏天(1978- ),男,博士,讲师,主要研究方向为Web数据挖掘. Email:iamxiatian@gmail.com
  • 基金资助:

    国家社会科学基金资助项目(09CTQ027)

Deep directional collection of Web data

XIA Tian1,2   

  1. 1. Key Laboratory of Data Engineering and Knowledge Engineering, MOE, Beijing 100872, China;
    2. School of Information Resource Management, Renmin University of China, Beijing 100872, China
  • Received:2010-12-06 Published:2011-05-25

摘要:

通过模拟人类访问网页的浏览行为,提取定向爬行子页面集限定爬虫的爬行方向;引入页面继承关系,并通过爬行条目的属性继承实现跨页面复合对象的数据关联关系;设计实现了支持深度定向采集的通用爬行流程。面向天涯热帖的舆情采集实验结果表明:该方法可以在整体处理流程不变的前提下,实现复杂对象的数据采集,并具有较高的采集效率。

关键词: 深度采集;定向网络爬虫;公共网络舆情

Abstract:

Based on the Web surf behaviors of human beings, crawling directions are restricted by extracted crawling subpages, and the associated relationships of crosspage compound object are  realized through the properties′ inheritance between crawl datum. Then, the generalized crawl process with deep directional collection support is  designed and implemented. Experimental results about the hot posts of the Tianya site show that this method can achieve data collection of complicated objects without changing the main procedure, and has high collection efficiency.

Key words: deep collection; directional web crawler; public web opinion

No related articles found!
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!