J4 ›› 2011, Vol. 46 ›› Issue (5): 34-38.
• SEWM 2011 会议 • 上一篇 下一篇
夏天1,2
收稿日期:
发布日期:
作者简介:
基金资助:
国家社会科学基金资助项目(09CTQ027)
XIA Tian1,2
Received:
Published:
摘要:
通过模拟人类访问网页的浏览行为,提取定向爬行子页面集限定爬虫的爬行方向;引入页面继承关系,并通过爬行条目的属性继承实现跨页面复合对象的数据关联关系;设计实现了支持深度定向采集的通用爬行流程。面向天涯热帖的舆情采集实验结果表明:该方法可以在整体处理流程不变的前提下,实现复杂对象的数据采集,并具有较高的采集效率。
关键词: 深度采集;定向网络爬虫;公共网络舆情
Abstract:
Based on the Web surf behaviors of human beings, crawling directions are restricted by extracted crawling subpages, and the associated relationships of crosspage compound object are realized through the properties′ inheritance between crawl datum. Then, the generalized crawl process with deep directional collection support is designed and implemented. Experimental results about the hot posts of the Tianya site show that this method can achieve data collection of complicated objects without changing the main procedure, and has high collection efficiency.
Key words: deep collection; directional web crawler; public web opinion
夏天1,2. Web数据的深度定向采集[J]. J4, 2011, 46(5): 34-38.
XIA Tian1,2. Deep directional collection of Web data[J]. J4, 2011, 46(5): 34-38.
0 / / 推荐
导出引用管理器 EndNote|Reference Manager|ProCite|BibTeX|RefWorks
链接本文: http://lxbwk.njournal.sdu.edu.cn/CN/
http://lxbwk.njournal.sdu.edu.cn/CN/Y2011/V46/I5/34
Cited