《山东大学学报(理学版)》 ›› 2019, Vol. 54 ›› Issue (5): 28-36, 43.doi: 10.6040/j.issn.1671-9352.2.2018.058
Pei-ming WANG1(),Xing-shu CHEN1,2,Hai-zhou WANG2,Wen-xian WANG3,*()
摘要:
微博正逐步成为公共信息传播的主要社交媒体,高效地获取微博数据对于网络舆情分析具有重要意义。以新浪微博为研究对象,研究了通过微博API、模拟登录和构造访客Cookie进行数据采集的3种方案,提出了一种多策略融合的微博数据采集方案。针对模拟登录的方案设计实现了自适应的并发采集算法,使数据采集较为稳定高效;针对构造访客Cookie的方案设计实现了高可用代理池模块,进一步提高了数据采集效率。实验结果表明,基于模拟登录的自适应并发采集策略和构造访客Cookie融合的方案能够高效、全面、稳定地获取微博数据。
中图分类号:
1 | 中国互联网中心.第41次中国互联网发展状况报告[EB/OL].[2018-01-31] http://www.cac.gov.cn/files/pdf/cnnic/CNNIC41.pdf. |
CNNIC. The 41st report on China's Internet development[EB/OL].[2018-01-31] http://www.cac.gov.cn/files/pdf/cnnic/CNNIC41.pdf. | |
2 | CATANESE S A, MEO P D, FERRARA E, et al. Crawling Facebook for social network analysis purposes[C]// International Conference on Web Intelligence, Mining and Semantics. New York: ACM, 2011: 1-8. |
3 | WU S, HOFMAN J M, MASON W A, et al. Who says what to whom on Twitter[C]// International Conference on World Wide Web, WWW 2011, Hyderabad: DBLP, 2011: 705-714. |
4 | SINGANAMALLA S, CHRISTEN M P. Loklak-A distributed crawler and data harvester for overcoming rate limits[J]. arXiv: 1704.03624, 2017. https://arxiv.org/abs/1704.03624 |
5 | WANI M A, AGARWAL N, JABIN S, et al. Design and implementation of iMacros-based data crawler for behavioral analysis of Facebook users[J]. arXiv preprint arXiv: 1802.09566, 2018. |
6 | 冯典.面向微博的数据采集和分析系统的设计与实现[D].北京:北京邮电大学, 2013. |
FENG Dian. The design and implementation of the data acquisition and analysis system for micro-blog[D]. Beijing: Beijing University of Posts and Telecommunications, 2013. | |
7 | 孙青云, 王俊峰, 赵宗渠, 等. 一种基于模拟登录的微博数据采集方案[J]. 计算机技术与发展, 2014, 24 (3): 6- 10. |
SUN Qingyun , WANG Junfeng , ZHAO Zongqu , et al. A microblog data collection method based on simulated login technology[J]. Computer Technology and Development, 2014, 24 (3): 6- 10. | |
8 |
周雪, 刘乃文. 基于用户影响力的微博数据采集技术[J]. 山东师范大学学报(自然科学版), 2016, 31 (2): 34- 39.
doi: 10.3969/j.issn.1001-4748.2016.02.007 |
ZHOU Xue , LIU Naiwen . Weibo data mining technology based on the user influential[J]. Journal of Shandong Normal University(Natural Science), 2016, 31 (2): 34- 39.
doi: 10.3969/j.issn.1001-4748.2016.02.007 |
|
9 | 黄鑫博.面向社交网络的数据采集系统的研究和实现[D].北京:北京邮电大学, 2016. |
HUANG Xinbo. Research and implementation of data acquisition system oriented on social networks[D]. Beijing: Beijing University of Post and Telecommunications, 2013. | |
10 | 徐恒.社会化网络数据获取技术研究与实现[D].长春:吉林大学, 2016. |
XU Heng. Study and implementation of data acquisition technology in social network[D]. Changchun: Jilin University, 2016. | |
11 | 徐雁飞, 刘渊, 吴文鹏. 社交网络数据采集技术研究与应用[J]. 计算机科学, 2017, 44 (1): 277- 282. |
XU Yanfei , LIU Yuan , WU Wenpeng . Research and application of social network data acquisition technology[J]. Computer Science, 2017, 44 (1): 277- 282. | |
12 | MOHAMMAD A A , SHAIKHLI I F A , MOHAMMAD A H , et al. Protection of the texts using Base64 and MD5[J]. Journal of Advanced Computer Science & Technology Research, 2012, 2 (1): 22- 34. |
13 | SOMANI U, LAKHANI K, MUNDRA M. Implementing digital signature with RSA encryption algorithm to enhance the data security of cloud in cloud computing[C]//2010 First International Conference on Parallel, Distributed and Grid Computing. Solan: PDGC.2010. |
14 | CHAU D H, PANDIT S, WANG S, et al. Parallel crawling for online social networks[C]// International Conference on World Wide Web, WWW 2007, Banff: DBLP, 2007: 1283-1284. |
15 |
BOLDI P , CODENOTTI B , SANTINI M , et al. UbiCrawler:a scalable fully distributed Web crawler[J]. Software:Practice and Experience, 2004, 34 (8): 711- 726.
doi: 10.1002/(ISSN)1097-024X |
16 |
袁浩, 黄烟波. 网页标题分析对主题爬虫的改进[J]. 计算机技术与发展, 2009, 19 (6): 22- 24, 28.
doi: 10.3969/j.issn.1673-629X.2009.06.006 |
YUAN Hao , HUANG Yanbo . Analysis of title page to improve focus crawler[J]. Computer Technology and Development, 2009, 19 (6): 22- 24, 28.
doi: 10.3969/j.issn.1673-629X.2009.06.006 |
|
17 | CHO J, GARCIA-MOLINA H. Parallel crawlers[C]//Proceedings of the 11th International Conference on World Wide Web. New York: ACM, 2002: 124-135. |
18 |
高梦超, 胡庆宝, 程耀东, 等. 基于众包的社交网络数据采集模型设计与实现[J]. 计算机工程, 2015, 41 (4): 36- 40.
doi: 10.3969/j.issn.1000-3428.2015.04.007 |
GAO Mengchao , HU Qingbao , CHENG Yaodong , et al. Design and implementation of crowd sourcing-based social network data collection model[J]. Computer Engineering, 2015, 41 (4): 36- 40.
doi: 10.3969/j.issn.1000-3428.2015.04.007 |
|
19 | CHIU D M , JAIN R . Analysis of the increase and decrease algorithms for congestion avoidance in computer networks[J]. Computer Networks and ISDN Systems, 1989, 17 (1): 1- 14. |
20 | 中央网信办.互联网跟帖评论服务管理规定[EB/OL].[2017-08-25] http://www.cac.gov.cn/2017-08/25/c_1121541842.htm |
The Cyberspace Administration of China. Post comment service management regulations on Internet[EB/OL].[2017-08-25] http://www.cac.gov.cn/2017-08/25/c_1121541842.htm |
[1] | 晏燕,郝晓弘. 差分隐私密度自适应网格划分发布方法[J]. 山东大学学报(理学版), 2018, 53(9): 12-22. |
[2] | 康海燕,黄渝轩,陈楚翘. 基于视频分析的地理信息隐私保护方法[J]. 山东大学学报(理学版), 2018, 53(1): 19-29. |
[3] | 宋元章,李洪雨,陈媛,王俊杰. 基于分形与自适应数据融合的P2P botnet检测方法[J]. 山东大学学报(理学版), 2017, 52(3): 74-81. |
[4] | 黄伟婷,赵红,祝峰. 代价敏感属性约简的自适应分治算法[J]. 山东大学学报(理学版), 2016, 51(8): 98-104. |
[5] | 姚亮,洪宇,刘昊,刘乐,姚建民. 基于语义分布相似度的翻译模型领域自适应研究[J]. 山东大学学报(理学版), 2016, 51(7): 43-50. |
[6] | 葛彦强,汪向征. 一种改进的自适应和声搜索优化算法[J]. 山东大学学报(理学版), 2016, 51(1): 84-88. |
[7] | 刘春梅, 钟柳强, 舒适, 肖映雄. 平面弹性问题的高次有限元离散系统的局部多重网格法[J]. 山东大学学报(理学版), 2015, 50(08): 34-39. |
[8] | 张晶, 肖智斌, 容会, 崔毅. 改进型遗传算法在网络蜘蛛上的应用[J]. 山东大学学报(理学版), 2015, 50(05): 1-6. |
[9] | 杨叶红,肖剑*,马珍珍. 一个新分数阶混沌系统的同步和控制[J]. 山东大学学报(理学版), 2014, 49(2): 76-83. |
[10] | 吕小妮1,王艳彩2,高岳林2. BVaR风险度量下限制性卖空的单位风险收益最大投资组合模型[J]. J4, 2013, 48(05): 92-96. |
[11] | 周燕1,2,刘培玉1,2,赵静1,2,王乾龙1,2. 基于自适应惯性权重的混沌粒子群算法[J]. J4, 2012, 47(3): 27-32. |
[12] | 丁卫平1,2,3,王建东2,段卫华2,施佺1. 一种求解属性约简优化的协同粒子群算法[J]. J4, 2011, 46(5): 97-102. |
[13] | 王少波,高振明,李志勇,胡兰雨,王凤丽 . FMT系统中的动态比特分配算法性能分析[J]. J4, 2006, 41(6): 99-102 . |
|