JOURNAL OF SHANDONG UNIVERSITY(NATURAL SCIENCE) ›› 2019, Vol. 54 ›› Issue (5): 28-36, 43.doi: 10.6040/j.issn.1671-9352.2.2018.058

•   • Previous Articles     Next Articles

Research on microblog data collection based on multiple hybrid strategy

Pei-ming WANG1(),Xing-shu CHEN1,2,Hai-zhou WANG2,Wen-xian WANG3,*()   

  1. 1. College of Computer Science, Sichuan University, Chengdu 610065, Sichuan, China
    2. College of Cybersecurity, Sichuan University, Chengdu 610065, Sichuan, China
    3. Cybersecurity Research Institute, Sichuan University, Chengdu 610065, Sichuan, China
  • Received:2018-10-17 Online:2019-05-20 Published:2019-05-09
  • Contact: Wen-xian WANG E-mail:resolvewang@foxmail.com;catean@scu.edu.cn
  • Supported by:
    国家自然科学基金资助项目(61802270);国家自然科学基金资助项目(61802271);国家“双创”示范基地之变革性技术国际研发转化平台资助项目(C700011);四川省重点研发资助项目(2018G20100);四川省科技支撑计划项目(2016GZ0038);中央高校基本科研业务费专项资金资助(2017SCU11065)

Abstract:

Microblog is becoming the main social media to spread public information, efficient acquisition of microblog data is important to the analysis of online public opinion. Taking Microblog as the research object, there are three data collection strategies through microblog API, simulated login technology and visitor cookie are studied. A data collection method for microblog based on fusion strategy is proposed. An adaptive concurrent data acquisition algorithm is designed and implemented for the web crawler based on simulated login technology. A high available IP proxy pool is designed to accelerate data acquisition for the web crawler based on visitor Cookie. Experimental results show that the fusion strategy is more effective, complete and stable in microblog data collection.

Key words: API, simulated login, visitor Cookie, fusion strategy, adaptive, IP proxy pool

CLC Number: 

  • TP393

Table 1

Request limit on microblog API for per IP"

授权方式请求次数/h-1
测试1 000
普通10 000
中级20 000
高级30 000
合作40 000

Table 2

Request limit on microblog API for per account"

授权方式总限制次数/h-1发微博次数/h-1发评论次数/h-1
测试1503060
普通1 0003060
中级1 50060120
高级2 00090180
合作4 000120240

Fig.1

Data collection process based on simulated login"

Fig.2

Microblog data collection process based on simulated login and visitor cookie fusion"

Fig.3

Crawling and validating process of proxy ip"

Fig.4

Architecture diagram of microblog data collection"

Fig.5

Diagram of user relationship collection performance comparison"

Fig.6

Diagram of user information collection performance comparison"

Fig.7

Diagram of microblog information collection performance comparison"

Fig.8

Diagram of comment information collection performance comparison"

Table 3

Comparison of microblog data collection with different strategies"

采集方法用户信息数微博数用户关系数
使用采集器与网络爬虫结合的方法1 87426 97237 862
基于用户影响力的采集方法1 93230 40042 300
本文方法7 54798 24042 105
1 中国互联网中心.第41次中国互联网发展状况报告[EB/OL].[2018-01-31] http://www.cac.gov.cn/files/pdf/cnnic/CNNIC41.pdf.
CNNIC. The 41st report on China's Internet development[EB/OL].[2018-01-31] http://www.cac.gov.cn/files/pdf/cnnic/CNNIC41.pdf.
2 CATANESE S A, MEO P D, FERRARA E, et al. Crawling Facebook for social network analysis purposes[C]// International Conference on Web Intelligence, Mining and Semantics. New York: ACM, 2011: 1-8.
3 WU S, HOFMAN J M, MASON W A, et al. Who says what to whom on Twitter[C]// International Conference on World Wide Web, WWW 2011, Hyderabad: DBLP, 2011: 705-714.
4 SINGANAMALLA S, CHRISTEN M P. Loklak-A distributed crawler and data harvester for overcoming rate limits[J]. arXiv: 1704.03624, 2017. https://arxiv.org/abs/1704.03624
5 WANI M A, AGARWAL N, JABIN S, et al. Design and implementation of iMacros-based data crawler for behavioral analysis of Facebook users[J]. arXiv preprint arXiv: 1802.09566, 2018.
6 冯典.面向微博的数据采集和分析系统的设计与实现[D].北京:北京邮电大学, 2013.
FENG Dian. The design and implementation of the data acquisition and analysis system for micro-blog[D]. Beijing: Beijing University of Posts and Telecommunications, 2013.
7 孙青云, 王俊峰, 赵宗渠, 等. 一种基于模拟登录的微博数据采集方案[J]. 计算机技术与发展, 2014, 24 (3): 6- 10.
SUN Qingyun , WANG Junfeng , ZHAO Zongqu , et al. A microblog data collection method based on simulated login technology[J]. Computer Technology and Development, 2014, 24 (3): 6- 10.
8 周雪, 刘乃文. 基于用户影响力的微博数据采集技术[J]. 山东师范大学学报(自然科学版), 2016, 31 (2): 34- 39.
doi: 10.3969/j.issn.1001-4748.2016.02.007
ZHOU Xue , LIU Naiwen . Weibo data mining technology based on the user influential[J]. Journal of Shandong Normal University(Natural Science), 2016, 31 (2): 34- 39.
doi: 10.3969/j.issn.1001-4748.2016.02.007
9 黄鑫博.面向社交网络的数据采集系统的研究和实现[D].北京:北京邮电大学, 2016.
HUANG Xinbo. Research and implementation of data acquisition system oriented on social networks[D]. Beijing: Beijing University of Post and Telecommunications, 2013.
10 徐恒.社会化网络数据获取技术研究与实现[D].长春:吉林大学, 2016.
XU Heng. Study and implementation of data acquisition technology in social network[D]. Changchun: Jilin University, 2016.
11 徐雁飞, 刘渊, 吴文鹏. 社交网络数据采集技术研究与应用[J]. 计算机科学, 2017, 44 (1): 277- 282.
XU Yanfei , LIU Yuan , WU Wenpeng . Research and application of social network data acquisition technology[J]. Computer Science, 2017, 44 (1): 277- 282.
12 MOHAMMAD A A , SHAIKHLI I F A , MOHAMMAD A H , et al. Protection of the texts using Base64 and MD5[J]. Journal of Advanced Computer Science & Technology Research, 2012, 2 (1): 22- 34.
13 SOMANI U, LAKHANI K, MUNDRA M. Implementing digital signature with RSA encryption algorithm to enhance the data security of cloud in cloud computing[C]//2010 First International Conference on Parallel, Distributed and Grid Computing. Solan: PDGC.2010.
14 CHAU D H, PANDIT S, WANG S, et al. Parallel crawling for online social networks[C]// International Conference on World Wide Web, WWW 2007, Banff: DBLP, 2007: 1283-1284.
15 BOLDI P , CODENOTTI B , SANTINI M , et al. UbiCrawler:a scalable fully distributed Web crawler[J]. Software:Practice and Experience, 2004, 34 (8): 711- 726.
doi: 10.1002/(ISSN)1097-024X
16 袁浩, 黄烟波. 网页标题分析对主题爬虫的改进[J]. 计算机技术与发展, 2009, 19 (6): 22- 24, 28.
doi: 10.3969/j.issn.1673-629X.2009.06.006
YUAN Hao , HUANG Yanbo . Analysis of title page to improve focus crawler[J]. Computer Technology and Development, 2009, 19 (6): 22- 24, 28.
doi: 10.3969/j.issn.1673-629X.2009.06.006
17 CHO J, GARCIA-MOLINA H. Parallel crawlers[C]//Proceedings of the 11th International Conference on World Wide Web. New York: ACM, 2002: 124-135.
18 高梦超, 胡庆宝, 程耀东, 等. 基于众包的社交网络数据采集模型设计与实现[J]. 计算机工程, 2015, 41 (4): 36- 40.
doi: 10.3969/j.issn.1000-3428.2015.04.007
GAO Mengchao , HU Qingbao , CHENG Yaodong , et al. Design and implementation of crowd sourcing-based social network data collection model[J]. Computer Engineering, 2015, 41 (4): 36- 40.
doi: 10.3969/j.issn.1000-3428.2015.04.007
19 CHIU D M , JAIN R . Analysis of the increase and decrease algorithms for congestion avoidance in computer networks[J]. Computer Networks and ISDN Systems, 1989, 17 (1): 1- 14.
20 中央网信办.互联网跟帖评论服务管理规定[EB/OL].[2017-08-25] http://www.cac.gov.cn/2017-08/25/c_1121541842.htm
The Cyberspace Administration of China. Post comment service management regulations on Internet[EB/OL].[2017-08-25] http://www.cac.gov.cn/2017-08/25/c_1121541842.htm
[1] YAN Yan, HAO Xiao-hong. Differential privacy partitioning algorithm based on adaptive density grids [J]. JOURNAL OF SHANDONG UNIVERSITY(NATURAL SCIENCE), 2018, 53(9): 12-22.
[2] KANG Hai-yan, HUANG Yu-xuan, CHEN Chu-qiao. Enhancing privacy for geographic information based on video analysis [J]. JOURNAL OF SHANDONG UNIVERSITY(NATURAL SCIENCE), 2018, 53(1): 19-29.
[3] SONG Yuan-zhang, LI Hong-yu, CHEN Yuan, WANG Jun-jie. P2P botnet detection method based on fractal and adaptive data fusion [J]. JOURNAL OF SHANDONG UNIVERSITY(NATURAL SCIENCE), 2017, 52(3): 74-81.
[4] HUANG Wei-ting, ZHAO Hong, ZHU William. Adaptive divide and conquer algorithm for cost-sensitive attribute reduction [J]. JOURNAL OF SHANDONG UNIVERSITY(NATURAL SCIENCE), 2016, 51(8): 98-104.
[5] ZHANG Jing, XIAO Zhi-bin, RONG Hui, CUI Yi. An improved genetic algorithm in the application of Web spider [J]. JOURNAL OF SHANDONG UNIVERSITY(NATURAL SCIENCE), 2015, 50(05): 1-6.
[6] ZHANG Cong, FANG Ding-yi, WANG Huai-jun, QI Sheng-de. A software protection method base on concealment of API security attributes [J]. JOURNAL OF SHANDONG UNIVERSITY(NATURAL SCIENCE), 2015, 50(01): 12-19.
[7] YANG Ye-hong, XIAO Jian*, MA Zhen-zhen. Synchronization and control of a novel fractional-order chaotic system [J]. JOURNAL OF SHANDONG UNIVERSITY(NATURAL SCIENCE), 2014, 49(2): 76-83.
[8] L Xiao-ni 1, WANG Yan-cai 2, GAO Yue-lin 2. The maximum of earnings per risk portfolio model with restricted short selling under BVaR [J]. J4, 2013, 48(05): 92-96.
[9] DAI Tao1, XI Kai-hua2*, DAI Jia-lin1, LU Tong-chao2, YU Jin-biao1, REN Yong-qiang2, YANG Yao-zhong1, CHENG Ai-jie2. A method for numerical simulation of binary combination flooding based on interpolation of capillary number [J]. J4, 2012, 47(8): 55-59.
[10] FANG Wei-wei1,2, LI Jing-yuan1, LIU Yue1, YU Zhi-hua1, CAO Peng1,2, ZHANG Kai1. Research of Twitter data collection [J]. J4, 2012, 47(5): 73-77.
[11] ZHOU Yan1,2, LIU Pei-yu 1,2, ZHAO Jing1,2, WANG Qian-long1,2. Chaos particle swarm optimization based on the adaptive inertia weight [J]. J4, 2012, 47(3): 27-32.
[12] DING Wei-ping1,2,3, WANG Jian-dong2, DUAN Wei-hua2, SHI Quan1. Research of cooperative PSO for attribute reduction optimization [J]. J4, 2011, 46(5): 97-102.
[13] HU Yun, HU Lin. The heaping of granular materials at different vibration frequencies [J]. J4, 2009, 44(11): 25-28.
[14] LU Wei-jie,ZHU Chen-fu,SONG Cui and YANG Yan-li . Determination of inorganic cations in the Chinese traditional drug Yujin by capillary electrophoresis [J]. J4, 2007, 42(7): 13-18 .
[15] WANG Shao-bo,GAO Zhen-ming.LI Zhi-yong,HU Lan-yu,WANG Feng-li . Analysis of dynamic bit allocation algorithm in FMT system [J]. J4, 2006, 41(6): 99-102 .
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
[1] YANG Jun. Characterization and structural control of metalbased nanomaterials[J]. JOURNAL OF SHANDONG UNIVERSITY(NATURAL SCIENCE), 2013, 48(1): 1 -22 .
[2] ZHAO Jun1, ZHAO Jing2, FAN Ting-jun1*, YUAN Wen-peng1,3, ZHANG Zheng1, CONG Ri-shan1. Purification and anti-tumor activity examination of water-soluble asterosaponin from Asterias rollestoni Bell[J]. J4, 2013, 48(1): 30 -35 .
[3] SUN Xiao-ting1, JIN Lan2*. Application of DOSY in oligosaccharide mixture analysis[J]. J4, 2013, 48(1): 43 -45 .
[4] LUO Si-te, LU Li-qian, CUI Ruo-fei, ZHOU Wei-wei, LI Zeng-yong*. Monte-Carlo simulation of photons transmission at alcohol wavelength in  skin tissue and design of fiber optic probe[J]. J4, 2013, 48(1): 46 -50 .
[5] YANG Lun, XU Zheng-gang, WANG Hui*, CHEN Qi-mei, CHEN Wei, HU Yan-xia, SHI Yuan, ZHU Hong-lei, ZENG Yong-qing*. Silence of PID1 gene expression using RNA interference in C2C12 cell line[J]. J4, 2013, 48(1): 36 -42 .
[6] MAO Ai-qin1,2, YANG Ming-jun2, 3, YU Hai-yun2, ZHANG Pin1, PAN Ren-ming1*. Study on thermal decomposition mechanism of  pentafluoroethane fire extinguishing agent[J]. J4, 2013, 48(1): 51 -55 .
[7] XUE Qiu-fang1,2, GAO Xing-bao1*, LIU Xiao-guang1. Several equivalent conditions for H-matrix based on the extrapolated GaussSeidel iterative method[J]. J4, 2013, 48(4): 65 -71 .
[8] SUN Liang-ji,JI Guo-xing . Jordan(α,β)-derivations and generalized Jordan(α,β)-derivations on upper triangular matrix algebras[J]. J4, 2007, 42(10): 100 -105 .
[9] WU Zhi-jun,SHEN Dan-dan. Architecture and key technologies of network-enabled next generation global flight tracking based on information integration and sharing[J]. JOURNAL OF SHANDONG UNIVERSITY(NATURAL SCIENCE), 2016, 51(11): 1 -6 .
[10] YANG Ying, JIANG Long*, SUO Xin-li. Choquet integral representation of premium functional and related properties on capacity space[J]. J4, 2013, 48(1): 78 -82 .