JOURNAL OF SHANDONG UNIVERSITY(NATURAL SCIENCE) ›› 2016, Vol. 51 ›› Issue (9): 106-112.doi: 10.6040/j.issn.1671-9352.1.2015.057

Previous Articles     Next Articles

K-Canopy:a fast data segmentation algorithm for the topic detection

CHEN Qiang1,2, DU Pan1, CHEN Hai-qiang3, BAO Xiu-guo4, LIU Yue1, CHENG Xue-qi1*   

  1. 1.CAS Key Lab of Network Data Science and Technology, Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190, China;
    2. University of Chinese Academy of Sciences, Beijing 100190, China;
    3. China Information Technology Security Evaluation Center, Beijing 100085, China;
    4. National Computer Network and Information Security Management Center, Beijing 100029, China
  • Received:2015-09-25 Online:2016-09-20 Published:2016-09-23

Abstract: This paper presented a pre-clustering algorithm for tasks of topic detection on big data. To support the parallelization of the successive topic detection task,the proposed algorithm was designed to segment the dataset according to the semantic association among data points as evenly and efficiently as possible. The experimental result shows that our proposed algorithm is effective at segmenting dataset while preserving semantic association inside data blocks, and is helpful for improving the efficiency and effectiveness of topic detection.

Key words: topic detection, balance, big data, data segmentation

CLC Number: 

  • TP391
[1] 黄波.基于向量空间模型和LDA模型相结合的微博客话题发现算法研究[M].成都:西南交通大学出版社,2012. HUANG Bo. Microblog topic detection algorithm based on combination of vector space model and LDA [D]. Chengdu: Southwest Jiaotong University Press, 2012.
[2] 魏明川,朱俊杰,张瑾,等.基于吸收马尔可夫链的子话题发现方法[J]. 中文信息学报,2014,28(1):41-46. WEI Mingchuan, ZHU Junjie, ZHANG Jin, et al. Sub-topic detection method based on absorbing markov chain [J].Journal of Chinese Information Processing, 2014, 28(1):41-46.
[3] 袁继鹏.网络舆情话题演化及话题重要度分析[D].北京:中国科学院计算技术研究所,2012. YUAN Jipeng. Public opinion evolvement and topical importance analysis on web[D]. Beijing: Institute of Computing Technology, CAS, 2012.
[4] MACQUEEN J. Some methods for classification and analysis of multivariate observations[C] //Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability.[S.l.] :[s.n.] , 1967:281-297.
[5] 黄康平.个性化话题推荐[D].北京:中国科学院计算技术研究所,2013. HUANG Kangping. Personalized topic recommendation[D]. Beijing: Institute of Computing Technology, CAS, 2013.
[6] CHAN P K, SCHLAG D F, ZIEN J Y. Spectral K-way ratio-cut partitioning and clustering [J]. IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems, 1994, 13(9):1088-1096.
[7] SHI J, MALIK J. Normalized cuts and image segmentation[J]. IEEE Transactions on Pattern Analysis and Machine Intelligence, 2000, 22(8):888-905.
[8] DING C H Q, HE X, ZHA H, et al. A min-max cut algorithm for graph partitioning and data clustering[C] // Proceedings of the IEEE International Conference on Data Mining. New York: IEEE, 2001.
[9] PAPKA R, ALLAN J. On-line new event detection using single pass clustering TITLE2[R]. Amherst, MA: University of Massachusetts, 1998.
[10] MCCALLUM A, NIGAM K, UNGAR L H. Efficient clustering of high-dimensional data sets with application to reference matching[C] //Proceedings of the 6th ACM SIGKDD International Conference on Knowledge Discovery and Data mining. New York: ACM, 2000:169-178.
[11] MANKU G S, JAIN A, SARMA A D. Detecting near-duplicates for web crawling[C] //Proceedings of the 16th International Conference on World Wide Web. New York: ACM, 2007:141-150.
[1] YAN Yan, HAO Xiao-hong. Differential privacy partitioning algorithm based on adaptive density grids [J]. JOURNAL OF SHANDONG UNIVERSITY(NATURAL SCIENCE), 2018, 53(9): 12-22.
[2] . Design and implementation of topic detection in Russian news based on ontology [J]. JOURNAL OF SHANDONG UNIVERSITY(NATURAL SCIENCE), 2018, 53(9): 49-54.
[3] LI Jin-hai, WU Wei-zhi. Granular computing approach for formal concept analysis and its research outlooks [J]. JOURNAL OF SHANDONG UNIVERSITY(NATURAL SCIENCE), 2017, 52(7): 1-12.
[4] LI Run-chuan, ZAN Hong-ying, SHEN Sheng-ya, BI Yin-long, ZHANG Zhong-jun. Spam messages identification based on multi-feature fusion [J]. JOURNAL OF SHANDONG UNIVERSITY(NATURAL SCIENCE), 2017, 52(7): 73-79.
[5] DU Hong-le, ZHANG Yan, ZHANG Lin. Intrusion detection on imbalanced dataset [J]. JOURNAL OF SHANDONG UNIVERSITY(NATURAL SCIENCE), 2016, 51(11): 50-57.
[6] SUN Lin. The application of some special hyper-graphs in the perfect graphs [J]. J4, 2011, 46(8): 92-94.
[7] LIU Xie-jin1, MIAO Bai-qi2. The superiority of the Bayes linear unbiased minimum Variance estimator under balanced loss function [J]. J4, 2011, 46(11): 89-95.
[8] TIAN Shuang-liang. Star total colorings of Mycielski’s graphs of the balanced general join of graphs [J]. J4, 2010, 45(6): 23-26.
[9] LU Jian-li, CAI Wen-juan. The number of vertex-disjoint 6-cycles containing specified vertices in a balance bipartite graph [J]. J4, 2010, 45(12): 5-11.
[10] GENG Jian-yan,YAN Jin,LI Feng . Vertex-disjoint 4-cycle in a bipartite graph [J]. J4, 2008, 43(5): 87-92 .
[11] ZHANG Yan-gang,RONG Xiao-xia,WANG Feng . An endogenous growth model of knowledge-based R&D economy [J]. J4, 2008, 43(10): 56-59 .
[12] GAO Yun-shu and LI Guo-jun . On 2-factors with large cycles in biparitite graph [J]. J4, 2007, 42(4): 28-31 .
Viewed
Full text


Abstract

Cited

  Shared   
  Discussed   
No Suggested Reading articles found!