K-Canopy:a fast data segmentation algorithm for the topic detection

CHEN Qiang1,2, DU Pan1, CHEN Hai-qiang3, BAO Xiu-guo4, LIU Yue1, CHENG Xue-qi1*   

  1. 1.CAS Key Lab of Network Data Science and Technology, Institute of Computing Technology, Chinese Academy of Sciences, Beijing 100190, China;
    2. University of Chinese Academy of Sciences, Beijing 100190, China;
    3. China Information Technology Security Evaluation Center, Beijing 100085, China;
    4. National Computer Network and Information Security Management Center, Beijing 100029, China
  • Received:2015-09-25 Online:2016-09-20 Published:2016-09-23

Abstract: This paper presented a pre-clustering algorithm for tasks of topic detection on big data. To support the parallelization of the successive topic detection task,the proposed algorithm was designed to segment the dataset according to the semantic association among data points as evenly and efficiently as possible. The experimental result shows that our proposed algorithm is effective at segmenting dataset while preserving semantic association inside data blocks, and is helpful for improving the efficiency and effectiveness of topic detection.

Key words: topic detection, balance, big data, data segmentation

