Clustering an Evolving Data Stream 对不断发展的数据流进行聚类

Data stream clustering can be used for network intrusion detection, analysing web clickstreams, and stock market analysis. Efficient stream clustering algorithms should have at least one of following properties:

  • Compute and store summaries of past data 计算并储存历史数据的总结
  • Apply a divide and conquer strategy 应用分而治之的策略
    • Divide data streams into chunks based on order of arrival, compute summaries for these chunks, and then merge the summaries with the previous ones. 根据到达顺序将数据流分为数据块,计算这些数据块的摘要,并与之前的数据块合并。
  • Incremental clustering of incoming data streams 传入数据流的增量聚类
  • Perform, micro-clustering as well as macro-clustering 执行微观集群和宏观集群
    • Two step approaches 两步方法
    • Compute and store summaries at the micro-cluster level 在微集群级别计算和存储摘要
    • Cluster micro-clusters to obtain macro-clusters 将微集群聚集成宏观集群
  • Explore multiple time granularity for the analysis of cluster evolution 探索多时间粒度来分析集群演化
    • Give more weight to the recent data for clustering 更重视最近的数据以便进行聚类。
  • Divide stream clustering into online and offline processes 将流聚类操作分为在线和离线流程

Many algorithms have been developed for clustering data streams. We introduce the efficient STREAM divide-and-conquer algorithm here.

STREAM: K-medians based stream clustering algorithm.

Choose k , the number of clusters 选择k为聚类数量
Choose m the main-memory-bound bucket size of points arriving in the stream, m>>k 选择m是到达流中的点的主存限制桶的大小。 m远大于k
Intialise Megacluster = null
Repeat
Wait until we obtain m data points from a data stream.
When m data points are collected, find k/2 clusters from the collected set using k-medoids clustering
Keep the centroids for each cluster c{ij} together with the number of points assigned to each cluster w{ij}, but discard the m data points.
// Now we have k/2 weighted cluster centres as well as a summarised megacluster from earlier in the stream, and we compress back to only one weighted set of k cluster centres (i.e. a new megacluster) by finding k cluster centres from all the previous weighted centres using k-medoid clustering.

  1. If w{ij} is the weight of cluster centre c{ij}, then notionally duplicate the cluster centre w{ij} times, and rerun k-medoid algorithm on the \Sigma ^{m}{i=1}\Sigma ^{(k/2)}{j=1}(w{ij}) notional points from the k/2 weighted cluster centres as well as the accumulated k weighted cluster centres of Megacluster (notionally representing all the points seen before this round).
  2. Set Megacluster to this new weighted clustering of k clusters.

Until stream is exhausted // Loop back to the repeat for another round.

STREAM only retains summary information with fixed upper bounds on size so it can effectively reduce the required storage space while running in something like O(nk log n) time. STREAM只保留大小有固定上限的摘要信息,因此它可以有效地减少所需的存储空间,同时以类似O(nk log n)的时间运行。