Text mining problems - Document Clustering - 《机器学习》

Motivation
Clustering process

Motivation

• Automatically group related documents based on their contents 根据文本内容自动聚类
• No predetermined training sets or taxonomies 没有提前预定的训练集或分类法
• Generate a taxonomy at runtime 运行时生成分类

Clustering process

• Data preprocessing: remove stop words, stem, feature extraction, lexical analysis, etc. 数据预处理:删除停止词、词干、特征提取、词汇分析等。
• Document vectors are very high-dimensional — need to project to a lower-dimensional space using spectral clustering, mixture model clustering, Latent Semantic Indexing or Locality Preserving Indexing. 文档向量是非常高维的——需要使用谱聚类、混合模型聚类、潜在语义索引或局部保持索引投影到低维空间。
• Hierarchical clustering: compute similarities applying clustering algorithms
or 分层聚类:应用聚类算法计算相似度
• Model-based clustering (neural network approach): clusters are represented by “exemplars” (for example Self-Organising Maps, SOM) 基于模型的聚类(神经网络方法):聚类由“样本”表示(例如自组织映射、自组织映射)