Text mining problems - Text Classification - 《机器学习》

Motivation
Classification process 分类过程
Text document classification differs from the classification of relational data 文本文档分类不同于关系数据的分类
Class labels (categories) may be developed by hand

Motivation

• Automatic classification for the large number of on-line text documents 大量在线文本文档的自动分类
(Web pages, e-mails, corporate intranet documents, etc.)

Classification process 分类过程

• Data pre-processing 数据预处理
• Definition of training set and test sets 定义训练集和测试集
• Creation of the classification model using the selected classification algorithm 用选择的分类算法训练模型
• Classification model validation 评估分类模型
• Classification of new/unknown text documents 用来预测新的文本对象

Text document classification differs from the classification of relational data 文本文档分类不同于关系数据的分类

• Document databases are not structured according to attribute-value pairs 文档数据库不是根据属性值对构建的
**

Class labels (categories) may be developed by hand

Pre-given classes (categories) and labeled documents (examples)
Categories may form hierarchy/taxonomy
Classify new documents
A standard classification (supervised learning) problem

Classification algorithms that are used:

Support vector machines
K-nearest neighbors
Naïve Bayes
Neural networks
Decision trees
Association rule-based
Boosting
more..

Here are some methods from the literature used for such classification.