Classification builds models that describe interesting classes of data. The models are called classifiers because, once built, the model may be used to classify unseen data. Sometimes the model itself is more important than its use in ongoing classification because it provides a compact summary of the data, that is explanatory for humans. 分类建立描述感兴趣的数据类别的模型。这些模型被称为分类器,因为一旦建立,模型就可以用来分类看不见的数据。有时模型本身比它在正在进行的分类中的使用更重要,因为它提供了一个紧凑的数据摘要,对人类来说是解释性的。
Most commonly classification is binary, that is, objects are determined to belong to a class or not. For example, taxpayers are classified as fraudulent, or not. However, the generalisation to classfying data into more than two classes is important.
Classification is often classified as a machine-learning problem due to its origins in AI research, although data mining research has developed the scalability to handle large disk-stored data sets.最常见的分类是二元的,即确定对象是否属于一个类。例如,纳税人被归类为欺诈,或者不是。然而,将数据分类为两个以上的类别是很重要的。 分类通常被归类为机器学习问题,因为它起源于人工智能研究,尽管数据挖掘研究已经发展了处理大型磁盘存储数据集的可扩展性。
Nowadays it is widely used in application to problems in science, marketing, fraud detection, performance prediction, medical diagnosis, and fault diagnosis. 如今,它被广泛应用于科学、市场营销、欺诈检测、性能预测、医疗诊断和故障诊断等问题。

Classification

  • Used to predict categorical class labels (discrete or nominal) from unlabelled data.
  • Constructs (or learns) a classifier (or model) from training data that includes, for each example in the data, data values as well as a pre-determined class label.
  • Uses the model to predict the class label for new,unseen, unlabelled data.

**

Classification vs Prediction

Although classifiers predict the values of unknown class labels, classification is usually distinguished from the problem of numerical prediction (commonly called simply prediction) that builds models of continuous-valued functions and so predicts unknown or missing numeric values. We will also study some popular prediction techniques.

**

Supervised Learning vs Unsupervised learning

Again, we see the AI influence in the language here, where supervised learning refers to classification as we have defined it — where the training data (observations, measurements, etc.) is accompanied by labels indicating the class of the observations, and new data is classified based on the training set. In this AI-oriented view of classification we often talk about batch vs incremental learning. The former is usually an unstated assumption for data mining. In the latter case the labelled data becomes available to the learning algorithm in a sequence and a working classifier developed initially from a small amount of data must be continually updated to account for new data.同样,我们在这里看到了人工智能在语言中的影响,其中监督学习指的是我们定义的分类——训练数据(观察、测量等)。)伴随着指示观察的类别的标签,并且基于训练集对新数据进行分类。在这种面向人工智能的分类观点中,我们经常谈论批量学习和增量学习。对于数据挖掘来说,前者通常是一个未说明的假设。在后一种情况下,标记的数据按顺序可用于学习算法,并且最初从少量数据开发的工作分类器必须不断更新以考虑新数据。
On the other hand, for unsupervised learning there are no class labels in the training data and the learning algorthm must find some interesting classes, or classifications with which to classify new data. This is commonly called clustering. We will study some popular clustering techniques later.另一方面,对于无监督学习,在训练数据中没有类标签,学习算法必须找到一些感兴趣的类,或者分类来对新数据进行分类。这通常被称为聚类。稍后我们将研究一些流行的聚类技术。 所以分类也可以定义为分类变量的监督学习。
So classification can also be defined as supervised learning of categorical variables.