1.什么是异常检测
异常检测(Outlier Detection),顾名思义,是识别与正常数据不同的数据,与预期行为差异大的数
据。识别如信用卡欺诈,工业生产异常,网络流里的异常(网络侵入)等问题,针对的是少数的事件。
1.1异常的类别
1.2异常检测任务分类
1.3异常检测场景
- 故障检测
- 物联网异常检测
- 欺诈检测
- 工业异常检测
- 时间序列异常检测
- 视频异常检测
- 日志异常检测
- 医疗日常检测
- 网络入侵检测
2.异常检测常用方法
PCA算法数学实例
PCA(Principal Component Analysis)是一种常用的数据分析方法。PCA通过线性变换将原始数据变换为一组各维度线性无关的表示,可用于提取数据的主要特征分量,常用于高维数据的降维。详细来源来源实例:http://blog.codinglabs.org/articles/pca-tutorial.html
PCA的应用实例:https://blog.csdn.net/qq_15111861/article/details/94185363
DBSCAN和LOF的异同:http://blog.sina.com.cn/s/blog_ab089a840102ylin.html
3.异常检测常用开源库
Scikit-learn
Scikit-learn是一个Python语言的开源机器学习库。它具有各种分类,回归和聚类算法。也包含了一些
异常检测算法,例如LOF和孤立森林。
官网:https://scikit-learn.org/stable/
PyOD
Python Outlier Detection(PyOD)是当下最流行的Python异常检测工具库,其主要亮点包
括:
包括近20种常见的异常检测算法,比如经典的LOF/LOCI/ABOD以及最新的深度学习如对抗生
成模型(GAN)和集成异常检测(outlier ensemble)
支持不同版本的Python:包括2.7和3.5+;支持多种操作系统:windows,macOS和Linux
简单易用且一致的API,只需要几行代码就可以完成异常检测,方便评估大量算法
使用JIT和并行化(parallelization)进行优化,加速算法运行及扩展性(scalability),可以
处理大量数据。
官网:https://pyod.readthedocs.io/en/latest/
练习
学习pyod库基本操作。
from __future__ import division
from __future__ import print_function
import os
import sys
# temporary solution for relative imports in case pyod is not installed
# if pyod is installed, no need to use the following line
sys.path.append(
os.path.abspath(os.path.join(os.path.dirname("__file__"), '..')))
from pyod.models.lof import LOF
from pyod.utils.data import generate_data
from pyod.utils.data import evaluate_print
from pyod.utils.example import visualize
if __name__ == "__main__":
contamination = 0.1 # percentage of outliers
n_train = 200 # number of training points
n_test = 100 # number of testing points
# Generate sample data
X_train, y_train, X_test, y_test = \
generate_data(n_train=n_train,
n_test=n_test,
n_features=2,
contamination=contamination,
random_state=42)
# train LOF detector
clf_name = 'LOF'
clf = LOF()
clf.fit(X_train)
# get the prediction labels and outlier scores of the training data
y_train_pred = clf.labels_ # binary labels (0: inliers, 1: outliers)
y_train_scores = clf.decision_scores_ # raw outlier scores
# get the prediction on the test data
y_test_pred = clf.predict(X_test) # outlier labels (0 or 1)
y_test_scores = clf.decision_function(X_test) # outlier scores
# evaluate and print the results
print("\nOn Training Data:")
evaluate_print(clf_name, y_train, y_train_scores)
print("\nOn Test Data:")
evaluate_print(clf_name, y_test, y_test_scores)
# visualize the results
visualize(clf_name, X_train, y_train, X_test, y_test, y_train_pred,
y_test_pred, show_figure=True, save_figure=False)
归一化的图例
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
y = np.random.randint(1,100,20)
x = np.arange(1,21)
mean = np.mean(y)
plt.figure(figsize=(10, 5))
data, = plt.plot(x,y)
meanLine, = plt.plot(x,[mean]*20,linestyle = '--')
translationLine, =plt.plot(x,y-mean)
centringLine, = plt.plot(x,[np.mean(y-mean)]*20)
plt.legend([data,meanLine,translationLine,centringLine], ["data", "mean","y-mean","centring"], loc='upper left')