相似性度量
- 欧式距离
%7D%5E2#card=math&code=%7B%28x-y%29%7D%5E2&id=ZVKNM)
- 余弦相似度
两个样本点愈相似,则相似系数值愈接近1;样本点愈不相似,则相似系数值愈 接近0
评估指标
模型评估(聚类评估)
(1)轮廓系数:%3D%5Cfrac%20%7Bb(i)-a(i)%7D%20%7Bmax%5Cleft%20%5C%7B%20a(i)%2Cb(i)%20%5Cright%20%5C%7D%7D#card=math&code=s%28i%29%3D%5Cfrac%20%7Bb%28i%29-a%28i%29%7D%20%7Bmax%5Cleft%20%5C%7B%20a%28i%29%2Cb%28i%29%20%5Cright%20%5C%7D%7D&id=C4YJX)
轮廓系数是聚类效果好坏的一种评价方式。最佳值为1,最差值为-1。接近0的值表示重叠的群集。负值通常表示样本已分配给错误的聚类,因为不同的聚类更为相似。
(2) 熵:%20%3D%20Entropy(p_1%2C...p_n)%20%3D%20-%20%5Csum_%7Bi%3D1%7D%5E%7Bn%7Dp_ilog_2(p_i)#card=math&code=Entropy%28S%29%20%3D%20Entropy%28p_1%2C...p_n%29%20%3D%20-%20%5Csum_%7Bi%3D1%7D%5E%7Bn%7Dp_ilog_2%28p_i%29&id=y2JGE)
熵越小,数据越纯;熵越大,数据越混乱。
注:_熵为0的时候,所有样本的目标属性取值相同。_
(3)纯度: ,其中 #card=math&code=P_i%20%3D%20max%28P_%7Bij%7D%29&id=N3E3I)
#轮廓系数
from sklearn.metrics import silhouette_score
score = silhouette_score(weight, KM_model.labels_, metric='euclidean')
print("轮廓系数为:",score)
# 纯度
true = result.max(axis=1)
total = result.sum(axis=1)
accuracy = true / total
print("各簇的纯度:\n", accuracy)
#print("总纯度: %.2f" % (accuracy.sum(axis=0) / 3))
print("总纯度: %.2f" % (true.sum() / total.sum()))
# 熵
import math
len = len(df_news['category'].unique())
e_i = [0 for x in range(0,len)]
m=0
result = np.array(result)
print("每个聚类的熵为:")
for i in range(0,27):
for j in range(0,len):
m += result[i][j]
p_i_j = result[i][j]*1.0/result[i].sum() + 0.00000001
e_i[i] += 0 - p_i_j * math.log(p_i_j,2)
print("第",i,"类 :",e_i[i])
e=0
for i in range(0,27):
e += result[i].sum()/m*e_i[i]
print("整体熵 %.6f" %e)
1 分类指标评价计算示例
## accuracy
import numpy as np
from sklearn.metrics import accuracy_score
y_pred = [0, 1, 0, 1]
y_true = [0, 1, 1, 1]
print('ACC:',accuracy_score(y_true, y_pred))
## Precision,Recall,F1-score
from sklearn import metrics
y_pred = [0, 1, 0, 0]
y_true = [0, 1, 0, 1]
print('Precision',metrics.precision_score(y_true, y_pred))
print('Recall',metrics.recall_score(y_true, y_pred))
print('F1-score:',metrics.f1_score(y_true, y_pred))
## AUC
import numpy as np
from sklearn.metrics import roc_auc_score
y_true = np.array([0, 0, 1, 1])
y_scores = np.array([0.1, 0.4, 0.35, 0.8])
print('AUC socre:',roc_auc_score(y_true, y_scores))
2 回归指标评价计算示例
# coding=utf-8
import numpy as np
from sklearn import metrics
# MAPE需要自己实现
def mape(y_true, y_pred):
return np.mean(np.abs((y_pred - y_true) / y_true))
y_true = np.array([1.0, 5.0, 4.0, 3.0, 2.0, 5.0, -3.0])
y_pred = np.array([1.0, 4.5, 3.8, 3.2, 3.0, 4.8, -2.2])
# MSE
print('MSE:',metrics.mean_squared_error(y_true, y_pred))
# RMSE
print('RMSE:',np.sqrt(metrics.mean_squared_error(y_true, y_pred)))
# MAE 本次比赛所用
print('MAE:',metrics.mean_absolute_error(y_true, y_pred))
# MAPE
print('MAPE:',mape(y_true, y_pred))
## R2-score
from sklearn.metrics import r2_score
y_true = [3, -0.5, 2, 7]
y_pred = [2.5, 0.0, 2, 8]
print('R2-score:',r2_score(y_true, y_pred))
损失函数
平方损失函数%3D1%2FN%20%5Csum%7Bi%3D1%7D%5EN(y_i%E2%88%92t_i)%5E2%0A#card=math&code=L%28x%29%3D1%2FN%20%5Csum%7Bi%3D1%7D%5EN%28yi%E2%88%92t_i%29%5E2%0A&id=xq5vn)
交叉熵损失函数%3D%E2%88%92%20%5Csum%7Bx%7Dp(x)log(q(x))%0A#card=math&code=H%28p%2Cq%29%3D%E2%88%92%20%5Csum_%7Bx%7Dp%28x%29log%28q%28x%29%29%0A&id=TNHjG)