1.前提条件和准备工作

image.png
image.png
image.png
image.png
Anaconda3的安装参考教程:

https://zhuanlan.zhihu.com/p/75717350

安装Anaconda3后与原有的python冲突解决办法:

https://blog.csdn.net/u013880679/article/details/80318826

image.png
安装xgboost
下载地址:https://www.lfd.uci.edu/~gohlke/pythonlibs/#xgboost
1.下载对应python版本的xgboost,我的是3.6.3,所以选择cp36,
image.png
image.png
不然安装了会出错:xgboost-0.90-cp37-cp37m-win_amd64.whl is not a supported wheel on this platform.

2.在cmd中安装(后面接的是.whl的文件路径)
image.png
3.通过conda list查看是否安装成功
image.png

2.机器学习相关概念及python机器学习库

与数据相关概念

image.png
image.png
image.png
image.png
训练集相当于高考的平时练习题,验证集相当于模拟考试题,测试数据就是高考真题

与方法相关概念

image.png
image.png
image.png
image.png

image.png

评估指标介绍

image.png
image.png
image.png
image.png
image.png
image.png

image.png
image.png
image.png

image.png
练习:Python实现F1 、AUC计算

  1. # -*- coding: utf-8 -*- #
  2. import numpy as np
  3. import pandas as pd
  4. class Score():
  5. def __init__(self,pre_score,rel_label,threshold,beta):
  6. self.tn = 0
  7. self.fn = 0
  8. self.fp = 0
  9. self.tp = 0
  10. self.pre_score = pre_score
  11. self.rel_label = rel_label
  12. self.threshold = threshold
  13. self.beta = beta
  14. list(map(self.__getCM_count,
  15. self.pre_score,
  16. self.rel_label))
  17. def __getCM(self,pre, rel):
  18. if (pre < self.threshold):
  19. if (rel == 0): return 'TN'
  20. if (rel == 1): return 'FN'
  21. if (pre >= self.threshold):
  22. if (rel == 0): return 'FP'
  23. if (rel == 1): return 'TP'
  24. def get_cm(self):
  25. return list(map(self.__getCM,
  26. self.pre_score,
  27. self.rel_label))
  28. def __getCM_count(self,pre, rel):
  29. if (pre < self.threshold):
  30. if (rel == 0): self.tn += 1
  31. if (rel == 1): self.fn += 1
  32. if (pre >= self.threshold):
  33. if (rel == 0): self.fp += 1
  34. if (rel == 1): self.tp += 1
  35. def get_f1(self):
  36. P = self.tp/(self.tp+self.fp)
  37. R = self.tp/(self.tp+self.fn)
  38. if(P == 0.0):
  39. return 0.0
  40. else:
  41. return (self.beta*self.beta+1)*P*R/(self.beta*self.beta*P+R)
  42. # 方法二 precision——分数精度
  43. def get_auc_by_count(self,precision=100):
  44. # 正样本数
  45. postive_len = sum(self.rel_label)
  46. # 负样本数
  47. negative_len = len(self.rel_label) - postive_len
  48. # 总对比数
  49. total_case = postive_len * negative_len
  50. # 正样本分数计数器(填0在range...)
  51. pos_histogram = [0 for _ in range(precision+1)]
  52. # 负样本分数计数器(填0在range...)
  53. neg_histogram = [0 for _ in range(precision+1)]
  54. # 分数放大
  55. bin_width = 1.0 / precision
  56. for i in range(len(self.rel_label)):
  57. nth_bin = int(self.pre_score[i] / bin_width)
  58. if self.rel_label[i] == 1:
  59. pos_histogram[nth_bin] += 1
  60. else:
  61. neg_histogram[nth_bin] += 1
  62. accumulated_neg = 0
  63. satisfied_pair = 0
  64. for i in range(precision+1):
  65. satisfied_pair += (pos_histogram[i] * accumulated_neg + pos_histogram[i] * neg_histogram[i] * 0.5)
  66. accumulated_neg += neg_histogram[i]
  67. return satisfied_pair / float(total_case)
  68. # 方法三
  69. def get_auc_by_rank(self):
  70. # 拼接排序
  71. df = pd.DataFrame({'pre_score':self.pre_score,'rel_label':self.rel_label})
  72. df = df.sort_values(by='pre_score',ascending=False).reset_index(drop=True)
  73. # 获取 n,N,M
  74. n = len(df)
  75. M = len(df[df['rel_label']==1])
  76. N = n - M
  77. # 初始化rank 和同值统计ank_tmp,count_all,count_p
  78. rank = 0.0
  79. rank_tmp,count_all,count_p = 0.0,0,0
  80. # 添加防止越界的一条不影响结果的记录
  81. df.loc[n] = [0,0]
  82. # 遍历一次
  83. for i in range(n):
  84. # 判断i+1是否与i同值,不同值则要考虑是否刚刚结束同值统计
  85. if(df['pre_score'][i+1] != df['pre_score'][i]):
  86. # 正样本
  87. if(df['rel_label'][i] == 1):
  88. # 计数不为0,刚刚结束同值统计
  89. if (count_all != 0):
  90. # 同值统计结果加在rank上,这里注意补回结束统计时漏掉的最后一条同值数据
  91. rank += (rank_tmp + n - i) * (count_p+1) / (count_all+1)
  92. rank_tmp, count_all, count_p = 0.0, 0, 0
  93. continue
  94. rank += (n-i)
  95. else:
  96. if (count_all != 0):
  97. rank += (rank_tmp + n - i) * (count_p) / (count_all+1)
  98. rank_tmp, count_all, count_p = 0.0, 0, 0
  99. continue
  100. else:
  101. rank_tmp += (n-i)
  102. count_all += 1
  103. if(df['rel_label'][i] == 1):
  104. count_p += 1
  105. return (rank-M*(1+M)/2)/(M*N)
  106. if __name__ == '__main__':
  107. learn_data_L2 = [0.2,0.3,0.4,0.35,0.6,0.55,0.2,0.57,0.3,0.15,0.77,0.33,0.9,0.49, 0.45,0.41, 0.66,0.43,0.7,0.4]
  108. learn_data_R2 = [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1]
  109. #通过DataFrame将数据转化为类似于excel表的形式
  110. learn_data2 = pd.DataFrame({'Learn': learn_data_L2, 'Real': learn_data_R2})
  111. score2 = Score(learn_data2['Learn'], learn_data2['Real'], 0.5, 1)
  112. print(score2.get_cm())
  113. print(score2.get_f1())
  114. print(score2.get_auc_by_count())
  115. print(score2.get_auc_by_rank())

常用机器学习算法

image.png
image.png
image.png
image.png
image.png

image.png
image.png
参考文档:

https://www.cnblogs.com/molieren/articles/10664954.html

https://blog.csdn.net/jiaoyangwm/article/details/79525237

image.png

image.png
image.png
image.png

定义和调用函数:使用位置和关键字参数

image.pngimage.png
image.png
image.png

列表、字典、集合

image.png
image.png
image.png
image.png
image.png
image.png
image.png
image.png
image.png