Resources
Books
- 周志华《机器学习》第四章:决策树
- 李航《统计学习方法》第五章:决策树
- 《机器学习实战》第三章:决策树
documents
- A series of decision tree resourceshttps://blog.csdn.net/cindy407/article/details/92850901
- Sklearn resourceshttps://scikit-learn.org/stable/modules/tree.html#classification
- ID3、C4.5、CART algorithmhttps://zhuanlan.zhihu.com/p/85731206
Decision tree algorithm
ID3
2 C4.5
3 CART
2 The process of decision tree
- collect data
- prepare data
- analyse data
- train algorithm
- test algorithm
-
First example: Judje fish or non-fish
1 Code analyse
1 Prepare data
1 Argument analyse
2 Process analyse
3 Code implementationdef createDataSet():# dataSet 前两列是特征,最后一列对应的是每条数据对应的分类标签dataSet = [[1, 1, 'yes'],[1, 1, 'yes'],[1, 0, 'no'],[0, 1, 'no'],[0, 1, 'no']]# dataSet = [['yes'],# ['yes'],# ['no'],# ['no'],# ['no']]# labels 露出水面 脚蹼,注意: 这里的labels是写的 dataSet 中特征的含义,并不是对应的分类标签或者说目标变量labels = ['no surfacing', 'flippers']# 返回return dataSet, labels
2 Train algorithm
1 Calculate Shannon entropy
Arguments analyse:
- data
- Process analyse
- Calculate data length
- Initialize labelCount and shannonEntropy
- Calculate labelCount(for cycle to achieve)
- Calculate shannonEntropy(for cycle to achieve)
Code implements
def CalculateShannonEntpry(data):dataNum = len(data)# Calculate the number of the labellabelCount = {}for line in data:currentLabel = data[-1]if currentLabel not in labelCount.key():labelCount[currentLabel] = 0labelCount += 1shannonEntropy = 0.0for key in labelCount:prob = float(labelCount[key])/dataNumshannonEntropy += probreturn shannonEntropy
2 Devision data and choose the best way of devision data
Arguments analyse:
- Process analyse:
Code implements:
def DevisionData(dataSet, index, value):retDataSet = []for featVec in dataSet:# index列为value的数据集【该数据集需要排除index列】# 判断index列的值是否为valueif featVec[index] == value:# chop out index used for splitting# [:index]表示前index行,即若 index 为2,就是取 featVec 的前 index 行reducedFeatVec = featVec[:index]'''extend和append的区别:list.append(object) 向列表中添加一个对象objectlist.extend(sequence) 把一个序列seq的内容添加到列表中1、使用append的时候,是将new_media看作一个对象,整体打包添加到music_media对象中。2、使用extend的时候,是将new_media看作一个序列,将这个序列和music_media序列合并,并放在其后面。result = []result.extend([1,2,3])print(result)result.append([4,5,6])print(result)result.extend([7,8,9])print(result)结果:[1, 2, 3][1, 2, 3, [4, 5, 6]][1, 2, 3, [4, 5, 6], 7, 8, 9]'''reducedFeatVec.extend(featVec[index+1:])# [index+1:]表示从跳过 index 的 index+1行,取接下来的数据# 收集结果值 index列为value的行【该行需要排除index列】retDataSet.append(reducedFeatVec)return retDataSet
3 Recursive build decision tree
Arguments analyse:
- Process analyse:
- Code implements:
classList = [example[-1] for example in dataSet]# first termination condition: only a classif classList.count(classList[0]) == len(classList):return classList[0]# second termination condition: Use all features,butif len(dataSet[0]) == 1:return majorityCnt(classList)
