Resources
Books
- 周志华《机器学习》第四章:决策树
- 李航《统计学习方法》第五章:决策树
- 《机器学习实战》第三章:决策树
documents
- A series of decision tree resourceshttps://blog.csdn.net/cindy407/article/details/92850901
- Sklearn resourceshttps://scikit-learn.org/stable/modules/tree.html#classification
- ID3、C4.5、CART algorithmhttps://zhuanlan.zhihu.com/p/85731206
Decision tree algorithm
ID3
2 C4.5
3 CART
2 The process of decision tree
- collect data
- prepare data
- analyse data
- train algorithm
- test algorithm
-
First example: Judje fish or non-fish
1 Code analyse
1 Prepare data
1 Argument analyse
2 Process analyse
3 Code implementationdef createDataSet():
# dataSet 前两列是特征,最后一列对应的是每条数据对应的分类标签
dataSet = [[1, 1, 'yes'],
[1, 1, 'yes'],
[1, 0, 'no'],
[0, 1, 'no'],
[0, 1, 'no']]
# dataSet = [['yes'],
# ['yes'],
# ['no'],
# ['no'],
# ['no']]
# labels 露出水面 脚蹼,注意: 这里的labels是写的 dataSet 中特征的含义,并不是对应的分类标签或者说目标变量
labels = ['no surfacing', 'flippers']
# 返回
return dataSet, labels
2 Train algorithm
1 Calculate Shannon entropy
Arguments analyse:
- data
- Process analyse
- Calculate data length
- Initialize labelCount and shannonEntropy
- Calculate labelCount(for cycle to achieve)
- Calculate shannonEntropy(for cycle to achieve)
Code implements
def CalculateShannonEntpry(data):
dataNum = len(data)
# Calculate the number of the label
labelCount = {}
for line in data:
currentLabel = data[-1]
if currentLabel not in labelCount.key():
labelCount[currentLabel] = 0
labelCount += 1
shannonEntropy = 0.0
for key in labelCount:
prob = float(labelCount[key])/dataNum
shannonEntropy += prob
return shannonEntropy
2 Devision data and choose the best way of devision data
Arguments analyse:
- Process analyse:
Code implements:
def DevisionData(dataSet, index, value):
retDataSet = []
for featVec in dataSet:
# index列为value的数据集【该数据集需要排除index列】
# 判断index列的值是否为value
if featVec[index] == value:
# chop out index used for splitting
# [:index]表示前index行,即若 index 为2,就是取 featVec 的前 index 行
reducedFeatVec = featVec[:index]
'''
extend和append的区别:
list.append(object) 向列表中添加一个对象object
list.extend(sequence) 把一个序列seq的内容添加到列表中
1、使用append的时候,是将new_media看作一个对象,整体打包添加到music_media对象中。
2、使用extend的时候,是将new_media看作一个序列,将这个序列和music_media序列合并,并放在其后面。
result = []
result.extend([1,2,3])
print(result)
result.append([4,5,6])
print(result)
result.extend([7,8,9])
print(result)
结果:
[1, 2, 3]
[1, 2, 3, [4, 5, 6]]
[1, 2, 3, [4, 5, 6], 7, 8, 9]
'''
reducedFeatVec.extend(featVec[index+1:])
# [index+1:]表示从跳过 index 的 index+1行,取接下来的数据
# 收集结果值 index列为value的行【该行需要排除index列】
retDataSet.append(reducedFeatVec)
return retDataSet
3 Recursive build decision tree
Arguments analyse:
- Process analyse:
- Code implements:
classList = [example[-1] for example in dataSet]
# first termination condition: only a class
if classList.count(classList[0]) == len(classList):
return classList[0]
# second termination condition: Use all features,but
if len(dataSet[0]) == 1:
return majorityCnt(classList)