Resources

Books

  • 周志华《机器学习》第四章:决策树
  • 李航《统计学习方法》第五章:决策树
  • 《机器学习实战》第三章:决策树

documents

2 The process of decision tree

  • collect data
  • prepare data
  • analyse data
  • train algorithm
  • test algorithm
  • use algorithm

    First example: Judje fish or non-fish

    1 Code analyse

    1 Prepare data

    1 Argument analyse
    2 Process analyse
    3 Code implementation

    1. def createDataSet():
    2. # dataSet 前两列是特征,最后一列对应的是每条数据对应的分类标签
    3. dataSet = [[1, 1, 'yes'],
    4. [1, 1, 'yes'],
    5. [1, 0, 'no'],
    6. [0, 1, 'no'],
    7. [0, 1, 'no']]
    8. # dataSet = [['yes'],
    9. # ['yes'],
    10. # ['no'],
    11. # ['no'],
    12. # ['no']]
    13. # labels 露出水面 脚蹼,注意: 这里的labels是写的 dataSet 中特征的含义,并不是对应的分类标签或者说目标变量
    14. labels = ['no surfacing', 'flippers']
    15. # 返回
    16. return dataSet, labels

    2 Train algorithm

    1 Calculate Shannon entropy

  • Arguments analyse:

    • data
  • Process analyse
    • Calculate data length
    • Initialize labelCount and shannonEntropy
    • Calculate labelCount(for cycle to achieve)
    • Calculate shannonEntropy(for cycle to achieve)
  • Code implements

    1. def CalculateShannonEntpry(data):
    2. dataNum = len(data)
    3. # Calculate the number of the label
    4. labelCount = {}
    5. for line in data:
    6. currentLabel = data[-1]
    7. if currentLabel not in labelCount.key():
    8. labelCount[currentLabel] = 0
    9. labelCount += 1
    10. shannonEntropy = 0.0
    11. for key in labelCount:
    12. prob = float(labelCount[key])/dataNum
    13. shannonEntropy += prob
    14. return shannonEntropy

    2 Devision data and choose the best way of devision data

  • Arguments analyse:

  • Process analyse:
  • Code implements:

    1. def DevisionData(dataSet, index, value):
    2. retDataSet = []
    3. for featVec in dataSet:
    4. # index列为value的数据集【该数据集需要排除index列】
    5. # 判断index列的值是否为value
    6. if featVec[index] == value:
    7. # chop out index used for splitting
    8. # [:index]表示前index行,即若 index 为2,就是取 featVec 的前 index 行
    9. reducedFeatVec = featVec[:index]
    10. '''
    11. extend和append的区别:
    12. list.append(object) 向列表中添加一个对象object
    13. list.extend(sequence) 把一个序列seq的内容添加到列表中
    14. 1、使用append的时候,是将new_media看作一个对象,整体打包添加到music_media对象中。
    15. 2、使用extend的时候,是将new_media看作一个序列,将这个序列和music_media序列合并,并放在其后面。
    16. result = []
    17. result.extend([1,2,3])
    18. print(result)
    19. result.append([4,5,6])
    20. print(result)
    21. result.extend([7,8,9])
    22. print(result)
    23. 结果:
    24. [1, 2, 3]
    25. [1, 2, 3, [4, 5, 6]]
    26. [1, 2, 3, [4, 5, 6], 7, 8, 9]
    27. '''
    28. reducedFeatVec.extend(featVec[index+1:])
    29. # [index+1:]表示从跳过 index 的 index+1行,取接下来的数据
    30. # 收集结果值 index列为value的行【该行需要排除index列】
    31. retDataSet.append(reducedFeatVec)
    32. return retDataSet

    3 Recursive build decision tree

  • Arguments analyse:

  • Process analyse:
  • Code implements:
    1. classList = [example[-1] for example in dataSet]
    2. # first termination condition: only a class
    3. if classList.count(classList[0]) == len(classList):
    4. return classList[0]
    5. # second termination condition: Use all features,but
    6. if len(dataSet[0]) == 1:
    7. return majorityCnt(classList)