Machine Learning in Action - 信息熵和决策树 - 《DeepLearning》

香农熵又称信息熵，反映了一条信息的信息量大小和它的不确定性之间的关系，是信息量的度量，单位为 bit。
对于某件事情
不确定性越大，熵越大，确定该事所需的信息量也越大；
不确定性越小，熵越小，确定该事所需的信息量也越小。
假设有一事件XX，XX事件有ii种可能性，每一种可能性发生的概率记为P(Xi)P(Xi)，则香农熵的计算公式为：
H(X)=−∑i=1nP(Xi)log2P(Xi)H(X)=−∑i=1nP(Xi)log2⁡P(Xi)

计算给定数据集的香农熵(python模块)：

def calsShannonEnt(dataSet):
    numEntries = len(dataSet)
    labelCounts = {}
    for dataVec in dataSet:
        label = dataVec[-1]
        if label not in labelCounts.keys():
            labelCounts[label] = 0
        labelCounts[label] += 1
    shannonEnt = 0.0
    for key in labelCounts.keys():
        prob = float(labelCounts[key]) / numEntries
        shannonEnt -= prob * math.log(prob, 2)
    return shannonEnt
if __name__ == "__main__":
    print("Code Run As A Program")

可以通过划分数据集的方式对数据进行进一步的组织，可以使得原本混乱的数据变有序
利用熵来衡量数据的无序程度
越统一越整齐的数据熵越小
如下：计算原熵，尝试划分求熵减最小的划分

def chooseBestFeatureToSplit(dataSet):
    numFeatures = len(dataSet[0]) - 1      #the last column is used for the labels
    baseEntropy = calcShannonEnt(dataSet)
    bestInfoGain = 0.0; bestFeature = -1
    for i in range(numFeatures):        #iterate over all the features
        featList = [example[i] for example in dataSet]#create a list of all the examples of this feature
        uniqueVals = set(featList)       #get a set of unique values
        newEntropy = 0.0
        for value in uniqueVals:
            subDataSet = splitDataSet(dataSet, i, value)
            prob = len(subDataSet)/float(len(dataSet))
            newEntropy += prob * calcShannonEnt(subDataSet)     
        infoGain = baseEntropy - newEntropy     #calculate the info gain; ie reduction in entropy
        if (infoGain > bestInfoGain):       #compare this to the best gain so far
            bestInfoGain = infoGain         #if better than current best, set to best
            bestFeature = i
    return bestFeature                      #returns an integer

决策树的思路就是使用合理的指标对数据进行划分
对指标的要求就是这个指标足够的清晰的划分开不同的类别
换句话说就是原来无序的状态被划分之后就变得有序，也就是熵减
这个减小的熵也就是信息增益，也就是获取的信息
所以每一次构造决策树的时候，选择一个最优划分，然后作为一层节点往下递归构造