如上一篇文章所述,ID3方法主要有几个缺点:一是采用信息增益进行数据分裂,准确性不如信息增益率;二是不能对连续数据进行处理,只能通过连续数据离散化进行处理;三是没有采用剪枝的策略,决策树的结构可能会过于复杂,可能会出现过拟合的情况。
C4.5在ID3的基础上对上述三个方面进行了相应的改进:
a) C4.5对节点进行分裂时采用信息增益率作为分裂的依据;
b) 能够对连续数据进行处理;
c) C4.5采用剪枝的策略,对完全生长的决策树进行剪枝处理,一定程度上降低过拟合的影响。
**
1.采用信息增益率作为分裂的依据
信息增益率的计算公式为:
其中
表示信息增益,
表示分裂子节点数据量的信息增益,计算公式为:
其中m表示节点的数量,Ni表示第i个节点的数据量,N表示父亲节点的数据量,说白了,
其实是分裂节点的熵。
信息增益率越大,说明分裂的效果越好。
以一个实际的例子说明C4.5如何通过信息增益率选择分裂的属性:
表1 原始数据表
| 当天天气 | 温度 | 湿度 | 日期 | 逛街 |
|---|---|---|---|---|
| 晴 | 25 | 50 | 工作日 | 否 |
| 晴 | 21 | 48 | 工作日 | 是 |
| 晴 | 18 | 70 | 周末 | 是 |
| 晴 | 28 | 41 | 周末 | 是 |
| 阴 | 8 | 65 | 工作日 | 是 |
| 阴 | 18 | 43 | 工作日 | 否 |
| 阴 | 24 | 56 | 周末 | 是 |
| 阴 | 18 | 76 | 周末 | 否 |
| 雨 | 31 | 61 | 周末 | 否 |
| 雨 | 6 | 43 | 周末 | 是 |
| 雨 | 15 | 55 | 工作日 | 否 |
| 雨 | 4 | 58 | 工作日 | 否 |
以当天天气为例:
一共有三个属性值,晴、阴、雨,一共分裂成三个子节点。
根据上述公式,可以计算信息增益率如下:
2.对连续型属性进行处理
C4.5处理离散型属性的方式与ID3一致,新增对连续型属性的处理。处理方式是先根据连续型属性进行排序,然后采用一刀切的方式将数据砍成两半。
那么如何选择切割点呢?很简单,直接计算每一个切割点切割后的信息增益,然后选择使分裂效果最优的切割点。
以温度为例:
从上图可以看出,理论上来讲,N条数据就有N-1个切割点,为了选取最优的切割垫,要计算按每一次切割的信息增益,计算量是比较大的,那么有没有简化的方法呢?有,注意到,其实有些切割点是很明显可以排除的。比如说上图右侧的第2条和第3条记录,两者的类标签(逛街)都是“是”,如果从这里切割的话,就将两个本来相同的类分开了,肯定不会比将他们归为一类的切分方法好,因此,可以通过去除前后两个类标签相同的切割点以简化计算的复杂度,如下图所示:

从图中可以看出,最终切割点的数目从原来的11个减少到现在的6个,降低了计算的复杂度。
确定了分割点之后,接下来就是选择最优的分割点了,注意,对连续型属性是采用信息增益进行内部择优的,因为如果使用信息增益率进行分裂会出现倾向于选择分割前后两个节点数据量相差最大的分割点,为了避免这种情况,选择信息增益选择分割点。选择了最优的分割点之后,再计算信息增益率跟其他的属性进行比较,确定最优的分裂属性。
3. 剪枝
决策树只已经提到,剪枝是在完全生长的决策树的基础上,对生长后分类效果不佳的子树进行修剪,减小决策树的复杂度,降低过拟合的影响。
C4.5采用悲观剪枝方法(PEP)。悲观剪枝认为如果决策树的精度在剪枝前后没有影响的话,则进行剪枝。怎样才算是没有影响?如果剪枝后的误差小于剪枝前经度的上限,则说明剪枝后的效果与更佳,此时需要子树进行剪枝操作。
进行剪枝必须满足的条件:
其中:
表示子树的误差;
表示叶子节点的误差;
令子树误差的经度满足二项分布,根据二项分布的性质,
,
,其中
,N为子树的数据量;同样,叶子节点的误差
。
上述公式中,0.5表示修正因子。由于对父节点进行分裂总会得到比父节点分类结果更好的效果,因此,因此从理论上来说,父节点的误差总是不小于孩子节点的误差,因此需要进行修正,给每一个节点都加上0.5的修正因此,在计算误差的时候,子节点由于加上了修正的因子,就无法保证总误差总是低于父节点。
算例:

程序设计及源代码(C#版)
(1)数据格式
对原始的数据进行数字化处理,并以二维数据的形式存储,每一行表示一条记录,前n-1列表示属性,最后一列表示分类的标签。
如表1的数据可以转化为表2:
表2 初始化后的数据**
| 当天天气 | 温度 | 湿度 | 季节 | 明天天气 |
|---|---|---|---|---|
| 1 | 25 | 50 | 1 | 1 |
| 2 | 21 | 48 | 1 | 2 |
| 2 | 18 | 70 | 1 | 3 |
| 1 | 28 | 41 | 2 | 1 |
| 3 | 8 | 65 | 3 | 2 |
| 1 | 18 | 43 | 2 | 1 |
| 2 | 24 | 56 | 4 | 1 |
| 3 | 18 | 76 | 4 | 2 |
| 3 | 31 | 61 | 2 | 1 |
| 2 | 6 | 43 | 3 | 3 |
| 1 | 15 | 55 | 4 | 2 |
| 3 | 4 | 58 | 3 | 3 |
其中,对于“当天天气”属性,数字{1,2,3}分别表示{晴,阴,雨};对于“季节”属性{1,2,3,4}分别表示{春天、夏天、冬天、秋天};对于类标签“明天天气”,数字{1,2,3}分别表示{晴、阴、雨}。
代码如下所示:
static double[][] allData; //存储进行训练的数据
static List
featureValues是链表数组,数组的长度为属性的个数,数组的每个元素为该属性的离散值链表。
(2)两个类:节点类和分裂信息
a)节点类Node
该类表示一个节点,属性包括节点选择的分裂属性、节点的输出类、孩子节点、深度等。注意,与ID3中相比,新增了两个属性:leafWrong和leafNode_Count分别表示叶子节点的总分类误差和叶子节点的个数,主要是为了方便剪枝。
class Node{/// <summary>/// 各个子节点对应的取值/// </summary>//public List<String> features;public List<String> features{get;set;}/// <summary>/// 分裂属性的数据类型(1:连续 0:离散)/// </summary>public String feature_Type {get;set;}/// <summary>/// 分裂属性列的下标/// </summary>public String SplitFeature {get;set;}/// <summary>/// 各类别的数量统计/// </summary>public double[] ClassCount {get;set;}/// <summary>/// 数据量/// </summary>public int rowCount { get; set; }/// <summary>/// 各个子节点/// </summary>public List<Node> childNodes {get;set;}/// <summary>/// 父亲节点/// </summary>public Node Parent {get;set;}/// <summary>/// 该节点占比最大的类别/// </summary>public String finalResult {get;set;}/// <summary>/// 数的深度/// </summary>public int deep {get;set;}/// <summary>/// 节点占比最大类的标号/// </summary>public int result {get;set;}/// <summary>/// 子节点的错误数/// </summary>public int leafWrong {get;set;}/// <summary>/// 子节点的数目/// </summary>public int leafNode_Count {get;set;}public double getErrorCount(){return rowCount - ClassCount[result];}#regionpublic void setClassCount(double[] count){this.ClassCount = count;double max = ClassCount[0];int result = 0;for (int i = 1; i < ClassCount.Length; i++){if (max < ClassCount[i]){max = ClassCount[i];result = i;}}this.result = result;}#endregion}
b)分裂信息类
该类存储节点进行分裂的信息,包括各个子节点的行坐标、子节点各个类的数目、该节点分裂的属性、属性的类型等。
class SplitInfo{/// <summary>/// 分裂的属性下标/// </summary>public int splitIndex { get; set; }/// <summary>/// 数据类型/// </summary>public int type { get; set; }/// <summary>/// 分裂属性的取值/// </summary>public List<String> features { get; set; }/// <summary>/// 各个节点的行坐标链表/// </summary>public List<int>[] temp { get; set; }/// <summary>/// 每个节点各类的数目/// </summary>public double[][] class_Count { get; set; }}
主方法findBestSplit(Node node,List
其中:
- node表示即将进行分裂的节点;
- nums表示节点数据的行坐标列表;
- isUsed表示到该节点位置所有属性的使用情况;
findBestSplit的这个方法主要有以下几个组成部分:
1)节点分裂停止的判定
节点分裂条件如上文所述,源代码如下:
public static bool ifEnd(Node node, double entropy,int[] isUsed){try{double[] count = node.ClassCount;int rowCount = node.rowCount;int maxResult = 0;#region 数达到某一深度int deep = node.deep;if (deep >= maxDeep){maxResult = node.result + 1;node.feature_Type=("result");node.features=(new List<String>() { maxResult + "" });node.leafWrong=(rowCount - Convert.ToInt32(count[maxResult - 1]));node.leafNode_Count = 1;return true;}#endregion#region 纯度(其实跟后面的有点重了,记得要修改)//maxResult = 1;//for (int i = 1; i < count.Length; i++)//{// if (count[i] / rowCount >= 0.95)// {// node.feature_Type=("result");// node.features=(new List<String> { "" + (i + 1) });// node.leafWrong=(rowCount - Convert.ToInt32(count[maxResult - 1]));// node.leafNode_Count = 1;// return true;// }//}#endregion#region 熵为0if (entropy == 0){maxResult = node.result+1;node.feature_Type=("result");node.features=(new List<String> { maxResult + "" });node.leafWrong=(rowCount - Convert.ToInt32(count[maxResult - 1]));node.leafNode_Count = 1;return true;}#endregion#region 属性已经分完bool flag = true;for (int i = 0; i < isUsed.Length - 1; i++){if (isUsed[i] == 0){flag = false;break;}}if (flag){maxResult = node.result+1;node.feature_Type=("result");node.features=(new List<String> { "" + (maxResult) });node.leafWrong=(rowCount - Convert.ToInt32(count[maxResult - 1]));node.leafNode_Count = 1;return true;}#endregion#region 数据量少于100if (rowCount < Limit_Node){maxResult = node.result+1;node.feature_Type=("result");node.features=(new List<String> { "" + (maxResult) });node.leafWrong=(rowCount - Convert.ToInt32(count[maxResult - 1]));node.leafNode_Count = 1;return true;}#endregionreturn false;}catch (Exception e){return false;}}
2)寻找最优的分裂属性
寻找最优的分裂属性需要计算每一个分裂属性分裂后的信息增益率,计算公式上文已给出,其中熵的计算代码如下:
public static double CalEntropy(double[] counts, int countAll){try{double allShang = 0;for (int i = 0; i < counts.Length; i++){if (counts[i] == 0){continue;}double rate = counts[i] / countAll;allShang = allShang + rate * Math.Log(rate, 2);}return allShang;}catch (Exception e){return 0;}}
3)进行分裂,同时对子节点进行迭代处理
其实就是递归的工程,对每一个子节点执行findBestSplit方法进行分裂。
findBestSplit源代码:
public static Node findBestSplit(Node node, List<int> nums, int[] isUsed){try{//判断是否继续分裂double totalShang = CalEntropy(node.ClassCount, node.rowCount);if (ifEnd(node, totalShang,isUsed)){return node;}#region 变量声明SplitInfo info = new SplitInfo();int RowCount = nums.Count; //样本总数double jubuMax = 0; //局部最大熵#endregionfor (int i = 0; i < isUsed.Length - 1; i++){if (isUsed[i] == 1){continue;}#region 离散变量if (type[i] == 0){int[] allFeatureCount = new int[0]; //所有类别的数量double[][] allCount = new double[allNum[i]][];for (int j = 0; j < allCount.Length; j++){allCount[j] = new double[classCount];}int[] countAllFeature = new int[allNum[i]];List<int>[] temp = new List<int>[allNum[i]];for (int j = 0; j < temp.Length; j++){temp[j] = new List<int>();}for (int j = 0; j < nums.Count; j++){int index = Convert.ToInt32(allData[nums[j]][i]);temp[index - 1].Add(nums[j]);countAllFeature[index - 1]++;allCount[index - 1][Convert.ToInt32(allData[nums[j]][lieshu - 1]) - 1]++;}double allShang = 0;double chushu = 0;for (int j = 0; j < allCount.Length; j++){allShang = allShang + CalEntropy(allCount[j], countAllFeature[j]) * countAllFeature[j] / RowCount;if (countAllFeature[j] > 0){double rate = countAllFeature[j] / Convert.ToDouble(RowCount);chushu = chushu + rate * Math.Log(rate, 2);}}allShang = (-totalShang + allShang);if (allShang > jubuMax){info.features = new List<string>();info.type = 0;info.temp = temp;info.splitIndex = i;info.class_Count = allCount;jubuMax = allShang;allFeatureCount = countAllFeature;}}#endregion#region 连续变量else{double[] leftCount = new double[classCount]; //做节点各个类别的数量double[] rightCount = new double[classCount]; //右节点各个类别的数量double[] count1 = new double[classCount]; //子集1的统计量//double[] count2 = new double[node.getCount().Length]; //子集2的统计量double[] count2 = new double[node.ClassCount.Length]; //子集2的统计量for (int j = 0; j < node.ClassCount.Length; j++){count2[j] = node.ClassCount[j];}int all1 = 0; //子集1的样本量int all2 = nums.Count; //子集2的样本量double lastValue = 0; //上一个记录的类别double currentValue = 0; //当前类别double lastPoint = 0; //上一个点的值double currentPoint = 0; //当前点的值int splitPoint = 0;double splitValue = 0;double[] values = new double[nums.Count];for (int j = 0; j < values.Length; j++){values[j] = allData[nums[j]][i];}QSort(values, nums, 0, nums.Count - 1);double chushu = 0;double lianxuMax = 0; //连续型属性的最大熵for (int j = 0; j < nums.Count - 1; j++){currentValue = allData[nums[j]][lieshu - 1];currentPoint = allData[nums[j]][i];if (j == 0){lastValue = currentValue;lastPoint = currentPoint;}if (currentValue != lastValue){double shang1 = CalEntropy(count1, all1);double shang2 = CalEntropy(count2, all2);double allShang = shang1 * all1 / (all1 + all2) + shang2 * all2 / (all1 + all2);allShang = (-totalShang + allShang);if (lianxuMax < allShang){lianxuMax = allShang;for (int k = 0; k < count1.Length; k++){leftCount[k] = count1[k];rightCount[k] = count2[k];}splitPoint = j;splitValue = (currentPoint + lastPoint) / 2;}}all1++;count1[Convert.ToInt32(currentValue) - 1]++;count2[Convert.ToInt32(currentValue) - 1]--;all2--;lastValue = currentValue;lastPoint = currentPoint;}double rate1 = Convert.ToDouble(leftCount[0] + leftCount[1]) / (leftCount[0] + leftCount[1] + rightCount[0] + rightCount[1]);chushu = 0;if (rate1 > 0){chushu = chushu + rate1 * Math.Log(rate1, 2);}double rate2 = Convert.ToDouble(rightCount[0] + rightCount[1]) / (leftCount[0] + leftCount[1] + rightCount[0] + rightCount[1]);if (rate2 > 0){chushu = chushu + rate2 * Math.Log(rate2, 2);}//lianxuMax = lianxuMax ;//lianxuMax = lianxuMax;if (lianxuMax > jubuMax){//info.setSplitIndex(i);info.splitIndex=(i);//info.setFeatures(new List<String> { splitValue + "" });info.features = (new List<String> { splitValue + "" });//info.setType(1);info.type=(1);jubuMax = lianxuMax;//info.setType(1);List<int>[] allInt = new List<int>[2];allInt[0] = new List<int>();allInt[1] = new List<int>();for (int k = 0; k < splitPoint; k++){allInt[0].Add(nums[k]);}for (int k = splitPoint; k < nums.Count; k++){allInt[1].Add(nums[k]);}info.temp=(allInt);//info.setTemp(allInt);double[][] alls = new double[2][];alls[0] = new double[leftCount.Length];alls[1] = new double[leftCount.Length];for (int k = 0; k < leftCount.Length; k++){alls[0][k] = leftCount[k];alls[1][k] = rightCount[k];}info.class_Count=(alls);//info.setclassCount(alls);}}#endregion}#region 如果找不到最佳的分裂属性,则设为叶节点if (info.splitIndex == -1){double[] finalCount = node.ClassCount;double max = finalCount[0];int result = 1;for (int i = 1; i < finalCount.Length; i++){if (finalCount[i] > max){max = finalCount[i];result = (i + 1);}}node.feature_Type=("result");node.features=(new List<String> { "" + result });return node;}#endregion#region 分裂int deep = node.deep;node.SplitFeature=("" + info.splitIndex);List<Node> childNode = new List<Node>();int[] used = new int[isUsed.Length];for (int i = 0; i < used.Length; i++){used[i] = isUsed[i];}if (info.type == 0){used[info.splitIndex] = 1;node.feature_Type=("离散");}else{used[info.splitIndex] = 0;node.feature_Type=("连续");}int sumLeaf = 0;int sumWrong = 0;List<int>[] rowIndex = info.temp;List<String> features = info.features;for (int j = 0; j < rowIndex.Length; j++){if (rowIndex[j].Count == 0){continue;}if (info.type == 0)features.Add("" + (j + 1));Node node1 = new Node();node1.setClassCount(info.class_Count[j]);node1.deep=(deep + 1);node1.rowCount = info.temp[j].Count;node1 = findBestSplit(node1, info.temp[j], used);sumLeaf += node1.leafNode_Count;sumWrong += node1.leafWrong;childNode.Add(node1);}node.leafNode_Count = (sumLeaf);node.leafWrong = (sumWrong);node.features=(features);node.childNodes=(childNode);#endregionreturn node;}catch (Exception e){Console.WriteLine(e.StackTrace);return node;}}
(4)剪枝
悲观剪枝方法(PEP):
public static void prune(Node node){if (node.feature_Type == "result")return;double treeWrong = node.getErrorCount() + 0.5;double leafError = node.leafWrong + 0.5 * node.leafNode_Count;double var = Math.Sqrt(leafError * (1 - Convert.ToDouble(leafError) / node.nums.Count));double panbie = leafError + var - treeWrong;if (panbie > 0){node.feature_Type=("result");node.childNodes=(null);int result = (node.result + 1);node.features=(new List<String>() { "" + result });}else{List<Node> childNodes = node.childNodes;for (int i = 0; i < childNodes.Count; i++){prune(childNodes[i]);}}}
C4.5核心算法的所有源代码:
#region C4.5核心算法/// <summary>/// 测试/// </summary>/// <param name="node"></param>/// <param name="data"></param>public static String findResult(Node node, String[] data){List<String> featrues = node.features;String type = node.feature_Type;if (type == "result"){return featrues[0];}int split = Convert.ToInt32(node.SplitFeature);List<Node> childNodes = node.childNodes;double[] resultCount = node.ClassCount;if (type == "连续"){double value = Convert.ToDouble(featrues[0]);if (Convert.ToDouble(data[split]) <= value){return findResult(childNodes[0], data);}else{return findResult(childNodes[1], data);}}else{for (int i = 0; i < featrues.Count; i++){if (data[split] == featrues[i]){return findResult(childNodes[i], data);}if (i == featrues.Count - 1){double count = resultCount[0];int maxInt = 0;for (int j = 1; j < resultCount.Length; j++){if (count < resultCount[j]){count = resultCount[j];maxInt = j;}}return findResult(childNodes[0], data);}}}return null;}/// <summary>/// 判断是否还需要分裂/// </summary>/// <param name="node"></param>/// <returns></returns>public static bool ifEnd(Node node, double entropy,int[] isUsed){try{double[] count = node.ClassCount;int rowCount = node.rowCount;int maxResult = 0;#region 数达到某一深度int deep = node.deep;if (deep >= maxDeep){maxResult = node.result + 1;node.feature_Type=("result");node.features=(new List<String>() { maxResult + "" });node.leafWrong=(rowCount - Convert.ToInt32(count[maxResult - 1]));node.leafNode_Count = 1;return true;}#endregion#region 纯度(其实跟后面的有点重了,记得要修改)//maxResult = 1;//for (int i = 1; i < count.Length; i++)//{// if (count[i] / rowCount >= 0.95)// {// node.feature_Type=("result");// node.features=(new List<String> { "" + (i + 1) });// node.leafWrong=(rowCount - Convert.ToInt32(count[maxResult - 1]));// node.leafNode_Count = 1;// return true;// }//}#endregion#region 熵为0if (entropy == 0){maxResult = node.result+1;node.feature_Type=("result");node.features=(new List<String> { maxResult + "" });node.leafWrong=(rowCount - Convert.ToInt32(count[maxResult - 1]));node.leafNode_Count = 1;return true;}#endregion#region 属性已经分完bool flag = true;for (int i = 0; i < isUsed.Length - 1; i++){if (isUsed[i] == 0){flag = false;break;}}if (flag){maxResult = node.result+1;node.feature_Type=("result");node.features=(new List<String> { "" + (maxResult) });node.leafWrong=(rowCount - Convert.ToInt32(count[maxResult - 1]));node.leafNode_Count = 1;return true;}#endregion#region 数据量少于100if (rowCount < Limit_Node){maxResult = node.result+1;node.feature_Type=("result");node.features=(new List<String> { "" + (maxResult) });node.leafWrong=(rowCount - Convert.ToInt32(count[maxResult - 1]));node.leafNode_Count = 1;return true;}#endregionreturn false;}catch (Exception e){return false;}}#region 排序算法public static void InsertSort(double[] values, List<int> arr, int StartIndex, int endIndex){for (int i = StartIndex + 1; i <= endIndex; i++){int key = arr[i];double init = values[i];int j = i - 1;while (j >= StartIndex && values[j] > init){arr[j + 1] = arr[j];values[j + 1] = values[j];j--;}arr[j + 1] = key;values[j + 1] = init;}}static int SelectPivotMedianOfThree(double[] values, List<int> arr, int low, int high){int mid = low + ((high - low) >> 1);//计算数组中间的元素的下标//使用三数取中法选择枢轴if (values[mid] > values[high])//目标: arr[mid] <= arr[high]{swap(values, arr, mid, high);}if (values[low] > values[high])//目标: arr[low] <= arr[high]{swap(values, arr, low, high);}if (values[mid] > values[low]) //目标: arr[low] >= arr[mid]{swap(values, arr, mid, low);}//此时,arr[mid] <= arr[low] <= arr[high]return low;//low的位置上保存这三个位置中间的值//分割时可以直接使用low位置的元素作为枢轴,而不用改变分割函数了}static void swap(double[] values, List<int> arr, int t1, int t2){double temp = values[t1];values[t1] = values[t2];values[t2] = temp;int key = arr[t1];arr[t1] = arr[t2];arr[t2] = key;}static void QSort(double[] values, List<int> arr, int low, int high){int first = low;int last = high;int left = low;int right = high;int leftLen = 0;int rightLen = 0;if (high - low + 1 < 10){InsertSort(values, arr, low, high);return;}//一次分割int key = SelectPivotMedianOfThree(values, arr, low, high);//使用三数取中法选择枢轴double inti = values[key];int currentKey = arr[key];while (low < high){while (high > low && values[high] >= inti){if (values[high] == inti)//处理相等元素{swap(values, arr, right, high);right--;rightLen++;}high--;}arr[low] = arr[high];values[low] = values[high];while (high > low && values[low] <= inti){if (values[low] == inti){swap(values, arr, left, low);left++;leftLen++;}low++;}arr[high] = arr[low];values[high] = values[low];}arr[low] = currentKey;values[low] = values[key];//一次快排结束//把与枢轴key相同的元素移到枢轴最终位置周围int i = low - 1;int j = first;while (j < left && values[i] != inti){swap(values, arr, i, j);i--;j++;}i = low + 1;j = last;while (j > right && values[i] != inti){swap(values, arr, i, j);i++;j--;}QSort(values, arr, first, low - 1 - leftLen);QSort(values, arr, low + 1 + rightLen, last);}#endregion/// <summary>/// 寻找最佳的分裂点/// </summary>/// <param name="num"></param>/// <param name="node"></param>public static Node findBestSplit(Node node, List<int> nums, int[] isUsed){try{//判断是否继续分裂double totalShang = CalEntropy(node.ClassCount, node.rowCount);if (ifEnd(node, totalShang,isUsed)){return node;}#region 变量声明SplitInfo info = new SplitInfo();int RowCount = nums.Count; //样本总数double jubuMax = 0; //局部最大熵#endregionfor (int i = 0; i < isUsed.Length - 1; i++){if (isUsed[i] == 1){continue;}#region 离散变量if (type[i] == 0){int[] allFeatureCount = new int[0]; //所有类别的数量double[][] allCount = new double[allNum[i]][];for (int j = 0; j < allCount.Length; j++){allCount[j] = new double[classCount];}int[] countAllFeature = new int[allNum[i]];List<int>[] temp = new List<int>[allNum[i]];for (int j = 0; j < temp.Length; j++){temp[j] = new List<int>();}for (int j = 0; j < nums.Count; j++){int index = Convert.ToInt32(allData[nums[j]][i]);temp[index - 1].Add(nums[j]);countAllFeature[index - 1]++;allCount[index - 1][Convert.ToInt32(allData[nums[j]][lieshu - 1]) - 1]++;}double allShang = 0;double chushu = 0;for (int j = 0; j < allCount.Length; j++){allShang = allShang + CalEntropy(allCount[j], countAllFeature[j]) * countAllFeature[j] / RowCount;if (countAllFeature[j] > 0){double rate = countAllFeature[j] / Convert.ToDouble(RowCount);chushu = chushu + rate * Math.Log(rate, 2);}}allShang = (-totalShang + allShang);if (allShang > jubuMax){info.features = new List<string>();info.type = 0;info.temp = temp;info.splitIndex = i;info.class_Count = allCount;jubuMax = allShang;allFeatureCount = countAllFeature;}}#endregion#region 连续变量else{double[] leftCount = new double[classCount]; //做节点各个类别的数量double[] rightCount = new double[classCount]; //右节点各个类别的数量double[] count1 = new double[classCount]; //子集1的统计量//double[] count2 = new double[node.getCount().Length]; //子集2的统计量double[] count2 = new double[node.ClassCount.Length]; //子集2的统计量for (int j = 0; j < node.ClassCount.Length; j++){count2[j] = node.ClassCount[j];}int all1 = 0; //子集1的样本量int all2 = nums.Count; //子集2的样本量double lastValue = 0; //上一个记录的类别double currentValue = 0; //当前类别double lastPoint = 0; //上一个点的值double currentPoint = 0; //当前点的值int splitPoint = 0;double splitValue = 0;double[] values = new double[nums.Count];for (int j = 0; j < values.Length; j++){values[j] = allData[nums[j]][i];}QSort(values, nums, 0, nums.Count - 1);double chushu = 0;double lianxuMax = 0; //连续型属性的最大熵for (int j = 0; j < nums.Count - 1; j++){currentValue = allData[nums[j]][lieshu - 1];currentPoint = allData[nums[j]][i];if (j == 0){lastValue = currentValue;lastPoint = currentPoint;}if (currentValue != lastValue){double shang1 = CalEntropy(count1, all1);double shang2 = CalEntropy(count2, all2);double allShang = shang1 * all1 / (all1 + all2) + shang2 * all2 / (all1 + all2);allShang = (-totalShang + allShang);if (lianxuMax < allShang){lianxuMax = allShang;for (int k = 0; k < count1.Length; k++){leftCount[k] = count1[k];rightCount[k] = count2[k];}splitPoint = j;splitValue = (currentPoint + lastPoint) / 2;}}all1++;count1[Convert.ToInt32(currentValue) - 1]++;count2[Convert.ToInt32(currentValue) - 1]--;all2--;lastValue = currentValue;lastPoint = currentPoint;}double rate1 = Convert.ToDouble(leftCount[0] + leftCount[1]) / (leftCount[0] + leftCount[1] + rightCount[0] + rightCount[1]);chushu = 0;if (rate1 > 0){chushu = chushu + rate1 * Math.Log(rate1, 2);}double rate2 = Convert.ToDouble(rightCount[0] + rightCount[1]) / (leftCount[0] + leftCount[1] + rightCount[0] + rightCount[1]);if (rate2 > 0){chushu = chushu + rate2 * Math.Log(rate2, 2);}//lianxuMax = lianxuMax ;//lianxuMax = lianxuMax;if (lianxuMax > jubuMax){//info.setSplitIndex(i);info.splitIndex=(i);//info.setFeatures(new List<String> { splitValue + "" });info.features = (new List<String> { splitValue + "" });//info.setType(1);info.type=(1);jubuMax = lianxuMax;//info.setType(1);List<int>[] allInt = new List<int>[2];allInt[0] = new List<int>();allInt[1] = new List<int>();for (int k = 0; k < splitPoint; k++){allInt[0].Add(nums[k]);}for (int k = splitPoint; k < nums.Count; k++){allInt[1].Add(nums[k]);}info.temp=(allInt);//info.setTemp(allInt);double[][] alls = new double[2][];alls[0] = new double[leftCount.Length];alls[1] = new double[leftCount.Length];for (int k = 0; k < leftCount.Length; k++){alls[0][k] = leftCount[k];alls[1][k] = rightCount[k];}info.class_Count=(alls);//info.setclassCount(alls);}}#endregion}#region 如果找不到最佳的分裂属性,则设为叶节点if (info.splitIndex == -1){double[] finalCount = node.ClassCount;double max = finalCount[0];int result = 1;for (int i = 1; i < finalCount.Length; i++){if (finalCount[i] > max){max = finalCount[i];result = (i + 1);}}node.feature_Type=("result");node.features=(new List<String> { "" + result });return node;}#endregion#region 分裂int deep = node.deep;node.SplitFeature=("" + info.splitIndex);List<Node> childNode = new List<Node>();int[] used = new int[isUsed.Length];for (int i = 0; i < used.Length; i++){used[i] = isUsed[i];}if (info.type == 0){used[info.splitIndex] = 1;node.feature_Type=("离散");}else{used[info.splitIndex] = 0;node.feature_Type=("连续");}int sumLeaf = 0;int sumWrong = 0;List<int>[] rowIndex = info.temp;List<String> features = info.features;for (int j = 0; j < rowIndex.Length; j++){if (rowIndex[j].Count == 0){continue;}if (info.type == 0)features.Add("" + (j + 1));Node node1 = new Node();node1.setClassCount(info.class_Count[j]);node1.deep=(deep + 1);node1.rowCount = info.temp[j].Count;node1 = findBestSplit(node1, info.temp[j], used);sumLeaf += node1.leafNode_Count;sumWrong += node1.leafWrong;childNode.Add(node1);}node.leafNode_Count = (sumLeaf);node.leafWrong = (sumWrong);node.features=(features);node.childNodes=(childNode);#endregionreturn node;}catch (Exception e){Console.WriteLine(e.StackTrace);return node;}}/// <summary>/// 计算熵/// </summary>/// <param name="counts"></param>/// <param name="countAll"></param>/// <returns></returns>public static double CalEntropy(double[] counts, int countAll){try{double allShang = 0;for (int i = 0; i < counts.Length; i++){if (counts[i] == 0){continue;}double rate = counts[i] / countAll;allShang = allShang + rate * Math.Log(rate, 2);}return allShang;}catch (Exception e){return 0;}}#region 悲观剪枝public static void prune(Node node){if (node.feature_Type == "result")return;double treeWrong = node.getErrorCount() + 0.5;double leafError = node.leafWrong + 0.5 * node.leafNode_Count;double var = Math.Sqrt(leafError * (1 - Convert.ToDouble(leafError) / node.rowCount));double panbie = leafError + var - treeWrong;if (panbie > 0){node.feature_Type = "result";node.childNodes = null;int result = node.result + 1;node.features= new List<String>() { "" + result };}else{List<Node> childNodes = node.childNodes;for (int i = 0; i < childNodes.Count; i++){prune(childNodes[i]);}}}#endregion#endregion
总结
要记住,C4.5是分类树最终要的算法,算法的思想其实很简单,但是分类的准确性高。可以说C4.5是ID3的升级版和强化版,解决了ID3未能解决的问题。要重点记住以下几个方面:
1、C4.5是采用信息增益率选择分裂的属性,解决了ID3选择属性时的偏向性问题;
2、C4.5能够对连续数据进行处理,采用一刀切的方式将连续型的数据切成两份,在选择切割点的时候使用信息增益作为择优的条件;
3、C4.5采用悲观剪枝的策略,一定程度上降低了过拟合的影响。
,所以应该进行剪枝。
