Attribute Selection Methods - Gini index - 《机器学习》

Gini index is used in CART and IBM Intelligent miner decision tree learners

All attributes are assumed nominal

If a data set contains examples from classes, gini index, , measures the impurity of and is defined as

where is the relative frequency of class in

If a data set is split on attribute into two subsets and , the gini index is defined as the size-weighted sum of the impurity of each partition:

To split a node in the tree:

Enumerate all the possible ways of splitting all the possible attributes
The attribute split that provides the smallest gini(D) (i.e the greatest purity) is chosen to split the node 选择提供最小基尼系数(D)(即最大纯度)的属性分割来分割节点

Example (continued from previous)
D has 9 tuples in class buyscomputer = “yes” and 5 in “no”
Then
Now consider the attribute _income. Partition D into 10 objects in D1 with income in {low, medium} and 4 objects in D2 _with income in {high}

We have
[](https://wattlecourses.anu.edu.au/filter/tex/displaytex.php?texexp=gini%7Bincome%20%5Cin%20%5C%7Blow%2C%20medium%5C%7D%7D%28D%29%20)

Similarly, gini{low,high} is 0.458; and gini{medium,high} is 0.450.
Thus, we split on the {low,medium} (and the other partition is {high}) since it has the lowest Gini index

When attributes are continuous or ordinal, the method for selecting the midpoint between each pair of adjacent values (described earlier) may be used.