Attribute selection methods are also called splitting rules. 属性选择方法也称为拆分规则。

    • Techniques to choose a splitting criterion comprised of a splitting attribute and a split point or splitting subset 选择由分割属性和分割点或分割子集组成的分割标准的技术
    • Aim to have partitions at each branch as pure as possible — i.e. all examples at each sub- node belong in the same class. 目标是每个分支的分区尽可能的纯——即每个子节点的所有例子都属于同一个类。

    Example
    This figure shows three possibilities for partitioning tuples based on the splitting criterion, each with examples. Let A be the splitting attribute. (a) If A is nominal-valued, then one branch is grown for each known value of A. (b) If A is continuous-valued or ordinal, then two branches are grown, corresponding to A Attribute Selection Methods - 图1split_point and A > split_point. (c) If A is nominal and a binary tree must be produced, then the test is of the form A Attribute Selection Methods - 图2S, where S is the splitting subset for A.

    image.png

    Heuristics, (or attribute selection measures) are used to choose the best splitting criterion.

    • Information Gain, Gain ratio and Gini index are most popular.
    • Information gain:
      • biased towards multivalued attributes 偏向于多值属性
    • Gain ratio:
      • tends to prefer unbalanced splits in which one partition is much smaller than the others 倾向于不平衡分割,其中一个分区比其他分区小的多
    • Gini index:
      • biased towards multivalued attributes 偏向于多值属性
      • has difficulty when number of classes is large 类别数量多时会有困难
      • tends to favour tests that result in equal-sized partitions and purity in both partitions 倾向于进行测试,使两个分区的大小和纯度相等