Nearest Neighbor Algorithm

    a1 a2 a3 class
    1 1 3 1 yes
    2 3 5 2 yes
    3 3 2 2 no
    4 5 2 3 no

    What will be the prediction of the Nearest Neighbor algorithm using the Euclidean distance for the following new example: a1=2, a2=4, a3=2?
    Ans:
    D(new, ex1) = W2 - 图1 = W2 - 图2 Yes
    D(new, ex2) = W2 - 图3 = W2 - 图4 Yes
    D(new, ex3) = W2 - 图5 = W2 - 图6 No
    D(new, ex4) = W2 - 图7 = W2 - 图8 No
    ∵ D(new, ex2) is the closest.
    ∴ The result of new example is yes according to 1NN algorithm.
    Computational complexity

    • Training is fast – no model is built; just storing training examples
    • Classification (prediction for a new example)
      • Compare each unseen example with each training example
      • If m training examples with dimensionality n => lookup for 1 unseen example takes m*n computations, i.e. O(mn)
    • Memory requirements – you need to remember each training example, O(mn)
    • Impractical for large data due to time and memory requirements
    • However, there are variations allowing to find the nearest neighbours more efficiently
      • e.g. by using more efficient data structures such as KD-trees and ball trees, see Witten p.136-141

    KNN
    • K-Nearest Neighbor is very sensitive to to the value of k
    • rule of thumb: k <= sqrt(#training_examples)
    • commercial packages typically use k=10
    • Using more nearest neighbors increases the robustness to noisy examples
    • K-Nearest Neighbor can be used not only for classification, but also for regression
    • The prediction will be the average value of the class values (numerical) of the
    k nearest neighbors
    Following former example, according to 3NN Algorithm, the 3 nearest neighbors are ex1 (yes), ex2 (yes), and ex3(no); the majority class is yes. ∴3-Nearest Neighbor predicts class =yes
    KNN for nominal class, if both are different, the difference of square is 1; otherwise, it is 0.
    Weighted nearest neighbor:
    Idea: Closer neighbors should count more than distant neighbors.
    • Distance-weighted nearest-neighbor algorithm
    • Find the k nearest neighbors
    • Weight their contribution based on their distance to the new example
    • bigger weight if they are closer
    • smaller weight if they are further
    • e.g. the vote can be weighted according to the disstance – weight
    w = 1/d2

    Decision boundary of 1-nearest neighbor
    • Nearest neighbor classifiers produced decision boundaries with an
    arbitrary shape
    • The 1-nearest neighbor boundary is formed by the edges of the Voronoi diagram that separate the
    points of the two classes
    • Voronoi region:
    • Each training example has an associated Voronoi region; it contains the
    data points for which this is the closest example
    image.png
    KNN - discussion
    • Often very accurate
    • Has been used in statistics since 1950s
    • Slow for big datasets
    • requires efficient data structures such as KD-trees
    • Distance-based - requires normalization
    • Produces arbitrarily shaped decision boundary defined by a subset of the Voronoi edges
    • Not effective for high-dimensional data (data with many features)
    • The notion of “nearness” is not effective in high dimensional data
    • Solution – dimensionality reduction and feature selection
    • Sensitive to the value of k
    Rule-based Algorithm

    • 1R
      • 1R stands for “1-rule”
      • Generates 1 rule that tests the value of a single attribute
        • e.g. a rule that tests the value of outlook

    if outlook=sunny then class=no
    elseif outlook=overcast then class=yes elseif outlook=rainy then class=yes

    • The rule can be represented as a 1-level decision tree
      • At the root, test the attribute value
      • Each branch correspond to a value
      • Each leaf correspond to a class

    image.png
    image.png
    image.pngimage.pngimage.png
    1R discussion:
    • Simple and computationally efficient algorithm
    • Produces rules that are easy to understand
    • In contrast to “black-box” models such as neural networks
    • Sometimes produces surprisingly good results
    • It is useful to try the simple algorithms first and compare their performance with more sophisticated algorithms
    • Numerical datasets require discretization
    • 1R has an in-built procedure to do this

    • PRISM (Rule-based Algorithm)
      • PRISM is an example of rule-based covering algorithm
      • This type of algorithms:
        • consider each class in turn and
        • construct a set of if-then rules that cover all examples from this class and do not cover any examples from the other classes.
      • Called covering algorithms because:
        • At each stage of the algorithm, a rule is identified that “covers” some of the examples
      • if part of the rules
        • contains conditions that test the value of an attribute
        • initially empty, tests are added incrementally in order to create a rule with maximum accuracy

    if ? then class a
    if x1>2 then class a
    if x1>2 & x2<5 then class a
    if x1>2 & x2<5 & x3<3 then class a
    image.png

    • Idea: Generate a rule by adding tests that maximize the rule’s accuracy
    • For each class:
      • Start with an empty rule and gradually add conditions
      • Each condition tests the value of 1 attribute
      • By adding a new test, the rule coverage is reduced (i.e.the rule becomes more specific)
    • Finish when p/t=1 (i.e. the rule covers only examples fromthe same class, i.e. is “perfect”)
    • Which test to add at each step? The one that maximizes accuracy p/t:
      • t: total number of examples (from all classes) covered by the rule (t comes fromtotal)
      • p: examples from the class under consideration, covered by the rule (p comesfrom positive)
      • t-p: number of errors made by the rule
      • Select the test that maximises the accuracy p/t

    Example:
    image.png

    1. Start with an empty rule: if ? then recommendation = hard
    2. W2 - 图17

    W2 - 图18
    W2 - 图19

    1. Best test (highest accuracy): astigmatism = yes

    Note that there is a tie: both astigmatism = yes and tear production rate = normal have the same accuracy 4/12; we choose the first one randomly

    1. Current rule: if astigmatism = yes then recommendation = hard
    2. Not “perfect” - covers 12 examples but only 4 of them are from class hard =>refinement is needed

    image.png

    1. Further refinement by adding tests: if astigmatism = yes and ? then recommendation = hard
    2. W2 - 图21

    W2 - 图22
    W2 - 图23

    1. Best test: tear production rate = normal
    2. Current rule: if astigmatism = yes and tear production rate = normal then recommendation = hard

    image.png

    1. The rule is again not “perfect” – 2 examples classified as none => further refinement is needed

    2. Further refinement: if astigmatism = yes and tear production rate = normal and ? then recommendation = hard

    3. W2 - 图25

    W2 - 图26

    1. Best test: tie between the 1st and 4th; choose the one with the greater coverage (4th)
    2. New rule: if astigmatism = yes & tear production = normal & spectacle prescription = myope then recommendation = hard

    image.png
    image.png
    image.png
    image.png