Density-based outlier detection aims to detect local outliers, that is data points that are outliers compared to their local neighbourhoods, as well as global outliers that differ from the global data distribution.image.png
In the diagram, C1 is a dense cluster and C2 is sparse. o1 and o2 are local outliers to C1, o3 is a global outlier that can be detected by a distance-based method. 在图中,C1是一个密集的星团,而C2是稀疏的。o1和o2是C1的局部异常值,o3是可以通过基于距离的方法检测到的全局异常值。
Proximity-based clustering cannot find that o1 and o2 are outliers (as they are closer to objects in C1 than the average distance apart of the objects in C2. o4 is not an outlier because the density of objects nearby is also low. 基于邻近度的聚类无法发现o1和o2是异常值(因为它们比C2物体的平均距离更接近C1物体。o4不是异常值,因为附近物体的密度也很低。
For local outlier detection we want to capture the idea that the density of objects near an outlier is significantly lower than the density of objects near its neighbouring objects.

Method:

Use the relative density of an object against its neighbours as the indicator of the degree of the object being outliers. 使用对象相对于其邻居的相对密度作为对象异常程度的指标。

First we identify the distance from each object that defines a neighbourhood, by choosing a user-defined Density-based Outlier Detection - 图2 and calculating the distance to the object’s kth nearest neighbour.
Define
k-distance** of an object o,
Density-based Outlier Detection - 图3 where Density-based Outlier Detection - 图4 is an object such that

  • There are at least Density-based Outlier Detection - 图5 objects Density-based Outlier Detection - 图6 such that Density-based Outlier Detection - 图7, and
  • There are at most Density-based Outlier Detection - 图8 objects Density-based Outlier Detection - 图9 such that Density-based Outlier Detection - 图10.

That is, Density-based Outlier Detection - 图11 is the distance between o and its k-th nearest neighbour (which is p in the definition above).
distk 是第k近的邻居到对象o的距离
Now, the k-distance neighbourhood of o is the set of objects that are closer than (or as close as) the k-distance of o. That is,

Density-based Outlier Detection - 图12
While Density-based Outlier Detection - 图13 is usually of size Density-based Outlier Detection - 图14, it could hold more than Density-based Outlier Detection - 图15 objects since multiple objects may be same distance from o. 虽然其包含的邻居数量通常为k, 但也会超过k个对象,因为有可能存在不同的对象离对象o有相同的距离。

Now we have a set of objects that are in the neighbourhod of o, but we need to translate that to a notion of density around o.我们现在有了对象o周围的一组对象,但是我们需要把这些转化为对象o的密度。For this, we start with asymmetric reachability distance amongst pairs of objects, from Density-based Outlier Detection - 图16 to Density-based Outlier Detection - 图17:
Density-based Outlier Detection - 图18
Density
So the density function for an object becomes the local reachability density, defined as

Density-based Outlier Detection - 图19

See how this is the number of objects in the k-distance neighbourhood of o per unit of space. That space is the sum of the reachability distances from o to each of those objects. Let one of those objects be o’. The reachability distance here will often be simply the distance from the object o’ to o. _However, where it is higher, the k-distance neigbourhood of that other object _o’ will be used instead of the simple distance between the two. Such an object o’ falls inside the k-distance neighbourhood of o, but it has a bigger sparser neighbourhood itself. The effect of using the k-distance neigbourhood of_ o’ _here, then, is to decrease the density otherwise attributed to o to account for neighbour objects like o’ that are themselves more locally sparsely-packed.
这是每单位空间中o的k距离邻域内的物体数。该空间是从o到每个对象的可达性距离之和。让那些物体中的一个成为o。这里的可达性距离通常是从物体o’到o’的简单距离。然而,在可达性距离较高的地方,将使用另一个物体o’的k距离邻域,而不是两者之间的简单距离。这样一个物体o ‘落在o的k距离邻域内,但它本身有一个更大更稀疏的邻域。因此,在这里使用o ‘的k-距离邻域的效果是,减少了原本归于o ‘的密度,以考虑到像o ‘这样本身在局部更稀疏的邻近物体。

Local outlier factor
Now we compare the density of an object o to the density of its neighbours. The local outlier factor Density-based Outlier Detection - 图20 is defined as:

Density-based Outlier Detection - 图21

That is, the local outlier factor is the ratio of the average local reachability density of the k-nearest neighbours of o to the local reachability density of o itself.
N.B. The text page 566 has errors in the LOF formula 12.14 and its textual interpretation below the the fomula. These errors are corrected here.
**

Properties

可以使用用户定义的阈值来选择LOF最高的离群值。

ACTION: Do this exercise to work through a tiny example on paper