数据清洗 - 19. Record linkage evaluation - 《机器学习》

Evaluating the record linkage process
Measuring linkage complexity 测量链接复杂性

Evaluating the record linkage process

● Different techniques are available for each step of the record linkage process (cleaning and standardisation, blocking, comparison, and classification) 记录链接过程的每一步都有不同的技术可用(清洁和标准化、blocking、比较和分类)
● When employing a record linkage system, one wants to get the best possible results within operational constraints (linkage time, computational resources, minimum linkage quality, available software and human expertise, etc.) 当采用记录链接系统时，人们希望在操作限制(链接时间、计算资源、最低链接质量、可用软件和人力专业知识等)内获得最佳结果。)
● Measures are required to evaluate the two main aspects of record linkage 需要评估记录关联的两个主要方面
– Linkage quality (effectiveness) 链接质量(有效性)
– Linkage complexity (efficiency)链接复杂性(效率)
**

Measuring linkage quality

**
● Achieving high linkage quality is a main goal of most record linkage projects / applications 实现高链接质量是大多数记录链接项目/应用程序的主要目标

● Ground truth data is needed to measure linkage quality 需要真实数据来衡量联系质量
– A set of true matching record pairs 一组真是匹配的数据
– A set of true non-matching record pairs 一组真实不匹配的数据
● How to obtain such ground truth data? 怎么样获得真实数据？
– Results of a previous linkage 之前做了数据链接的结果
– Manual clerical review (more in the next lecture) or manually classified (sampled) record pairs 手动文书审核(下节课将详细介绍)或手动分类(取样)记录对
– Contact all individuals in the databases and ask them? 联系数据库中所有出现的人并询问？
● How confident can one be such ground truth data is always correct?

● Various difficulties with manually prepared ground truth data 人工准备真实数据的各种困难
– It is easy to classify record pairs that have totally different attribute values as non-matches 将具有完全不同属性值的记录对归类为不匹配很容易
– It is (generally) easy to classify record pairs that are very similar as matches (but what about twins, or if not enough information is available?) 将具有完全不同属性值的记录对归类为不匹配很容易
– It is difficult to classify record pairs where some attribute values are the same/similar while others are different 当某些属性值相同/相似而其他属性值不同时，很难对记录对进行分类
– Studies have shown that manual classification of record pairs is never 100% correct 研究表明，记录对的手动分类永远不会100%正确
– Domain expertise is often required (such as knowledge about names and their origins / cultures) 研究表明，记录对的手动分类永远不会100%正确
– If randomly sampled, most record pairs will be non-matches如果随机抽样，大多数记录对将是不匹配的

Measuring linkage quality with ground truth

● Assuming ground truth data are available, the classification of record pairs into matches and non-matches has four possible outcomes:
– True positives (TP): True matches correctly classified as matches (correct matches)
– False negatives (FN): True matches incorrectly classified as non-matches (false non-matches)
– True negatives (TN): True non-matches correctly classified as non- matches (correct non-matches)
– False positives (FP): True non-matches incorrectly classified as matches (false matches)

● Due to the quadratic comparison space, the number of true non-matches is usually much larger than the number of true matches 由于二次比较空间，真正不匹配的数量通常比真正匹配的数量大得多
– As the number of records in the databases to be linked increases, the number of true matches increases linearly while the number of possible record pairs increases quadratically 随着要链接的数据库中记录数量的增加，真正匹配的数量线性增加，而可能的记录对的数量呈二次增加
● This holds even after blocking 封装过后也是如此
● Question: Assuming no duplicates in two databases with 1 and 5 million records, respectively, what is the maximum number of true matches between these two databases?

Error or confusion matrix 错误或混淆矩阵

● Based on the values in the four cells of the error/confusion matrix, different linkage quality measures can be defined 基于四格中的值，可以定义不同的数据链接质量衡量方法
● These measures are binary classification measures as also used in other domains 这些度量是二元分类度量，也用于其他领域
– Machine learning and data mining 机器学习和数据挖掘
– Information retrieval (Web search) 信息检索(网络搜索)
– Medical tests 医学测试
– Security (airport screening), etc. 医学测试
● There is often a trade-off between the number of false positives and false negatives (as one goes down the other goes up) 假阳性和假阴性的数量之间通常有一个权衡(一个下降，另一个上升)

Accuracy 准确度

● Widely used in machine learning and data mining 在机器学习和数据挖掘中广泛应用
● Considers both true positives and true negatives 同时考虑到真阳和假阴
acc = (TP + TN) / (TP + FP + FN + TN)

Precision (or positive predictive value) 精确度

● Widely used in information retrieval (Web search) to assess the quality of search results (how many documents retrieved for a query are relevant?) 文档检索、网页搜索中应用广泛，衡量搜索结果质量（有多少文档检索结果是相关的？）
● Considers only true positives 只考虑真阳
prec = TP / (TP + FP)
● For record linkage, it measures how many of the classified matches are true matches

Recall (or positive predictive value) 召回率

● Widely used in information retrieval to assess the quality of search results (how many of all relevant documents have been retrieved for a query?) 在信息检索中广泛用于评估搜索结果的质量(一个查询检索了多少个相关文档？)
● Considers only true positives 只考虑真阳
reca = TP / (TP + FN)
● For record linkage, it measures how many of all true matches have been classified as matches

F-measure: Combining precision and recall 结合精确度和召回率

● Precision and recall are often combined into the F-measure (or F-score): 精确度和召回率通常被合并到F-量度(或F-分数)中
fmeas = 2 (prec reca) / (prec + reca)
● It is the harmonic mean of precision and recall 精确度和召回率通常被合并到F-量度(或F-分数)中
● As precision goes up (e.g. lowering a similarity threshold), recall goes down, and vice-versa 随着精确度的提高(例如降低相似度阈值)，召回率会下降，反之亦然
● But: Recent research has shown that comparing F-measure results can be misleading (different weights given to precision and recall) (Hand and Christen, Statistics and Computing, 2018) 但是:最近的研究表明，比较F-measure结果可能会产生误导(精度和召回率的权重不同)(Hand and Christen，Statistics and Computing，2018)

Visualising linkage quality results 可视化链接结果质量

● Each of the presented measures is calculated based on a specific error / confusion matrix 基于特定的误差/混淆矩阵来计算每个呈现的度量
● Each classifier, and each change in a classifier parameter, will produce a different error matrix 每个分类器以及分类器参数的每次变化都会产生不同的误差矩阵
– Lowering a classification threshold, t, will usually increase the numbers of TP but also FP, and lower the numbers of TN and FN 降低分类器的阈值 t，通常会导致TP和FP的增长， TN和FN的数量会降低。
– Raising a classification threshold leads to the opposite 提高分类门槛会导致相反的结果
● To better understand classifiers and to compare them, plots are useful tools (for example for different classification thresholds) 为了更好地理解分类器并比较它们，图是有用的工具(例如，对于不同的分类阈值)

● F-measure graph shows precision, recall and f-measure for different classifier thresholds F-measure图显示了不同分类器阈值的精度、召回率和f-measure

● Precision-recall graph shows precision versus recall for different classifier thresholds 精度-召回图显示了不同分类器阈值的精度与召回率的关系

● ROC (receiver operating curve) graph shows true positive rate versus false positive rate for different classifier thresholds ROC(接收器工作曲线)图显示了不同分类器阈值的真阳性率与假阳性率

Measuring linkage complexity 测量链接复杂性

● We can easily measure run-time and memory consumption of a linkage program / system (but is this meaningful)? 我们可以很容易地测量一个联动程序/系统的运行时间和内存消耗(但这有意义吗)？
● Generally, platform independent measure are better 一般情况下，平台独立测量更好
– Allows the performance of systems to be compared even when not run on the same computing platform (but same data sets and same parameter settings) 即使不在相同的计算平台上运行(但数据集和参数设置相同)，也可以比较系统的性能
● Linkage complexity is generally measured by the number of record pairs that need to be compared 链接复杂性通常通过需要比较的记录对的数量来衡量
– The number of candidate record pairs generated by blocking Blocking生成的候选记录对的数量

Reduction ratio 减少比

● Measures by how much a blocking technique is able to reduce the comparison space 衡量阻塞技术能够减少多少比较空间
– Compared to the full pair-wise comparison of all record pairs 与所有记录对的完全成对比较相比

Pairs completeness

● Measures how many true matches ‘pass’ through a blocking process 测量有多少真正的匹配通过了阻塞过程
● It corresponds to the recall of blocking 对应于blocking的召回

● It requires the truth match status of all record pairs (as with the linkage quality measures) 它需要所有记录对的真实匹配状态(与链接质量度量一样)

Pairs quality 配对质量

● Measures how many candidate record pairs generated by blocking are true matches 测量通过blocking生成的候选记录对中有多少是真正的匹配
● It corresponds to the precision of blocking 对应于blocking的精度

● It requires the truth match status of all record pairs (as with the linkage quality measures) 它需要所有记录对的真实匹配状态(与链接质量度量一样)