outline

●What is record linkage
● Record linkage applications
● A short history of record linkage
● The record linkage process
● Record linkage techniques and challenges

What is record linkage?

● The process of linking records that represent the same entity in one or more databases (patients, customers, businesses, products, publications, etc.) 在一个或多个数据库(患者、客户、企业、产品、出版物等)中链接代表同一实体的记录的过程。
● Also known as data linkage, data matching, entity resolution, duplicate detection, object identification, etc.
● Major challenge is that unique entity identifiers are not available in the databases to be linked (or if available, they are not consistent or not stable) 主要挑战是在不同的数据库中,实体的单一识别码无法统一用来聚合

Record linkage challenges

● No unique entity identifiers available 没有可用的实体识别符
● Real world data are dirty (typographical errors and variations, missing and out-of-date values, different coding schemes, etc.) 数据不干净
● Scalability
– Naïve comparison of all record pairs has a quadratic complexity
– Remove likely non-matches as efficiently as possible
● No training data in many linkage applications 在许多链接应用中没有训练数据
– No record pairs with known true match status 没有已知真实匹配状态的记录对
● Privacy and confidentiality (because personal information, like names and addresses, are commonly required for linking) 隐私和保密性