Duplicate matching

From CDL
Jump to: navigation, search
Managed element
Last edit 6 June 2018 16:04:00
Support contact 
A member of the CDL Team who is responsible for a specific element of the CDL infrastructure.
Sebastian Kaczmarski

Duplicate matching compares all records of a given set of custom databases to each other, identifies similar records, and groups "best matches" in matching groups. The process to get a duplicate report comprises three steps: (1) select custom databases to be analyzed and a matching configuration to configure the matching algorithm, (2) start a matching job and wait for the result (i.e. record links with similarity score), and (3) generate a duplicate report with (optionally) cleansed golden records for each matching group.

Duplicate matching approach

Matching process

The duplicate analysis algorithm compares all records of all given custom databases on the basis of the given matching configuration. However, due to performance reasons, only records of the same country (i.e. with the same country shortname are compared. Hence, country information is a 'hard pre-condition' for duplicate matching: A record must comprise a country shortname, and there will never be a duplicate link between two records with different country shortnames.

Matching result and report creation

The matching process "only" provides a set of linked records, i.e. a set of pairs of similar records with a similarity score. To get a comprehensive duplicate report, the report creation process compares these links and compiles groups of similar records. The following examples show several characteristics of the grouping algorithm (e.g.prioritization of scores, sorting of records by score, sorting of groups by size).

Example 1

R1 -> R2 (90%)
R3 -> R4 (90%)
R3 -> R5 (90%)

[R3, R4, R5]
[R1, R2]

Example 2

R1 -> R3 (90%)
R2 -> R3 (90%)
R2 -> R3 (90%)
R2 -> R4 (90%)
R3 -> R1 (90%)
R3 -> R2 (90%)
R3 -> R4 (90%)
R4 -> R1 (90%)
R4 -> R2 (90%)
R4 -> R3 (90%)

[R1, R3]
[R2, R4]

Golden record creation

There is an option to create "golden records" for the matching groups that have been identified during the duplicate matching. The "golden record" will by default be created on the attributes that appear the most. This default is just a majority of the frequency of a value. In case the duplicate matching is performed on at least two sources (custom databases), it is possible to provide attributes in the matching configuration to preferre attributes from one of the sources.

Default source

Defines a default source in case no mojority can be determined. This default is applied to all attributes/properties in the matching configuration.

<duke>
   <defaultGoldenRecordSource>customDBName</defaultGoldenRecordSource>
   <schema>
   ...

Attribute source

Defines source for an attribute/property in case no mojority can be determined. This source is applied only to a single attributes/properties in the matching configuration.

<duke>
   <schema>
   ...
      <property lookup="true">
         <name>NAME_LOCAL</name><comparator>NameComparator</comparator>
         <low>0.1</low>
         <high>0.9</high>
         <propertyGoldenRecordSource>customDBName</propertyGoldenRecordSource>
      </property>
      ...