|Last edit||5 April 2017 20:01:08|
|Support contact||Sebastian Kaczmarski|
The algorithm which is used for duplicate matching can be configured according to your needs. By cleaners, you can remove characters which are not needed (e.g. non-digits, whitespace) or normalize your data (e.g. all lower case, no accents). By comparators, you can define the algorithms to use for data comparison per attribute (e.g. exact comparison or several fuzzy algorithms). And by several thresholds, you can configure the formula for calculating the similarity score.
The following listing shows a standard configuration that can be used to begin matching:
<duke> <object class="no.priv.garshol.duke.comparators.QGramComparator" name="NameComparator"> <param name="formula" value="DICE" /> <param name="q" value="3" /> </object> <schema> <threshold>0.85</threshold> <property type="id"> <name>EXTERNAL_ID</name> </property> <property lookup="required"> <name>COUNTRY_SHORTNAME</name> <comparator>no.priv.garshol.duke.comparators.ExactComparator</comparator> <low>0.0</low> <high>0.5</high> </property> <property lookup="true"> <name>NAME_LOCAL</name><comparator>NameComparator</comparator> <low>0.1</low> <high>0.9</high> </property> <property lookup="false"> <name>POST_CODE_VALUE</name> <comparator>no.priv.garshol.duke.comparators.ExactComparator</comparator> <low>0.4</low> <high>0.7</high> </property> <property lookup="false"> <name>LOCALITY_VALUE</name> <comparator>no.priv.garshol.duke.comparators.Levenshtein</comparator> <low>0.1</low> <high>0.6</high> </property> <property lookup="false"> <name>THOROUGHFARE_VALUE</name> <comparator>no.priv.garshol.duke.comparators.Levenshtein</comparator> <low>0.1</low> <high>0.75</high> </property> </schema> <database class="no.priv.garshol.duke.databases.LuceneDatabase"> <param name="max-search-hits" value="30" /> <param name="min-relevance" value="0.9" /> <param name="fuzzy-search" value="true" /> <param name="boost-mode" value="INDEX" /> </database> <data-source class="cdq.cdl.matching.batch.MatchingDataSource"> <column name="EXTERNAL_ID" /> <column name="NAME_LOCAL" cleaner="cdq.cdl.matching.cleaners.LegalFormCleaner" configProperty="COUNTRY_CODE" /> <column name="COUNTRY_SHORTNAME" cleaner="no.priv.garshol.duke.cleaners.LowerCaseNormalizeCleaner" /> <column name="POST_CODE_VALUE" cleaner="no.priv.garshol.duke.cleaners.DigitsOnlyCleaner" /> <column name="LOCALITY_VALUE" cleaner="no.priv.garshol.duke.cleaners.LowerCaseNormalizeCleaner" /> <column name="street" property="THOROUGHFARE_VALUE" cleaner="no.priv.garshol.duke.cleaners.LowerCaseNormalizeCleaner" /> </data-source> </duke>
The link between your data and the matching configuration is defined by columns, their names, and properties. A
column is an attribute in your data, e.g. a column in an Excel file. The
name refers either to an attribute name which you have defined in the mapping, or -- for attributes without a mapping -- to the column name in your data. For example, if you have mapped an attribute
country to the CDL data model attribute
COUNTRY_SHORTNAME, you have to use
COUNTRY_SHORTNAME for the column name. The
property value assigns a matching property (in the top of the example configuration) to the column's data. If no
property is defined, the
name is used to find the appropriate matching property.
For each field, a cleaner can be used. A cleaner transforms or normalizes data of the field before it is compared. Thus, cleaners help to increase match scores by removing characters or accents which a not meaningful in a given context. The CDL matching engine can use the following cleaners:
- Country cleaner --
cdq.cdl.matching.cleaners.CountryCleaner: Removes country names from source.
- Digits only cleaner --
no.priv.garshol.duke.cleaners.DigitsOnlyCleaner: Removes everything which is not a digit, e.g. to compare post codes.
- Legal form cleaner --
cdq.cdl.matching.cleaners.LegalFormCleaner: Special cleaner for business partner names. The cleaner identifies a legal form in the input string and cuts the part BEFORE the legal form. To well-recognize legal forms, the cleaner needs to get some country information.
- Locality in name cleaner --
cdq.cdl.matching.cleaners.LocalityInNameCleaner: Removes locality information. Used especially for name local and name international.
- Lower case normalize cleaner --
no.priv.garshol.duke.cleaners.LowerCaseNormalizeCleaner: Most widely used cleaner. It lowercases all letters, removes whitespace characters at beginning and end, and normalizes whitespace characters in between tokens. It also removes accents, e.g. turning é into e, and so on.
- Non-character cleaner --
: Removes any chars that are not latin characters including numbers.
- Phone number cleaner --
no.priv.garshol.duke.cleaners.PhoneNumberCleaner: Cleaner for international phone numbers. It assumes that it can get the same phone number in forms like
+47 (0) 55301400.
- Replace cleaner --
cdq.cdl.matching.cleaners.ReplaceCleaner: Replaces strings by other strings. Patterns may also comprise regular expressions, and character case can be ignored. To use this cleaner, you have to define the cleaner as a separate object, the following snipped provides an example. Patterns and replacements have to be provided as JSON string.
- Strip nontext characters cleaner --
no.priv.garshol.duke.cleaners.StripNontextCharacters: Removes non-text characters. Specifically it strips control characters (0-0x1F, 0x7F-0x9F) and special symbols in the range 0xA1-0xBF.
- Trim cleaner --
no.priv.garshol.duke.cleaners.TrimCleaner: Trims whitespace characters at the beginning and end of the input string.
Comparators compare two string values and produce a similarity measure between 0 (meaning completely different) and 1 (meaning exactly equal). To compare different kinds of values differently, the CDL matching engine can use the following comparators:
- Exact comparator --
no.priv.garshol.duke.comparators.ExactComparator: Just reports 0.0 if the values are not equal and 1.0 if they are.
- Geopositioin comparator --
no.priv.garshol.duke.comparators.GeopositionComparator: Compares two geographic positions given by coordinates by the distance between them along the earth's surface. It assumes the parameters are of the form
59.917516,10.757933, where the numbers are degrees latitude and longitude, respectively. The computation simply assumes a sphere with a diameter of 6371 kilometers, so no particular geodetic model is assumed. WGS83 coordinates will work fine, while UTM coordinates will not work. See Duke documentation for details, e.g. how to use a parameter
- Jaro Winkler --
no.priv.garshol.duke.comparators.JaroWinkler: Jaro–Winkler distance, which have found to be the best available general string comparator for deduplication. Use this for short strings like given names and family names. Not so good for longer, general strings.
- Levenshtein --
no.priv.garshol.duke.comparators.Levenshtein: Most widely used fuzzy comparator. Uses Levenshtein distance to compute the similarity. Basically, it measures the number of edit operations needed to get from string 1 to string 2.
- Longest common substring comparator --
no.priv.garshol.duke.comparators.LongestCommonSubstring: This comparator does not merely find the longest common substring, but does so repeatedly down to a minimal substring length. See  for details.
- Metaphone --
no.priv.garshol.duke.comparators.MetaphoneComparator: Compares field values using Metaphone.
- Q-gram --
no.priv.garshol.duke.comparators.QGramComparator: Uses n-grams of field values to calculate their similarity. It seems to be similar to Levenshtein, but a bit more eager to consider strings the same, and doesn't care what order tokens are in. So for strings consisting of tokens that may be reordered (e.g. "Hotel Lindner Hamburg" and "Lindner Hotel Hamburg") it may be a better alternative than Levenshtein. May be further configured by
q-Parameter to specify the size of the n-grams. Default size is 3 which is fine for must business partner use cases.
- Soundex --
no.priv.garshol.duke.comparators.SoundexComparator: Compares field values using Soundex.
For each comparator, two parameters
high have to be specified. These probabilities model the "risk" of false-negative and false-positive matches. For example,
<low>0.1</low> means that a comparison result of 0.0 (i.e. completely unlike) is considered a 10% match. And
<high>0.9</high> means that a comparison result of 1.0 (i.e. exact match) is considered just a 90% match. By these parameters, match scores can be controlled on a very detailed level, and the influence of different fields on the match score can be defined individually.
threshold tag within the matching
schema defines the threshold for the similarity score to consider two records a match after comparison. Default is
0.7. Only records with a match score above this value are compiled to matching groups, remaining records are listed as singles. The match score is calculated from the fields' comparison results with the Bayes' theorem. See Duke dcoumentation for more information.