Matching configuration

From CDL
Jump to: navigation, search
Managed element
Last edit 5 April 2017 20:01:08
Support contact 
A member of the CDL Team who is responsible for a specific element of the CDL infrastructure.
Sebastian Kaczmarski

The algorithm which is used for duplicate matching can be configured according to your needs. By cleaners, you can remove characters which are not needed (e.g. non-digits, whitespace) or normalize your data (e.g. all lower case, no accents). By comparators, you can define the algorithms to use for data comparison per attribute (e.g. exact comparison or several fuzzy algorithms). And by several thresholds, you can configure the formula for calculating the similarity score.

Basic configuration

The following listing shows a standard configuration that can be used to begin matching:

<duke>
    <object class="no.priv.garshol.duke.comparators.QGramComparator" name="NameComparator">
        <param name="formula" value="DICE" />
        <param name="q" value="3" />
    </object>
    <schema>
        <threshold>0.85</threshold>
        <property type="id">
            <name>EXTERNAL_ID</name>
        </property>
        <property lookup="required">
            <name>COUNTRY_SHORTNAME</name>
            <comparator>no.priv.garshol.duke.comparators.ExactComparator</comparator>
            <low>0.0</low>
            <high>0.5</high>
        </property>
        <property lookup="true">
            <name>NAME_LOCAL</name><comparator>NameComparator</comparator>
            <low>0.1</low>
            <high>0.9</high>
        </property>
        <property lookup="false">
            <name>POST_CODE_VALUE</name>
            <comparator>no.priv.garshol.duke.comparators.ExactComparator</comparator>
            <low>0.4</low>
            <high>0.7</high>
        </property>
        <property lookup="false">
            <name>LOCALITY_VALUE</name>
            <comparator>no.priv.garshol.duke.comparators.Levenshtein</comparator>
            <low>0.1</low>
            <high>0.6</high>
        </property>
        <property lookup="false">
            <name>THOROUGHFARE_VALUE</name>
            <comparator>no.priv.garshol.duke.comparators.Levenshtein</comparator>
            <low>0.1</low>
            <high>0.75</high>
        </property>
    </schema>
    <database class="no.priv.garshol.duke.databases.LuceneDatabase">
        <param name="max-search-hits" value="30" />
        <param name="min-relevance" value="0.9" />
        <param name="fuzzy-search" value="true" />
        <param name="boost-mode" value="INDEX" />
    </database>
    <data-source class="cdq.cdl.matching.batch.MatchingDataSource">
        <column name="EXTERNAL_ID" />
        <column name="NAME_LOCAL" cleaner="cdq.cdl.matching.cleaners.LegalFormCleaner" configProperty="COUNTRY_CODE" />
        <column name="COUNTRY_SHORTNAME" cleaner="no.priv.garshol.duke.cleaners.LowerCaseNormalizeCleaner" />
        <column name="POST_CODE_VALUE" cleaner="no.priv.garshol.duke.cleaners.DigitsOnlyCleaner" />
        <column name="LOCALITY_VALUE" cleaner="no.priv.garshol.duke.cleaners.LowerCaseNormalizeCleaner" />
        <column name="street" property="THOROUGHFARE_VALUE" cleaner="no.priv.garshol.duke.cleaners.LowerCaseNormalizeCleaner" />
    </data-source>
</duke>

The link between your data and the matching configuration is defined by columns, their names, and properties. A column is an attribute in your data, e.g. a column in an Excel file. The name refers either to an attribute name which you have defined in the mapping, or -- for attributes without a mapping -- to the column name in your data. For example, if you have mapped an attribute country to the CDL data model attribute COUNTRY_SHORTNAME, you have to use COUNTRY_SHORTNAME for the column name. The property value assigns a matching property (in the top of the example configuration) to the column's data. If no property is defined, the name is used to find the appropriate matching property.

Cleaners

For each field, a cleaner can be used. A cleaner transforms or normalizes data of the field before it is compared. Thus, cleaners help to increase match scores by removing characters or accents which a not meaningful in a given context. The CDL matching engine can use the following cleaners:

  • Country cleaner -- cdq.cdl.matching.cleaners.CountryCleaner: Removes country names from source.
  • Digits only cleaner -- no.priv.garshol.duke.cleaners.DigitsOnlyCleaner: Removes everything which is not a digit, e.g. to compare post codes.
  • Legal form cleaner -- cdq.cdl.matching.cleaners.LegalFormCleaner: Special cleaner for business partner names. The cleaner identifies a legal form in the input string and cuts the part BEFORE the legal form. To well-recognize legal forms, the cleaner needs to get some country information.
  • Locality in name cleaner -- cdq.cdl.matching.cleaners.LocalityInNameCleaner: Removes locality information. Used especially for name local and name international.
  • Lower case normalize cleaner -- no.priv.garshol.duke.cleaners.LowerCaseNormalizeCleaner: Most widely used cleaner. It lowercases all letters, removes whitespace characters at beginning and end, and normalizes whitespace characters in between tokens. It also removes accents, e.g. turning é into e, and so on.
  • Non-character cleaner -- : Removes any chars that are not latin characters including numbers.
  • Phone number cleaner -- no.priv.garshol.duke.cleaners.PhoneNumberCleaner: Cleaner for international phone numbers. It assumes that it can get the same phone number in forms like 0047 55301400, +47 55301400, 47-55-301400, or +47 (0) 55301400.
  • Replace cleaner -- cdq.cdl.matching.cleaners.ReplaceCleaner: Replaces strings by other strings. Patterns may also comprise regular expressions, and character case can be ignored. To use this cleaner, you have to define the cleaner as a separate object, the following snipped provides an example. Patterns and replacements have to be provided as JSON string.
  • Strip nontext characters cleaner -- no.priv.garshol.duke.cleaners.StripNontextCharacters: Removes non-text characters. Specifically it strips control characters (0-0x1F, 0x7F-0x9F) and special symbols in the range 0xA1-0xBF.
  • Trim cleaner -- no.priv.garshol.duke.cleaners.TrimCleaner: Trims whitespace characters at the beginning and end of the input string.

Comparators

Comparators compare two string values and produce a similarity measure between 0 (meaning completely different) and 1 (meaning exactly equal). To compare different kinds of values differently, the CDL matching engine can use the following comparators:

  • Exact comparator -- no.priv.garshol.duke.comparators.ExactComparator: Just reports 0.0 if the values are not equal and 1.0 if they are.
  • Geopositioin comparator -- no.priv.garshol.duke.comparators.GeopositionComparator: Compares two geographic positions given by coordinates by the distance between them along the earth's surface. It assumes the parameters are of the form 59.917516,10.757933, where the numbers are degrees latitude and longitude, respectively. The computation simply assumes a sphere with a diameter of 6371 kilometers, so no particular geodetic model is assumed. WGS83 coordinates will work fine, while UTM coordinates will not work. See Duke documentation for details, e.g. how to use a parameter max-distance.
  • Jaro Winkler -- no.priv.garshol.duke.comparators.JaroWinkler: Jaro–Winkler distance, which have found to be the best available general string comparator for deduplication. Use this for short strings like given names and family names. Not so good for longer, general strings.
  • Levenshtein -- no.priv.garshol.duke.comparators.Levenshtein: Most widely used fuzzy comparator. Uses Levenshtein distance to compute the similarity. Basically, it measures the number of edit operations needed to get from string 1 to string 2.
  • Longest common substring comparator -- no.priv.garshol.duke.comparators.LongestCommonSubstring: This comparator does not merely find the longest common substring, but does so repeatedly down to a minimal substring length. See [1] for details.
  • Metaphone -- no.priv.garshol.duke.comparators.MetaphoneComparator: Compares field values using Metaphone.
  • Q-gram -- no.priv.garshol.duke.comparators.QGramComparator: Uses n-grams of field values to calculate their similarity. It seems to be similar to Levenshtein, but a bit more eager to consider strings the same, and doesn't care what order tokens are in. So for strings consisting of tokens that may be reordered (e.g. "Hotel Lindner Hamburg" and "Lindner Hotel Hamburg") it may be a better alternative than Levenshtein. May be further configured by q-Parameter to specify the size of the n-grams. Default size is 3 which is fine for must business partner use cases.
  • Soundex -- no.priv.garshol.duke.comparators.SoundexComparator: Compares field values using Soundex.

For each comparator, two parameters low and high have to be specified. These probabilities model the "risk" of false-negative and false-positive matches. For example, <low>0.1</low> means that a comparison result of 0.0 (i.e. completely unlike) is considered a 10% match. And <high>0.9</high> means that a comparison result of 1.0 (i.e. exact match) is considered just a 90% match. By these parameters, match scores can be controlled on a very detailed level, and the influence of different fields on the match score can be defined individually.

Threshold

The threshold tag within the matching schema defines the threshold for the similarity score to consider two records a match after comparison. Default is 0.7. Only records with a match score above this value are compiled to matching groups, remaining records are listed as singles. The match score is calculated from the fields' comparison results with the Bayes' theorem. See Duke dcoumentation for more information.