Address curation

From CDL
Jump to: navigation, search
Managed element
Last edit 20 February 2019 14:16:55
Support contact 
A member of the CDL Team who is responsible for a specific element of the CDL infrastructure.
Martin Ofner

Address curation covers standardization, enrichment, cleansing, translation, and geo-coding of addresses. This functionality is key to enable cross-corporate data management because it helps to deal with e.g. different languages, abbreviations, and writing rules of addresses.

Address curation approach

CDL address curation is implemented as a chain of several phases with each phase transforming a given address record in a specific way. The following listing gives a brief overview of these address curation phases:

  • Parsing: Identifies specific information in given address data (e.g. Post code, Premise) and puts this information in the appropriate attributes of the data model.
  • Cleansing: Compares given address data to reference data from e.g. GoogleMaps or Open Street Maps and replaces 'wrong' data (e.g. due to typos or missing accents) by reference data.
  • Translation: Translates given address data to a given language. By default, all address data is translated to English.
  • Enrichment: Adds missing information (e.g. Locality, Post code or Administrative area) to given address records. Also geo-coding is performed in this phase.
  • Abbreviation: Adds abbreviations and codes (e.g. for Country, Administrative area Thoroughfare) to given address data. The cleansing phase provides full names for all attributes, so after this phase, all known fullnames and abbreviations are available.
  • Normalization: The CDL has defined certain standards for address data, e.g. "only latin characters and no accents in base language address data". Such rules are applied in this phase.

Address curation quality

The quality of an address curation result depends mainly on the volume and quality of reference data which is used in the curation process. For each curated address, the CDL Address Curation Engine provides a score which indicates how "similar" the result is to given data, i.e. to address data that is used as input. Another indicator in terms of "expected quality" is provided by some metrics for used reference data on country level

Comparison of request and results

For each address which is processed by the CDL Address Curation Engine, input data is compared to the curation result. Similar values indicate that the engine has found well-fitting reference data and has really improved an address. Low similarity (e.g. due to a completely different locality) indicates that the engine may have used wrong reference data or may have misunderstood given data.

The following snippets show a request for a Swiss address and the corresponding response with request similarity scores, a curation level, an accuracy indicator, and a change log.

Request

{
  "address": {
    "country": {
      "shortname": "CH"
    },
    "thoroughfare": {
      "value": "Lukasstrasse 4"
    },
    "locality": {
      "value": "St. Gallen"
    },
    "postCode": {
      "value": "9008"
    }
  }
}

Response

{
  "addressCurationResult": {
    "curatedAddress": {
      "administrativeArea": {
        "value": "Canton of St. Gallen",
        "shortname": "SG",
        "administrativeArea": {
          "value": "Sankt Gallen",
          "shortname": "St. Gallen"
        }
      },
      "country": {
        "value": "Switzerland",
        "shortname": "CH"
      },
      "geographicCoordinates": {
        "latitude": 47.43951029999999,
        "longitude": 9.395268699999999
      },
      "locality": {
        "value": "St. Gallen"
      },
      "postCode": {
        "value": "9008",
        "type": "Regular"
      },
      "thoroughfare": {
        "value": "Lukasstrasse",
        "shortname": "Lukasstr.",
        "number": "4"
      },
      "metadata": {...}
    },
    "originalAddress": {...},
    "requestSimilarity": {
      "locality": 1.0,
      "postCode": 1.0,
      "thoroughfare": 1.0,
      "geoCoordinates": true,
      "overall": 1.0
    },
    "curationLevel": "CL_6",
    "accuracyIndicator": 4,
    "changes": [
      {
        "action": "ADDED",
        "field": "ADMINISTRATIVE_AREA_LEVEL_2_SHORTNAME",
        "message": "'St. Gallen'"
      },
      {
        "action": "ADDED",
        "field": "ADMINISTRATIVE_AREA_LEVEL_2_VALUE",
        "message": "'Sankt Gallen'"
      },
      {
        "action": "ADDED",
        "field": "ADMINISTRATIVE_AREA_SHORTNAME",
        "message": "'SG'"
      },
      {
        "action": "ADDED",
        "field": "ADMINISTRATIVE_AREA_VALUE",
        "message": "'Canton of St. Gallen'"
      },
      {
        "action": "ADDED",
        "field": "COUNTRY_VALUE",
        "message": "'Switzerland'"
      },
      {
        "action": "ADDED",
        "field": "POST_CODE_TYPE",
        "message": "'Regular'"
      },
      {
        "action": "ADDED",
        "field": "THOROUGHFARE_NUMBER",
        "message": "'4'"
      },
      {
        "action": "ADDED",
        "field": "THOROUGHFARE_SHORTNAME",
        "message": "'Lukasstr.'"
      },
      {
        "action": "CHANGED",
        "field": "LATITUDE",
        "message": "'0.0' -> '47.43951029999999' (approximated)"
      },
      {
        "action": "CHANGED",
        "field": "LONGITUDE",
        "message": "'0.0' -> '9.395268699999999' (approximated)"
      },
      {
        "action": "CHANGED",
        "field": "THOROUGHFARE_VALUE",
        "message": "'Lukasstrasse 4' -> 'Lukasstrasse'"
      }
    ],
    "languageCode": "EN"
  }
}

Request similarity

The similarity metric compares Locality, Post code, and Thoroughfare data from request to response data. For Locality, only top level locality from the request is compared to both top level and first sub level of the response. For Thoroughfare, request data is compared to concatenated Thoroughfare value and Thoroughfare number data. Comparison is done by Q grams. To consider "included" terms (e.g. "Gallen" and "Saint Gallen"), the score algorithm also looks for longest common substrings, and the higher score (i.e. Q-gram vs. sub string) is used. The overall similarity is simply calculated by average of the other scores. However, if geoCoordinates is false (i.e. no Geographic coordinates could be found), the overall score is max. 0.5.

TODO: Overall score logic

Curation level

Due to legacy data model, address curation responses also provide a curationLevel. This informal "level" is just derived from the request similarity score:

Identifier Name Description
CL_6 Validated overall score (0.9, 1.0] The address was found in the shared CDL data pool. This means another company uses the same address which is a very reliable indicator that the address is correct (currently only available in a alpha version, not in the stable environment)
CL_5 Reliable match overall score (0.8, 0.9] The address was found by the CDL, but no major changes have been made as the address was correct (e.g. only a trailing whitespace was removed).
CL_4 High confidence match overall score (0.7, 0.8] The address was found by the CDL. There were only  changes in less critical fields such as the Premise or Thoroughfare number.
CL_3 Medium confidence match overall score (0.6, 0.7] The address was found and there are minor changes in highly important fields (i.e. Locality, Post code, Country, Thoroughfare).
CL_2 Low confidence match overall score (0.4, 0.6] The address was found, but there were significant changes in critical fields.
CL_1 Not found overall score (0.2, 0.4] The address was not found by the CDL in the employed external data sources (i.e. Google Maps, Open Street Map or geonames.org).

Accuracy indicator

Value Criteria (or)
5 Name and address are matched in the CDL database with match score > 0.9
4
  • Address has geo coordinates, locality, thoroughfare and post code
  • Address is matched in the CDL address index with a match score > 0.9
  • Locality, thoroughfare, and postcode are given in request and requestSimilarityScores for locality and thoroughfare are > 0.9 and for postcode is > 0.95, and geoCords are given
  • GoogleMaps provides only one result for the request
3
  • Address has geo coordinates, locality, thoroughfare and post code
  • When GoogleMaps provides only one result for the request:
    • NOT (Locality, thoroughfare, and postcode are given in request and requestSimilarity for locality and thoroughfare are > 0.9 and for postcode is > 0.95)
    • requestSimilarity.overall >= 0.5
  • When GoogleMaps provides multiple results for the request:
    • requestSimilarity.overall >= 0.7
2
  • Address has geo coordinates, locality, thoroughfare and post code
  • Curated with Google
  • When GoogleMaps provides only one result for the request:
    • requestSimilarity.overall < 0.5
  • When GoogleMaps provides multiple results for the request:
    • requestSimilarity.overall [0.5, 0.7)
1 else

Reference data metrics

The following metrics inform about the quality which can be expected from CDL address curation. However, due to dependencies on several reference data, quality may vary in particular cases.

Post codes management level: Quality indicator, i.e. the information source, for post codes which are used for data validation and address curation:

  • NO_DATA: No data is available for the given country.
  • GEONAMES: Data is taken from [www.geonames.org].
  • OSM: Data is taken from Open Street Map.
  • NATIONAL_AUTHORITY: A national authority provides complete and up-to-date information..

Administrative area management level: Quality indicator, i.e. the information source, for administrative areas which are used for data validation and address curation:

  • NO_DATA: No data is available for the given country.
  • GEONAMES: Data is taken from http://www.geonames.org.
  • OSM: Data is taken from Open Street Map.
  • NATIONAL_AUTHORITY: A national authority provides complete and up-to-date information.

Number of address business rules: Counts business rules for address data for a given country.

Number of address reference concepts: Counts terms which are managed e.g. for premise types for a given country.

Number of address reference sources: counts reference data sources for address data for a given country.

Address details level: Quality indicator, i.e. the information source, for post codes which are used for data validation and address curation:

  • LOCALITY: Address curation only provides address information on Country, Administrative area, Post code, and Locality level for a given country.
  • THOROUGHFARE: In addition to LOCALITY level, address curation provides address information on Thoroughfare level.
  • PREMISE: Address curation provides address information on THOROUGHFARE level and also identifies and structures Premise information in address data.

Address curation quality: Quality level [1..5] to indicate the (subjective) quality of address curation.

Use the following widget to to understand the address curation quality of particular countries:

 Has post codes management levelHas administrative area management levelHas number of address business rulesHas number of address reference conceptsHas number of address reference sourcesHas address details levelHas address curation quality
Afghanistan
Aland Islands
Albania
Algeria
American Samoa
Andorra
Angola
Anguilla
Antarctica
Antigua and Barbuda
Argentina
Armenia
Aruba
Australia
Austria
Azerbaijan
Bahamas
Bahrain
Bangladesh
Barbados
Belarus
Belgium
Belize
Benin
Bermuda
Bhutan
Bolivia
Bosnia and Herzegovina
Botswana
Bouvet Island
Brazil
British Indian Ocean Territory
British Virgin Islands
Brunei
Bulgaria
Burkina Faso
Burundi
Cambodia
Cameroon
Canada
Canary Islands
Cape Verde
Caribbean Netherlands
Cayman Islands
Central African Republic
Ceuta
Chad
Chile
China
Christmas Island
Cocos (Keeling) Islands
Colombia
Comoros
Congo
Cook Islands
Costa Rica
Croatia
Cuba
Curacao
Cyprus
Czech Republic
Democratic Republic of the Congo
Denmark
Djibouti
Dominica
Dominican Republic
East Timor
Ecuador
Egypt
El Salvador
Equatorial Guinea
Eritrea
Estonia
Ethiopia
Falkland Islands
Faroe Islands
Fiji
Finland
France
French Guiana
French Polynesia
French Southern Territories
Gabon
Gambia
Georgia
Germany
Ghana
Gibraltar
Greece
Greenland
Grenada
Guadeloupe
Guam
Guatemala
Guernsey
Guinea
Guinea-Bissau
Guyana
HaitiNO_DATAGEONAMES010PREMISE4
Heard Island and McDonald Islands
Honduras
Hong Kong
Hungary
Iceland
India
Indonesia
Iran
Iraq
Ireland
Isle of Man
Israel
Italy
Ivory Coast
Jamaica
Japan
Jersey
Jordan
Kazakhstan
Kenya
Kiribati
Kosovo
Kuwait
Kyrgyzstan
Laos
Latvia
Lebanon
Lesotho
Liberia
Libya
Liechtenstein
Lithuania
Luxembourg
MacaoNO_DATAGEONAMES000PREMISE4
Macedonia
Madagascar
Malawi
Malaysia
Maldives
MaliNO_DATAGEONAMES010PREMISE4
Malta
Marshall Islands
Martinique
Mauritania
Mauritius
Mayotte
Melillia
Mexico
Micronesia
Moldova
Monaco
Mongolia
Montenegro
Montserrat
Morocco
Mozambique
Myanmar
Namibia
Nauru
Nepal
Netherlands Antilles
New Caledonia
New Zealand
Nicaragua
Niger
Nigeria
Niue
Norfolk Island
North Korea
Northern Mariana Islands
Norway
Oman
Pakistan
Palau
Palestine
Panama
Papua New Guinea
Paraguay
Peru
Philippines
Pitcairn Islands
Poland
Portugal
Puerto Rico
Qatar
Reunion
Romania
Russia
Rwanda
Saint Barthelemy
Saint Helena, Ascension and Tristan da Cunha
Saint Kitts and Nevis
Saint Lucia
Saint Martin
Saint Pierre and Miquelon
Saint Vincent and the Grenadines
Samoa
San Marino
Sao Tome and Principe
Saudi Arabia
Senegal
Serbia
Serbia (XS)
Seychelles
Sierra Leone
Singapore
Sint Maarten
Slovakia
Slovenia
Solomon Islands
Somalia
South Africa
South Georgia and the South Sandwich Islands
South Korea
South Sudan
Spain
Sri Lanka
Sudan
Suriname
Svalbard
Swaziland
Sweden
Switzerland
Syria
Taiwan
Tajikistan
Tanzania
Thailand
The Netherlands
Togo
Tokelau
Tonga
Trinidad and Tobago
Tunisia
Turkey
Turkmenistan
Turks and Caicos Islands
Tuvalu
Uganda
Ukraine
United Arab Emirates
United Kingdom
United States
United States Minor Outlying IslandsNO_DATAGEONAMES000PREMISE4
United States Virgin Islands
Uruguay
Uzbekistan
Vanuatu
Vatican City
Venezuela
Vietnam
Wallis and Futuna
Western Sahara
World
YemenNO_DATAGEONAMES000PREMISE4
Yugoslavia
Zaire
Zambia
Zimbabwe