Address curation

From CDL
Jump to: navigation, search
Managed element
Last edit 6 June 2018 13:32:14
Support contact 
A member of the CDL Team who is responsible for a specific element of the CDL infrastructure.
Martin Ofner

Address curation covers standardization, enrichment, cleansing, translation, and geo-coding of addresses. This functionality is key to enable cross-corporate data management because it helps to deal with e.g. different languages, abbreviations and writing rules of addresses.

Address curation approach

CDL address curation is implemented as a chain of several phases with each phase transforming a given address record in a specific way. The following listing gives a brief overview of these address curation phases:

  • Parsing: Identifies specific information in given address data (e.g. Post code, Premise) and puts this information in the appropriate attributes of the data model.
  • Cleansing: Compares given address data to reference data from e.g. GoogleMaps or Open Street Maps and replaces 'wrong' data (e.g. due to typos or missing accents) by reference data.
  • Translation: Translates given address data to a given language. By default, all address data is translated to English.
  • Enrichment: Adds missing information (e.g. Locality, Post code or Administrative area) to given address records. Also geo-coding is performed in this phase.
  • Abbreviation: Adds abbreviations and codes (e.g. for Country, Administrative area Thoroughfare) to given address data. The cleansing phase provides full names for all attributes, so after this phase, all known fullnames and abbreviations are available.
  • Normalization: The CDL has defined certain standards for address data, e.g. "only latin characters and no accents in base language address data". Such rules are applied in this phase.

Address curation quality

The quality of an address curation result depends mainly on the volume and quality of reference data which is used in the curation process. For each curated address, the CDL Address Curation Engine provides a score which indicates how "similar" the result is to given data, i.e. to address data that is used as input. Another indicator in terms of "expected quality" is provided by some metrics for used reference data on country level

Request similarity and curation level

For each address which is processed by the CDL Address Curation Engine, input data is compared to the curation result. Similar values indicate that the engine has found well-fitting reference data and has really improved an address. Low similarity (e.g. due to a completely different locality) indicates that the engine may have used wrong reference data or may have misunderstood given data. The following snipped shows an exemplary requestSimilarity provided within the response of the curation engine.

<cdl:responseAddressCuration xmlns:cdl="http://ws.cdl.cdq.ch">
  <cdl:curatedAddress>
    ...
  </cdl:curatedAddress>
  <cdl:requestSimilarity>
    <geoCoordinates>true</geoCoordinates>
    <locality>0.8571428571428571</locality>
    <overall>0.813492063</overall>
    <postCode>0.75</postCode>
    <thoroughfare>0.8333333333333334</thoroughfare>
  </cdl:requestSimilarity>
  <cdl:curationLevel>CL_5</cdl:curationLevel>
  <cdl:curationInformation>
    <cdl:log>[added] ADMINISTRATIVE_AREA_LEVEL_2_VALUE: 'Canton of St. Gallen'</cdl:log>
    <cdl:log>[added] ADMINISTRATIVE_AREA_VALUE: 'Canton of St. Gallen'</cdl:log>
    <cdl:log>[added] COUNTRY_VALUE: 'Switzerland'</cdl:log>
    <cdl:log>[added] POST_CODE_TYPE: 'ADDRESS_POST_CODE_TYPE_REGULAR'</cdl:log>
    <cdl:log>[added] THOROUGHFARE_NUMBER: '4'</cdl:log>
    <cdl:log>[added] THOROUGHFARE_SHORTNAME: 'Lukasstr.'</cdl:log>
    <cdl:log>[changed] LATITUDE: '0.0' -> '47.4394863'</cdl:log>
    <cdl:log>[changed] LOCALITY_VALUE: 'Gallen ST' -> 'St. Gallen'</cdl:log>
    <cdl:log>[changed] LONGITUDE: '0.0' -> '9.3951915'</cdl:log>
    <cdl:log>[changed] POST_CODE_VALUE: '9000' -> '9008'</cdl:log>
    <cdl:log>[changed] THOROUGHFARE_VALUE: 'Lukast 4' -> 'Lukasstrasse'</cdl:log>
  </cdl:curationInformation>
</cdl:responseAddressCuration>

The similarity metric compares Locality, Post code, and Thoroughfare data from request to response data. For Locality, only top level locality from the request is compared to both top level and first sub level of the response. For Thoroughfare, request data is compared to concatenated Thoroughfare value and Thoroughfare number data. Comparison is done by Q grams. To consider "included" terms (e.g. "Gallen" and "Saint Gallen"), the score algorith also looks for longest common substrings is also used, and the higher value is used. The overall similarity is simply calculated by average of the other scores. However, if geoCoordinates is false (i.e. no Geographic coordinates could be found]], the overall score is max. 0.5.

Due to legacy data model, address curation responses also provide a curationLevel. This informal "level" is just derived from the request similarity score:

Identifier Name Description
CL_6 Validated overall score (0.9, 1.0] The address was found in the shared CDL data pool. This means another company uses the same address which is a very reliable indicator that the address is correct (currently only available in a alpha version, not in the stable environment)
CL_5 Reliable match overall score (0.8, 0.9] The address was found by the CDL, but no major changes have been made as the address was correct (e.g. only a trailing whitespace was removed).
CL_4 High confidence match overall score (0.7, 0.8] The address was found by the CDL. There were only  changes in less critical fields such as the Premise or Thoroughfare number.
CL_3 Medium confidence match overall score (0.6, 0.7] The address was found and there are minor changes in highly important fields (i.e. Locality, Post code, Country, Thoroughfare).
CL_2 Low confidence match overall score (0.4, 0.6] The address was found, but there were significant changes in critical fields.
CL_1 Not found overall score (0.2, 0.4] The address was not found by the CDL in the employed external data sources (i.e. Google Maps, Open Street Map or geonames.org).

Reference data metrics

The following metrics inform about the quality which can be expected from CDL address curation. However, due to dependencies on several reference data, quality may vary in particular cases.

Post codes management level: Quality indicator, i.e. the information source, for post codes which are used for data validation and address curation:

  • NO_DATA: No data is available for the given country.
  • GEONAMES: Data is taken from [www.geonames.org].
  • OSM: Data is taken from Open Street Map.
  • NATIONAL_AUTHORITY: A national authority provides complete and up-to-date information..

Administrative area management level: Quality indicator, i.e. the information source, for administrative areas which are used for data validation and address curation:

  • NO_DATA: No data is available for the given country.
  • GEONAMES: Data is taken from http://www.geonames.org.
  • OSM: Data is taken from Open Street Map.
  • NATIONAL_AUTHORITY: A national authority provides complete and up-to-date information.

Number of address business rules: Counts business rules for address data for a given country.

Number of address reference concepts: Counts terms which are managed e.g. for premise types for a given country.

Number of address reference sources: counts reference data sources for address data for a given country.

Address details level: Quality indicator, i.e. the information source, for post codes which are used for data validation and address curation:

  • LOCALITY: Address curation only provides address information on Country, Administrative area, Post code, and Locality level for a given country.
  • THOROUGHFARE: In addition to LOCALITY level, address curation provides address information on Thoroughfare level.
  • PREMISE: Address curation provides address information on THOROUGHFARE level and also identifies and structures Premise information in address data.

Address curation quality: Quality level [1..5] to indicate the (subjective) quality of address curation.

Use the following widget to to understand the address curation quality of particular countries:

 Has post codes management levelHas administrative area management levelHas number of address business rulesHas number of address reference conceptsHas number of address reference sourcesHas address details levelHas address curation quality
AfghanistanNO_DATAGEONAMES000PREMISE4
Aland IslandsGEONAMESGEONAMES000PREMISE4
AlbaniaNO_DATAGEONAMES010PREMISE4
AlgeriaGEONAMESGEONAMES010PREMISE4
American SamoaGEONAMESGEONAMES020PREMISE4
AndorraGEONAMESGEONAMES010PREMISE4
AngolaNO_DATAGEONAMES000PREMISE4
AnguillaNO_DATAGEONAMES000PREMISE4
AntarcticaNO_DATAGEONAMES000PREMISE4
Antigua and BarbudaNO_DATAGEONAMES010PREMISE4
ArgentinaGEONAMESGEONAMES020PREMISE5
ArmeniaNO_DATAGEONAMES000PREMISE4
ArubaNO_DATAGEONAMES010PREMISE4
AustraliaGEONAMESGEONAMES010PREMISE5
AustriaGEONAMESGEONAMES0300PREMISE5
AzerbaijanNO_DATAGEONAMES000PREMISE4
BahamasNO_DATAGEONAMES010PREMISE4
BahrainNO_DATAGEONAMES000PREMISE4
BangladeshGEONAMESGEONAMES010PREMISE4
BarbadosNO_DATAGEONAMES010PREMISE4
BelarusGEONAMESGEONAMES000PREMISE4
BelgiumGEONAMESGEONAMES0200PREMISE5
BelizeNO_DATAGEONAMES010PREMISE4
BeninNO_DATAGEONAMES010PREMISE4
BermudaNO_DATAGEONAMES010PREMISE4
BhutanNO_DATAGEONAMES000PREMISE4
BoliviaNO_DATAGEONAMES030PREMISE4
Bosnia and HerzegovinaNO_DATAGEONAMES000PREMISE4
BotswanaNO_DATAGEONAMES020PREMISE4
Bouvet IslandNO_DATAGEONAMES000PREMISE4
BrazilGEONAMESGEONAMES020PREMISE4
British Indian Ocean TerritoryNO_DATAGEONAMES000PREMISE4
British Virgin IslandsNO_DATAGEONAMES010PREMISE4
BruneiNO_DATAGEONAMES000PREMISE4
BulgariaGEONAMESGEONAMES010PREMISE4
Burkina FasoNO_DATAGEONAMES010PREMISE4
BurundiNO_DATAGEONAMES010PREMISE4
CambodiaNO_DATAGEONAMES000PREMISE4
CameroonNO_DATAGEONAMES010PREMISE4
CanadaGEONAMESGEONAMES020PREMISE4
Canary IslandsNO_DATAGEONAMES000PREMISE4
Cape VerdeNO_DATAGEONAMES000PREMISE4
Caribbean NetherlandsNO_DATAGEONAMES000PREMISE4
Cayman IslandsNO_DATAGEONAMES010PREMISE4
Central African RepublicNO_DATAGEONAMES010PREMISE4
CeutaNO_DATAGEONAMES000PREMISE4
ChadNO_DATAGEONAMES020PREMISE4
ChileNO_DATAGEONAMES010PREMISE4
ChinaNO_DATAGEONAMES050PREMISE4
Christmas IslandNO_DATAGEONAMES010PREMISE4
Cocos (Keeling) IslandsNO_DATAGEONAMES010PREMISE4
ColombiaGEONAMESGEONAMES010PREMISE4
ComorosNO_DATAGEONAMES020PREMISE4
CongoNO_DATAGEONAMES000PREMISE4
Cook IslandsNO_DATAGEONAMES000PREMISE4
Costa RicaGEONAMESGEONAMES030PREMISE4
CroatiaGEONAMESGEONAMES010PREMISE4
CubaNO_DATAGEONAMES010PREMISE4
CuracaoNO_DATAGEONAMES000PREMISE4
CyprusNO_DATAGEONAMES000PREMISE4
Czech RepublicGEONAMESGEONAMES020PREMISE5
Democratic Republic of the CongoNO_DATAGEONAMES000PREMISE4
DenmarkGEONAMESGEONAMES020PREMISE5
DjiboutiNO_DATAGEONAMES020PREMISE4
DominicaNO_DATAGEONAMES020PREMISE4
Dominican RepublicGEONAMESGEONAMES010PREMISE4
East TimorNO_DATAGEONAMES020PREMISE4
EcuadorNO_DATAGEONAMES010PREMISE4
EgyptNO_DATAGEONAMES020PREMISE4
El SalvadorNO_DATAGEONAMES010PREMISE4
Equatorial GuineaNO_DATAGEONAMES010PREMISE4
EritreaNO_DATAGEONAMES000PREMISE4
EstoniaNO_DATAGEONAMES000PREMISE4
EthiopiaNO_DATAGEONAMES010PREMISE4
Falkland IslandsNO_DATAGEONAMES010PREMISE4
Faroe IslandsGEONAMESGEONAMES000PREMISE4
FijiNO_DATAGEONAMES010PREMISE4
FinlandGEONAMESGEONAMES030PREMISE5
FranceGEONAMESGEONAMES0140PREMISE5
French GuianaGEONAMESGEONAMES000PREMISE4
French PolynesiaNO_DATAGEONAMES010PREMISE4
French Southern TerritoriesNO_DATAGEONAMES000PREMISE4
GabonNO_DATAGEONAMES010PREMISE4
GambiaNO_DATAGEONAMES000PREMISE4
GeorgiaNO_DATAGEONAMES000PREMISE4
GermanyGEONAMESGEONAMES0310PREMISE5
GhanaNO_DATAGEONAMES020PREMISE4
GibraltarNO_DATAGEONAMES010PREMISE4
GreeceNO_DATAGEONAMES010PREMISE4
GreenlandGEONAMESGEONAMES000PREMISE4
GrenadaNO_DATAGEONAMES010PREMISE4
GuadeloupeGEONAMESGEONAMES010PREMISE4
GuamGEONAMESGEONAMES010PREMISE4
GuatemalaGEONAMESGEONAMES000PREMISE4
GuernseyGEONAMESGEONAMES010PREMISE4
GuineaNO_DATAGEONAMES010PREMISE4
Guinea-BissauNO_DATAGEONAMES000PREMISE4
GuyanaNO_DATAGEONAMES010PREMISE4
HaitiNO_DATAGEONAMES010PREMISE4
Heard Island and McDonald IslandsNO_DATAGEONAMES000PREMISE4
HondurasNO_DATAGEONAMES010PREMISE4
Hong KongNO_DATAGEONAMES010PREMISE4
HungaryGEONAMESGEONAMES010PREMISE4
IcelandGEONAMESGEONAMES010PREMISE4
IndiaGEONAMESGEONAMES010PREMISE4
IndonesiaNO_DATAGEONAMES010PREMISE4
IranNO_DATAGEONAMES010PREMISE4
IraqNO_DATAGEONAMES000PREMISE4
IrelandGEONAMESGEONAMES020PREMISE4
Isle of ManGEONAMESGEONAMES010PREMISE4
IsraelNO_DATAGEONAMES010PREMISE4
ItalyGEONAMESGEONAMES030PREMISE5
Ivory CoastNO_DATAGEONAMES010PREMISE4
JamaicaNO_DATAGEONAMES010PREMISE4
JapanGEONAMESGEONAMES040PREMISE5
JerseyGEONAMESGEONAMES010PREMISE4
JordanNO_DATAGEONAMES000PREMISE4
KazakhstanNO_DATAGEONAMES000PREMISE4
KenyaNO_DATAGEONAMES010PREMISE4
KiribatiNO_DATAGEONAMES000PREMISE4
KosovoNO_DATAGEONAMES020PREMISE4
KuwaitNO_DATAGEONAMES000PREMISE4
KyrgyzstanNO_DATAGEONAMES000PREMISE4
LaosNO_DATAGEONAMES000PREMISE4
LatviaNO_DATAGEONAMES000PREMISE4
LebanonNO_DATAGEONAMES030PREMISE4
LesothoNO_DATAGEONAMES010PREMISE4
LiberiaNO_DATAGEONAMES010PREMISE4
LibyaNO_DATAGEONAMES000PREMISE4
LiechtensteinGEONAMESGEONAMES010PREMISE4
LithuaniaGEONAMESGEONAMES010PREMISE4
LuxembourgGEONAMESGEONAMES010PREMISE4
MacaoNO_DATAGEONAMES000PREMISE4
MacedoniaGEONAMESGEONAMES010PREMISE4
MadagascarNO_DATAGEONAMES000PREMISE4
MalawiNO_DATAGEONAMES010PREMISE4
MalaysiaGEONAMESGEONAMES010PREMISE4
MaldivesNO_DATAGEONAMES010PREMISE4
MaliNO_DATAGEONAMES010PREMISE4
MaltaGEONAMESGEONAMES010PREMISE4
Marshall IslandsGEONAMESGEONAMES010PREMISE4
MartiniqueGEONAMESGEONAMES010PREMISE4
MauritaniaNO_DATAGEONAMES020PREMISE4
MauritiusNO_DATAGEONAMES010PREMISE4
MayotteGEONAMESGEONAMES010PREMISE4
MelilliaNO_DATAGEONAMES000PREMISE4
MexicoGEONAMESGEONAMES030PREMISE5
MicronesiaNO_DATAGEONAMES010PREMISE4
MoldovaGEONAMESGEONAMES010PREMISE4
MonacoGEONAMESGEONAMES010PREMISE4
MongoliaNO_DATAGEONAMES000PREMISE4
MontenegroNO_DATAGEONAMES010PREMISE4
MontserratNO_DATAGEONAMES010PREMISE4
MoroccoNO_DATAGEONAMES020PREMISE4
MozambiqueNO_DATAGEONAMES000PREMISE4
MyanmarNO_DATAGEONAMES010PREMISE4
NamibiaNO_DATAGEONAMES020PREMISE4
NauruNO_DATAGEONAMES010PREMISE4
NepalNO_DATAGEONAMES010PREMISE4
Netherlands AntillesNO_DATAGEONAMES010PREMISE4
New CaledoniaGEONAMESGEONAMES010PREMISE4
New ZealandGEONAMESGEONAMES010PREMISE4
NicaraguaGEONAMESGEONAMES010PREMISE4
NigerNO_DATAGEONAMES010PREMISE4
NigeriaNO_DATAGEONAMES020PREMISE4
NiueNO_DATAGEONAMES010LOCALITY4
Norfolk IslandNO_DATAGEONAMES010PREMISE4
North KoreaNO_DATAGEONAMES000THOROUGHFARE3
Northern Mariana IslandsGEONAMESGEONAMES010PREMISE4
NorwayGEONAMESGEONAMES000PREMISE4
OmanNO_DATAGEONAMES030PREMISE4
PakistanGEONAMESGEONAMES000PREMISE4
PalauNO_DATAGEONAMES010PREMISE4
PalestineNO_DATAGEONAMES000PREMISE4
PanamaNO_DATAGEONAMES010PREMISE4
Papua New GuineaNO_DATAGEONAMES010PREMISE4
ParaguayNO_DATAGEONAMES030PREMISE4
PeruNO_DATAGEONAMES010PREMISE4
PhilippinesGEONAMESGEONAMES010PREMISE4
Pitcairn IslandsNO_DATAGEONAMES000PREMISE4
PolandGEONAMESGEONAMES050PREMISE5
PortugalGEONAMESGEONAMES020PREMISE4
Puerto RicoGEONAMESGEONAMES010PREMISE4
QatarNO_DATAGEONAMES020PREMISE4
Reference data
ReunionGEONAMESGEONAMES000PREMISE4
RomaniaGEONAMESGEONAMES020PREMISE4
RussiaGEONAMESGEONAMES030PREMISE4
RwandaNO_DATAGEONAMES010PREMISE4
Saint BarthelemyNO_DATAGEONAMES000PREMISE4
Saint Helena, Ascension and Tristan da CunhaNO_DATAGEONAMES000PREMISE4
Saint Kitts and NevisNO_DATAGEONAMES010PREMISE4
Saint LuciaNO_DATAGEONAMES010PREMISE4
Saint MartinNO_DATAGEONAMES000PREMISE4
Saint Pierre and MiquelonGEONAMESGEONAMES010PREMISE4
Saint Vincent and the GrenadinesNO_DATAGEONAMES010PREMISE4
SamoaNO_DATAGEONAMES000PREMISE4
San MarinoGEONAMESGEONAMES010PREMISE4
Sao Tome and PrincipeNO_DATAGEONAMES000PREMISE4
Saudi ArabiaNO_DATAGEONAMES000PREMISE4
SenegalNO_DATAGEONAMES010PREMISE4
SerbiaNO_DATAGEONAMES010PREMISE4
Serbia (XS)NO_DATAGEONAMES000PREMISE4
SeychellesNO_DATAGEONAMES010PREMISE4
Sierra LeoneNO_DATAGEONAMES010PREMISE4
SingaporeNO_DATAGEONAMES010PREMISE4
Sint MaartenNO_DATAGEONAMES010PREMISE4
SlovakiaGEONAMESGEONAMES020PREMISE4
SloveniaGEONAMESGEONAMES010PREMISE4
Solomon IslandsNO_DATAGEONAMES010PREMISE4
SomaliaNO_DATAGEONAMES000PREMISE4
South AfricaGEONAMESGEONAMES010PREMISE4
South Georgia and the South Sandwich IslandsNO_DATAGEONAMES000PREMISE4
South KoreaNO_DATAGEONAMES040PREMISE4
South SudanNO_DATAGEONAMES010PREMISE4
SpainGEONAMESGEONAMES070PREMISE5
Sri LankaGEONAMESGEONAMES010PREMISE4
SudanNO_DATAGEONAMES000PREMISE4
SurinameNO_DATAGEONAMES010PREMISE4
SvalbardGEONAMESGEONAMES000PREMISE4
SwazilandNO_DATAGEONAMES010PREMISE4
SwedenGEONAMESGEONAMES010PREMISE4
SwitzerlandGEONAMESGEONAMES0350PREMISE5
SyriaNO_DATAGEONAMES000PREMISE4
TaiwanNO_DATAGEONAMES010PREMISE4
TajikistanNO_DATAGEONAMES000PREMISE4
TanzaniaNO_DATAGEONAMES010PREMISE4
ThailandGEONAMESGEONAMES010PREMISE4
The NetherlandsGEONAMESGEONAMES060PREMISE5
TogoNO_DATAGEONAMES010PREMISE4
TokelauNO_DATAGEONAMES010PREMISE4
TongaNO_DATAGEONAMES010PREMISE4
Trinidad and TobagoNO_DATAGEONAMES010PREMISE4
TunisiaNO_DATAGEONAMES020PREMISE4
TurkeyGEONAMESGEONAMES010PREMISE4
TurkmenistanNO_DATAGEONAMES000PREMISE4
Turks and Caicos IslandsNO_DATAGEONAMES000PREMISE4
TuvaluNO_DATAGEONAMES010PREMISE4
UgandaNO_DATAGEONAMES010PREMISE4
UkraineNO_DATAGEONAMES000PREMISE4
United Arab EmiratesNO_DATAGEONAMES000PREMISE4
United KingdomGEONAMESGEONAMES020PREMISE4
United StatesGEONAMESGEONAMES020PREMISE5
United States Minor Outlying IslandsNO_DATAGEONAMES000PREMISE4
United States Virgin IslandsGEONAMESGEONAMES010PREMISE4
UruguayNO_DATAGEONAMES010PREMISE4
UzbekistanNO_DATAGEONAMES000PREMISE4
VanuatuNO_DATAGEONAMES020PREMISE4
Vatican CityGEONAMESGEONAMES000PREMISE4
VenezuelaNO_DATAGEONAMES020PREMISE4
VietnamNO_DATAGEONAMES000PREMISE4
Wallis and FutunaGEONAMESGEONAMES010PREMISE4
Western SaharaNO_DATAGEONAMES000LOCALITY4
WorldGEONAMESGEONAMES0330PREMISE5
YemenNO_DATAGEONAMES000PREMISE4
YugoslaviaNO_DATAGEONAMES000PREMISE4
ZaireNO_DATAGEONAMES000PREMISE4
ZambiaNO_DATAGEONAMES020PREMISE4
ZimbabweNO_DATAGEONAMES020PREMISE4