Data curation report

From CDL
Jump to: navigation, search
Managed element
Last edit 12 May 2019 18:10:42
Support contact 
A member of the CDL Team who is responsible for a specific element of the CDL infrastructure.
Kai Hüner
The Data Curation Report provides the result of a data curation job, a CDQ Cloud Service to enrich, correct, and translate a set of business partner and address data. A report comprises details per record (e.g. added translation of names, corrected post codes) and charts to summarize overall results.

A report is provided as an Excel file with the following pages:

  • documentation: Link to this page, to provide up-to-date documentation to report users.
  • charts: Visual summaries of certain aspects of the data curation results. Particular charts are documented below.
  • curation results: Detailed results of the data curation per record with comparison of results with input data. Details are documented below.
  • country quality (chart base): Request similarity statistics per country, numerical basis of the Address Curation Quality chart.
  • changes (chart base): Statistics on data changes (e.g. added data, changed data, confirmed data) per address field, numerical basis of the Changed Address Fields chart.

Charts

The charts of the Data Curation Report summarize certain statistics of the data curation result, e.g. changes per address field or address curation quality per country.

Address Curation Quality

Address curation quality chart
The Address Curation Quality chart shows a "confidence" metric for the "top 10" countries of the given data set. The country bars are sorted by number of records per country in descending order, i.e. the chart shows bars for the 10 countries with most records. The "confidence" metric ranks overall request request similarity by 3 categories (low, some, high).

The "confidence" classification indicates the probability that a given result represents the requested Address. For results in the "high confidence" range (0.7,1.0], result data is most likely better than requested data: The high request similarity indicates that the response "fits" to requested data (e.g. same post code, similar thoroughfare), and missing elements (e.g. administrative area, geographic coordinates) are added by consistent and complete reference data.

Also "some confidence" results (i.e. request similarity in (0.5,0.7]) mostly provide correct and "better" results. However, some results may not fit to the requested address and represent another address, e.g. with a similar bu different thoroughfare or even a different locality (but thoroughfare with same name). By filtering results and checking questionable columns, correct and wrong results can be distinguished quit well.

Results in "low confidence" range [0.0,0.5] are most likely representing a different address. There may be cases with correct results but low request similarity due to local language versions and translation issues. But most results may be wrong and should not be copied automatically to operative data.

Changed Address Fields

Summary of changed address fields
The Changed Address Fields chart shows how "much" major address fields administrative area, locality, post code, thoroughfare, thoroughfare number, and geographic coordinates were changed by data curation. Change is classified by the following categories:
  • Missing: Data was missing in request, but data curation could not add missing information.
  • Confirmed: Data curation found reference data for the given field which is exactly the same like given data.
  • Enriched: Data was missing in request and data curation could add missing information.
  • Changed: Data curation found reference data for the given field which is different to given data.

Statistics are calculated from "change" columns in "curation results", e.g. "Locality value change", see example below. "Changed" category summarizes both MINOR CHANGE and MAJOR CHANGE tags which differentiate case-only changes from different characters.

Curation result details

Details on request and result comparison per record
The Data Curation report provides data curation results for many fields. For clarity reasons, same columns are used for each field, see an example for Locality:
  • Locality value: Data from input set, St Gulen in the right-hand example.
  • Locality value curated: Result from data curation for the given value, St. Gallen in the example.
  • Locality value change: Classification for "what has changed". There may be NO CHANGE (i.e. input and result are equal) or data may be ADDED (i.e. input was empty, data curation enriched missing data). If data was changed, the report distinguishes MINOR CHANGE (i.e. only character case changed) and MAJOR CHANGE (i.e. characters were changed, really different data).
  • Locality: Provides the request similarity for the particular field, 85.71% in the example for St Gulen and St. Gallen.