Data Quality is a bit of an under
estimated concept in small to mid scale projects, where it is often addressed far too late during
testing or bug fixing. There are plenty of articles which elaborate why Data
Quality should be a concern, not just for project managers but also for solution
architects and folks in governance roles. However, it’s quite difficult finding
a simple framework which you could use to assure data quality particularly for
small projects. The methods typically available are meant for large programmes
where you have to contract someone to build a specific solution or enterprise focused Data Quality Assurance practice,
which is not what this blog is about. You'd have to reach into your wallets for that !!
In this blog, I have made an attempt at
putting forward a simple and basic framework using which Data Quality assurance can be
managed, monitored and measure – particularly to aid in Test Planning. This may
be toned up or down to fit needs in Application Development, Solutions Architecture
and Enterprise Level Architecture.
The
methodology focuses on measuring and monitoring data quality levels for
key data items or entities that are used or produced by the product/domains/enterprise.
We will use a hypothetical scenario where we are trying to assess data quality
of a system (or collection of systems) that deal with product sales. In order
to quantitatively measure data quality, the following key metrics may be used.
1. Uniqueness - achieved by ensuring there is no duplicate data within the same domain. For e.g. The sale of a product by the same vendor has not been reported different for the same time period across different reports. Typical area to invest time is ensuring each key data entity has a means of uniquely identifying the records, usually enforced using unique key constraints. The concept of 'single source of truth' comes into play here.
2. Completeness - achieved by ensuring that a complete set of data is available for each entity. For e.g. the match-rates obtained by consolidating sales reports to master product data records is of acceptable quality. This is sometimes difficult to assess, as you never know what is incomplete unless you have a truly complete set to compare with.
3. Accuracy - the correctness of the data is achieved by ensuring that the data is a true reflection of the actual circumstances. For e.g. the Tracked Sales Report produced matches the Individual Sales Report received at aggregated and granular levels from various Point of Sale. This often requires Business Users to eyeball through samples to validate that data is accurate and represents best knowledge held by the business.
4. Timeliness - The data is up-to-date and delivery of the data is timely. For e.g. the sales period used in the Master Data is the same as that used by the reporting Point of Sales. Often times, this gets complex when you have multiple systems processing same data sets i.e. high velocity data processing.
5. Consistent - The data is the same in its definition, business rules, format and value across systems at any point in time. For e.g. the Master Data provided has same expected data structure, values and data types, unless validated through a known change. Consistency also implies that the data is a true reflection of the real world scenario, hence may overall with accuracy.
The Data Quality at each completed and iterative stage of development will be monitored by aggregating percentage scores against set targets for each metric. Each data set needs to have its own DQ metric assessed. Each DQ metric has a set of test scripts / quality measures which focuses on the data sets involved and needs to be contextually relevant. For theoretical correctness, I have considered Data Integrity to be responsible for Completeness, Uniqueness & Accuracy therefore haven’t considered as a separate metric.
The DQ Assurance (artifact) can be split into each data set (product data, reference data & sales data), which have list of data quality tests that need to be performed, categorized by the metric that the test assures.
The tests under each Data Set must also have Test Phase, Data State and other relevant attributes such as Platform, Users if directly involved in Testing and Assurance otherwise can be separately addressed in a Data Definition artifact. The Test Phase correlates with the Overall Test Plan to suggest which phase of testing the test activity can be executed. Typical Test Phases can be Migration Testing, Functional, UAT/Business Acceptance and post Go-Live Change Management etc. The Data State is a more technical indicator of the state of the data in the end-to-end process the application / solution deals with. This correlates with the Business & Data process proposed & agreed for the project, to say at which stage of process the test needs to be conducted. Typical Data Stages can be: Raw Data, Ingested Data, Transformed Data and Processed Data.
Measuring Results
Each test activity will ideally have a target of 100% unless otherwise is a more appropriate measure. The target values are calculated as a percentage of how many pass the test case over the total population of the records in that data set / entity. Example:
1. Uniqueness - achieved by ensuring there is no duplicate data within the same domain. For e.g. The sale of a product by the same vendor has not been reported different for the same time period across different reports. Typical area to invest time is ensuring each key data entity has a means of uniquely identifying the records, usually enforced using unique key constraints. The concept of 'single source of truth' comes into play here.
2. Completeness - achieved by ensuring that a complete set of data is available for each entity. For e.g. the match-rates obtained by consolidating sales reports to master product data records is of acceptable quality. This is sometimes difficult to assess, as you never know what is incomplete unless you have a truly complete set to compare with.
3. Accuracy - the correctness of the data is achieved by ensuring that the data is a true reflection of the actual circumstances. For e.g. the Tracked Sales Report produced matches the Individual Sales Report received at aggregated and granular levels from various Point of Sale. This often requires Business Users to eyeball through samples to validate that data is accurate and represents best knowledge held by the business.
4. Timeliness - The data is up-to-date and delivery of the data is timely. For e.g. the sales period used in the Master Data is the same as that used by the reporting Point of Sales. Often times, this gets complex when you have multiple systems processing same data sets i.e. high velocity data processing.
5. Consistent - The data is the same in its definition, business rules, format and value across systems at any point in time. For e.g. the Master Data provided has same expected data structure, values and data types, unless validated through a known change. Consistency also implies that the data is a true reflection of the real world scenario, hence may overall with accuracy.
The Data Quality at each completed and iterative stage of development will be monitored by aggregating percentage scores against set targets for each metric. Each data set needs to have its own DQ metric assessed. Each DQ metric has a set of test scripts / quality measures which focuses on the data sets involved and needs to be contextually relevant. For theoretical correctness, I have considered Data Integrity to be responsible for Completeness, Uniqueness & Accuracy therefore haven’t considered as a separate metric.
The DQ Assurance (artifact) can be split into each data set (product data, reference data & sales data), which have list of data quality tests that need to be performed, categorized by the metric that the test assures.
The tests under each Data Set must also have Test Phase, Data State and other relevant attributes such as Platform, Users if directly involved in Testing and Assurance otherwise can be separately addressed in a Data Definition artifact. The Test Phase correlates with the Overall Test Plan to suggest which phase of testing the test activity can be executed. Typical Test Phases can be Migration Testing, Functional, UAT/Business Acceptance and post Go-Live Change Management etc. The Data State is a more technical indicator of the state of the data in the end-to-end process the application / solution deals with. This correlates with the Business & Data process proposed & agreed for the project, to say at which stage of process the test needs to be conducted. Typical Data Stages can be: Raw Data, Ingested Data, Transformed Data and Processed Data.
Measuring Results
Each test activity will ideally have a target of 100% unless otherwise is a more appropriate measure. The target values are calculated as a percentage of how many pass the test case over the total population of the records in that data set / entity. Example:
ID
|
Test
|
Metric
|
Test Phase
|
Data State
|
Target
|
Actual
|
Resource
|
2
|
Does the data include products
licensed in the period being reported on?
|
Completeness
|
Migration
|
Raw Data
|
100%
|
80% - This could mean that out of 100 products the
POS has reported sales for March 2012, only 80 of them were made available in
March 2012 i.e. 20 were possibly made available in 2011.
|
Who does the test?
|
The summary results of the Data Quality Assurance Tests will be compared against targets and can be visually represented to identify where the Data Quality can be improved.
The more blue you have in the radar
graph, the poorer your Data Quality is. Each data set should have one of these
radar graphs plotted to give an executive summary of the quality of data in
that domain. The numeric results are then used either during test or during
normal operation of the solution to identify issues where data quality is poor
and root causes behind the gaps.
Hope this blog was useful, and would welcome
hearing queries or discussion if you’re interested in this field.