Tawhid's Blog: Data Analysis

Friday, 29 June 2012

Data Quality Assurance

Data Quality is a bit of an under estimated concept in small to mid scale projects, where it is often addressed far too late during testing or bug fixing. There are plenty of articles which elaborate why Data Quality should be a concern, not just for project managers but also for solution architects and folks in governance roles. However, it’s quite difficult finding a simple framework which you could use to assure data quality particularly for small projects. The methods typically available are meant for large programmes where you have to contract someone to build a specific solution or enterprise focused Data Quality Assurance practice, which is not what this blog is about. You'd have to reach into your wallets for that !!

In this blog, I have made an attempt at putting forward a simple and basic framework using which Data Quality assurance can be managed, monitored and measure – particularly to aid in Test Planning. This may be toned up or down to fit needs in Application Development, Solutions Architecture and Enterprise Level Architecture.

The methodology focuses on measuring and monitoring data quality levels for key data items or entities that are used or produced by the product/domains/enterprise. We will use a hypothetical scenario where we are trying to assess data quality of a system (or collection of systems) that deal with product sales. In order to quantitatively measure data quality, the following key metrics may be used.

1. Uniqueness - achieved by ensuring there is no duplicate data within the same domain. For e.g. The sale of a product by the same vendor has not been reported different for the same time period across different reports. Typical area to invest time is ensuring each key data entity has a means of uniquely identifying the records, usually enforced using unique key constraints. The concept of 'single source of truth' comes into play here.

2. Completeness - achieved by ensuring that a complete set of data is available for each entity. For e.g. the match-rates obtained by consolidating sales reports to master product data records is of acceptable quality. This is sometimes difficult to assess, as you never know what is incomplete unless you have a truly complete set to compare with.

3. Accuracy - the correctness of the data is achieved by ensuring that the data is a true reflection of the actual circumstances. For e.g. the Tracked Sales Report produced matches the Individual Sales Report received at aggregated and granular levels from various Point of Sale. This often requires Business Users to eyeball through samples to validate that data is accurate and represents best knowledge held by the business.

4. Timeliness - The data is up-to-date and delivery of the data is timely. For e.g. the sales period used in the Master Data is the same as that used by the reporting Point of Sales. Often times, this gets complex when you have multiple systems processing same data sets i.e. high velocity data processing.

5. Consistent - The data is the same in its definition, business rules, format and value across systems at any point in time. For e.g. the Master Data provided has same expected data structure, values and data types, unless validated through a known change. Consistency also implies that the data is a true reflection of the real world scenario, hence may overall with accuracy.

The Data Quality at each completed and iterative stage of development will be monitored by aggregating percentage scores against set targets for each metric. Each data set needs to have its own DQ metric assessed. Each DQ metric has a set of test scripts / quality measures which focuses on the data sets involved and needs to be contextually relevant. For theoretical correctness, I have considered Data Integrity to be responsible for Completeness, Uniqueness & Accuracy therefore haven’t considered as a separate metric.

The DQ Assurance (artifact) can be split into each data set (product data, reference data & sales data), which have list of data quality tests that need to be performed, categorized by the metric that the test assures.

The tests under each Data Set must also have Test Phase, Data State and other relevant attributes such as Platform, Users if directly involved in Testing and Assurance otherwise can be separately addressed in a Data Definition artifact. The Test Phase correlates with the Overall Test Plan to suggest which phase of testing the test activity can be executed. Typical Test Phases can be Migration Testing, Functional, UAT/Business Acceptance and post Go-Live Change Management etc. The Data State is a more technical indicator of the state of the data in the end-to-end process the application / solution deals with. This correlates with the Business & Data process proposed & agreed for the project, to say at which stage of process the test needs to be conducted. Typical Data Stages can be: Raw Data, Ingested Data, Transformed Data and Processed Data.

Measuring Results
Each test activity will ideally have a target of 100% unless otherwise is a more appropriate measure. The target values are calculated as a percentage of how many pass the test case over the total population of the records in that data set / entity. Example:

ID	Test	Metric	Test Phase	Data State	Target	Actual	Resource
2	Does the data include products licensed in the period being reported on?	Completeness	Migration	Raw Data	100%	80% - This could mean that out of 100 products the POS has reported sales for March 2012, only 80 of them were made available in March 2012 i.e. 20 were possibly made available in 2011.	Who does the test?

The summary results of the Data Quality Assurance Tests will be compared against targets and can be visually represented to identify where the Data Quality can be improved.

The more blue you have in the radar graph, the poorer your Data Quality is. Each data set should have one of these radar graphs plotted to give an executive summary of the quality of data in that domain. The numeric results are then used either during test or during normal operation of the solution to identify issues where data quality is poor and root causes behind the gaps.

Hope this blog was useful, and would welcome hearing queries or discussion if you’re interested in this field.

Thursday, 9 February 2012

Basic SQL for Data Reconciliation

Excel’s VLOOKUP Function is probably the most used one for basic Data Analysis in a spreadsheet environment. Below are some SQL code snippets that do what VLOOKUP does in theory, but can be used when the data is in SQL Database for any reason.

The following tables will be used to describe on some very common scenarios. Let’s assume we have two independent data stores which are not integrated hence require regular reconciliation. Table A contains the Products Inventory. Table B contains the Online Catalog of those products. Table C is a mapping table across the two platforms. Table C contains sales figures made Online.

Product_ID	ProductName	Price	Category	OnlineCode	TotalSale
ZJU897JHG	JB Raincoat	110.00	Clothing	NULL	NULL
JOK23MK30	Autograph Luxury Col	180.00	Clothing	NULL	NULL
UIO0-SDNK	Animal Leather Wallet	20.00	Clothing	NULL	NULL
PO0IKJU-90	Austin Reed B/W Shirt	60.00	Clothing	NULL	NULL
78H-09-POI	Nike Sports Jacket	80.00	Clothing	NULL	NULL

Table A: ProductDetails

OnlineCode	CreationDate	Stock
OL-FAS-002	12.03.2011	5
OL-FOD-091	23.06.2011	25
EX-COM-09	13.05.2011	60
OL-EXC-178	30.11.2011	14
EX-FAS-450	05.10.2011	10

Table B: OnlineProductCatalog

Product_ID	OnlineCode
ZJU897JHG	OL-FAS-002
II9080JM77	OL-FOD-091
JOK23MK30	OL-EXC-178
UIO0-SDNK	EX-FAS-450

Table C: ProductOnlineMapping

OnlineCode	Date	Qty
OL-FAS-002	14.03.2011	3
OL-FAS-002	15.03.2011	6
OL-FOD-091	23.06.2011	12
OL-FOD-091	25.06.2011	9
EX-COM-09	13.05.2011	20
EX-COM-09	13.05.2011	13
OL-EXC-178	30.11.2011	5
OL-EXC-178	30.11.2011	5
OL-EXC-178	30.11.2011	3
EX-FAS-450	05.10.2011	2
EX-FAS-450	05.10.2011	7

Table D: OnlineSales

Case 1: Identify products that are not available in the Online Store

SQL Code A: Nested queries:

SELECT DISTINCT Product_ID

FROM ProductDetails

WHERE Product_ID NOT IN

(

SELECT DISTINCT Product_ID

FROM ProductOnlineMapping

)

SQL Code B: LEFT JOIN

SELECT DISTINCT P.Product_ID

FROM ProductDetails P LEFT JOIN ProductOnlineMapping PM ON P.Product_ID = PM. Product_ID

LEFT JOIN OnlineProductCatalog OC ON PM. OnlineCode = OC. OnlineCode

WHERE OC.OnlineCode IS NULL

Case 2: Update Product Details table so it contains each item’s Online Code

SQL Code : UPDATE script with INNER JOIN

UPDATE ProductDetails

SET OnlineCode = PM. OnlineCode

FROM ProductDetails P INNER JOIN ProductOnlineMapping PM

ON P.Product_ID = PM. Product_ID

Case 3: Update ProductDetails table so it contains total online sale per item

SQL Code:

-- Declare a temporary table / view to store results of the aggregate function
DECLARE @ProductTotalSale table(
Product_ID varchar(20),
TotalSale int );

-- Store results of the aggregate function
INSERT INTO @ProductTotalSale
SELECT P.Product_ID, SUM(OS.Qty)
from @ProductDetails P INNER JOIN @ProductOnlineMapping PO ON P.Product_ID = PO.Product_ID
INNER JOIN @OnlineSales OS ON PO.OnlineCode = OS.OnlineCode
GROUP BY P.Product_ID

-- Update Products table using the stored values
UPDATE @ProductDetails SET TotalSale= PTS.TotalSale
FROM @ProductDetails P INNER JOIN @ProductTotalSale PTS ON P.Product_ID = PTS.Product_ID

Feel free to leave a comment if you have suggestions / queries.

Tawhid's Blog

Friday, 29 June 2012

Data Quality Assurance

Thursday, 9 February 2012

Basic SQL for Data Reconciliation

Blog Archive

Search This Blog