Showing posts with label Data Analysis. Show all posts
Showing posts with label Data Analysis. Show all posts

Friday, 29 June 2012

Data Quality Assurance

Data Quality is a bit of an under estimated concept in small to mid scale projects, where it is often addressed far too late during testing or bug fixing. There are plenty of articles which elaborate why Data Quality should be a concern, not just for project managers but also for solution architects and folks in governance roles. However, it’s quite difficult finding a simple framework which you could use to assure data quality particularly for small projects. The methods typically available are meant for large programmes where you have to contract someone to build a specific solution or enterprise focused Data Quality Assurance practice, which is not what this blog is about. You'd have to reach into your wallets for that !!

In this blog, I have made an attempt at putting forward a simple and basic framework using which Data Quality assurance can be managed, monitored and measure – particularly to aid in Test Planning. This may be toned up or down to fit needs in Application Development, Solutions Architecture and Enterprise Level Architecture. 
The methodology focuses on measuring and monitoring data quality levels for key data items or entities that are used or produced by the product/domains/enterprise. We will use a hypothetical scenario where we are trying to assess data quality of a system (or collection of systems) that deal with product sales. In order to quantitatively measure data quality, the following key metrics may be used.

1. Uniqueness - achieved by ensuring there is no duplicate data within the same domain. For e.g. The sale of a product by the same vendor has not been reported different for the same time period across different reports. Typical area to invest time is ensuring each key data entity has a means of uniquely identifying the records, usually enforced using unique key constraints. The concept of 'single source of truth' comes into play here.

2. Completeness - achieved by ensuring that a complete set of data is available for each entity. For e.g. the match-rates obtained by consolidating sales reports to master product data records is of acceptable quality. This is sometimes difficult to assess, as you never know what is incomplete unless you have a truly complete set to compare with.

3. Accuracy - the correctness of the data is achieved by ensuring that the data is a true reflection of the actual circumstances. For e.g. the Tracked Sales Report produced matches the Individual Sales Report received at aggregated and granular levels from various Point of Sale. This often requires Business Users to eyeball through samples to validate that data is accurate and represents best knowledge held by the business.

4. Timeliness - The data is up-to-date and delivery of the data is timely. For e.g. the sales period used in the Master Data is the same as that used by the reporting Point of Sales. Often times, this gets complex when you have multiple systems processing same data sets i.e. high velocity data processing.

5. Consistent - The data is the same in its definition, business rules, format and value across systems at any point in time. For e.g. the Master Data provided has same expected data structure, values and data types, unless validated through a known change. Consistency also implies that the data is a true reflection of the real world scenario, hence may overall with accuracy.

The Data Quality at each completed and iterative stage of development will be monitored by aggregating percentage scores against set targets for each metric. Each data set needs to have its own DQ metric assessed. Each DQ metric has a set of test scripts / quality measures which focuses on the data sets involved and needs to be contextually relevant. For theoretical correctness, I have considered Data Integrity to be responsible for Completeness, Uniqueness & Accuracy therefore haven’t considered as a separate metric.

The DQ Assurance (artifact) can be split into each data set (product data, reference data & sales data), which have list of data quality tests that need to be performed, categorized by the metric that the test assures.

The tests under each Data Set must also have Test Phase, Data State and other relevant attributes such as Platform, Users if directly involved in Testing and Assurance otherwise can be separately addressed in a Data Definition artifact. The Test Phase correlates with the Overall Test Plan to suggest which phase of testing the test activity can be executed. Typical Test Phases can be Migration Testing, Functional, UAT/Business Acceptance and post Go-Live Change Management etc. The Data State is a more technical indicator of the state of the data in the end-to-end process the application / solution deals with. This correlates with the Business & Data process proposed & agreed for the project, to say at which stage of process the test needs to be conducted. Typical Data Stages can be: Raw Data, Ingested Data, Transformed Data and Processed Data.

Measuring Results
Each test activity will ideally have a target of 100% unless otherwise is a more appropriate measure. The target values are calculated as a percentage of how many pass the test case over the total population of the records in that data set / entity. Example:
ID
Test
Metric
Test Phase
Data State
Target
Actual
Resource
2
Does the data include products licensed in the period being reported on?
Completeness
Migration
Raw Data
100%
80% - This could mean that out of 100 products the POS has reported sales for March 2012, only 80 of them were made available in March 2012 i.e. 20 were possibly made available in 2011.
Who does the test?

The summary results of the Data Quality Assurance Tests will be compared against targets and can be visually represented to identify where the Data Quality can be improved.



The more blue you have in the radar graph, the poorer your Data Quality is. Each data set should have one of these radar graphs plotted to give an executive summary of the quality of data in that domain. The numeric results are then used either during test or during normal operation of the solution to identify issues where data quality is poor and root causes behind the gaps.
Hope this blog was useful, and would welcome hearing queries or discussion if you’re interested in this field.

Thursday, 9 February 2012

Basic SQL for Data Reconciliation


Excel’s VLOOKUP Function is probably the most used one for basic Data Analysis in a spreadsheet environment. Below are some SQL code snippets that do what VLOOKUP does in theory, but can be used when the data is in SQL Database for any reason.

The following tables will be used to describe on some very common scenarios. Let’s assume we have two independent data stores which are not integrated hence require regular reconciliation. Table A contains the Products Inventory. Table B contains the Online Catalog of those products. Table C is a mapping table across the two platforms. Table C contains sales figures made Online.


Product_ID
ProductName
Price
Category
OnlineCode
TotalSale
ZJU897JHG
JB Raincoat
110.00
Clothing
NULL
 NULL
JOK23MK30
Autograph Luxury Col
180.00
Clothing
NULL
 NULL
UIO0-SDNK
Animal Leather Wallet
20.00
Clothing
NULL
 NULL
PO0IKJU-90
Austin Reed B/W Shirt
60.00
Clothing
NULL
 NULL
78H-09-POI
Nike Sports Jacket
80.00
Clothing
NULL
 NULL
Table A: ProductDetails

OnlineCode
CreationDate
Stock
OL-FAS-002
12.03.2011
5
OL-FOD-091
23.06.2011
25
EX-COM-09
13.05.2011
60
OL-EXC-178
30.11.2011
14
EX-FAS-450
05.10.2011
10
Table B: OnlineProductCatalog

Product_ID
OnlineCode
ZJU897JHG
OL-FAS-002
II9080JM77
OL-FOD-091
JOK23MK30
OL-EXC-178
UIO0-SDNK
EX-FAS-450
Table C: ProductOnlineMapping

OnlineCode
Date
Qty
OL-FAS-002
14.03.2011
3
OL-FAS-002
15.03.2011
6
OL-FOD-091
23.06.2011
12
OL-FOD-091
25.06.2011
9
EX-COM-09
13.05.2011
20
EX-COM-09
13.05.2011
13
OL-EXC-178
30.11.2011
5
OL-EXC-178
30.11.2011
5
OL-EXC-178
30.11.2011
3
EX-FAS-450
05.10.2011
2
EX-FAS-450
05.10.2011
7
Table D: OnlineSales




Case 1:   Identify products that are not available in the Online Store
SQL Code A:          Nested queries:

SELECT DISTINCT Product_ID
FROM ProductDetails
WHERE Product_ID NOT IN
                (
SELECT DISTINCT Product_ID
FROM ProductOnlineMapping
                )

SQL Code B:          LEFT JOIN
SELECT DISTINCT P.Product_ID
FROM ProductDetails P     LEFT JOIN ProductOnlineMapping PM ON P.Product_ID = PM. Product_ID
                                                LEFT JOIN OnlineProductCatalog OC ON PM. OnlineCode = OC. OnlineCode
WHERE OC.OnlineCode IS NULL


Case 2:   Update Product Details table so it contains each item’s Online Code
SQL Code :             UPDATE script with INNER JOIN

UPDATE ProductDetails
SET OnlineCode = PM. OnlineCode
FROM ProductDetails P INNER JOIN ProductOnlineMapping PM
ON P.Product_ID = PM. Product_ID


Case 3:   Update ProductDetails table so it contains total online sale per item
SQL Code:             

-- Declare a temporary table / view to store results of the aggregate function
DECLARE @ProductTotalSale table(
   
Product_ID varchar(20),
    TotalSale int    );
 

-- Store results of the aggregate function
INSERT INTO  @ProductTotalSale
SELECT P.
Product_ID, SUM(OS.Qty)
from @ProductDetails P INNER JOIN @
ProductOnlineMapping PO ON P.Product_ID = PO.Product_ID
INNER JOIN @OnlineSales OS ON PO.OnlineCode = OS.OnlineCode
GROUP BY P.Product_ID


-- Update Products table using the stored values
UPDATE @
ProductDetails SET TotalSale= PTS.TotalSale
FROM @
ProductDetails P INNER JOIN @ProductTotalSale PTS ON P.Product_ID = PTS.Product_ID

Feel free to leave a comment if you have suggestions / queries.