Preparing Data for Data Analytics

Author: Sanjay Ankolikar

Data preparation is the act of preparing (or pre-processing) raw data or disparate data sources into refined information assets that can be used effectively for various business purposes. The data preparation step covers all activities to construct the final dataset for modeling from the raw data.  Tasks include database, table, record, and field selection as well as cleaning, aggregation and transformation of data.

Since there are variety of data sources involved (e.g. business applications, databases etc.), the data is usually not ready for immediate analysis. Following are the reasons:

  • Data might not be clean and therefore not suitable for further analysis. Data might be incomplete, wrong, or inconsistent.
  • Data might be distributed in many tables, and values might be recorded at an inconvenient granularity for the business purpose at hand.

Often, the task of this data reorganization is called data preparation.

Data preparation steps result into having:

  • Valid data
  • Complete data
  • Consistent data
  • Uniform data
  • Accurate data

In this post we will share some of the data preparation issues we have experienced and possible ways to deal with it. The document below shares data preparation issues (without going into any statistical techniques). Some of the major data issues we have noticed are:

  • Missing value
  • Duplicate data
  • Inconsistent data
  • Noise
  • Outliers
Missing Value:

Several variables used in business applications may not have values. For example, a customers age is known but income might not be shared.

NameAgeIncome
Brian48
Sara150K
Brad3580K

There are variety of reasons for missing values.

  1. The source system allows optional fields where users may not provide all values for the variables.
  2. In some cases, the missing data can be caused by failure of the sensors that provides data or the systems that collect the data.
  3. Changes to user data are not maintained. For example, the system may allow users to add/modify age, income and other data values. The data changes may happen over a period.

The severity of the problem in data analysis increases if there are more data sets with missing values.

Actions suggested:

  1. Removing missing data – Use this technique if the missing values data sets are smaller compare the available data sets. All leading databases allow the filtering of such data and you can choose to remove it before further utilization of the data.
  2. Ignore data or attribute for missing data values
  3. Replace the missing values – You can replace the missing values with another value based on business understanding of the data sets. For example –
  • Mean /Median –for example, use mean or median of the samples in place of missing values.
  • Most frequent data value
  • Customized value – You may replace the missing income value of users with age less than 18 years with 0
Duplicate Data

The cause of duplicate data errors ranges from human error to customers providing slightly different information at different points in time (i.e. – A consumer lists his name as Jonathan Bradley on one form and Jon Bradley on another). Whatever the cause for duplicate data, it is essential businesses remove duplicates from their databases to achieve business goals.

For example, you may have same customers having multiple addresses. It can happen when the customer address is changed, and the system allows additional entries for addresses or instead of updating existing address the system records new address. In either case identifying the current address is confusing.

NameAddress
Brian9 Willedrop Road Apartment 29-Bloomington IL 61701
Brian1220 Ponds Road Apart 29-Bloomington IL 61704

It can also occur when merging the data from multiple sources. For example, you may have two customers showing same address when a one of the customers is moved to new place and his/her new address is not updated. There are possible scenarios where the common address may be valid to multiple customers residing at the same address.

Actions suggested:

  1. Delete the older record – You can use the latest address entry as a basis for the current address and delete the older record
  2. Merge the duplicate records – Merging also requires understanding of the data. You may merge the data with same address with Road vs RD, Street vs ST.
  3. Replacing the data with most probable value based on business understanding.
Inconsistent Data

Data inconsistency occur between files when similar data is kept in different formats in two different files, or when matching of data must be done between files. It can also happen when data is received from multiple business applications.

Some of the examples of inconsistent data are:

  • Age =”42” and Birthday =”01/01/2012”
  • Earlier Rating = 1,2,3 and now its rated A, B, C

Such issues arise due to incorrect implementation of the business rules across the source systems.

Actions suggested:

  1. Ensure that the business rules are correctly implemented across systems.
  2. If possible common data between different business application can be centralized.
Invalid Data

When the system allows nonstandard values or due to data entry errors, invalid data becomes part of the dataset. Some common examples are zip code can be only 5 digit and the system allow it to type more than 5, Age can’t be negative but if the system allows it to record as negative number the invalid data enters the dataset.

NameZip Code
Brian61701
Brian617041

Actions suggested:

  1. Use any external/authenticated source to get the correct value – in case of a zip code issues we can utilize the external source to determine the exact Zip code of the city within the  state.
  2. Apply reasonable logic to replace the invalid entry
Noise

There can be noise (unwanted characters) introduced due to system allowing special characters or more characters instead of specific data value. It can also happen when there is noise in the audio recording due to background noise. There can be many examples like incorrect light exposure on the image can distort the image processing data.

NameZip Code
Brian61701
Bri*an617^41

Actions suggested:

  1. Filter out noisy source of a data – precaution must be taken to filter out as it can potentially remove some valid data
  2. Fix the source
Outliers

It is a data sample that is not in the same range as the rest of the data set. It can be created due to malfunctioning of a sensor data (either high or low). In many cases people tend to ignore such data sets. However, in some scenarios it may be important to understand the outliers. For example, in case of a fraud detection IT security firm may want to investigate on the outliers more than the normal data.

Actions suggested:

  1. Remove outliers if they are not part of the focus of our analysis
  2. Examine the outliers closely and use business knowledge to deal with outliers.
Summary

Data preparation results into faster, more efficient data analysis. To reduce the effort in preparing the data, your source systems must be modified. In case there is a large legacy data available, data cleaning can be done in staging areas before final consumption. If you ignore the symptoms, the meaningful data analysis is near impossible.

Following are the tips to get the data preparation right the first time:

1.      Understand the Business Requirements and Results Needed

2.      Break Business Enquiries Down into Facts, Dimensions, Filters, and Order

3.      Standardize the common data used across business applications –

          Typically contact information is found in various business applications it can be helpful to avoid any data cleansing issues.

4.      Define the level of matching – Data administrators must define the level of matching and tolerance levels to identify duplicate records.

5.      Utilize business rules to identify duplicates – Make sure that the software you use identify and restrict duplicate entries.

6.      Use just the Data you need, rather than cleaning all the Data available

7.      Define metrics calculations in advance to prevent end user disagreements