When data is inaccurate, it can produce information that is of no use to anyone. An explanation of how an organisation could improve the quality of its business information is detailed below…
Purpose and Context of data
The collection of data isn’t free and takes some resources to collect and maintain data. A business should always take time to understand why each piece of data needs to be collected and to make sure there is a use for the data. Every bit of data collected should have a purpose for business or mission purposes. Any irrelevant data collected is wasting time and resources and a business needs to question why it’s collecting data if they have no use for it?
A data dictionary is a useful and valuable tool in the use of documenting data. It contains the rules regarding the quality of data. The following are some essential elements that should be included in a good data dictionary for the improvement of the quality of data:
ALL data elements should be identified in the data dictionary. There are many data elements that accumulate over time, mostly in informal data systems like spreadsheets but a lot of time and money can be saved on training and projects by making sure you have up to date lists of all data elements.
- Data element definitions – If nobody can understand what an element is it can lead to errors with interpretation which reduces the quality of data. Comprehensive definitions should be provided so the data elements can be understood.
- When data is stored, it’s usually in a database of some type. These databases provide validation rules which ensure all data entered is valid and any invalid data cannot be entered. Validation rules should be specified so that each piece of data has a value, examples of validation rules are:
- Data type – number, integer, text, date etc.
- Minimums and Maximums – e.g. can there be a negative sales on a certain month?
- Values that are acceptable, unacceptable – is ‘IRE’ a code that is valid for Ireland?
- Codification rules – A column of numbers.. what does it represent? Are the invoice numbers randomly generated or does each segment of it represent something?
- Classification – The rules regarding what data belongs to what group
- Units – the rules used so measurement units are consistent at all times i.e. mm, cm, inch etc…
- Authority identification – In large and more complex systems the ultimate source of a data element is critical to data quality but may not be essential in simpler systems.
- Frequency of updates – Data elements can be static and do not update often, however some dynamic data is event or transaction driven by business processes.
- Proper use/access description – Depending on the population of users in a system, there may be data elements that may not be as equally accessible to users. User roles may be used.
- Specify Scope – A data dictionary should provide information on what is included/excluded from data elements depending on the scope of a query although it doesn’t have to include any underlying data (what may be used for calculations).
Snapshots of static Data
Data quality issues and changes to data over time when taking snapshots of static data. You may find data that can be factually correct but can be difficult to use or manipulate in analytical systems when using snapshots. When using analytics, data quality is important, here are some common issues to look at to keep data clean:
- Caps – the incorrect use of CaPital letters can cause searches with filters to give back incorrect results
- Zip Codes – making sure that zip codes are correct
- Dates – Making sure that when date information is used that it doesn’t transfer as a number instead of a date as well as having it in the right order (i.e. US and European dates can differ in how they’re displayed / interpreted)
- Abbreviations – Rd for road, Ave for Avenue etc. All abbreviations should be consistent or converted into the same type
When using databases to classify data it can be difficult to categorise data subjectively, which means when putting in values it is important to use classification rules that are as objective as possible. There can be a risk of human bias and misinterpretation that can impact on data quality and make data useless. e.g. if a product rating system ranges from ‘good to excellent’, it implies bias on the part of the company as the minimum rating a product can get is ‘good’.
Collect data from one source
Similar to the game Chinese whispers, if you’re in a circle and you tell a person one thing which is then passed around the circle, it will come back to you as something completely different. This can happen to data and can effect data quality. When the local replication, maintenance and redistributing copies of data occurs it can increase errors. Similarly, generating data from different sources can create data errors. Data should only ever be generated once, but can be used by many systems after that.
Remove computed or derived data
There should be no need to store information more than one. For example if a business stores the profit per product, cost per product and profit margin per product, that is being inefficient as the profit margin can be derived from the cost per product and the profit per product. The business only needs to store 2 bits of data and not 3. Extrapolated over a large database, this would increase storage and maintenance costs. To solve issues like this, an algorithm can be stored which calculates the result. The result itself does not need to be stored.
Awareness of Missing data
Electricity companies sometimes use estimated bills for the current month based on your electricity use from previous years. This happens is they don’t get around to manually reading the meter. However, if the electricity company wanted to compute the overall average of your actual meter readings then they could not use estimated reading as this would be an unsound estimation. It’s important to be aware of the difference between estimated and real data that is being analysed and upon which you draw conclusions from.
Reviews to find anomalies
When working with data, sometimes you have to just jump in to looking at it in order to develop an understanding of what ‘normal’ looks like. When you have the understanding you can look at the data and see when something is just not right with it and actions can be taken to correct it. If there is no basic understanding of data then there is no context or sense of what ‘normal’ data may look like. Only through experience of looking at data, does one acquire such a skill.