In Richard Neale’s previous blog post, he looked at the three Vs that many use to define big data and the impact they can have on data and information governance. In this post, I’d like to follow up on that and look at how poor quality data can be introduced into a big data project and what kind of strategies need to be employed to prevent it from happening.
Poor data quality can be traced back to three root causes. That may sound a bit low but it’s not the number of sources that is important as much as the type and processes that are used to load that data into the big data repository. The challenge most organisations have is that these three are often managed separately and therefore consistency is difficult to achieve.
Root Cause #1: The Initial Data Load
In almost all big data projects, there will be a go-live for the system when it moves from a pilot or proof of concept into a production system. This is the moment it becomes generally available as an analytical source for the business. It’s often the case that a dedicated project team or competency centre will have been setup to manage this process and of course will have worked hard to ensure that the data is of the highest quality possible at the moment of launch.
This is an extremely valuable exercise because it’s often one of the few times when IT and business people work together to define what good data looks like and to set the rules that will be applied to the data load. As data is loaded, it can be tested against these rules and if rejected it can be routed to the appropriate data steward for confirmation and remediation, preferably at the source. In addition to the six classic data quality dimensions around completeness, conformity, consistency, accuracy, duplication and integrity, a key test is business relevancy. This rule is about identifying and loading only data that is deemed “active” by the business. It’s a common misconception of big data that you analyse everything and somehow insights will follow. Loading irrelevant data just creates noise that makes those analytical insights even harder to achieve.
Another common mistake made after the initial load is to throw away those rules that have been collected on the grounds that they won’t be needed again. Most organisations will have many bulk load cycles as merger and acquisition activity will require the integration of new data sets as well as additional analytic systems consolidation so it’s important that processes are put in place for reuse with review to ensure they are still a correct reflection of current business conditions.
Root Cause #2: Application Integration
In the world of big data, this is the root cause that will probably create the most challenges. Here we are talking about the regular, possibly real-time, data feeds that will be populating your big data repository. Inevitably there will be a wide variety of sources, some more structured than others. There will also be a mixture of both internal and external sources of information. The external sources are particularly problematic because you have no control over their quality.
The approach to take here is to setup a data quality firewall which, just like an internet firewall, prevents attacks on your big data by applying the same rules you developed for the initial load. Consistent and rigorous application of your data quality standards are key to preventing bad data entering your big data environment and the use of a technology platform like SAP Data Services will make it easier to do this.
Incidentally, rather than giving up on controlling external data quality, why not work in partnership with your external data suppliers to improve their data quality? There’s no reason why you can’t extend your own data stewardship processes out beyond the walls of the organisation if you are using a data governance platform like SAP Information Steward and the SAP Information Steward Accelerator by BackOffice Associates.
Root Cause #3: Data Maintenance
It’s an unfortunate fact of life that humans and mistakes go together so you won’t be surprised to learn that the third root cause is business as usual; data entry by the people who use your source systems every day. This includes employees but also increasingly includes external users such as suppliers and customers. It’s inevitable that typos will occur, decimal points will appear in the wrong place, or incorrect or default options will be selected from menus.
The problem of course, is that these simple errors can easily propagate across multiple operational and analytical systems with a danger that critical business decisions will then be made based on faulty data. Many budgeting processes have been derailed by non-existent revenue sources. Before you setup that new office in your fastest growing region, it might be worth checking to see if an alphabetical list is driving an incorrect analysis.
Much has been written elsewhere about methods for incentivising employees to improve the accuracy of their data entry by changing behaviours but it’s also important to apply the data quality firewall approach to this data source too. It’s always more cost effective to apply the rules at the source, so clearly the same data governance platform should be used to validate all data entry, and a robust master data process put in place to set the right checks and balances on updating critical master and reference data.
The key to fixing the three root causes of poor data quality is the consistent application of data standards across all of your data sources and the development of a firewall mentality to protect your big data. If your next question is how do I get started? I will cover that in my next blog.