My recent interview with Timo Elliott (@timoelliott) on the need for big data governance for his business analytics blog received some nice feedback and almost 200 tweets – so I thought it would be a great idea to expand on some of the points I made in the interview through a series of blogs. In this post, I want to outline what I believe is a worrying trend by throwing out some of the hard lessons we have learnt during recent years since information governance has matured as a discipline within organisations.
In the next few blogs we’ll also look at some of the things you need to worry about beyond just data quality, and finally we’ll explore what sort of processes and strategies need to be in place in order to achieve data governance on big data.
First, I have a confession to make. As I alluded to in Timo’s interview, I personally find the term big data rather meaningless. If you remove the word “big” from almost any discussion or article on big data, you will find that it still makes perfect sense because big data is about treating all data as a true business asset and, sadly, it’s the case today that not many organisations do that. However, if calling a project a “big data project” is what it takes for management to invest in data and information governance solutions, and to start using it as an asset to run their organisation by facts, then I’m all for it. Let’s just not get too hung up on the name.
Much has been written about what exactly is big data and how do you define it, and what makes it different from ordinary data or “small” data as it is sometimes called. By far the most popular definition are the three Vs: Volume, Velocity and Variety, which was first described by Gartner’s Doug Laney back in 2001. Of course, there have been many debates on the merits of that definition; some have tried to add additional terms like veracity and value. Whatever your views are on the definition, it’s true that those three original Vs are the cause to data governance headaches.
The V Most Clearly Associated with Big Data
Let’s start with volume as the obvious place to begin. This is the V that most clearly associates itself with “big” data. Matt Aslett of 451 research makes the case in an excellent blog, Big data reconsidered: it’s the economics, stupid, that there is no such thing as data that is big but it’s all about the new approaches and economics of storing data. Historically, the cost and technical complexity of storing data was the limiting factor to how much you could store and analyse. If you read too much into the hype then you could quite literally believe that by just dumping all of your data into a large bucket and by analysing it, a whole new set of insights will somehow magically appear. The reality is that in the race to store everything we may not be applying the same criteria of data quality to that big data. Worse still, the quality of the data will vary depending on the information governance that surrounds the source. It might be that your core ERP has a mature and effective information governance program but that might not be the case in some of the other data sources, particularly the newer and less structured ones. And when those multiple sources are mashed together quickly, the poor data quality sources will pollute the entire big data set resulting in a set of untrustworthy data.
The V That Is the Very Spice of Life
Managing quality across many sources of data is something I will return to in a later blog but it’s clear that the variety attribute is another dimension that makes data governance in a big data scenario challenging. In the past, the economics of storing data limited the sources available. Now those economics have changed simultaneously with explosive growth in the amount of unstructured data that organisations are generating via web and social media channels. As we will see later in this series, it’s imperative that these two worlds of data – the traditional structured types that we are most familiar with and the new unstructured ones – come together to start setting organisation-wide data quality and governance rules.
The V Most to Blame
However, it’s my firm belief that the velocity dimension is most to blame for causing the abandoning of data governance processes. Just like the ability to analyse larger and more complete data sets promises to offer organisations new insights into their business performance, so does the promise of real time analytics. There are many business processes that could be transformed by the inclusion of a real-time insight especially when dealing with time sensitive information, whether it’s financial risk or retail sales data. The mistake here is to believe that anything that gets in the way of the data flowing from the source to appearing in the business user dashboard is a bad thing. I’ve seen many discussions about removing all latency in the data flows and perhaps using technology like data replication tools to move the data with as little processing as possible, especially if the target system is an in-memory database with its inherent ability to process billions of items of data in sub-second response times. I believe that you still need to apply robust data quality rules to data before it is presented to that user. Otherwise you are only creating, as Don Loden so succinctly put it in his Information Week article called The Biggest Big Data Myths of 2013, “fast trash”, which is information that is quickly available to the business but with very little value. The question we have to ask ourselves is would a few additional seconds of processing to consistently apply data governance rules to real time big data really reduce its value by a significant amount compared to the danger of presenting the business with data that is just plain incorrect?
So I think in summary, it’s difficult to justify why big data should be exempt from the normal rules of data governance. If you short cut this important process, you’ll end up with a very large mess. Next time I’ll want to look at some other issues of big data governance beyond the veracity dimension. In the meantime, if you want to get a really quick view of the quality of your current data, why not talk to us about running a Data Quality Assessment on your data, big or small?
Originally contributed by Richard Neale, Marketing Manager at ENTOTA, a BackOffice Associates company.