Big Data: Separating Hype from Reality

an interview with Matt Kuperholz about why so many people get big data wrong


One the things I like most about being a part of Future Crunch is that I get to meet a lot of very smart people. There’s plenty of them around in our hometown of Melbourne. Its unique mix of creative communities, economic prosperity and liveability make it a pretty attractive destination for people who want to work on things that are new and interesting. One of those people is Matt Kuperholz, who I was introduced to through my housemate towards the end of last year. Matt is a partner at PwC Australia, and in charge of their data analytics team. It didn’t take long for the two of us to start talking about digital culture; at the time I was reading Jaron Lanier’s Who Owns the Future, and was doing a lot of thinking about open-source movements and digital currencies. I was immediately impressed by Matt’s ability to get to the heart of these issues. He’s someone with the rare ability to talk about the inner workings of the internet in a way that was accessible to a non-technical person like myself, without dumbing it down or losing the complexity.

At Future Crunch our aim is to provide clear, critical and intelligent analysis of big future trends. We like to get behind the buzzwords and try and really understand what it’s all about. Yes, we’re optimistic, but we’re not naive (hence the ‘critical’ part ). One of the biggest buzzwords out there right now is big data and when I thought about doing an article on it, Matt was the first person that came to mind. His area of specialisation is in the application of machine learning technologies to data analysis for business, and he’s been doing this for top tier companies in Australia for more than 20 years. He’s also one of the nicest people you’ll meet and has a passion for conscious, human experiences that’s inspiring. When I asked him how he’d feel about an interview he happily agreed. We sat down for a cup of coffee a few weeks ago and here’s what went down…

I wonder if you could tell me a little more about your background?

I was born in 1972 in South Africa, and have been into technology since as long as I can remember. I was one of those children that took a toaster to bed instead of a teddy bear. I’ve been into computers from an early age so my decision to work in the field was a deliberate choice. Computers weren’t everywhere in the 1970s like they are now but I knew they were going to matter. I got a scholarship to become an actuary (because that’s what you did back then if you were good at maths) and studied at the University of Melbourne. But computer science was my passion and it didn’t take me long to see that all the actuarial stuff I was doing could be made better with a higher degree of integration with technology.

I think if actuaries had seen the writing on the wall they would have become the pre-eminent data mining cohort. But they didn’t; the industry was too established, complacent and resistant to change. That’s why I jumped into an Artificial Intelligence software startup in the late 1990s. The pressures of that startup meant that I had to build data mining processes. That’s an important point to remember — thanks to advances in UX and usability most consumers now just think that you can throw in data and get results. But no software works like that. You don’t just throw in data, it needs to be massaged. I ended up inventing a process around the use of software that was distinguished by its ability to model thousands of dimensional spaces /co-variables. This was really different from conventional processes which utilised at most a few dozen variables.

So you were in the startup world — how did you end up working for larger businesses?

The beauty of being in a startup was that I was exposed to lots of different industries. Right from day one we were talking to potential clients about the value of an asset they all had — data. And of course the value of that asset changes depending on what you do with it, how you run it, how you twist it. I took those lessons and launched my own consulting business to teach companies how to manage their data. I did that from 2001 to 2011 where my main client was a top tier global consulting firm. They were ahead of most competitors as they took a bet on data early, and I helped them build a data mining practice that was the first of its kind in Australia that then went on to be a global leader. Right now I’m now working with PwC, helping them do something similar but better.

What is big data?

A working definition is that it’s more information than you can handle with traditional approaches, which means you have to do something different. It’s not necessarily new though. For example I helped a large retailer look at every advertisement in their catalogues back in the 1990s and align those all with their advertising revenues, and then every single line item of every single sale over time. Does that mean we were doing big data 20 years ago? Another example — what about airlines in the 1970s and 1980s? They had to take data from 40 disparate systems to talk about the entire customer journey, from their marketing to ticket purchases, to check-in and then the flights themselves. That counts as big data by almost any definition you care to use. This stuff is not new.

What is new is that thanks to Moore’s Law we are on an exponential curve. That curve is now so steep on measures not only of data volume but also data velocity that it’s on everyone’s radar. That’s why it’s a buzzword. But it’s been around for a long time. The other thing that’s going on is that we’re getting a much greater variety of data thanks to the incredible rise in our ability to store data on multiple devices. And then of course there’s connectedness. We’re in a world where about three billion people are online, and two billion of them are connecting via smartphones. Those are the first three Vs of big data (by another common working definition): volume, velocity and variety. And if you think the growth curve is steep now wait until the Internet of Things (IoT) is in full swing.

There are two other Vs as well. The first is veracity — how good is your data quality? Researchers know this, that it’s not enough to just take measurements, you need to make sure those measurements are being taken properly. Admittedly this is contentious; there’s a branch of data analytics that suggests you can overcome these problems even with poor quality data. But of course the better quality your data is the easier it is to manipulate. And finally, there’s value. In the early days you had to convince people of analytics. That’s no longer the case. The world has changed. Today your decision making has to be evidence based. In the old days that was only really for scientists and geeks. Today it’s everyone; even the marketers, who are now pointing to big data to drum up sales.

Does that mean there’s a gap between what people say big data is and what it actually is?

Big data is mostly a marketer’s term. I’d love it if people dropped the term big data and just used analytics. People are drowning in data and aren’t sure what to do with it, and this has been the case for most of this century. For example let’s say you’re a big manufacturing company with thousands of sensors installed in your factories that are taking readings. That’s easy these days because sensors are cheap. But even if you crunch all the data and use it to produce reports you’re still not using it to extract maximum value — you’re just looking backwards. One of data’s big selling points is that you can use to produce real time information. But you have to ask yourself “how many organisations actually need that?” Sure it’s useful for border patrol security, or air traffic controllers, or Amazon. But most ASX500 companies don’t need so much real time information. They just need enough information to make decisions about what they’re going to do in a week or a month’s time.

That’s why big data is so over-hyped. Everyone says they’re doing it but very few are. And those that are doing it aren’t doing it very well. Information brokers or companies at the front of the information economy are probably the best at it — Google, Amazon, Facebook, Uber, AliBaba. They’re doing it properly because it’s built right into their business model. Everyone else has room to improve. And banks, telcos and airlines are comparatively behind pure information economy companies. Sure they’ve got all the parts, such as the hardware and the people to do it properly. But often they’re looking at a bowl full of quality ingredients and saying “where’s my awesome cake?” I don’t think that’s going to get sorted out until the need to get it right is embedded into the actual structure of a company. This is probably going to take another generation, and requires a very different mindset.

It’s not just companies that are doing big data though right?

Once you get out of the private sector it changes a little. Some parts of government are actually pretty good at big data because they need to be. The USA and Israel for example have a huge security apparatus that depends on getting their data analytics right. In some of the more classified areas it’s really bleeding edge stuff — things like real time facial recognition, telephone conversation scanning or other intelligence flows. The reality is that private sector companies don’t need to track millions of entities across a wide variety of sectors like the government does. They just need to analyse one sector. And right now they don’t need to be perfect, they just need to be one step ahead of their competitors. Governments have a duty to protect and serve their citizens whereas the private sector just needs to make sure it’s profitable.

The reason we all get hyped about big data is because we get glimpses of possibilities that are way out on a curve. The gap between the hype and the reality isn’t because we can’t do it technically, but because there just isn’t a need to do it yet. For example, take the big banks in Australia. They’re making solid profits, and really leading edge data analytics isn’t necessary for their survival yet. That’s different to something like microtrading on the stock market. When it started a few years ago there were only a few people doing it, but it gave them such a competitive edge that almost immediately everyone had to follow. It was a question of survival. Big data isn’t there yet for the majority of real world business problems. That’s why when you hear someone hyping big data solutions keep in mind that it needs to start with a real business problem. That’s the first and most important step. Until that problem exists (and it needs to be compelling) then business won’t do big data for data’s sake.

Where do you see big data (sorry, analytics) headed in the future?

The IoT is going to be massive. Connected devices are going to create an even steeper curve. And they might for us into federated data models whereby you don’t own the data, you just have access to it. In my mind that’s really interesting for the greater good of society. Right now data is regarded as an asset, so if you’re in a commercial environment you don’t share it. The concept of federated data is that it’s semi private; so all owners go into a common pool where everyone gets to use it. A nice example of this is SENSE T in Tasmania which is being used to monitor everything from oysters to air quality.

In the US they’re a little further down the road than we are; government has partially adopted federated data because they allow it. There’s a difference between this and the open data movement because you still recognise that it’s yours but you allow people to use it. Once you’ve got the IoT and you’ve got all the data out there as well as the cloud, it means individuals or organisations don’t need to ever invest in hardware, you can just have access on demand. That opens up a world where people can use data analytics to really start pushing the envelope. It’s not just cancer, weather or infrastructure. We’re talking about societal change at large. It’s possibly the next really big leap in our technological evolution, where we end up creating something that didn’t exist before.

We’ve done this before remember — think about what we did with copper cabling. Our wires hit their theoretical speed limits a long time before we actually moved on. We pushed the physical limits far beyond what was possible using mathematical optimisation, which is what happens when you take your current model where it interacts with the real world (which is messy) and minimise the negative externalities. And greater efficiency has societal impacts and the holy grail of ‘win-win’ or ‘create something from nothing.’ For example, when UPX and Fedex used data analytics to do route optimisation not only did it save them money, but it helped citizens because it meant fewer carbon emissions and better customer service. Right now the planet is faced with huge inefficiencies. We’re wasting food, destroying the environment, and creating unnecessary pollution. I think we should be saying “I want to do what I do better.” Data analytics allows us to do that.

What’s the flipside?

As we’re now all too aware we might end up throwing civil liberties out the window. Facial recognition is a technology that’s going to be ubiquitous in the future. We’re going to end up in a world where you really cannot hide. Also, are we aware of what all this electro magnetic radiation doing? Will optimisation increase the disparity between rich and poor? What about warfare? History says we invent stuff and then make weapons out of it. And what about machines becoming self aware? Like anything this new technology has the power to bring people together but also to drive people apart. What if data is the currency is that’s going to drive a wedge between us all?

Why should people care?

Any discussion of the future of big data reminds me of Hoffstader’s Theorem: everything always takes longer than you think, even when you factor in Hoffstader’s Theorom. In the very long term I think data is key to our evolution. Remember, we’ve already transcended space and time. We can create physical objects at a scale of nanometres; we can’t see anything at that level with the naked eye and yet there are things down there that we’ve created with intentional design. And thanks to our machines we can now think in gigahertz. We’re able to process thoughts or questions far faster than was physically possible just a few generations ago. Our brains operate in a space from milimetres to kilometres, and in time from a fraction of a second to a lifetime. But our machines create things at the atomic level, and operate in microseconds.

For the first time in our history we are evolving with a purpose that goes beyond the Darwinian scale. After all, if you break it down a person is a program encapsulated in DNA, which runs in place and accumulates materials from its surrounding environment to eat, walk, procreate and think. Data is a key not only to our evolution but also our survival. You can’t get off the planet without it, and it can overcome the biggest problems facing us. We’re getting close. The future of analytics and data is so bright we need to wear augmented-reality shades. But it’s not here yet.