The Virtuous Circle of Machine Learning and Data Quality

You’ve made the decision to engage in a machine learning (ML) initiative to augment your analytics. That’s great; it’s the wave of the future. I’m a huge proponent of ML because I think it gets close to the approach that humans take when they learn—processing all relevant experiences and learning from the outcomes of those experiences. That’s a natural approach that can yield deeper, and more prescient insights than you can get with analytics technology alone.

However, there’s one mistake that I’ve seen companies make that often sours them on ML—and artificial intelligence in general—even though it really has nothing to do with ML technology, or the algorithms used in it: they use low-quality data, so they get erroneous results. The problem is that they often don’t realize that their data quality is poor, so they blame the algorithm for the bad results.

But how do know you can trust your data? It’s a process. You can’t rely on gut. You have to ask hard questions. The first question is, “Where did this data come from?” The second is, “Can I see the data quality stats?” If your data scientist can tell you that it came from your internal data warehouse, and that it’s been cleansed and formatted, and that the data quality stats are acceptable according to your company’s data quality policies, then you can use it.

If the data is from a public source, or from a data gathering effort and it hasn’t been scrubbed and validated, you shouldn’t trust it. Before you use it, insist that it be validated and measured against your data quality policies. If it doesn’t pass the test, insist that it be scrubbed before you use it. You’ll get pushback, but in the end, the cost to clean the data will be worth the insights it provides.

Speaking of quality data, in a virtuous circle, ML can help with your data quality efforts. Specifically, it can help with one of the biggest headaches you face when trying to improve your data quality: matching and de-duplicating data. For example, let’s take say you’re a large financial institution and one smaller bank you deal with is Eastern Community Bank. Across your different business units, this bank may be referred to as ECB, Eastern Bank, Community Bank—you get the point. Whether it’s in systems or spreadsheets, humans are entering the data and they take shortcuts or make mistakes.

Machine learning can help you catch those types of errors and clean up your data, and keep it cleaner going forward. Normally, the process to find and clean this data would be arduous—even with the excellent data quality tools available on the market. However, with ML, the process is simplified, because with the correct data matching algorithms, the machine learns to match data and clean it up as it goes.

Suddenly, the matching process that took weeks, now can be delivered in days. What does that do? Well, let’s say that Eastern Community is failing. What’s your exposure? Instead of waiting weeks for the answers as your analysts pour over spreadsheets, and tease out possible matches in different, unconnected systems across your organization, now you can have the information more quickly, and you can trust that it will be accurate and complete. You can evaluate your exposure and formulate a plan quickly—probably more quickly than your competitor who also has exposure to ECB. How big of an advantage is that?

What’s more with ML, the algorithm learns and gets smarter, so when you feed it the next set of data, it leverages its experience and applies it to that data set. The process repeats itself with each set of data, so it gets quicker over time.

Data matching is only one example of how ML can help with your data quality. There are many more such as error detection and correction, continuous monitoring of data formatting and quality, and automatic enrichment of data without human input—to name but a few. The developments are coming at such a rapid pace, that new uses for ML vis-à-vis data quality are appearing almost every day.

It really is a virtuous circle. Machine algorithms are virtually useless without clean data. It’s the old garbage in equals garbage out axiom. However, if you feed them clean data, they can enhance your insights far beyond your non-ML-augmented analytics can achieve. And, ML algorithms can also help themselves by helping you clean your data. That’s a win-win.

Leave a Reply