Home » Data stories » AI needs clean, high-quality data – here’s why

AI needs clean, high-quality data – here’s why

With AI becoming more and more popular, its usage as a technology as well as a buzzword is growing ever so quickly. Even so, the title of this blog contains that very buzzword, merely to illustrate a point of course. This blog, however, will elaborate on Machine Learning, discussing why it requires clean, high-quality data. The pitfall with new exciting technologies, such as Machine Learning, is not putting enough thought into its usage. With crises such as the Dutch childcare benefits scandal, it becomes even more apparent that new technologies require caution. Companies collect increasingly more data, which also means that there is more data to take care of.

In Computer Science, Garbage In Garbage Out (GIGA) is a common concept. It entails that biased or poor quality input produces similar output. The same applies to Data Science and Analytics: Machine Learning is only as good as the data it is given. In this blog, Mr. Clean will take you by the hand to pass all the buzz and hype and apply Machine Learning with more certainty. Mr. Clean will discuss some examples of unclean data and explain which brooms, sponges and brushes can be used to clean it. Next, some examples will be given of how things can go wrong when Machine Learning is deployed without sufficient cleanliness. To end on a high note, successful Machine Learning and its possibilities will be discussed.

Examples of unclean data

Machine Learning makes informed decisions based on training data. If that training data is of low quality, it will make similarly bad decisions. Some of the most common examples of unclean, low-quality data can be found below. They are often caused by human error, but can be the fault of machines as well.

Missing data
Redundant data
Typos
Inconsistent data
Biased data

Unclean data: causes and solutions

Missing data

The first filth in our unclean data is missing data: as the name suggests, it is data where values are missing. Something might go wrong in the collection or tracking of data for example, omitting values from a dataset. A way to tackle this is by filling missing places with default values or using the mean, median or mode as a replacement. One should be careful, however, that the share of missing data is not too large when using such numbers to replace missing values. Otherwise it might be wise to consider removing the data point altogether, as it is therefore less fit for use. Another more complicated option is to use a statistical method such as imputation, using known values to predict missing ones, to complete the data.

Redundant data

Another smudge in your data can come in the form of redundant data: when the same data exists more than once in a dataset. This can, for instance, be caused by a data connection that appends a dataset when information is overwritten. This way, a user can have multiple rows with very similar data. Thoroughly going through the dataset and, for example, only keeping the most recent rows for each user can be a good solution. It is good to be careful because some data might seem redundant that is not actually redundant. A product that is ordered multiple times might create distinct rows in a dataset, but should definitely not be considered redundant.

Typos

Likewise, typos can stain the dataset we want to use for training a Machine Learning model, posing many potential problems. For instance, typos might create formatting mistakes and cause a value in a date field to be left out. Comparably, a typo in a customer’s name could incorrectly track some of his orders, affecting his profile undesirably. As typos often occur through human error, it is advisable to carefully validate and verify data before using it.

Inconsistent data

Very similar is inconsistent data, for example in the form of capital versus lowercase characters or inconsistent punctuation marks. Conversely to typos, this can also be caused by inconsistent data connections. A connection to multiple different sources might mean they each have their own data formats. Pay attention to consistency when cleaning data, for example by unifying these distinct formats. This ensures that, for instance, two instances of the same name are not considered distinct, just because of their letter case.

Biased data

Lastly, a less obvious but equally (if not more) problematic source of unclean data is biased data. You might have already come across the term “biased data” in the media. For those who haven’t, it is data that is limited in some way and can therefore not give an accurate representation. For example, a bank might have less historical data of loan approvals to minorities than to others. To avoid using biased data, validate and verify data extra carefully. As bias can be challenging to notice, especially on your own, it might be useful to ask more people for their expertise. If possible, you can modify your data to make it more fair. To maintain the quality of the model, it should be evaluated continuously. Ensure that people understand what the model does to avoid unexpected and unwanted decision-making.

Dystopia: when Machine Learning goes wrong

With an abundance of possible types of unclean data, there is a lot that can cause Machine Learning to make the wrong decisions. Most of the above examples of smudges in data that ought to be cleaned, cause a few problems: data being interpreted incorrectly or not at all. Stories about failures go in all directions but the basis remains plain and simple: calculations are incorrect. For instance, if a user has made more purchases according to data than he actually has, he might receive the wrong recommendations from the model. In the example from the bank with biased data, less loans will likely be approved to minorities, simply because the model does not know any better. A similar disaster occurred at the Dutch tax authorities with the childcare benefits scandal: parents were wrongly marked as fraudulent by a model trained with biased data. More caution could have prevented this.

Utopia: when it does go right

The other side of the story is more bright, nearly as bright as Mr. Clean’s smile. Just as doomsday scenarios seem endless, the possibilities of Machine Learning, when used carefully and trained with clean data, are limitless as well! Well-informed decision-making can go a long way and often beats humans by miles. Imagine how effective the right recommendations would be for your customers and how impactful the right business investment would be based on clean, high-quality data. What’s more, using and maintaining clean data facilitates understanding, crucial in any situation but especially with Machine Learning. After cleaning your data, finding the right Machine Learning model is the next challenge, requiring thorough insight into the question at hand. One of our favourite tools in the cleaning cupboard is Dataform. It helps us maintain clean, high-quality data. Read our other blogs to learn more about it.

Conclusion

Machine Learning can be extremely useful when used correctly. Don’t get caught in the buzz and hype around AI and listen to Mr. Clean: clean your missing, redundant, typo-ridden, inconsistent or biased data using the right cleaning gear. Things can go horribly wrong otherwise. For fans of Oppenheimer: the so-called ‘father of modern AI’, John Von Neumann, was involved in the Manhattan Project doing immense computer calculations. Imagine what could have gone wrong had he used unclean data; here’s to cleaning our data and hoping our mistakes have less catastrophic effects! Need help cleaning and maintaining your data? Want to know what Machine Learning possibilities lie in your data? Feel free to reach out to us!

Oppenheimer and ai — Oppenheimer and Von Neumann in front of one of the computers used for immense calculations in the Manhattan project

Building High Performance Data & Analytics Teams

When we work together with our partners, we often assist them in building a data team. Numerous screenings and interviews all demonstrate one thing: finding the right candidate for the...

Data stories

Data Quality is not difficult - how automation simplifies

When companies start collecting more data, data quality eventually becomes a topic of discussion. With more plants in your garden, maintenance is a larger responsibility. It is good to be...

Data stories