Home » Data stories » Data Quality is not difficult – how automation simplifies

Data Quality is not difficult – how automation simplifies

When companies start collecting more data, data quality eventually becomes a topic of discussion. With more plants in your garden, maintenance is a larger responsibility. It is good to be cautious, but the problem quickly grows out of proportion: Once companies decide to start measuring data quality, they might appoint a whole team to take care of it. When they discover that they do not have a company-wide data quality standard, they will try to come up with one. Eventually, an entire data governance team might rise from the ground, which will explore a wide range of tools.

At some point, people might start wondering “what actually is data quality?”, followed by countless meetings to define it. You get the gist – companies are overcomplicating data quality which costs money, time and other resources. However, data quality is actually very straight-forward. Sometimes there is no need for a full team of gardeners to maintain a high level of data quality.

This blog will demonstrate why data quality doesn’t have to be so difficult: it will show how automation can help guarantee the condition of the majority of your data. First, the reasons to do so will be discussed. Next, the key ingredients for data quality will be discussed. Using these, we will go over an algorithm to automate data quality.

Why data quality?

There are numerous reasons to strive for a high level of data quality. For example, high-quality data is essential for Artificial Intelligence (AI) and Machine Learning (ML) models. Check out our latest blog to learn more about the reasons for this. According to Gartner, 85% of AI or ML projects fail to move past preliminary stages due to poor data quality.

Not only do projects not get to see the light of day, if they do low-quality data can come with a cost: IBM estimated the annual cost of “bad data” to be $3.1 trillion, in the US alone. Similarly, in the MIT Sloan Management Review of 2017, the authors stated that “the cost of bad data is an astonishing 15% to 25% of revenue for most companies”. It seems like common sense to value data quality – it saves time, reduces cost and improves project outcome. As stated in our previous blog, Garbage In Garbage Out definitely applies to data in business processes. The wrong crops in your garden will naturally yield the wrong produce. Now that we understand why we should maintain our garden, let’s see what requirements there are.

What defines data quality?

The quality can be measured through 6 dimensions:

completeness
timeliness
validity
integrity
uniqueness
consistency

Although most of them are quite straight-forward, this section will quickly elaborate on them. Completeness indicates whether data is sufficient for delivering meaningful information. Missing values are an example of incomplete data. Timeliness measures how up-to-date your data is. If customer data, for example, stems from 2020, there will likely be an issue regarding timeliness.

Validity refers to how well the data adheres to a set of rules, such as a format or process. A customer’s birth date might, for instance, have to be in a date format to be considered valid. The integrity of data is determined by how reliable and trustworthy information is. A user that has already deleted his or her account but still shows up in a list of customers is an example of a lack of integrity. Uniqueness, as the name suggests, measures how unique data is. Duplicate customer data is a case of data that does not satisfy the uniqueness condition.

Lastly, consistency specifies whether data is stored in the same way across different values. A product where its name is spelled with or without a capital or a space is an example of inconsistent data. Now that we know what to look out for in our garden, how do we automate its maintenance?

Solution —> automation

Automation comes with a list of advantages: it is more efficient, scalable, accurate and reliable, especially because there is less room for human error. Similar to a spelling checker in a text editor, human mistakes can easily be flagged and corrected. What’s more, it is much easier to get a real-time overview of data quality. Clearly, our garden would benefit from an automatic lawn mower, let’s discuss how it works to see what can be built:

In essence, an algorithm ‘reads’ our data to identify its characteristics, such as number ranges. Based on these characteristics, the algorithm then creates a set of data rules which are enforced each time when new data is collected. If irregularities are detected, the algorithm gives a warning. This process is illustrated in the diagram below. Warnings can be handled differently. In the example, the value 74 might get us to change the number range in the rules. A number in a yes/no column, however, requires us to change the value.

The warnings, along with other information, are displayed in a dashboard to give a clear overview of data conditions. The current version of our algorithm, constructed by Nino, is written in R and linked to Looker Studio. Although almost any combination of programming languages and tools for data visualisation can be used. Python, SQL, PowerBI, you name it – the point is that there are many possibilities. Moreover, it means it is easier to create and implement, automatically checking data quality and thereby covering about 80% of problems. A preview of such a dashboard is displayed below. It gives an overview of data quality over time, including the status quo, measured using the amount of errors and warnings, among other things.

Conclusion

All in all, maintaining data quality has numerous advantages. A high level of data quality saves time, reduces cost and means more projects will see the light of day. Unfortunately, too many companies overcomplicate the process and thereby waste their resources. Being fairly straight-forward, it is not at all difficult to automate. On top of that, automation will provide more efficiency, scalability and most of all accuracy and reliability. With a coverage of about 80% of data quality issues, it is important to continue the manual, tailored maintenance of data quality.

An automatic lawn mower to take care of our garden does not mean that we never have to take a look at our garden again. Manually taking care of the plants that are specific to your business continues to be of great importance. We wish you the best of luck with your garden! Do you want to learn more about our dashboards? Or do you want to ask some further questions? Feel free to reach out, we are more than happy to help!

Need some help?

Jorian Faber

“Data en creativiteit gaan hand in hand: mijn hart gaat sneller kloppen bij het bedenken van innovatieve oplossingen en het ontdekken van patronen die niet vanzelfsprekend zijn.”

Let's connect

Staying on Track with Dataform Railway Design - Streamlining Dataform Development with Local Setup and CI/CD

Explore how to streamline Dataform local development using CI/CD integration. Automate schema testing, manage environments, optimize workflows, and build scalable, reliable data pipelines.

Data stories

The Data Story’s research on heuristic and data-driven attribution models: rigidity versus flexibility

Marketers recognise that accurately attributing revenue to marketing efforts is key to better decision-making, budgeting, and strategy. However, implementing marketing attribution effectively is challenging. In marketing attribution, we assign credit...

Data stories

Frequentist Over Bayesian: A Statistician's 'Normal' Choice

Probability and statistics lie at the centre of data science. There are different ways of interpreting and expressing probability. Very often, it is expressed using the function P(), where P(a)...

Data stories

Why GA4 classifies Google Ads traffic as (Organic) and how to fix it

If you rely on Google Analytics 4 (GA4) and noticed that some of your Google Ads traffic is showing up under the campaign name “(organic)”, you might be wondering why...

Data stories

Five Ways to Enhance Your First-party Data Strategy

Google planned on phasing out third-party cookies due to issues mainly concerning privacy, at the end of 2024. However, they have postponed this phase-out once again, giving businesses (and Google)...

Data stories

European Women in Technology 2024 – Part 2

On May 26th and 27th, our team members Yvette and Sophie attended the 2024 installment of European Women In Technology. This event’s main purpose is to share ideas and discuss...

Data Quality is not difficult – how automation simplifies

Why data quality?

What defines data quality?

Solution —> automation

Conclusion

Need some help?

Jorian Faber

More Data stories

Staying on Track with Dataform Railway Design - Streamlining Dataform Development with Local Setup and CI/CD

The Data Story’s research on heuristic and data-driven attribution models: rigidity versus flexibility

Frequentist Over Bayesian: A Statistician's 'Normal' Choice

Why GA4 classifies Google Ads traffic as (Organic) and how to fix it

Five Ways to Enhance Your First-party Data Strategy

European Women in Technology 2024 – Part 2

The Data Story