Home » Data stories » Beyond Accuracy: How to Evaluate Unsupervised Models for Reliable Data Insights

Beyond Accuracy: How to Evaluate Unsupervised Models for Reliable Data Insights

Unsupervised learning is a form of machine learning that identifies patterns and structures in data without relying on labelled examples or predefined outcomes. That is both its greatest strength and its biggest challenge. It can uncover valuable insights even when we do not yet know what we are looking for. Without labels however, how do you know whether the model is performing well? Traditional metrics such as accuracy cannot provide the answer.

That does not mean model quality cannot be evaluated. It simply means we need to approach the problem differently. In this blog, we explain how we tackled that challenge in a project for one of the largest banks in the Netherlands.

How can you evaluate unsupervised models in practice?

In this project, our goal was to improve transaction monitoring using an isolation forest: an unsupervised learning algorithm designed to detect anomalies. In this context, those anomalies may point to potentially fraudulent transactions.

The challenge was clear. As it was unknown in advance which transactions were genuinely fraudulent and which were not, we could not evaluate the model in the traditional sense. We had no direct way to measure whether the model was separating suspicious transactions from legitimate ones correctly.

To address this, we shifted our focus from accuracy to consistency.

Our reasoning was straightforward: if the same transaction is repeatedly flagged as anomalous across different versions of the data, it is more likely to be a genuinely unusual case. In other words, if the model consistently identifies the same transactions as anomalous across multiple datasets, this increases our confidence that it is producing meaningful results.

To put this into practice, we built an evaluation framework based on cross-validation. Across many iterations, the data was split into training and test sets. In each iteration, the isolation forest was trained on the training set and then used to evaluate the test set.

For every transaction, we recorded:

how often it appeared in the test set,

how often it was classified as anomalous,

and how often it was classified as non-anomalous.

This gave us a way to assess how consistently each transaction was categorized.

A transaction with low consistency would be classified as anomalous about as often as it was classified as non-anomalous. A transaction with high consistency would be assigned to the same category almost every time it appeared in the test data.

We quantified this by taking, for each transaction, the higher of the two classification counts and dividing it by the number of times that transaction appeared in the test set. This produces a consistency score between 0.5 and 1.0, where a higher score indicates more stable classification.

By averaging these scores across all transactions, we obtain a global consistency score. This provides a practical way to compare different isolation forest implementations and determine which one produces more reliable results.

Don’t let complexity hold you back

This project shows that valuable solutions in data science do not always come from standard methods or off-the-shelf metrics. In this case, we developed a practical way to compare models when conventional evaluation was not possible, allowing us to provide the most reliable model for the client with confidence.

More broadly, data science is full of situations where the real challenge lies not just in building models, but in framing problems correctly and designing approaches that work in practice. That is where data science delivers its real value: creativity, critical thinking, and domain expertise make it possible to develop solutions that truly fit the context at hand.

Final words

Are you looking for new ways to get more out of your data? Don’t hesitate to reach out to our team of experts.

How server-side tracking enhances data accuracy in GA4

Have you ever noticed that the number of conversions displayed in your reports is smaller than the actual amount shown in the CRM? If yes, you should have already gone...

Data stories

Beyond Accuracy: How to Evaluate Unsupervised Models for Reliable Data Insights

Data stories

GA4 Measurement Protocol: Sending Server-Side Events with Webhooks

Have you ever wondered how to see refunds in GA4 or how to add (dis)qualified leads to GA4? This is where the GA4 Measurement Protocol really shines. The GA4 Measurement...

Data stories

Why you need a marketing data pipeline in 2026

It’s Monday morning. You take a sip of your first coffee and open Looker Studio as the weekly marketing performance meeting begins. Everyone is eagerly waiting for the numbers. You...

Data stories

A scalable way to handle multiple GA4-properties in Dataform

Many organisations don’t just have one GA4 property – they have several. A webshop might split brands, countries and domains across different properties, and before you know it you’re maintaining...

Data stories

Staying on Track with Dataform Railway Design - Streamlining Dataform Development with Local Setup and CI/CD

Explore how to streamline Dataform local development using CI/CD integration. Automate schema testing, manage environments, optimize workflows, and build scalable, reliable data pipelines.

Beyond Accuracy: How to Evaluate Unsupervised Models for Reliable Data Insights

Table of Contents

How can you evaluate unsupervised models in practice?

Don’t let complexity hold you back

Final words

More Data stories

How server-side tracking enhances data accuracy in GA4

Beyond Accuracy: How to Evaluate Unsupervised Models for Reliable Data Insights

GA4 Measurement Protocol: Sending Server-Side Events with Webhooks

Why you need a marketing data pipeline in 2026

A scalable way to handle multiple GA4-properties in Dataform

Staying on Track with Dataform Railway Design - Streamlining Dataform Development with Local Setup and CI/CD

The Data Story