Unsupervised learning is a form of machine learning that identifies patterns and structures in data without relying on labelled examples or predefined outcomes. That is both its greatest strength and its biggest challenge. It can uncover valuable insights even when we do not yet know what we are looking for. Without labels however, how do you know whether the model is performing well? Traditional metrics such as accuracy cannot provide the answer.
That does not mean model quality cannot be evaluated. It simply means we need to approach the problem differently. In this blog, we explain how we tackled that challenge in a project for one of the largest banks in the Netherlands.
How can you evaluate unsupervised models in practice?
In this project, our goal was to improve transaction monitoring using an isolation forest: an unsupervised learning algorithm designed to detect anomalies. In this context, those anomalies may point to potentially fraudulent transactions.
The challenge was clear. As it was unknown in advance which transactions were genuinely fraudulent and which were not, we could not evaluate the model in the traditional sense. We had no direct way to measure whether the model was separating suspicious transactions from legitimate ones correctly.
To address this, we shifted our focus from accuracy to consistency.
Our reasoning was straightforward: if the same transaction is repeatedly flagged as anomalous across different versions of the data, it is more likely to be a genuinely unusual case. In other words, if the model consistently identifies the same transactions as anomalous across multiple datasets, this increases our confidence that it is producing meaningful results.
To put this into practice, we built an evaluation framework based on cross-validation. Across many iterations, the data was split into training and test sets. In each iteration, the isolation forest was trained on the training set and then used to evaluate the test set.
For every transaction, we recorded:
- how often it appeared in the test set,
- how often it was classified as anomalous,
- and how often it was classified as non-anomalous.
This gave us a way to assess how consistently each transaction was categorized.
A transaction with low consistency would be classified as anomalous about as often as it was classified as non-anomalous. A transaction with high consistency would be assigned to the same category almost every time it appeared in the test data.
We quantified this by taking, for each transaction, the higher of the two classification counts and dividing it by the number of times that transaction appeared in the test set. This produces a consistency score between 0.5 and 1.0, where a higher score indicates more stable classification.
By averaging these scores across all transactions, we obtain a global consistency score. This provides a practical way to compare different isolation forest implementations and determine which one produces more reliable results.
Don’t let complexity hold you back
This project shows that valuable solutions in data science do not always come from standard methods or off-the-shelf metrics. In this case, we developed a practical way to compare models when conventional evaluation was not possible, allowing us to provide the most reliable model for the client with confidence.
More broadly, data science is full of situations where the real challenge lies not just in building models, but in framing problems correctly and designing approaches that work in practice. That is where data science delivers its real value: creativity, critical thinking, and domain expertise make it possible to develop solutions that truly fit the context at hand.
Final words
Are you looking for new ways to get more out of your data? Don’t hesitate to reach out to our team of experts.





