Home » Data stories » An algorithm for automatic data analysis, explained with a hint of orange

An algorithm for automatic data analysis, explained with a hint of orange

Over the past few months, The Data Story has had its first intern! Dorian is now finishing his Master’s in Data Science with a graduation project, also at The Data Story. Afterwards, he will continue to be a part of our team! It’s about time you get to know him better. During Dorian’s internship, he created an algorithm to help The Data Story show more people how valuable Data Science can be.

The algorithm automatically generates four valuable analyses based on GA4 data: customer segmentation, lead scoring, seasonality analysis and a webpage recommender system. It provides insights into performance metrics that can help refine marketing strategies. For example, by understanding more about customer types, they can be targeted more effectively. Similarly, by finding patterns in past revenue and sales and trying to predict future trends, one can anticipate and adjust marketing strategies accordingly.

The algorithm performs all analyses automatically, enabling fast and easy generation of graphs and plots, without much interference. Thereby, it paves the way for us to show many more people what the world of Data Science has to offer! As the Dutch streets turn bright orange and diagnosis of orange fever reaches an all time high; with the Euros in full swing and the Summer Olympics in sight (let’s be sports-inclusive here), what better way to explain these analyses than with a little hint of orange. This blog will go over the algorithm’s four analyses, and provide some orange here and there to help understand their value.

Customer segmentation

Customer segmentation is an attempt at dividing customers into subgroups, based on the data we have about their behaviour. If we hear someone singing a national anthem for instance, we have a good shot at dividing someone into a subgroup of nationality. Segmentation is performed by clustering customers: taking customers that display similar behaviour and putting them in the same group. Clustering’s biggest challenge is deciding how many clusters to define. Here’s how Dorian tackles the problem. The algorithm takes a random sample of customers and performs clustering for 2 to 20 clusters. It then uses metrics such as the Dunn index, Davies-Bouldin index and intra-cluster distances to determine validity and quality for different numbers of clusters.

Once the algorithm has found the optimal number of clusters, it clusters the entire dataset. For each of the clusters, it provides information about what data points and corresponding values characterise each cluster. This is what actually makes the results useful. Since there is no use in defining customer types without knowing their (unique) properties. Imagine an e-commerce store that sells sports products. Merely knowing that there are 6 different customer types does not provide any insight.

Knowledge what characterises these customer types, however, does open the door for new or different marketing strategies. The sports store, for example, might find it very useful to know that one group is very interested in ball sports, and another one interested in sports jerseys. It allows for more effective and personal marketing strategies.

Lead scoring

Just as we would love to know which of our football players will score or win (convert), it would be extremely interesting to know this about customers. Just as we would treat such a player differently by lining them up, we can treat customers differently as well. To achieve this, the algorithm evaluates the GA4 data and determines the likelihood that a customer will convert, based on their behaviour. First, however, this requires some data cleaning, as is often the case with ML. Data points with values always higher than 0 when someone converts are removed. For example, if shipping info is present in the data, it is obvious that someone has converted. This ensures that the model, when trained, focuses on general customer behaviour rather than variables directly related to a purchase.

For those who, understandably so, get more enthusiastic from ML than sports metaphors. Random Forest, Decision Tree and both Ridge- and Lasso Regularised Logistic Regression models showed better predictive accuracy than Support Vector Machines or K-Nearest Neighbour models. The regression with the highest predictive accuracy is selected and used to make probabilistic predictions of conversion for each individual. Back to more sports terms. The model with the highest credibility gets to predict whether a player will score, or whether an athlete will win a medal.

Seasonality analysis

For those of you who let their entrepreneurial side get a little too enthusiastic during these times, and start buying orange decoration in bulk: it might be good to try a little seasonality analysis every now and then. A seasonality analysis tries to identify patterns in data and thereby help anticipate on trends. Even without complex analysis though, many of us will conclude that the demand for orange decorations is highly seasonal.

Less obvious but definitely present, are regular trends in revenue of a company each year. The algorithm described in this blog identifies patterns in historical data of both revenue and number of purchases. Based on these patterns, it tries to predict future trends. This allows for anticipating on these trends and adjusting marketing strategies accordingly. For now: create a reminder for the next seasonal demand for orange decoration for the World Cup in 2026.

Page recommender system

We all know how effective recommendations can be. From a Netflix show you might like, to a related product on an e-commerce website, to the one in the previous paragraph. Page referral system make customers more likely to purchase more products and more likely to return. For that reason, the algorithm provides a personalised product recommender system, based on GA4 user data. It is based on a very cool algorithm. It became widely known when it was used in a $1,000,000 Netflix contest: matrix factorisation.

All distinct webpages are listed, and placed on both rows and columns. Each time a combination of webpages occurs in a session, the place where their row and column meet is incremented by 1. The algorithm thereby creates a matrix that indicates the frequency at which webpages occur together in a session.

The image below demonstrates how this would work in Netflix, with rows of users and columns of items. We now have a matrix with frequencies: this means that for each webpage, the algorithm can create a sorted list of webpages that most often occur in combination with it. This provides us with a page recommender system, offering valuable insights into customer interests.

Despite the challenges Dorian faced, such as managing virtual machines due to large datasets, he did an outstanding job in creating the algorithm. It helps us at The Data Story automatically generate four analyses on GA4 data, to show more people how valuable their data can be. It provides useful insights and help refine marketing strategies. For his Master thesis, Dorian is now researching how marketing attribution models behave in different situations. This should provide insight into which marketing attribution models best fit certain marketing strategies, we’ll keep you updated! Has this blog piqued your interest and do you want to find out what value lies in your data? Feel free to reach out, we are more than happy to help!

Need some help?

Jorian Faber

“Data en creativiteit gaan hand in hand: mijn hart gaat sneller kloppen bij het bedenken van innovatieve oplossingen en het ontdekken van patronen die niet vanzelfsprekend zijn.”

How server-side tracking enhances data accuracy in GA4

Have you ever noticed that the number of conversions displayed in your reports is smaller than the actual amount shown in the CRM? If yes, you should have already gone...

Data stories

Beyond Accuracy: How to Evaluate Unsupervised Models for Reliable Data Insights

Unsupervised learning is a form of machine learning that identifies patterns and structures in data without relying on labelled examples or predefined outcomes. That is both its greatest strength and its biggest...

Data stories

GA4 Measurement Protocol: Sending Server-Side Events with Webhooks

Have you ever wondered how to see refunds in GA4 or how to add (dis)qualified leads to GA4? This is where the GA4 Measurement Protocol really shines. The GA4 Measurement...

Data stories

Why you need a marketing data pipeline in 2026

It’s Monday morning. You take a sip of your first coffee and open Looker Studio as the weekly marketing performance meeting begins. Everyone is eagerly waiting for the numbers. You...

Data stories

A scalable way to handle multiple GA4-properties in Dataform

Many organisations don’t just have one GA4 property – they have several. A webshop might split brands, countries and domains across different properties, and before you know it you’re maintaining...

Data stories

Staying on Track with Dataform Railway Design - Streamlining Dataform Development with Local Setup and CI/CD

Explore how to streamline Dataform local development using CI/CD integration. Automate schema testing, manage environments, optimize workflows, and build scalable, reliable data pipelines.

An algorithm for automatic data analysis, explained with a hint of orange

Customer segmentation

Lead scoring

Seasonality analysis

Page recommender system

Need some help?

Jorian Faber

More Data stories

How server-side tracking enhances data accuracy in GA4

Beyond Accuracy: How to Evaluate Unsupervised Models for Reliable Data Insights

GA4 Measurement Protocol: Sending Server-Side Events with Webhooks

Why you need a marketing data pipeline in 2026

A scalable way to handle multiple GA4-properties in Dataform

Staying on Track with Dataform Railway Design - Streamlining Dataform Development with Local Setup and CI/CD