An algorithm for automatic data analysis, explained with a hint of orange

Over the past few months, The Data Story has had its first intern! Dorian is now finishing his Master’s in Data Science with a graduation project, also at The Data Story. Afterwards, he will continue to be a part of our team! It’s about time you get to know him better. During Dorian’s internship, he created an algorithm to help The Data Story show more people how valuable Data Science can be.

The algorithm automatically generates four valuable analyses based on GA4 data: customer segmentation, lead scoring, seasonality analysis and a webpage recommender system. It provides insights into performance metrics that can help refine marketing strategies. For example, by understanding more about customer types, they can be targeted more effectively. Similarly, by finding patterns in past revenue and sales and trying to predict future trends, one can anticipate and adjust marketing strategies accordingly.

The algorithm performs all analyses automatically, enabling fast and easy generation of graphs and plots, without much interference. Thereby, it paves the way for us to show many more people what the world of Data Science has to offer! As the Dutch streets turn bright orange and diagnosis of orange fever reaches an all time high; with the Euros in full swing and the Summer Olympics in sight (let’s be sports-inclusive here), what better way to explain these analyses than with a little hint of orange. This blog will go over the algorithm’s four analyses, and provide some orange here and there to help understand their value.

Customer segmentation

Customer segmentation is an attempt at dividing customers into subgroups, based on the data we have about their behaviour. If we hear someone singing a national anthem for instance, we have a good shot at dividing someone into a subgroup of nationality. Segmentation is performed by clustering customers: taking customers that display similar behaviour and putting them in the same group. Clustering’s biggest challenge is deciding how many clusters to define. Here’s how Dorian tackles the problem. The algorithm takes a random sample of customers and performs clustering for 2 to 20 clusters. It then uses metrics such as the Dunn index, Davies-Bouldin index and intra-cluster distances to determine validity and quality for different numbers of clusters.

Once the algorithm has found the optimal number of clusters, it clusters the entire dataset. For each of the clusters, it provides information about what data points and corresponding values characterise each cluster. This is what actually makes the results useful. Since there is no use in defining customer types without knowing their (unique) properties. Imagine an e-commerce store that sells sports products. Merely knowing that there are 6 different customer types does not provide any insight.

Knowledge what characterises these customer types, however, does open the door for new or different marketing strategies. The sports store, for example, might find it very useful to know that one group is very interested in ball sports, and another one interested in sports jerseys. It allows for more effective and personal marketing strategies.

Lead scoring

Just as we would love to know which of our football players will score or win (convert), it would be extremely interesting to know this about customers. Just as we would treat such a player differently by lining them up, we can treat customers differently as well. To achieve this, the algorithm evaluates the GA4 data and determines the likelihood that a customer will convert, based on their behaviour. First, however, this requires some data cleaning, as is often the case with ML. Data points with values always higher than 0 when someone converts are removed. For example, if shipping info is present in the data, it is obvious that someone has converted. This ensures that the model, when trained, focuses on general customer behaviour rather than variables directly related to a purchase.

For those who, understandably so, get more enthusiastic from ML than sports metaphors. Random Forest, Decision Tree and both Ridge- and Lasso Regularised Logistic Regression models showed better predictive accuracy than Support Vector Machines or K-Nearest Neighbour models. The regression with the highest predictive accuracy is selected and used to make probabilistic predictions of conversion for each individual. Back to more sports terms. The model with the highest credibility gets to predict whether a player will score, or whether an athlete will win a medal.

algorithm data analysis

Seasonality analysis

For those of you who let their entrepreneurial side get a little too enthusiastic during these times, and start buying orange decoration in bulk: it might be good to try a little seasonality analysis every now and then. A seasonality analysis tries to identify patterns in data and thereby help anticipate on trends. Even without complex analysis though, many of us will conclude that the demand for orange decorations is highly seasonal.

Less obvious but definitely present, are regular trends in revenue of a company each year. The algorithm described in this blog identifies patterns in historical data of both revenue and number of purchases. Based on these patterns, it tries to predict future trends. This allows for anticipating on these trends and adjusting marketing strategies accordingly. For now: create a reminder for the next seasonal demand for orange decoration for the World Cup in 2026.

algorithm data analysis

Page recommender system

We all know how effective recommendations can be. From a Netflix show you might like, to a related product on an e-commerce website, to the one in the previous paragraph. Page referral system make customers more likely to purchase more products and more likely to return. For that reason, the algorithm provides a personalised product recommender system, based on GA4 user data. It is based on a very cool algorithm. It became widely known when it was used in a $1,000,000 Netflix contest: matrix factorisation.

All distinct webpages are listed, and placed on both rows and columns. Each time a combination of webpages occurs in a session, the place where their row and column meet is incremented by 1. The algorithm thereby creates a matrix that indicates the frequency at which webpages occur together in a session.

The image below demonstrates how this would work in Netflix, with rows of users and columns of items. We now have a matrix with frequencies: this means that for each webpage, the algorithm can create a sorted list of webpages that most often occur in combination with it. This provides us with a page recommender system, offering valuable insights into customer interests.

algorithm data analysis

Despite the challenges Dorian faced, such as managing virtual machines due to large datasets, he did an outstanding job in creating the algorithm. It helps us at The Data Story automatically generate four analyses on GA4 data, to show more people how valuable their data can be. It provides useful insights and help refine marketing strategies. For his Master thesis, Dorian is now researching how marketing attribution models behave in different situations. This should provide insight into which marketing attribution models best fit certain marketing strategies, we’ll keep you updated! Has this blog piqued your interest and do you want to find out what value lies in your data? Feel free to reach out, we are more than happy to help!

Need some help?

Jorian Faber The Data Story

Jorian Faber

“Data en creativiteit gaan hand in hand: mijn hart gaat sneller kloppen bij het bedenken van innovatieve oplossingen en het ontdekken van patronen die niet vanzelfsprekend zijn.”

More Data stories

google ads traffic
Data stories

Why GA4 classifies Google Ads traffic as (Organic) and how to fix it

If you rely on Google Analytics 4 (GA4) and noticed that some of your Google Ads traffic is showing up under the campaign name “(organic)”, you might be wondering why...
BLOG_koekje
Data stories

Five Ways to Enhance Your First-party Data Strategy

Google planned on phasing out third-party cookies due to issues mainly concerning privacy, at the end of 2024. However, they have postponed this phase-out once again, giving businesses (and Google)...
European Women in Technology
Data stories

European Women in Technology 2024 – Part 2

On May 26th and 27th, our team members Yvette and Sophie attended the 2024 installment of European Women In Technology. This event’s main purpose is to share ideas and discuss...
BLOG_ewit
Data stories

European Women in Technology 2024 - Part 1

On May 26th and 27th, our team members Yvette and Sophie attended the 2024 installment of European Women In Technology. This event’s main purpose is to share ideas and discuss...
Ontwerp zonder titel
Data stories

An algorithm for automatic data analysis, explained with a hint of orange

Over the past few months, The Data Story has had its first intern! Dorian is now finishing his Master’s in Data Science with a graduation project, also at The Data...
BLOG_afbw2
Data stories

Analytics for a Better World

For the second year in a row, our team-member Sophie Caro attended the Analytics for a Better World conference on May 14th. Nowadays, analytics play an important role in increasing...
nl_NLNederlands