Mike the headless chicken

You should filter out useless traffic, here’s how to!

Useless traffic, be it bot or developer traffic, can be a major issue for data analysts. Certain traffic can be considered useless if it is not representative of actual user behaviour. If so, it should be excluded from any reporting to make results more reliable. With nearly half of the global internet traffic coming from bots, handling it carefully has become increasingly important.

There are several reasons for the use of bots: some examples are search engines, testing services and web scrapers. The main attraction to bots comes from their speed and ease of automation. By nature, these bots do not represent human behaviour and neither do developers. Therefore, these sources may pollute a website’s analytics data. This blog will lay out different ways to identify as well as filter out useless traffic. Let’s put our analytics traffic light on red for bots and developers!

useless traffic data

Currently, GA4 provides the possibility to filter out certain data. By default, Google already automatically excludes known bot traffic from its data. This traffic is identified using a combination of Google’s research and the International Spiders and Bots list. This does, unfortunately, mean that some unknown bots are still not filtered out. Apart from this default filter, Google provides two data filters that can be activated. These filters are, however, very limited as they only allow to exclude internal and developer traffic. They also only affect data from the point at which the filters are activated, meaning they do not affect historical data. The options offered by GA4 to filter out useless traffic are too limited and therefore insufficient. 

GA4’s Data Dilemma

Currently, GA4 provides the possibility to filter out certain data. By default, Google already automatically excludes known bot traffic from its data. This traffic is identified using a combination of Google’s research and the International Spiders and Bots list. This does, unfortunately, mean that some unknown bots are still not filtered out. Apart from this default filter, Google provides two data filters that can be activated. These filters are, however, very limited as they only allow to exclude internal and developer traffic. They also only affect data from the point at which the filters are activated, meaning they do not affect historical data. The options offered by GA4 to filter out useless traffic are too limited and therefore insufficient.

To avoid being identified as a bot or developer, one can use a headless browser. What makes a headless browser different from regular web browsers, is that it does not have a graphical user interface. Similar to Mike the Headless Chicken, there is not much to “see” when operating headlessly. Instead of navigating a GUI as one would usually do, a headless browser allows automating control of a web page from the command line or through network communication.

useless traffic
Mike the headless chicken

Popular use cases are test automation, taking screenshots of web pages, automating web page interactions and web scraping. They are used as bots as well as in development. In many ways, a headless browser can be considered a browser like any other, which is why it is not identified by GA4 as any different. This poses a problem since the majority of headless browser uses do not represent regular human behaviour. The data generated by headless browsers is useless for data analysis and should therefore be excluded from any reporting.

How to identify useless traffic manually?

There are several indicators that can help identify useless traffic. Bots are often used for automation tasks, partly because of their speed. One can imagine that in these use cases, it is not beneficial to hide this speed by slowing the bot down. Therefore, engagement time is the first metric to consider when identifying useless traffic. An engagement time of 0 seconds or only a few milliseconds is highly likely the sign of useless traffic.

Another indicator is an extreme bounce rate. Except for something such as a “thank you” page, users normally do not have a bounce rate of 100% on web pages. Especially if the bounce rate of 100% is not comparable to any of the website’s other bounce rate data, the traffic presumably originates from a bot or developer. 

Headless browsers are identified in a different way, by the user’s screen height and jQuery window height. An easy way to flag events as originating from a headless browser is by creating a custom dimension in GA4 with a custom variable in Google Tag Manager. First, create a custom variable in GTM with the following Javascript code below (source). This code checks whether the user’s screen height and jQuery window height are equal. If so, it returns true. If not, it returns false.

function(){
   if (window.screen.height - jQuery(window).height() === 0){
 	return true;
   }else{
 	return false;
   };
}

Next, as explained in the GA4 Docs, head over to the admin section in Google Analytics. Navigate to Custom Definitions > Custom Dimensions > New Custom Dimension. Name the dimension something such as “headless” and change the scope to “event”. When creating the dimension, remember the dimension index as it will be required for the next steps.

Tip: to make sure you have followed all steps correctly, also create custom dimensions for window screen height and jQuery window height. This way, you can later test in BigQuery that the headless values are returned as expected.

Switch back to Google Tag Manager and look for your Google Analytics Settings Variable. Navigate to More Settings > Custom Dimensions > Add Custom Dimension. Under “index”, fill in the index from the previous step. Under “dimension value”, enter the variable you created with the Javascript code in the first step. Save your changes and publish them to your GTM container.

Now that a custom dimension has been created, it should show up as a value in BigQuery tables that can be queried, as shown above. We can now filter out headless traffic using a “WHERE” clause. Hence, our data will no longer be polluted by event data originating from headless browsers! We have tackled traffic from headless browsers but that does not encapsulate all useless traffic. As explained, there are other metrics such as engagement time that can indicate useless traffic. In part two of this blog we will dive into how we can use Machine Learning to detect this traffic using previously mentioned indicators. Until next time!

Feel like reading more data stories? Then take a look at our blog page. Always want to stay up to date? Be sure to follow us on LinkedIn!

Need some help?

2022-01-12MRF (1368)

Andrei Fekete

“Mijn passie is leren van nieuwe technologieën, processen vereenvoudigen en automatiseren en écht bruikbare oplossingen creëren.”

More Data stories

BLOG_afbw2
Data stories

Analytics for a Better World

For the second year in a row, our team-member Sophie Caro attended the Analytics for a Better World conference on May 14th. Nowadays, analytics play an important role in increasing...
BLOGquality (1)
Data stories

Building High Performance Data & Analytics Teams

When we work together with our partners, we often assist them in building a data team. Numerous screenings and interviews all demonstrate one thing: finding the right candidate for the...
Data quality
Data stories

Data Quality is not difficult - how automation simplifies

When companies start collecting more data, data quality eventually becomes a topic of discussion. With more plants in your garden, maintenance is a larger responsibility. It is good to be...
clean ai
Data stories

AI needs clean, high-quality data - here’s why

With AI becoming more and more popular, its usage as a technology as well as a buzzword is growing ever so quickly. Even so, the title of this blog contains...
Product Owner
Data stories

Unlocking Data Potential: Role of a Product Owner in Data Teams

In the ever-evolving landscape of data science and analytics, organisations are recognising the need for a holistic and strategic approach to manage their data and turn it into value. One...
BLOGuit
Data stories

What is Server-side tagging in Google Tag Manager?

You’ve probably heard about Server-side tagging and might be wondering “What is it exactly?” and “How is it any different than the current Google Tag Manager setup?”. This blog will...
nl_NLNederlands