Mike the headless chicken

You should filter out useless traffic, here’s how to!

Useless traffic, be it bot or developer traffic, can be a major issue for data analysts. Certain traffic can be considered useless if it is not representative of actual user behaviour. If so, it should be excluded from any reporting to make results more reliable. With nearly half of the global internet traffic coming from bots, handling it carefully has become increasingly important.

There are several reasons for the use of bots: some examples are search engines, testing services and web scrapers. The main attraction to bots comes from their speed and ease of automation. By nature, these bots do not represent human behaviour and neither do developers. Therefore, these sources may pollute a website’s analytics data. This blog will lay out different ways to identify as well as filter out useless traffic. Let’s put our analytics traffic light on red for bots and developers!

useless traffic data

Currently, GA4 provides the possibility to filter out certain data. By default, Google already automatically excludes known bot traffic from its data. This traffic is identified using a combination of Google’s research and the International Spiders and Bots list. This does, unfortunately, mean that some unknown bots are still not filtered out. Apart from this default filter, Google provides two data filters that can be activated. These filters are, however, very limited as they only allow to exclude internal and developer traffic. They also only affect data from the point at which the filters are activated, meaning they do not affect historical data. The options offered by GA4 to filter out useless traffic are too limited and therefore insufficient. 

GA4’s Data Dilemma

Currently, GA4 provides the possibility to filter out certain data. By default, Google already automatically excludes known bot traffic from its data. This traffic is identified using a combination of Google’s research and the International Spiders and Bots list. This does, unfortunately, mean that some unknown bots are still not filtered out. Apart from this default filter, Google provides two data filters that can be activated. These filters are, however, very limited as they only allow to exclude internal and developer traffic. They also only affect data from the point at which the filters are activated, meaning they do not affect historical data. The options offered by GA4 to filter out useless traffic are too limited and therefore insufficient.

To avoid being identified as a bot or developer, one can use a headless browser. What makes a headless browser different from regular web browsers, is that it does not have a graphical user interface. Similar to Mike the Headless Chicken, there is not much to “see” when operating headlessly. Instead of navigating a GUI as one would usually do, a headless browser allows automating control of a web page from the command line or through network communication.

useless traffic
Mike the headless chicken

Popular use cases are test automation, taking screenshots of web pages, automating web page interactions and web scraping. They are used as bots as well as in development. In many ways, a headless browser can be considered a browser like any other, which is why it is not identified by GA4 as any different. This poses a problem since the majority of headless browser uses do not represent regular human behaviour. The data generated by headless browsers is useless for data analysis and should therefore be excluded from any reporting.

How to identify useless traffic manually?

There are several indicators that can help identify useless traffic. Bots are often used for automation tasks, partly because of their speed. One can imagine that in these use cases, it is not beneficial to hide this speed by slowing the bot down. Therefore, engagement time is the first metric to consider when identifying useless traffic. An engagement time of 0 seconds or only a few milliseconds is highly likely the sign of useless traffic.

Another indicator is an extreme bounce rate. Except for something such as a “thank you” page, users normally do not have a bounce rate of 100% on web pages. Especially if the bounce rate of 100% is not comparable to any of the website’s other bounce rate data, the traffic presumably originates from a bot or developer. 

Headless browsers are identified in a different way, by the user’s screen height and jQuery window height. An easy way to flag events as originating from a headless browser is by creating a custom dimension in GA4 with a custom variable in Google Tag Manager. First, create a custom variable in GTM with the following Javascript code below (source). This code checks whether the user’s screen height and jQuery window height are equal. If so, it returns true. If not, it returns false.

function(){
   if (window.screen.height - jQuery(window).height() === 0){
 	return true;
   }else{
 	return false;
   };
}

Next, as explained in the GA4 Docs, head over to the admin section in Google Analytics. Navigate to Custom Definitions > Custom Dimensions > New Custom Dimension. Name the dimension something such as “headless” and change the scope to “event”. When creating the dimension, remember the dimension index as it will be required for the next steps.

Tip: to make sure you have followed all steps correctly, also create custom dimensions for window screen height and jQuery window height. This way, you can later test in BigQuery that the headless values are returned as expected.

Switch back to Google Tag Manager and look for your Google Analytics Settings Variable. Navigate to More Settings > Custom Dimensions > Add Custom Dimension. Under “index”, fill in the index from the previous step. Under “dimension value”, enter the variable you created with the Javascript code in the first step. Save your changes and publish them to your GTM container.

Now that a custom dimension has been created, it should show up as a value in BigQuery tables that can be queried, as shown above. We can now filter out headless traffic using a “WHERE” clause. Hence, our data will no longer be polluted by event data originating from headless browsers! We have tackled traffic from headless browsers but that does not encapsulate all useless traffic. As explained, there are other metrics such as engagement time that can indicate useless traffic. In part two of this blog we will dive into how we can use Machine Learning to detect this traffic using previously mentioned indicators. Until next time!

Feel like reading more data stories? Then take a look at our blog page. Always want to stay up to date? Be sure to follow us on LinkedIn!

Need some help?

2022-01-12MRF (1368)

Andrei Fekete

“Mijn passie is leren van nieuwe technologieën, processen vereenvoudigen en automatiseren en écht bruikbare oplossingen creëren.”

More Data stories

Screenshot 2023-09-13 at 11.21.48
Data stories

Choosing the right path: following the Dataform trail

In the dense forest that data science can often be, finding a way through can be cumbersome. Although this forest is not made up of decision trees, making choices is...
Mike the headless chicken
Data stories

You should filter out useless traffic, here’s how to!

Useless traffic, be it bot or developer traffic, can be a major issue for data analysts. Certain traffic can be considered useless if it is not representative of actual user...
TDS_part2
Data stories

Simplifying Machine Learning: Less is More - PART 2

In part 1 on simplifying machine learning we talked about the pitfalls of complex models and why opting for simpler models is more often than not the right thing to...
Machine Learning
Data stories

Simplifying Machine Learning: Less is More

As businesses across the globe accelerate their digitalisation efforts, they are increasingly captivated by the power of artificial intelligence (AI) and machine learning (ML). Companies, especially those new to AI...
Analytics for a Better World
Data stories

Analytics for a Better World

I, Sophie, attended the Analytics for a Better World (ABW) annual conference at Amsterdam Business School on May 24th. This event brought together speakers and panelists from different groups: nonprofits,...
GA4
Data stories

GA4 Data Alerts in Slack

In this blogpost you will read about a Data Alerts system in Slack: a Slackbot that alerts users when it finds inconsistencies in data, using predefined BigQuery queries. After a...
nl_NLNederlands