Home » Data stories » You should filter out useless traffic, here’s how to!

You should filter out useless traffic, here’s how to!

Useless traffic, be it bot or developer traffic, can be a major issue for data analysts. Certain traffic can be considered useless if it is not representative of actual user behaviour. If so, it should be excluded from any reporting to make results more reliable. With nearly half of the global internet traffic coming from bots, handling it carefully has become increasingly important.

There are several reasons for the use of bots: some examples are search engines, testing services and web scrapers. The main attraction to bots comes from their speed and ease of automation. By nature, these bots do not represent human behaviour and neither do developers. Therefore, these sources may pollute a website’s analytics data. This blog will lay out different ways to identify as well as filter out useless traffic. Let’s put our analytics traffic light on red for bots and developers!

Currently, GA4 provides the possibility to filter out certain data. By default, Google already automatically excludes known bot traffic from its data. This traffic is identified using a combination of Google’s research and the International Spiders and Bots list. This does, unfortunately, mean that some unknown bots are still not filtered out. Apart from this default filter, Google provides two data filters that can be activated. These filters are, however, very limited as they only allow to exclude internal and developer traffic. They also only affect data from the point at which the filters are activated, meaning they do not affect historical data. The options offered by GA4 to filter out useless traffic are too limited and therefore insufficient.

GA4’s Data Dilemma

To avoid being identified as a bot or developer, one can use a headless browser. What makes a headless browser different from regular web browsers, is that it does not have a graphical user interface. Similar to Mike the Headless Chicken, there is not much to “see” when operating headlessly. Instead of navigating a GUI as one would usually do, a headless browser allows automating control of a web page from the command line or through network communication.

Popular use cases are test automation, taking screenshots of web pages, automating web page interactions and web scraping. They are used as bots as well as in development. In many ways, a headless browser can be considered a browser like any other, which is why it is not identified by GA4 as any different. This poses a problem since the majority of headless browser uses do not represent regular human behaviour. The data generated by headless browsers is useless for data analysis and should therefore be excluded from any reporting.

How to identify useless traffic manually?

There are several indicators that can help identify useless traffic. Bots are often used for automation tasks, partly because of their speed. One can imagine that in these use cases, it is not beneficial to hide this speed by slowing the bot down. Therefore, engagement time is the first metric to consider when identifying useless traffic. An engagement time of 0 seconds or only a few milliseconds is highly likely the sign of useless traffic.

Another indicator is an extreme bounce rate. Except for something such as a “thank you” page, users normally do not have a bounce rate of 100% on web pages. Especially if the bounce rate of 100% is not comparable to any of the website’s other bounce rate data, the traffic presumably originates from a bot or developer.

Headless browsers are identified in a different way, by the user’s screen height and jQuery window height. An easy way to flag events as originating from a headless browser is by creating a custom dimension in GA4 with a custom variable in Google Tag Manager. First, create a custom variable in GTM with the following Javascript code below (source). This code checks whether the user’s screen height and jQuery window height are equal. If so, it returns true. If not, it returns false.

function(){
   if (window.screen.height - jQuery(window).height() === 0){
 	return true;
   }else{
 	return false;
   };
}

Next, as explained in the GA4 Docs, head over to the admin section in Google Analytics. Navigate to Custom Definitions > Custom Dimensions > New Custom Dimension. Name the dimension something such as “headless” and change the scope to “event”. When creating the dimension, remember the dimension index as it will be required for the next steps.

Tip: to make sure you have followed all steps correctly, also create custom dimensions for window screen height and jQuery window height. This way, you can later test in BigQuery that the headless values are returned as expected.

Switch back to Google Tag Manager and look for your Google Analytics Settings Variable. Navigate to More Settings > Custom Dimensions > Add Custom Dimension. Under “index”, fill in the index from the previous step. Under “dimension value”, enter the variable you created with the Javascript code in the first step. Save your changes and publish them to your GTM container.

Now that a custom dimension has been created, it should show up as a value in BigQuery tables that can be queried, as shown above. We can now filter out headless traffic using a “WHERE” clause. Hence, our data will no longer be polluted by event data originating from headless browsers! We have tackled traffic from headless browsers but that does not encapsulate all useless traffic. As explained, there are other metrics such as engagement time that can indicate useless traffic. In part two of this blog we will dive into how we can use Machine Learning to detect this traffic using previously mentioned indicators. Until next time!

Feel like reading more data stories? Then take a look at our blog page. Always want to stay up to date? Be sure to follow us on LinkedIn!

Need some help?

Andrei Fekete

“I am passionate about learning new technologies, simplifying processes, automation, and creating really usable solutions.”

Let's connect

Staying on Track with Dataform Railway Design - Streamlining Dataform Development with Local Setup and CI/CD

Explore how to streamline Dataform local development using CI/CD integration. Automate schema testing, manage environments, optimize workflows, and build scalable, reliable data pipelines.

Data stories

The Data Story’s research on heuristic and data-driven attribution models: rigidity versus flexibility

Marketers recognise that accurately attributing revenue to marketing efforts is key to better decision-making, budgeting, and strategy. However, implementing marketing attribution effectively is challenging. In marketing attribution, we assign credit...

Data stories

Frequentist Over Bayesian: A Statistician's 'Normal' Choice

Probability and statistics lie at the centre of data science. There are different ways of interpreting and expressing probability. Very often, it is expressed using the function P(), where P(a)...

Data stories

Why GA4 classifies Google Ads traffic as (Organic) and how to fix it

If you rely on Google Analytics 4 (GA4) and noticed that some of your Google Ads traffic is showing up under the campaign name “(organic)”, you might be wondering why...

Data stories

Five Ways to Enhance Your First-party Data Strategy

Google planned on phasing out third-party cookies due to issues mainly concerning privacy, at the end of 2024. However, they have postponed this phase-out once again, giving businesses (and Google)...

Data stories

European Women in Technology 2024 – Part 2

On May 26th and 27th, our team members Yvette and Sophie attended the 2024 installment of European Women In Technology. This event’s main purpose is to share ideas and discuss...

You should filter out useless traffic, here’s how to!

GA4’s Data Dilemma

How to identify useless traffic manually?

Need some help?

Andrei Fekete

More Data stories

Staying on Track with Dataform Railway Design - Streamlining Dataform Development with Local Setup and CI/CD

The Data Story’s research on heuristic and data-driven attribution models: rigidity versus flexibility

Frequentist Over Bayesian: A Statistician's 'Normal' Choice

Why GA4 classifies Google Ads traffic as (Organic) and how to fix it

Five Ways to Enhance Your First-party Data Strategy

European Women in Technology 2024 – Part 2

The Data Story