Home » Data stories » Simplifying Machine Learning: Less is More – PART 2

Simplifying Machine Learning: Less is More – PART 2

In part 1 on simplifying machine learning we talked about the pitfalls of complex models and why opting for simpler models is more often than not the right thing to do. This blog will go over the advantages of both will be put side by side for a comparison.

Resisting trendy models

It is beyond doubt that the latest developments in machine learning, particularly in areas like deep learning, have brought unprecedented advancement in many fields. Technologies such as image and speech recognition, natural language processing, and game playing have made major leaps. As cool as they may be, it is essential not to let the prestige of these cutting-edge, state-of-the arts techniques blind us to the power and value of simpler, more traditional methods

The tendency to chase this novelty often manifests in the rush to apply the latest machine learning techniques without fully considering their appropriateness for the task at hand. Often, simpler models like logistic regression, decision trees, or gradient boosting can yield similar, if not better, results with less computational overhead and greater interpretability.

Simple facts

In fact, there is a wealth of research supporting this assertion. For example, Ribeiro, Singh, and Guestrin, in their 2016 study “Why Should I Trust You?”, revealed that simple models like linear regression often perform on par with complex models in terms of prediction accuracy. More sources such as Caruana et al. (2006) show that for many datasets, simpler models perform equally well compared to complexer models such as neural networks and random forests. What’s more, simpler models are often more robust to the curse of dimensionality, making them better suited for high-dimensional data (Hastie, Tibshirani & Friedman, 2017). The advantages of simple models extend beyond their predictive power – they offer faster computation times, easier implementation, and most importantly, they’re interpretable.

However, these empirical findings do not imply that simpler models are always better or that complex models are unnecessary. Instead, they suggest that the utility of a model should not be judged purely on its complexity or being “state-of-the-art”. It should be judged by its suitability to a given task and the trade-offs between predictive accuracy, interpretability, and computational efficiency that it offers. It underscores the need for empirical testing and cross-validation, rather than blind reliance on the newest, most complex techniques.

Advantages of simpler models

Having established the empirical case for simplicity, it is also important to look at the practical aspects/advantages of using simpler models. What makes them powerful in many business scenarios? What are the concrete benefits they yield?

Interpretability: Unlike complex models, which often act as ‘black boxes’, simple models provide transparent, understandable rules. Take a logistic regression model used for customer churn prediction as an example. The coefficient associated with each feature (such as customer age, tenure, or total charges) can be interpreted as the effect of that feature on the likelihood of a customer to buy a product. This is invaluable for business stakeholders, as it provides insights that can guide strategic decision-making. Furthermore, more interpretative models can be used to segment customers based on their behaviour (e.g. K-means clustering). These segments can then be targeted with personalised marketing campaigns. Marketing teams often need to predict how customers will respond to a campaign or offer. Decision trees are a popular choice for this task, as they provide clear, interpretable rules (e.g. “If a customer has made a purchase in the last month and visited the website more than five times in the last week, they are likely to respond to the campaign”). The similarity between all the cases above is interpretability: the ability to understand the “why” behind predictions can help take actions strategically.
Less prone to overfitting: Fewer parameters make models less prone to overfitting. Overfitting is a common problem in machine learning, where a model “learns” the training data too well, including its “noise”, leading to poor performance on unseen data. As such, predictions are made based on this noise, the information that should not play a role in decision-making. By having fewer parameters, simple models have less opportunity to fit the noise in the data, which often results in more generalisable models.
Computational efficiency: Simple models are typically more computationally efficient than complex ones. They require less processing power and memory, making them faster to train and predict. This is a critical factor in business scenarios where quick decision-making is required, or resources are constrained.

Easy to implement and maintain: Simple models are easier to implement and maintain. They require less expertise to set up and fine-tune, and the troubleshooting process is often more straightforward. This can save businesses time and resources, and reduce the risk of implementation errors.

Considerations, caveats, and when to go beyond simplicity

While the benefits of simple models are numerous, it is important to recognize their limitations as well. No single model or approach is universally the best, and different tasks require different tools. Therefore, as much as we advocate for simplicity, we also recognize that there are scenarios where complex models are the right choice:

Complex interactions and non-linearity: Simple models often assume independence among features or linearity in their relationship with the outcome. In cases where these assumptions are violated, complex models like Random Forests or Neural Networks, which can capture intricate interactions and non-linear relationships, may be more suitable. However, it’s worth noting that, in many real-world business scenarios, these complexities are the exception rather than the norm. More often than not, relationships are simple and linear, and simple models are quite adept at capturing these.
High-dimensional data: Even though some simple models can handle high-dimensional data, others may struggle. For instance, linear regression can become unstable or infeasible with high-dimensionality, especially when there is multicollinearity among features. In most business scenarios, however, data is rarely high-dimensional. Moreover, there are numerous feature selection and dimensionality reduction techniques available that can help manage high-dimensional data effectively. These techniques often pair well with simple models, enabling you to capture the essence of your data without the need for complex models.
Large volumes of data: Simple models may not scale well to very large datasets. In such cases, more complex models that are designed for scalability, like distributed versions of Random Forests or Gradient Boosting Machines, may be a better fit. But remember, “large” is relative. The vast majority of businesses, particularly small to medium-sized ones, will rarely encounter data volumes that necessitate highly scalable models. Simple models can comfortably handle most typical business datasets.

Takeaways

Remember, the key is not to pit simple against complex, but to appreciate the value each brings and understand their appropriate applications. It is about finding the right tool for the task, the right model for your specific business needs and constraints. Most importantly, itis about being pragmatic, empirical, and informed in your approach. The message is not that you should never use complex models, but rather you should not feel compelled to use them, just because they are trendy or your competitors are using them. Start with simpler models, understand your needs and constraints, and then move on to higher complexity, if need be.

To conclude, while simple models do have limitations, in most cases, these limitations do not pose significant hurdles most typical business scenarios. In the vast majority of business situations, the benefits of simple models – their interpretability, ease of implementation, lower risk of overfitting, and computational efficiency – far outweigh their limitations. Simplicity in machine learning is not a step backward, but rather a practical, effective, and often overlooked strategy. Remember, as Einstein said, “Everything should be made as simple as possible, but not simpler.”

Feel like reading more data stories? Then take a look at our blog page. Always want to stay up to date? Be sure to follow us on LinkedIn!

Need some help?

Nino Weerman

“Als gepassioneerd dataliefhebber ben ik gefascineerd door het ongelooflijke potentieel van gegevens om verborgen inzichten te ontdekken en toekomstige trends te voorspellen. Ik krijg voortdurend energie van de ontelbare manieren waarop gegevens kunnen worden gebruikt om succes te behalen en ik vind het geweldig om complexe bedrijfsdoelstellingen om te zetten in duidelijke, uitvoerbare plannen voor gegevensanalyse.”

Let's connect

Staying on Track with Dataform Railway Design - Streamlining Dataform Development with Local Setup and CI/CD

Explore how to streamline Dataform local development using CI/CD integration. Automate schema testing, manage environments, optimize workflows, and build scalable, reliable data pipelines.

Data stories

The Data Story’s research on heuristic and data-driven attribution models: rigidity versus flexibility

Marketers recognise that accurately attributing revenue to marketing efforts is key to better decision-making, budgeting, and strategy. However, implementing marketing attribution effectively is challenging. In marketing attribution, we assign credit...

Data stories

Frequentist Over Bayesian: A Statistician's 'Normal' Choice

Probability and statistics lie at the centre of data science. There are different ways of interpreting and expressing probability. Very often, it is expressed using the function P(), where P(a)...

Data stories

Why GA4 classifies Google Ads traffic as (Organic) and how to fix it

If you rely on Google Analytics 4 (GA4) and noticed that some of your Google Ads traffic is showing up under the campaign name “(organic)”, you might be wondering why...

Data stories

Five Ways to Enhance Your First-party Data Strategy

Google planned on phasing out third-party cookies due to issues mainly concerning privacy, at the end of 2024. However, they have postponed this phase-out once again, giving businesses (and Google)...

Data stories

European Women in Technology 2024 – Part 2

On May 26th and 27th, our team members Yvette and Sophie attended the 2024 installment of European Women In Technology. This event’s main purpose is to share ideas and discuss...

Simplifying Machine Learning: Less is More – PART 2

Resisting trendy models

Simple facts

Advantages of simpler models

Considerations, caveats, and when to go beyond simplicity

Takeaways

Need some help?

Nino Weerman

More Data stories

Staying on Track with Dataform Railway Design - Streamlining Dataform Development with Local Setup and CI/CD

The Data Story’s research on heuristic and data-driven attribution models: rigidity versus flexibility

Frequentist Over Bayesian: A Statistician's 'Normal' Choice

Why GA4 classifies Google Ads traffic as (Organic) and how to fix it

Five Ways to Enhance Your First-party Data Strategy

European Women in Technology 2024 – Part 2

The Data Story