How to safely use linear regression and stepwise elimination on real-world datasets?
- Amanda Duim Ferreira
- Jan 27
- 4 min read
I was recently preparing a class on Data Science and AI when I stumbled upon an article with a rather alarming title: "If you use the stepwise elimination algorithm, YOU ARE EVIL!"
As a researcher who uses these tools, I had to laugh, but it also got me thinking. Stepwise elimination is controversial, hated by some statisticians, and loved by others.
The truth is, the algorithm isn't "evil." It’s just a machine learning tool. The problem, and the source of the controversy, lies in how we apply it. In this post, we will walk through how to use linear regression and stepwise elimination safely, specifically when dealing with messy, real-world environmental data.
Back to Basics: What is Linear Regression?
At its core, linear regression is about finding relationships. Imagine we want to predict house prices based on house size. We collect data, plot it, and draw a line that best fits those points.
This is Simple Linear Regression. We have a response variable (price) and a predictor (size). As the size increases, the price increases.

But we know the real world is more complicated. House prices aren't just about size. They depend on other variables such as the number of bathrooms, the presence of a pool, or the location. When we add these extra variables, we move to Multiple Linear Regression.
By analyzing multiple predictors simultaneously, we can uncover patterns that a simple line might miss.

Leveling Up: Mixed-Effects Models
Things get tricky when our data has groups. For example, a 1000 sq ft house in a city center costs way more than a 1000 sq ft house in a rural village.
If we just lump all the data together, we lose that context.
This is where Mixed-Effects Models come in. They separate our predictors into two types:
Fixed Effects: The things you want to study (e.g., House Size, Bathrooms).
Random Effects: The grouping factors that create variation (e.g., Neighborhood, Site, or Time).
In environmental science, this is crucial. If you take soil samples from the same site over several years, those measurements are related (autocorrelated). Treating them as independent is a statistical sin! Random effects allow us to tell the model: "Hey, these data points belong to the same group, so account for that structure".
Mathematically speaking:

The "Controversial" Method: Stepwise Elimination
So, what happens when you have too many predictors? We often deal with dozens of parameters. Dumping them all into a model creates a mess. The Stepwise Elimination is an algorithm designed to simplify this.
In the Backwards method:
Start with the full model (all predictors included).
Remove the least significant predictor (often based on p-values > 0.05).
Check the AIC (Akaike Information Criterion). The AIC balances model fit vs. complexity—lower is better.
Repeat until removing variables no longer lowers the AIC.
The result is a lean model that keeps only the most important drivers of your system.
In the Forwards method:
Start with a null model (no predictors, only the intercept is included).
Add the most significant predictor (the one that best explains the remaining variance, often with the lowest p-value).
Check the AIC. If the new model has a lower AIC than the previous one, keep the variable.
Repeat until adding more variables no longer lowers the AIC.
In the Bidirectional method:
Start with a null model (usually, though you can start with the full model).
Add a significant predictor (forward step) OR remove a predictor that has become non-significant after adding others (backward step).
Check the AIC after every addition or removal to ensure the model is improving.
Repeat until no variables can be added or removed to lower the AIC.
The result in all cases is a model that attempts to find the best balance between complexity and explanatory power, though they may arrive at slightly different final sets of variables.
Why is it called "Evil"?
Here is where the "evil" label comes from. Linear models rely on strict assumptions:
Linearity: The relationship must actually be linear.
Normality: Errors should be normally distributed.
Homoscedasticity: Variance should be constant.
Independence: Data points shouldn't influence each other.
Stepwise algorithms do not check these assumptions. They will happily run on bad data and give you a "statistically significant" result that is completely wrong. In addition, because the method looks at the data multiple times to select variables, the p-values in the final model are often artificially low. You can be essentially "fishing" for significance, and this is called P-hacking. Another pitfall is that sometimes the coefficients of the selected variables are too large (biased away from zero) because the algorithm specifically selected them for being "strong," ignoring the noise.
How to Use It Safely: A Real-World Example
To show how we can use this safely, let’s look at a real-world dataset on soil biogeochemistry. We wanted to see how climatic and soil properties affect metal availability.
We didn't just run the algorithm blindly. We followed a strict safety protocol:
Clean the Data: We checked for collinearity and removed it.
Check Normality: Our data had many zeros and outliers. Instead of complex transformations that make interpretation hard, we removed the outliers to achieve approximate normality.
Test Random Effects: We tried to use "time" as a random effect. However, the model showed the variance for time was near zero. This meant a mixed-model wasn't justified, so we safely proceeded with a standard linear model.
Run Stepwise: Only after ensuring the data met the assumptions did we run the backward elimination.
The result? The algorithm helped us identify that climate was the dominant control on metal bioavailability, a finding that completely reshaped our research questions regarding climate change scenarios. You can read more about it in the paper that I published (Ferreira et al., 2024).
Try It Yourself
Stepwise elimination isn't evil. It's just a tool that requires a skilled operator. If you ensure your data structure is right and assumptions are met, it is a powerful way to interpret complex environmental processes.
We made a dataset and the R code used in this analysis available for download.
I encourage you to grab them, break the models, and see these principles in action!



Comments