In an idealized version of data science, the only thing that matters are big computers. Google is so successful, simply because they can throw hundreds, if not thousands of computers, at their problems. Of course, the reality is not so easy. In reality, there are methods that often have a far greater effect than simply throwing more computing power at a problem. Two of the most important methods are feature engineering and feature selection. In this blog, I will focus on the second of these, and try to explain how feature selection can make models that not only perform well on paper, but also in practice.
How it works
The idea of feature selection pretty much follows from the name. You select some features of your data to use, and toss the rest away. This might sound counter-intuitive: why would I ever want to throw away data? As it turns out, there are actually quite a few good reasons for doing so. Let’s rapid-fire a couple of these reasons, in a hypothetical scenario.
We are trying to predict the volume of drinking water used in the next few hours.
- We have 50,000 data points consisting of the rainfall in every 1kmx1km square. Most of this data is not useful. Instead, we could use larger squares, like 50kmx50km, or only use squares that lie over large urban centers.
- We have three data points: the temperature in Celcius, Fahrenheit, and Kelvin. There is no benefit by using all three of them, so we can select one and get rid of the others.
- One of our data points is the number of cars on the road. While this may be obtainable in hindsight, it is a very hard number to obtain in real-time. This means that our predictions in real-time will suffer. While we might have kept it for theoretical reasons, we can choose to get rid of it for practical reasons.
- One of our data points is the biggest movie release of the week. That is probably not very useful, so we can get rid of it.
Reasons for feature selection
In general, the reasons for removing data can be summarized as the following, each corresponding to an item in the list above.
- Less data means you need less computing power, meaning faster iterations of modelling.
- Less features means the results are more interpretable: the effect of the data is not smeared out across many redundant data points.
- Using less data makes the model easier to deploy, and easier to keep running in practice.
- The last reason I want to mention is an ever-present, often unseen risk in data science: the risk that your model only works in the present, and does not generalize to the future. Every feature you use carries some inherent risk: if rainfall suddenly triples in 2020, data about rainfall in 2019 won’t be nearly as useful. The fewer features you use, the less you expose yourself to this risk. For this reason, even benign information like a movie release carries some risk. What if the model learns that Christmas movies have a negative effect on water consumption? After all, people watch Christmas movies in the winter and that is when people drink less water. While this indirect effect may be true, this also means our model won’t generalize as nicely. In Australia, Christmas is in the summer! And perhaps, in the future, Christmas movies will go out of style. All in all, this is an unnecessary risk, when we could also use a feature like temperature, which has a much more direct and generalizable effect on water consumption.
Highly predictive & understandable models
In this blog, I covered the main reasons for considering feature selection. It turns out that, unlike what we may intuitively have thought, getting rid of unnecessary data is actually one of the most useful things you can do when modelling a problem in practice. There are many methods for selecting features, and which method to use depends which of the 4 listed reasons is the most important to you. Combined with feature selection, this is one of the best ways to achieve a powerful model that not only has high predictive power, but is also understandable with high generalization.