Regression to the mean is a statistical phenomenon that can be hard to grasp. In this post, I will give a simple explanation, with some intuitive and non-technical examples.
Key message
- Regression to the mean (or shrinkage) is the phenomenon that occurs when extreme observations (or data points) tend to be followed by observations that are closer to the average.
- Regression to the mean occurs in many areas because it is caused by randomness
A simple explanation of regression to the mean
Regression to the mean (or shrinkage) is the phenomenon that occurs when extreme observations (or data points) tend to be followed by observations that are closer to the average. Regression to the mean is observed in areas such as sports performance, stock market returns, and academic performance. In sports, a player who has an exceptional performance in one game is likely to have a more moderate performance in the next game. In the stock market, a stock that has a high return one year is likely to have a more moderate return the next year. And in academics, a student who gets an exceptionally high score on a test is likely to get a more moderate score on the next test. But why does regression to the mean occur?
An intuitive example of regression to the mean
In his book Thinking fast, and slow, Daniel Kahneman gives some intuitive examples of regression to the mean. When he was giving a lecture for the Israeli military, one of the officers shared an interesting experience. Whenever he would praise his cadets after a good performance in training, these cadets would perform much worse afterward. Similarly, he observed that cadets who were yelled at and screamed at after a poor performance would perform much better in the next run. The officer argued that he should yell more at his cadets because that seemed to improve performance.
However, Kahneman suggested that the observed effect could be explained entirely by randomness or ‘luck’. He noticed that the performance of the cadets did not only depend on their skill, but also on whether they were lucky or not. A cadet who has an exceptionally bad first run is likely to have a more moderate (better) performance in the next run. This is because the cadet was probably unlucky in the first run, and he will probably have more luck in the next run. Hence, the increased performance of the cadets in the second run could be due to randomness, and not because they were yelled at by the officer.
Another example
Let’s look at another example. Suppose I would ask 12 students to participate in a little game. The goal of the game is to throw a crayon on the blackboard, while the students are blindfolded. They are awarded points for how close the crayon hits the middle of the blackboard. The results of the first throw are shown below.
On average, the students scored 2 points, but 4 students scored higher (3 points) than the overall average (2 points). Because this exercise involves a great deal of chance, it is likely that these students were lucky. So, they are expected to perform worse on their second try. Similarly, the 4 students that scored lower (1 point) than the overall average are expected to perform better on the second try. In other words, even though a student performed better than average on the first try, her expected performance on the next try is still close to the average of all students, because the game involves a great deal of luck.
It’s important to note that regression to the mean doesn’t mean that the next observation will always be closer to the mean, it just means that it’s more likely to be closer to the mean than the previous extreme value.
Regression to the mean in statistical models
As we saw in the example of the cadets, ignoring regression to the mean can lead to false conclusions about cause-and-effect relationships. It is therefore important to be aware of regression to the mean, and possibly account for it when analyzing data.
Statistical models are often used for prediction. One problem in training prediction models is so-called overfitting, which occurs when the model has learned from the random noise in the data, rather than from the underlying relationship. Overfitting can be overcome by shrinking estimates toward the mean.
Let’s go back to our example of throwing crayons on the blackboard. Suppose I am interested in predicting the performance of the students on their second try, based on data collected from their first try. We know that the data from the first try contains a great deal of noise because there was luck involved. Thus, if I would like to predict the performance of a particular student on his second try, I better shrink her performance on the first try (4) toward the mean (2). In other words, my prediction will probably improve when I shrink my estimates because I know that there was a great deal of randomness involved in the data.
The amount of shrinkage depends on the amount of uncertainty in the data. For example, if I only have data on the first try, I do not have a lot of information on the performance of a student. In that case, I should heavily shrink that first observation to the mean, because there is not a lot of evidence of the student’s ability. However, if I have data on a student for 20 subsequent tries, this gives me a pretty good picture of the student’s ability. In that case, shrinkage can be reduced.
Conclusion
In conclusion, regression to the mean is a statistical phenomenon that occurs when an extreme value of a variable is followed by a more moderate value. It can be observed in a variety of areas and can lead to confusion and misinterpretation if not taken into consideration. It’s important to be aware of this phenomenon when interpreting data and making assumptions about cause-and-effect relationships.