I'm always interested in improving my model for predicting game outcomes. My logistic regression model is based on straightforward variables: offensive and defensive passing and running efficiencies, turnover rates, and penalty rates. In this post, I'll question some of my own assumptions and begin to look at which variables really belong in a prediction model.
Explanatory vs. Predictive Models
As I updated the data for the predictions each week over the recent regular season, I noticed that some of the variables were more consistent than others. Turnover rates were particularly erratic. A team with a very good interception rate in the first half of the season would very often have a below average interception rate in the second half of the season.
Any basic regression model that attempts prediction is based on an assumption that the variables used as predictors, the 'to-date' variables, are indicative of what the same variables will be in the future. In football terms, when I include each team's interception rates from weeks 1-8 in the model to predict outcomes for week 9, I'm assuming that previous team interception rates are representative of future team interception rates.
But what if interceptions were completely random? Weeks 1-8 would not be predictive of week 9. Even though interceptions would still explain a large part of previous outcomes, past interceptions would not predict future outcomes at all. Just like mutual funds, past turnover performance does not guarantee future returns.
What if interception rates were just 'mostly' random? Should they still be included in a prediction model? Perhaps variables with lots of random noise such interceptions or fumbles should not be included in predictive models even though they explain a large part of past outcomes. The question becomes 'what is the critical signal-to-noise ratio that makes a variable appropriate for inclusion as a predictor?' Building on my previous efforts to devise a better passer rating, and on my analysis of Air Yards, I've created a more complete passer rating formula.
"Interceptions are very random, and they are 'thrown' by an offense much more than they are 'taken' by a defense."
There is so much statistical data available for football teams that it is tempting to dump them all into regression software. Doing that would produce a very high r-squared, but would include so much noise, so many non-repeating circumstantial conditions, that it would not be an effective prediction model. I believe this is why so many other models out there do so poorly. A system like DVOA may be very good at quantifying how well teams have done to date--something we already know, but not as good at telling us which teams are likely to do well in the future.
Team Stat Self-Correlations
To test which variables should remain in a prediction model, I tested how well each variable predicted itself from the first half of a season to the second half. This is known as longitudinal auto-correlation. This method tests how enduring and repeatable each variable is.
I tested how well team efficiency stats from weeks 1-8 predicted themselves from weeks 9-17. For example, I tested how well offensive passing efficiency from the first half of the season predicted pass efficiency in the second half of the season. Both offensive and defensive stats were tested. I used data from the 2006 and 2007 regular seasons for all 32 teams (n=64, with two exceptions: mid-season 3rd down conversion rate and penalty rates were not available for 2006.)
The correlation coefficients between team stats from weeks 1-8 with stats from weeks 9-17 are listed in the table below.
|D Int Rate||0.08|
|D Sack Rate||0.24|
|O 3D Rate||0.43|
|O Fumble Rate||0.48|
|O Int Rate||0.27|
|O Sack Rate||0.26|
The longitudinal correlations range from as high as 0.60 for defensive pass efficiency and 0.58 for offensive pass efficiency, to as low as 0.08 for defensive interception rate.
The defensive interception rate stands out as the least enduring, least consistent team stat. In contrast, offensive interception rates correlate significantly better, with a coefficient of 0.27.
This indicates there is a lot of randomness in interceptions, which is no surprise. But producing defensive interception does not appear to be an enduring, repeatable ability of a team. Instead, it appears that defensive interceptions are more of a function of 1) randomness, and 2) their opponents' tendency to throw interceptions. In other words, interceptions are very random, and they are 'thrown' by an offense much more than they are 'taken' by a defense.
In following posts, I'll demonstrate that some of the more random team stats can be more accurately predicted by using other, less noisy stats instead of the to-date stats themselves. This may have large implications for an improved game prediction model.