Up to now, I had used linear regression to predict the total number of season wins based on efficiency stats. It was, and still is, very useful in helping to understand the importance of various phases of the game in terms of winning. But this method has its limits, notably in predicting the outcomes of individual games, and taking into account the strength of future opponents.
By using a different technique, we can use the same stats that we've established as the best measures of performance and strength of a team in the 4 primary dimensions of the sport plus turnovers.
By using a form of non-linear regression called "logit" regression, we can calculate the probability of a dichotomous outcome, i.e. that one team will beat another in individual games. The independent variables remain the same:
Off Pass Eff
Off Run Eff
Def Pass Eff
Def Run Eff
In the model there will be 2 sets of variables, those efficiency stats for Team A and those for Team B. The outcome variable would be which team won, technically speaking AWon = 1 if Team A won, and = 0 if Team B won.
The new model needs a database of games to analyze, so I prepared a data set of all the outcomes of every regular season game in 2005 with each team's corresponding efficiency stats. Each game is a "case," statistically speaking. There are 256 regular season games each year in the NFL, so I was confident that one year of data would be enough to establish significance of each variable. This also assumes that NFL football doesn't drastically change in nature from year to year, which I would come to learn is not a good assumption.
I also added home field advantage to the model. If Team A was the home team, the variable AHome = 1, otherwise it was 0.
After running the regression, I was amazed at how well the model predicted winners. For the 2005 season, it predicted 74% of all regular season games correctly. By adding variables such as penalties or sacks, the model improved only very marginally to about 75% correct. Because I had to rerun the numbers each week during the season, every additional variable I used added effort to the task without enough of a benefit to be worthwhile.
The regression results look like this:
Mean of A_Won = 0.500
Number of cases 'correctly predicted' = 380 (74.2%)
Log-likelihood = -262.404
Likelihood ratio test: Chi-square(11) = 184.974 (p-value 0.000000)
Actual 0 190 66
1 66 190