either a solid bargain or a disastrous ripoff depending on how we analyze the data. By only flipping the x and y axes of a scatterplot, we can come to completely opposite conclusions about the value of a QB relative to what we'd expect for a given salary or for a given level of performance. Much of this post is derived from the many insightful comments in the original. Please take the time to read them, especially those from Peter, X, Phil and Steve.
By regressing salary on performance (adjusted salary cap hit on the vertical (y) axis and Expected Points Added per Game (EPA/G) on the horizontal (x) axis), Rodgers' deal is insanely expensive by conventional standards. But by regressing performance on salary, his new contract is a bargain.
Which one is correct? That depends on several considerations. First, there are generally two types of analyses. The one I do most often is normative analysis--what should a team do? The second type is descriptive analysis--what do teams actually do? The right analytic tool can depend on which question we are trying to answer.
The reason that we saw two different results by swapping the axes is that Ordinary Least Squares (OLS) regression chooses a best-fit line by minimizing the square of the errors between the estimate and the actual data of the y variable. OLS therefore produces an estimate that naturally has a shallow slope with respect to the x axis. When we swap axes, the OLS algorithm is not symmetrical because of that shallowness.
If we chose another error-minimization function other than OLS, we get different estimates. One of the simplest is Least Absolute Deviation (LAD), which is similar to OLS except that it minimizes the absolute value of the error rather than the square of the error. Another method, mentioned in a comment to the original post by Peter, is called Reduced Major Axis (RMA) regression, which regresses both the x and y axes. RMA is useful when there is error in both the x and y variables.
The chart below illustrates how the results of each method compare. For now, performance (in terms of EPA/G) is on the x axis while salary (adjusted cap hit) is on the y axis. The black line is the OLS estimate where salary is the regressed variable--Rodgers is a ripoff (Brian 2's perspective in the original post). The purple line is the OLS estimate where performance is the regressed variable--Rodgers is a bargain (Brian 1's perspective). The green line is the LAD estimate, and the red line is the RMA estimate. Note that Rodgers' new contract would theoretically put him below the letter 'e' in 'Game' in the title.
We're typically taught that x and y axis choices should be based on cause and effect. The x, or independent variable, "causes" the y, or dependent, variable. So the choice for which variable goes on the x axis, and which should be regressed on the y axis should be simple. Does salary cause performance, or does performance cause salary? I think the answer is neither.
I think it really is a matter of perspective... For example, from the player's perspective, if he reliably performs around 11 EPA/G (independent/cause), how much money can he expect in return on the FA market (dependent/effect). But from the team's perspective, if they buy $21M worth of QB on the FA market (independent/cause), how much performance can they expect (dependent/effect)?
You might say (as I think someone above did), arbitrarily paying a person a lot of money does not "cause" him to play well at QB, as the Jets proved with Mark Sanchez (zing!). Case in point--if you paid me $20M to be an NFL QB, I'd average -100 EPA/G.
But I've left an important systematic linkage out of the discussion: The Market. Paying someone $20M to play QB doesn't cause someone to be skilled, but purchasing a $20M asset in a competitively priced market provides a systematic linkage from pay to performance. It's not unlike buying a race car. All other things being equal, paying $100k for a car rather than $50k for a car in a competitive market means I should expect a faster car. Money does "cause" performance, indirectly via a competitive market process.
From what I understand, and according to several of the comments in the original post, the choice of which variable should be regressed onto the other should be based on something else than our conception of causation.
Uncertainty and Error
In a bivariate regression, the regressed variable should be the one subject to statistical variance. In other words, the y variable should be the one with a component of random error, while the x variable is the one we know with certainty.
In this case, we know salary with absolute certainty. We know Peyton Manning has been paid an average of $17M per year. It's not as if we're not quite sure of his salary but we have some idea with some error built in. Tragically, Mark Sanchez has been paid $8M per season since his extension. That's an exact amount with no uncertainty. Now, one could say that amount was in error, because Sanchez never came close to living up to his contract. But that's not the kind of error we're talking about. Statistical error is not a mistake. It's the difference between what we would expect based on a model and what is actually observed.
For pure OLS regressions to be unbiased, the y variable should have a normal distribution. In fact, the least squares method is not an arbitrary choice. It is directed derived from the formula for the Gaussian (normal) distribution.
Here are the distributions of the two variables. The first is salary.
You can see that it's not normal, at least for the range of our sample. It's more like a power-law distribution where there are lots of players with relatively low pay, and fewer players as salary increases. This is a near universal salary distribution found in almost any context and every type of job. But I suspect it's not really the power law at play--It's probably the extreme right tail of a normal distribution of all athletes in the general population who could conceivably be QBs. After all, if scouts and coaches are doing their jobs, that's where NFL QBs will be found.
In contrast, the histogram of EPA/G appears very normal (bell-curved), which suggests there is a random error component at play. It's not that the normality of distributions should decide which variable gets regressed. Rather, the distribution betrays an uncertainty in the variable. In this case, the uncertainty surrounds the "true" value of a QB. EPA/G is a good stat, but it's only a sample of a player's "true" ability. There are many other factors beyond true ability that determine a QB's EPA/G, including teammates, coaches, opponents, sample error. Ideally, a QB's pay is in exchange for his true talent, but that can never be known. It can only be estimated. EPA/G is really just a crude approximation of a player's underlying ability.
[As an aside, one might wonder how the right tail of a normal distribution can produce a complete normal distribution in performance. Shouldn't EPA/G's distribution also look like the right tail of a bell curve? No, because when two right tails compete (offenses comprised of right-tail talent vs defenses comprised of right- tail talent, the outcome will be normal.]
So we know a player's salary with absolute certainty, and we can only estimate his true talent. EPA/G, the stat I chose to best approximate talent is clouded by sample error and unaddressed external factors, like surrounding team talent.
Ultimately, what I've learned from this exercise is that the selection of x and y variables in a regression don't have to do with cause and effect, or independent vs dependent. It's about which variable you know without (statistical) error, and which variable contains uncertainty.