Lineups and Pitchers, Observations on Markov Chains
by Alex Nelson, 17 August 2015
Executive Summary. We discuss the impact of the lineup and pitcher on predicting game outcomes.
- Lineup Matters
- Including the Pitcher is Hard, but is it Necessary?
So, last time the Markov chain model was introduced and we made some predictions. Unfortunately, those predictions were off. What happened?! Well, it turns out that the lineup matters significantly.
Furthermore, when batting, roughly 40% of the outcome is due to the pitcher, and 60% due to the batter. Adam Sugano discovered this relationship statistically for the 2001-2006 seasons (and confirmed it with the 1987 season) in section 3.3.1 et seq. of his doctoral thesis. So neglecting the pitcher ignores nearly half the battle.
Moral: If the reader has been perusing the previous posts on modeling baseball, we hope the moral comes across quite clear: the first few models will be terrible. But the point is to figure out why they’re wrong, then improve. Do not be discouraged with initial failures. (End of Moral)
Consider the following 9 players: David DeJesus, Kole Calhoun, Mike Trout, Albert Pujols, David Murphy, Erick Aybar, Conor Gillaspie, Johnny Giavotella, and Chris Iannetta. There are many superstitions about how to order them into a lineup, but would it really matter for a Markov model?
Lets consider all possible lineups (that is 9! = 362,880 possibilities). What are the expected number of runs in the Markov model? It would take 3 weeks of solid computing to find out, so instead I considered permutations of the first 6 or so positions.
Permuting the first 4 positions produces an expected number of runs between 4.55 and 4.58, and a random sample of 120 lineups (of all 362k possibilities) has the lowest scoring lineup be 4.45 runs scored by Giavotella, Murphy, DeJesus, Gillaspie, Aybar, Iannetta, Trout, Calhoun, Pujols; the highest scoring lineup be 4.58 runs scored by Pujols, Iannetta, Calhoun, Trout, Murphy, DeJesus, Gillaspie, Aybar, Giavotella.
If we started pulling from the inactive roster, things get horribly worse. But that’s because we’re using skewed probabilities in the Markov chain. So we shouldn’t be surprised when someone inactive is placed in a Markov chain, and the results are garbage.
Determining the Lineup
There is no definitive way to generate the correct lineup, but we may guess based on the recent lineups for a given team. These may be found on the ESPN page for each team, found under their roster. Strong regularities may be found (e.g., J Rollins always starts for the LA Dodgers, followed by E Hernandez).
Since we only have 2014 data (and pre-2014 data), we have some ambiguity when fresh rookies are in the lineup. We use 2014 league average as a proxy for the fresh rookies’ statistics (if our job depended on it, we could enter the 2015 data into the prediction and use that instead).
Cleveland Indians vs Anaheim Angels (August 3, 2015)
Prediction for tomorrow's game: .@Indians 3.7844 runs VS .@Angels 4.7573. Angels win with home-team advantage.— Alex Nelson (@anelson_unfold) August 2, 2015
The outcome Angels 5, Indians 4. We see that Garrett Richards performed superbly, with the probability a hitter would hit the ball being 21.05% in this game. Since this was so close to the league average, the naive Markov model worked well.
So, the problem was just the lineup. Huzzah, huzzah, I’ll just throw back my legs and pollute my britches with delight…
Cleveland Indians vs Minnesota Twins (August 8, 2015)
Prediction for tomorrow's game: .@Indians 3.6933 vs .@Twins 3.83678, Minnesota will win...if it could be called a victory...— Alex Nelson (@anelson_unfold) August 8, 2015
Final score Cleveland 17, Minnesota 4. D’oh! What happened? Did Cleveland draft Mutant Atomic Supermen? No, it appears that Minnesota’s pitchers were not up to scratch. Santana had a hit probability of 61.5% (his career stats average suggest 23% more likely), for example, and Graham had a 50% probability.
Baltimore Orioles vs Anaheim Angels (August 8, 2015)
Prediction for tomorrow's game: .@Angels 4.5614 vs .@Orioles 3.7763659— Alex Nelson (@anelson_unfold) August 8, 2015
But really, the score was Orioles 5, Angels 0. Jimenez was on fire, his career stats suggest an OBPA of about 26.655%, but in the game a mere 7.4% of batters hit the ball.
Including the Pitcher is Hard, but is it Necessary?
The Markov model has a serious flaw, in that it does not take into account the pitcher. (The Cleveland vs Minnesota game should convince us how important a factor the pitcher is in a game!) Sadly, we cannot simply compose the transition matrix with a “pitcher modification” matrix under most circumstances because the derivation does not allow for it.
(One trick is to rewrite this in a geometric series, Taylor expand to a couple orders, and you get some factor correcting the probabilities. Take the denominator HP/L + (1 - H)(1 - P)/(1 - L) add and subtract by (1 - H)P/L, rewriting it as P/L + (1 - H)((1 - P)/(1 - L) - P/L), divide the top and bottom by P/L, and voila! you have the denominator look like 1 - X. Observe X is of the order 0.01, which means a quadratic approximation is quite good. That’s one possibility to consider. This may cause problems for P > 6/7 or P < 1/7, only the latter has happened in this millenium…but there’s still time.)
Addendum: Bayesian Testing Pitcher Performance
I have written a post giving a more intuitive approach to testing pitcher performance, using Bayesian inference while treating the pitcher’s success or failure like a “coin flip”. Bayesian inference provides a rigorous way to bound the “bias” of the coin (or, for us, the successfulness of the pitcher) based on previous trials.
Case Study: Orioles Pitcher for August 7, 2015
Does this even matter? Well, if we ran the Markov model on the August 7th game for Anaheim Angels vs Baltimore Orioles, then we should expect the score to be Angels 4.5614 vs Orioles 3.7763659 for the Angels lineup (i) David DeJesus, (ii) Kole Calhoun, (iii) Mike Trout, (iv) Albert Pujols, (v) David Murphy, (vi) Erick Aybar, (vii) Conor Gillaspie, (viii) Johnny Giavotella, (ix) Chris Iannetta. But when this happened the score was Angels 8, Orioles 4.
What happened?! Orioles pitcher Gausman was having an off-day. Could we have predicted this? Lets see, we’ll compare the 2014 league average to Gausman’s career stats:
|Kevin Gausman (Career)||214.1||213||22||65||181|
|2014 League Avg||21798.7||20962||2151||7017||18588|
This produces the relevant probabilities:
|Signal||Gausman (Career)||Gausman (2015)||2014 League Average|
|Probability of Hit||23.144627%||22.737405%||22.449264%|
|Probability of Homerun||2.3905247%||3.1208204%||2.3036145%|
|Probability of Walk||7.062914%||6.241641%||7.51486%|
Observe Kevin Gausman’s career statistics are close to the league average, as are his 2015 statistics. (Note: the statistics were obtained on 8 August 2015, at 12:04PM (PST), so they are probably out of date already.)
Problem: Clearly the pitcher’s performance was responsible for the Angels scoring an abnormal number of runs, and the pitcher’s previous average performance does not allow for this. In particular, for the game in question, Gausman faced 17+9+2=28 batters and had a probability of getting a hit about 32.142857%, much higher than the league average. (His other statistics for that game were within 1% or less of the league average.)
Confidence Intervals. Perhaps it could be explained statistically? Well, if we use the Wilson confidence interval, we can determine what would be statistically feasible. The idea is to consider the batter hitting the ball as a “failure” for the pitcher, and the batter struck out as a “success” for the pitcher. Hey! It’s a “coin-flip” (Bernoulli trial).
We can apply the central limit theorem when we have 30 or more data points for the pitcher. We happen to have several hundred, so we’re golden. Then we may approximate the probability of failure as being normally distributed within some range. (For 30 or fewer data points, the Wilson score is a great approximation — and it converges to the binomial score as the number of data points increases.)
Intuitive Picture. The confidence interval tells us, with 95% confidence (or more if desired), the “mean” and the margin of error for the sample size.
When can we use the confidence inteval? The rule of thumb is np > 5 and n(1 - p) > 5. For us, n is about 920, and p is about 23%, we’re golden. (More generally, if 15% < p < 85%, then n > 40 is the heuristic…but any n > 33 works as well for such p.)
How do we use it? Well, we want to suggest that Gausman sucked with statistical significance. How we do this, using confidence intervals, is construct one interval for his performance during the game. Then we construct another based on his career (or his 2015 stats, whichever). If the two intervals do not overlap, then we may say “Gausman did indeed suck with statistical significance specified by the intervals”.
If the two intervals overlap, well, then Gausman may or may not suck with statistical significance…the test tells us little, in this case. Or more precisely, it tells us “Perform more tests!”
What does it say? Well, the wilson confidence interval tells us we should expect the probability a batter will hit the ball (when Gausman pitches based on his 2015 data) 23.1255% ± 5.4438486% with 95% confidence. To be more precise, when we drop a Gaussian about 23.1255%, 95% of the Gaussian lies within 5.44% of 23.1255%.
But the Wilson score for the game has p = 9/28. It produces an interval centered at 34.2972% ± 16.363955%, since the number of trials is so small and our confidence is so “large”. (The reader should confirm that we can perform a confidence interval using the data from the game.)
As odd as it sounds, we cannot “reject the null hypothesis”, i.e., although Gausman did not perform up to par (by our subjective/intuitive standards, it “feels” like he sucked), statistically he did not. I am not happy about this: I want to flip the table over, and blame one person squarely for screwing up my predictions. But that’s emotion. The cold hard facts indicate the Gausman really did just slightly below his usual best.
(Emotionally, there are two ways to respond: (a) reject statistics as baloney, (b) blame the other pitchers for screwing up my predictions! I choose (b).)
Exercise for the Reader. Perform a similar analysis for Brian Matusz and Brad Brach, since they are responsible for 3 runs.
Another Exercise for the Reader. Show my emotional response is invalid, i.e., the other pitchers also did adequately given their past performance history. Then find someone who we can blame for the outcome.
Moral of the Story. The moral for this particular case study: random variations cause serious and observable differences in outcomes. And, in this example, the random variations were not surprising (they were within a standard deviation of past behavior).
Puzzle. In these “unsurprising” scenarios, where “routine” randomness completely changes the outcome, how can we (a) determine who is most probable to win? And (b) communicate that the randomness can completely change the outcome?
One step towards a solution may be to consider using Fuzzy Markov Chains.
Case Study: Minnesota Pitcher for August 8, 2015
We find for Ervin Santana his pitching stats are:
|2014 League Avg||21798.2||20962||2151||7017||18588|
Using Santana’s career stats, we see that Santana has faced (3×1924)+1+1871+605=8249 batters, and has had 22.681538% of batters hit his pitches. The corresponding confidence interval is 22.694254% ± 0.9035812%, for his entire career.
Confidence Interval for the Game. We see, for this game, Santana faced 17 batters, of which 10 successfully hit the ball. So we have a confidence interval 57.197193% ± 21.19175% (using a 95% confidence). But look: the lower end point for this interval is 36% (rounding down), whereas the upper interval for Santana’s entire career is 24% (rounding up). Hence we may state with 95% confidence, Santana sucked at pitching with statistically significance in this game compared to what his career suggests.
Compared to this year’s performance. We can likewise construct the confidence interval for Santana’s performance this year. From simple arithmetic, we find it 24.31708% ± 6.0958176%. This is puzzling: the upper end point is 31% (rounding up), which again implies with 95% confidence, Santana sucked at pitching with statistical significance in this game compared to his career this year.
What should we have expected? Well, if Santana were well rested (etc.), we should have expected about a quarter as many runs and no home runs. That is to say, we should have expected 2.1067379 runs when facing 18 batters…which would have put us somewhere in the 5th inning.
Why is this happening? Well, Santana used performance enhancing drugs and was suspended for 80 games, and since he has returned his performance has been statistically significantly worse (as we have seen). There appears to be some correlation, unsurprisingly negative, between these two events.
We cannot say one caused the other. It could be lurking factors, like Santa wasn’t practicing during the 80-game period, or his wife has cancer and he’s distracted & worried, or he was in a car accident, etc. Experts have some plausible explanations.
What could we do? Even if we factored in the pitcher’s behaviour to our Markov model, we still could not have predicted the 17-4 outcome. Why? Because we have 2014 data for the pitchers, but it appears (if we learn one thing from Santana’s contribution to the outcome) we must use the most recent data for the starting pitcher.
We could consider using a weighted or exponential moving average, weighing the more recent performance more than historical performance.
- James Buckley and Esfandiar Eslami, Fuzzy Markov Chains: Uncertain Probabilities.
- Adam Sugano, A Player Based Approach to Baseball Simulation, Ph.D. Thesis.
- Frank Firke, Uncertainty and Pitching Statistics
- Pizza Cutter, On the reliability of pitching stats
- Graham MacAree(?), Sample Size
- Sky Kalkman, Optimizing Your Lineup By The Book