Next up in the wonderful world of expectation: expected shooting percentage. The reader may be wondering how this is different from an expected goal model and, in essence, it isn’t. The reason I’m calling this expected shooting percentage is because I intend to use it to estimate shooter talent. Shooter talent will then be incorporated into the final model. This is following a similar train of thought as Harry Shomer’s model from a few years back.
There are some sections where I try to explain the techniques being used. I avoid exact details, instead hoping to give some intuition to the uninitiated as to what’s going on. Since this is hockey analysis, I felt it appropriate to label these as “Explanation Attempts”. For those already acquainted with the concepts (or simply uninterested), these can be skipped.
For a primer on the subject, EvolvingWild had a good recap in their article from back when. There have been plenty of other models added to the public realm in recent years – it’s starting to feel like a rite of passage – but that recap should still prove sufficient. For some more recent models, the reader can also check out Alex Novet’s incorporating pre-shot movement and/or Patrick Bacon’s recent work.
What You Looking At?
Let’s start with the inputs to the model. We will only be considering unblocked shot attempts; as much as we’d all enjoy including blocks, the NHL records these at the location of the block, leaving us without coordinates for the shot itself. So we make do without.
From these unblocked shot attempts, the model considers the following:
- Distance to the center of the net
- Shots from below the goal line are all given the same value as are shots from outside the zone. The thinking is that the success of these shots has little to do with the exact distance.
- Rink bias (as discussed in my previous post)
- Absolute angle relative to the center of the net
- Below the goal line shots are, again, given the same value as are shots from outside the zone.
- Shooter’s offwing or strong side
- Shot from goalie’s glove hand or blocker side
- Distance to the center of the net
- Shot type (wrist shot, slap shot, etc)
- Shooter’s primary position in given game (F or D)
- Seconds elapsed in period
- Score state (-3 to 3, relative to the shooting team)
- Strength state (5×5, 5×4, etc relative to the shooting team)
- Empty net shot
- Shooting team’s goalie is pulled (for the occasions when a team pulls their goalie while killing a penalty and the strength state is, therefore, 5×5)
- Playoff or regular season game
- Home or away team shooting
- Changes to goalie equipment
Credit to Patrick Bacon for the idea here. The NHL reduced the size of equipment in recent years which has led to a modest increase in total goals. These reflect those changes.
- Shrunk pants
- Shrunk pads
Rebounds are defined as any shot within 4 seconds of the previous. This was based on the following chart:
Shots that aren’t rebounds are all given the same values.
- Seconds since previous shot
- Distance from previous shot
- Absolute angle change relative to previous shot
- Shot event of previous shot (shot on net, miss, block)
Misses and blocked shots are included here for the same reason Jim Corsi started tracking shot attempts as a measure of a goalie’s workload: just because it didn’t get to the net doesn’t mean the goalie didn’t have to react to it. These could just as easily put them out of position for the rebound. Including blocks, however, does come with the disadvantage of the distance and angle values being inconsistent between the event types. I’m hoping the differences in distances between the block and the shot are insignificant enough (and rebounds off of blocks rare enough) for this to be a non-factor, but I’ll allow that it could be poor form.
- Type of previous shot (wrist shot, slap shot, etc)
The hope here is to add some more context to the shot – a shot immediately after an offensive zone faceoff win is likely to be a completely different shot than one taken seconds after a defensive zone win.
- Seconds since most recent faceoff.
- Team that’s shooting won or lost the faceoff.
- Zone the faceoff was in relative to the shooting team.
- Most recent event from outside the zone
This is similar to including a “rush” variable. Instead of giving a binary yes/no, we’re letting the model decide how to piece these together.
- Seconds since event
- Type of event (giveaway, hit, etc)
- Zone of event (neutral zone or defensive zone)
As mentioned in the rink bias post, a gradient boosting machine (GBM) will be doing the work for us in figuring out how best to combine all these variables.
Explanation Attempt: Boosting
The basic idea with boosting is to combine a bunch of weak learners into a single stronger learner.
- Find a simple model to make predictions on the data. Decision trees are commonly used (if the shot is less than 10 feet from the net, follow this path to find the likelihood of a goal, otherwise follow this other path). This model doesn’t need to be all that great at making predictions on its own, it just needs to provide some usefulness.
This is like asking somebody, let’s say Joe, how likely he thinks it is that a shot will go in. Joe appreciates that long distance shots are low percentage plays, but he also thinks that rebounds are guaranteed goals. Listening to Joe is better than guessing at random, but it leaves something to be desired.
- Train another model having it focus more on (weight more heavily) the shots the previous model whiffed on. We want this model to do well (or at least better) in the areas the current model is erring.
Now that we know what Joe thinks, let’s ask Jane. Jane recognizes that mindlessly slapping at a goalie’s pads is no guarantee of success. On the other hand, she also thinks the only way a team is scoring while shorthanded is if there’s an empty net.
- Combine the models to make predictions, each model getting its say on the chances of a shot being a goal.
We give all our esteemed colleagues a vote and follow the wisdom of the crowds.
- Return to step 2).
Sally studies shorthanded situations exclusively. Let’s ask her what she thinks…
None of the models has to be all that effective on its own (hence “weak learners”), but they continually prop up the others in areas where they struggle (like characters in a bad TV drama).
Explanation Attempt: Gradients
If you’ve done any calculus (did I just cause people to run away?), you’ll be familiar with derivatives. If not, you probably learned about the slope of a line in high school math (rise over run). The steeper the line, the larger the slope. A derivative is a way of calculating the slope of a function (any function, not just straight lines) at any given point.
Basically, we’re looking up and down a hill and the derivative tells us where it’s steepest.
So what’s a gradient? It’s essentially a multi-dimensional derivative. Seeing as we’re not just predicting shots based on a single variable, we need to know what the slope is with respect to each of our input variables.
We’re still looking down a hill, but now we can also look left and right and all around to find the steepest slope.
In step two of the boosting explanation, I mentioned that the model will focus on shots it has previously struggled with. These shots are identified using gradients.
If we want to get to the bottom of the hill (and we do), the fastest way is via the steepest slopes.
LightGBM was my booster juice of choice. The most common choice among expected goal models (and really anything featuring tabular data) has been XGBoost (it’s even well named for the purpose). After a bit of research, it seems LightGBM offers the following advantages:
- It’s faster (I’m impatient)
- It’s more memory efficient (I know mine’s waning)
- It allows for easier handling of categorical features (I’m the only dummy I need)
All that with (supposedly) comparable results to XGBoost made it the winning choice.
The most common metric for evaluating expected goal models has been area under the receiver operating characteristic curve (AUC or AUROC). AUC compares the rate of true positives (goals correctly predicted to be goals) to the rate of false positives (saves and misses predicted to be goals). The more true positives / fewer false positives, the better.
For this model, we’ll be using Matthews Correlation Coefficient (MCC). It’s also called the phi coefficient, but Matthews seems more appropriate for present purposes. The formula is:
Goal data is inherently imbalanced – that’s to say, there are a lot more shots that don’t go in than shots that do. With AUC, this makes keeping the false positive rate low much easier (it doesn’t take much to figure out that most shots don’t go in). MCC is a more balanced measure that isn’t swayed by the class imbalance. As a result, MCC typically outperforms AUC as an evaluation metric on such imbalanced data. And, when I tried it, it gave me the better model. So I went with it.
Looking for Holdouts
Explanation Attempt: Validation and Test Sets
One of the drawbacks of GBMs is that they’re prone to overfitting. Overfitting means that the model will do a tremendous job of predicting the results of shots it has seen, but give it a shot from the upcoming season and there’s no guarantee it will fair nearly so well. This is why validation and test sets are withheld from the model. To evaluate the performance of the model, we look at how it does not on the data it was trained on but on the validation set we didn’t let it see. When we’re done, we compare this to its predictions for the test set (which the model has also never seen). If the model continues its admirable performance on the test set, we can be more confident that its usefulness extends beyond the shots it knows.
Explanation Attempt: Holdout vs K-Fold Cross Validation
Let’s go with an analogy for this. Let’s say we’re coaching a group of youth hockey players and we want to use them to advance our own purposes instead of focusing on what’s best for the kids (as is the case with far too many coaches … did that get a little dark?). To that end, we want to see which move in a shootout is most likely to get someone mistaken for TJ Oshie.
One way we could do this is to strap the pads to some poor kid and have all the others try out the different approaches on her/him. Whichever move works best for the kids, that’s our new go-to. This would be akin to holdout validation. It’s entirely possible we’ve just found a high end shootout tactic. It’s also entirely possible we’ve found a tactic that this particular goaltender happens to be susceptible too.
What we could do instead is have every kid take a turn in net. That way, when we look at the results of all the shots, we can see what works not only against one goalie but against an array of goalies. This is akin to k-fold validation. We don’t just measure our model on a single validation set, we give all the data a turn as part of the validation set.
K-Fold is the preferred validation method, but holdout was used for the model being unveiled today. The reason for this is, quite simply, time. If we were using k-fold, the model might still be training. This is mainly due to the choice of MCC as an evaluation metric – LightGBM doesn’t have a built-in method for it, so I had to write the function myself. That slowed things down considerably. The options were to use k-fold with no real hyperparameter tuning or holdout with tuning. I opted for the latter.
With over 1.4 million shots to work with, using holdout shouldn’t prove problematic. Approximately one season’s worth of shots were withheld for each of the validation and test sets.
OK, enough straining at analogies. Time for what you’ve been scrolling for.
With MCC, scores range from -1 to 1 with random guesses receiving a score of 0 and perfect predictions achieving a score of 1. The model achieved the following:
Training set: 0.298
Validation set: 0.261
Test set: 0.254
Looks like we’re doing fine as far as avoiding overfitting, but it’s hard to benchmark this with no other models using MCC as a metric.
So let’s calculate AUC on our test set for some comparisons. With AUC, scores range from 0 to 1 with random guesses receiving a score of 0.5 and perfect predictions achieving a score of 1. Probably the best benchmark for our purposes is courtesy of Harry Shomer’s model. The Younggren twins have three separate models making comparison somewhat more difficult, but we’ll compare to their even strength numbers.
The Bucketless: 0.785
Seems we’re on the right track.
All the coming numbers will be score and venue adjusted. I chose to calculate score and venue adjustments on the Bucketless numbers based on the results of the prior three seasons. There isn’t any special weighting between the years, they’re all equal. For seasons with fewer than three priors available, we’ll use what we have. Because 2007 didn’t have previous data available, it’s excluded from the results. I chose to do it this way to avoid any retro-fitting and allow for easy updates for future seasons.
Also, we’re only looking at 5-on-5 regular season data from seasons with the full complement of 82 games. Notably, this excludes the lockout shortened 2012/13 season as well as the COVID shortened 2019/20 season. There are also a couple games with missing play-by-play data; teams in those games will have their numbers from those seasons excluded.
First, we’ll split teams’ seasons into their first and second halves (games 1 through 41 vs games 42 through 82). We’re going to look at the r2 value between the various models in the first half and actual goal differentials (which aren’t score and venue adjusted) in the second half.
I’ll skip the first set of numbers from draglikepull’s article seeing as they include 2007 and mine won’t. Looking at 2009/10 to 2018/19:
|Natural Stat Trick||0.17||0.21|
Second, we’ll split teams’ seasons into odd and even numbered games. So games 1, 3, …, 81 are in one set and games 2, 4, …, 82 in the other. A small issue here is that I couldn’t replicate the numbers draglikepull presented. I only looked at MoneyPuck’s data, but what I found doesn’t match up. I’m not sure where the problem lies, but I’m pretty confident my numbers are accurate so I’m going with those (which, unfortunately, means excluding NST and EH).
One Final Test
Taking things one step further, let’s do a comparison similar to what Dawson Sprigings and Asmae Toumi looked at with their model.
For this, we’ll take the following steps:
- Select a random sample of games for each team from each season. These games go in Group A, the rest in Group B.
- Calculate the r-score between the models from Group A and actual goals from Group B.
- Use the Fisher-Z transformation to convert these into a z-score.
- Repeat steps 1) through 3) 1000 times.
- Average the z-scores.
- Convert the average into r2.
- Repeat using every 5 game interval from 11 to 71 games as the number of games in Group A.
Doing so yields the following:
A couple points should be noted:
- Dawson and Asmae’s model includes shooter talent, we’re not there yet.
- If you’re comparing this to the results in their post, it bears mentioning that I’m not aware of any models living up to the numbers released there.
All in all, it seems we’re doing all right, but there’s still plenty of room for improvement. I’ve got an idea as to how I might improve this expected shooting percentage model (the ideas always come after I think I’ve got things set), but we’ll table that for future iterations and continue on with things as they are.
Next in xG Model