Player Tracking: A New Cartographer

We’re back at it with a new homography estimation model.  If you haven’t checked out my previous work on the subject, it could be worth reviewing.

With the previous models, the main issue was speed.  They weren’t necessarily slow as far as these things go, but they were too slow for me to want to run with any regularity.  So the goal since the last post has been to find something faster.  I also felt that simplicity would be nice if we could manage it.  I may or may not have spent far too much time making exclusively linear models that didn’t require any of the labels I hand-tracked.  The idea of saying “Computer vision problem solved with linear regression” was too funny for me to not at least give it a try.  I also may be crazy enough to think about trying it again at some point.  You have fun your way; I’ll torment myself my way.

Anywho, today’s work was largely inspired by a couple papers.  When I was looking at ways to attack this as an unsupervised problem, I stumbled across SimClr.  The end result doesn’t resemble SimCLR so much, but that was the paper that sent me down this path, so let’s give it a mention.

What the end result does resemble is Sports Calibration via Synthetic Data.  We’re working with the same idea, just with a little different execution.  This project was a little more like SimCLR before I realized someone else had previously done the same thing I was trying.  When I saw that, I figured we may as well conform to what’s worked before.

Here’s a quick overview of what we’re looking at today.  We’ve got two models that have the same basic architecture running back-to-back:

  • The semantic segmentation model takes the frames from our video and predicts which part of the rink each pixel belongs to.
  • The embedding model takes these predictions and compresses them down into a smaller latent space (ie it finds a way to represent the images with far fewer numbers).
  • This embedding is then compared to the embeddings returned on a large number of synthetic rinks.
  • Our final prediction is whichever homography matrix was used to create the synthetic rink with the most similar embedding.

It’s a lot like the Color-Coded Hockeye model.  The key difference is that we’re no longer comparing pixel colors.  Instead, we’re passing the rinks through another model to create embeddings for storage in our dictionaries.

Image Prep

Let’s start with what the model will be seeing.  Our previous models tended to have trouble with the near side of the ice.  The view of the ice being obstructed by the boards and the player benches was making it difficult for our model to figure out what was lurking behind.  Rather than allowing this area to sully our model’s hard work, the idea finally came to me to cut off the bottom of the image and ignore that section of the rink completely.  It took me months to reach this incredible revelation, but I got there all on my own.  I accept and appreciate your heartfelt praise.

Specifically, we’re going to get rid of the bottom third of the image, only showing our model the top two thirds.  There’s no real reason for this choice of splits.  It’s something I tried and it looked decent, so we’re running with it.

After cropping down our image, we’ll resize it to 256×128 pixels.  This is what’ll get fed into the segmentation model.

Model Architecture

Image made with NN-SVG.
https://alexlenail.me/NN-SVG/

Transformers are quickly becoming ubiquitous in deep learning frameworks.  For image based tasks, vision transformers (ViT) work by breaking the image into distinct patches.

The patches are then encoded and fed to a transformer block.

After the success of ViT, some other smart folks came up with the ConvMixer model, as detailed in the paper Patches Are All You Need.  Their idea was that the success of ViT may not be directly attributable to transformers, but courtesy of the patch representation of the image.  Typically, a convolutional neural network will pass a kernel across overlapping sections of the image.

A 3×3 kernel with a stride length of 1 and no padding.

The ConvMixer architecture instead starts with a patch layer.  The stride length is set to equal the size of the kernel so that there’s no overlap.

A 3×3 kernel with a stride length of 3.

Based on the success of the relatively simple ConvMixer architecture, it seems like patches may indeed be the way to go.

With these models serving as inspiration, we’re also going to break things up into patches.  However, our motivation is a little different.  For us, it isn’t that doing so will necessarily lead to the best model (though we certainly hope it helps) but because it’s an easy way to keep the model relatively compact.  We’re going with three patch layers back-to-back, each using 128 filters of 4×4 kernels.  This helps us drop the number of parameters the model requires in short order.

Following each patch layer is a dropout layer which ignores 20% of the units in training.  After we’re through all the patches and dropouts, we flatten what’s left and hook it up to a fully-connected layer of 128 units.  This is the basic architecture that both our models will employ.  The outputs of the models require different shapes, so we’ll add on to the end of this to get to the actual predictions.

Before we carry on, let me go out of my way to stress that there is absolutely nothing special about this architecture that allows for the models to work.  This is what we’re using, but no one should come away from this thinking that it’s the “right” architecture to use.  It’s something I thought up that seemed simple and fast that provided results that weren’t terrible.  It shouldn’t be too difficult to design something that provides better predictions, if that’s your goal.

Segmentation Model

On to what we’re predicting.

The image above should remind you a lot of the one from Color-Coded Hockeye, though with some different colors.  A notable difference is that the outline of the rink and the lines on the ice aren’t colored in.  With this model, we’re predicting which of five classes a pixel belongs to:

  • Left ozone
  • Neutral zone
  • Right ozone
  • Faceoff circle
  • Outside the rink

I originally tried this with an even simpler model that only tried to separate the neutral zone from the ozones, but the results from the embedding model weren’t all that could be hoped for.  Giving the ozones different colors and filling in the faceoff circles seemed like the easiest additions to help the model differentiate between similar shapes.

In training, our segmentation model takes an image and randomly applies a few augmentations to it.  This is done to try to prevent the model from overfitting to the color of any specific pixel in any given image.  We’re toying around with the image’s brightness, contrast, saturation, and hue.  After that, we’re adding some Gaussian noise and taking a randomly sized crop.

The prediction for this model is a fully-connected layer of 163,840 units which gets reshaped to 256x128x5.  These correspond to the pixels of the image and the five classes they could belong to.  Recall that the layer before we get to the prediction has 128 units.  That we can go directly from that to the segmentation of the full image seems silly to me.  Throwing in some intermediary layers would probably be an easy place to find some improvements, but the end result this way doesn’t seem so bad.

Our loss is categorical cross-entropy.  I’d originally intended to make this as another GAN, but continuing our appreciation for simplicity, I tried it without a discriminator.  It seems to be doing the job, so we won’t fix what ain’t broke.

Embedding Model

Our second model is a shared encoder.Ā  These are commonly referred to as Siamese networks, but let’s not do that.Ā  I only mention it here in case someone goes looking for more information.Ā  Unfortunately, that will be the more useful search term.

A shared encoder model takes three images:Ā  a reference image (anchor), a similar image (positive), and a dissimilar image (negative).Ā  It passes these through the same model (hence “shared” encoder) to find a smaller representation of the image (its embedding).Ā  The model’s goal is to close the distance between the embeddings for the anchor and positive images while pushing away the embedding for the negative image.Ā  It isn’t important what numbers actually make up any given embedding, it’s only important that similar images receive similar embeddings.

Our model is trained entirely on synthetic data.Ā  We start with a random homography matrix from our hand-tracked labels and apply random pan/tilt/zoom combinations to it.Ā  We use similar views to provide us with the warped rinks of the anchor and positive images.Ā  The negative image is created by taking the positive image and applying a larger pan, tilt, or zoom to it.Ā  The hope is that the model will recognize that a small pan, tilt, or zoom should result in a smaller change in the embedding than a large change to the same.

The prediction layer for this model is a fully-connected layer of 16 units (the embedding) which gets some L2-normalization applied to it.

Our loss for this model is called triplet loss.

max(||f(A) - f(P)||^2 - ||f(A) - f(N)||^2 + \alpha, 0)

Seeing as people generally aren’t partial to math formulas, let’s break it down a little.

  • f(A), f(P), f(N) are the embeddings for the anchor, positive, and negative images respectively.
  • ||f(A) - f(P)||^2 is the distance between the embedding for the anchor image and the embedding for the positive image.Ā  We want this to be small.
  • ||f(A) - f(N)||^2 is the distance between the embedding for the anchor image and the embedding for the negative image.Ā  We want this to be larger than the anchor-to-positive distance.
  • \alpha is the margin.Ā  When you see a Greek letter in a formula, it typically means it’s a parameter you get to choose/tune.Ā  In our case, the margin will be 1.
  • Taking the max of the loss and 0 effectively clamps the loss, no longer encouraging movement when the differences in the embeddings have reached a large enough size.

The idea with this loss function is to encourage the positive embedding to be closer to the anchor embedding than the negative embedding is by at least the amount of the margin.

When we’re done training, camera views of similar parts of the ice should have similar embeddings, while views of opposite ends of the rink should have embeddings that share little in common.

Getting Centered

We’re not quite through with the embedding model.  With the Color-Coded Hockeye model, the means of speeding things up came from splitting the dictionary of pre-computed rinks into several dictionaries based on what part of the rink was prominently featured.  When we got a prediction, we could figure out which dictionary it belonged to and only needed to look up the best match from there.  This significantly decreased the number of comparisons we were making, but it came at the expense of some post-processing to count up pixel colors.

With this model, rather than wait until after the model runs to split things up, we’re going to let the model do it for us.

To that end, in addition to the 16 unit embedding, the embedding model will give us another vector of 16 numbers.  This one is a prediction of which class the image belongs to.  The classes are based on which area of the ice the center pixel of the image (pre-crop) is closest to.

Despite having similar embeddings, the anchor and positive images fall into different classes.
Keep in mind that we’ve cropped these images, so the center pixel actually appears in the bottom half of the image.

When we generate our dictionary with synthetic rinks, we can split it up into the 16 separate dictionaries based on where the center pixel falls on the rink.

With these class predictions included, the full loss function for the embedding model is a combination of the triplet loss function on the embeddings and categorical cross-entropy on the class labels.

After doing it this way, I had another idea of how to split things up.  My thought was that we could use k-means on the embeddings and assign rinks to a dictionary based on which cluster they belonged to.  This works, but in my tests, it’s actually a little slower than the class predictions.  The difference comes down to whether or not it’s faster to get an extra prediction from the embedding model or to calculate which cluster the embedding belongs to.  Because the class prediction from the embedding model uses all the same layers as the prediction for the actual embedding, the time spent on getting another output of 16 numbers is negligible.  Adding the extra step to find the embedding’s cluster seems to slow things down enough that we don’t make up the time afterward, even when increasing the number of clusters (thereby decreasing the size of each dictionary).  The highest I’ve tried is 1024 clusters, so maybe increasing that could give k-means the edge.  In any case, it could be something worth returning to.

One final note on the center pixel classes.  Spreading out the coordinates equally really isn’t the best way to do this.  What I should’ve done is created the full dictionary of homography matrices and used that to determine where to place the coordinates.  Clearly, that isn’t what I did.  Yet another aspect to play around with.

Webster’s Defines an Embedding…

The only thing left is to build the actual dictionaries.  We’re doing this the same way we always have:  we take the hand-tracked labels and apply random pan/tilt/zoom combinations to them to get a larger selection to compare against.  We then warp the image of the segmented rink by each of the matrices.  The only change from past models is that these then get passed to the embedding model to construct our rather large set of 16 digit representations.

Currently, we’re working with about 190,000 possible rinks.  Though, there is some redundancy there.  Someone (let’s not point fingers) got lazy and didn’t want to ensure all the rinks were sensible.  So we can/should pare that down some.  These 190,000 rinks are split up into 16 dictionaries based on the center pixel.  When all is said and done, we’ve got a couple unfortunately small dictionaries containing about 1,000 to 2,000 embeddings, but most have around 10,000 to 20,000.

Running linear discriminant analysis on the embeddings, we can see there is a lot of overlap with similar embeddings commonly falling into different classes.

The colors represent the dictionaries, the shapes are the space they occupy.

When we get a prediction from the embedding model, it could be that the best possible rink ends up in a separate dictionary than the one we look through.  But as long as we’re in the right general area, we should be able to find something reasonable to use as our prediction.

From the Top

With all our models trained, let’s review how everything is assembled.

  1. We generate a whole lot of synthetic rinks to pass to the embedding model.  These get us our 16 dictionaries to compare against.  Once we have them, we can save them and never have to run this again.  Which is to say, I’ll probably have to run it again in the near future.
  2. We grab a frame from the video and crop and resize it to feed to the segmentation model.
  3. We take the prediction from the segmentation model and pass it to the embedding model.  This gets us the video frame’s embedding as well as the class label of the dictionary it should be found in.
  4. We compare the embedding to all our pre-computed embeddings of the same class to find the most similar synthetic rink.
  5. We look up the homography matrix that was used to create the synthetic rink and that becomes our prediction for the frame.

Next Steps

This GIF I keep posting (our “test set”) contains 890 images.  On CPU, it took about 89 seconds to generate the predictions for this one, which includes some time actually making the images in it.  So we’re doing about 10 frames per second meaning that we can reasonably generate predictions on CPU.  In fact, not only can we use the model on CPU, both models were trained entirely on CPU.  The predictions could use a little work, but we’ve definitely met our goal of speeding things up.

Looking forward, I’ve got an idea of how I want to re-work the player tracking end of things.  If we get some reasonable results there, we’ll tie everything together.  Hopefully, when we get generate all our player coordinates, we can apply some smoothing to them to iron out the rough patches.

2 thoughts on “Player Tracking: A New Cartographer

Leave a comment