5 min read

Introducing the Model

Introducing the Model
Photo by Bozhin Karaivanov / Unsplash

We hope you are enjoying and benefitting from our race predictions. At this point you may be wondering how it all works.

Our race predictions are made using a statistical model and machine learning techniques. The model does all the work of form studying for you, saving you the time and effort you can use for more in-depth, targeted race analysis.

Being based on years of historic data, our model's understanding of form is both broad and deep. Its predictions summarise all this understanding about a race in a useful set of highly interpretable numbers, at high speed, so we can cover all UK and Irish races, every day. Human biases, such as stopping after finding one eye-catching piece of form, studying only some horses in a race, or giving certain form factors too much or too little weight, are not an issue for the model.

We believe ours may be the most advanced of all models currently deployed in the field of horse racing analytics.

Pursuing our commitment to be as transparent as possible, we share some details of the model in this page. It is challenging, however, to convey this technical subject in a way that is understandable to a broad audience, and we must keep some significant details private for commercial reasons. Within those constraints, we will give as much detail as we can, using everyday language as much as we can.

First, we will give a brief summary of the model using the technical jargon, so that more specialist readers (e.g. data scientists and statisticians) will know immediately what they are dealing with. Then the rest of the page will be devoted to explaining the model in more familiar language for a generally intelligent audience.

For a technical audience

Sequential Monte Carlo model learning.
Sequential Monte Carlo model learning.

We use a mixed effects hierarchical model, a form of generalised linear model using the logit link. This is a statistical explanation of the variation in binary race outcomes using numerical and categorical covariates and a set of parameters.

In the field of horse racing modelling and analytics, generalised linear models have been found to have greater external validity (are not so over-fit to the historic data) than non-parametric methods such as horse ratings or score-based models, making them particularly useful for prediction. The use of random effects as well as fixed effects gives our model the level of precision which those non-parametric methods can benefit from. In particular, in our model, race outcome depends on horse-specific effects.

The hierarchical aspect of the model provides a degree of partial pooling. This means some model parameters are grouped together, and our inferences for one group are modulated by the information contained in other groups.

We take a Bayesian approach to model inference. This requires some cutting-edge results from Bayesian modelling and MCMC. We brought these results together in a sequential Monte Carlo algorithm, which allows continuous updating of the parameter sample in response to new data, keeping our model up-to-date.

Predictions are made using simulation: we simulate a large number of results for a target race using the model parameterisation represented by each Monte Carlo sample and estimate outcome probabilities from the totality of simulations.

For a general audience

The model learns from past data (race results) and predicts the outcome of future races.
The model learns from past data (race results) and predicts the outcome of future races.

The basic idea of a model like ours is that we learn, from past results, the relationships between various data points and race outcomes. These relationships are encoded in some mathematical machinery. When we want to make predictions for a future race, we input the relevant data points for that race into the machinery, which then produces some predictions using everything previously learned.

The format of the predictions is a probability for each horse winning the race. This is a number between 0 and 1 (shown as a percentage in our Full Model Predictions) where 0 means the horse will definitely not win and 1 means it definitely will. In practice values of 0 or 1 never occur because we are always somewhat uncertain about the outcome of a race. You can think of the probability as a fraction: if you imagine running the race with exactly the same starting conditions many, many times, the probability of winning is the fraction of re-runs of the race in which the horse wins. We also provide a probability of each horse being placed, which has a similar meaning.

The "data points" mentioned above are the information the model requires in order to make predictions. These include, but are not limited to:

  • Going
  • Distance
  • Jockey
  • Trainer
  • Pedigree (horse's sire and dam)
  • Stall number/draw (if a flat race)

The influence of these data points on a horse's win probability is computed by the model using a mathematical formula. Each data point is represented in this formula by numbers which are multiplied or added together.

This is a common approach in sports modelling. There are several things we do differently which make our model special. These are:

1. Horse-level modelling

The numbers in the formula at the heart of our model are specific to the individual horse. For example, whether a particular horse in a particular race performs better on fast going or slow going is represented in our model.

This makes the model very powerful. It means we enjoy the best of both worlds: all the general advantages of statistical modelling combined with the ability to learn from and make use of individual horse traits, as in horse ratings-based prediction methods.

2. Race outcome representation

It is typical in models like ours to represent the outcome of a horse race just as a sequence of true or false values, indicating whether each horse wins its race or not. The way race outcomes are represented in a model can have an impact on the efficiency of learning important patterns in past race data.

We therefore thought very hard about how best to do this. The precise details will have to remain a trade secret - the "special sauce" of our model - so all we can say is we came up with something better. Our approach makes our model better able to learn from data and, we believe, makes our predictions more accurate.

3. Continual learning

A model is not much use without two things: 1) high quality data and 2) a good way to learn from those data and give values to all the important numbers in the model.

High quality data is not too difficult to get these days. Suffice to say we have an excellent source of data, updated with new race results every day. As for how to learn from data, we developed a machine learning algorithm using cutting-edge statistical methods (represented by the picture at the top of this page). A key advantage of our algorithm is that it can learn from new data as it arrives. This means our model is updated frequently, keeping up with new evidence about horses' abilities (and about all the other important factors such as jockeys, trainers, etc).

Further reading

For more information, see our pages: