How It Works

Last updated: May 2026

A detailed look at what feeds the model, how it's trained, and how we know whether it's working. No black boxes — every input and every accuracy number on this site is reproducible from the data in our public database.

The Seven Inputs

Each prediction starts as seven numbers about tonight's matchup. The same seven numbers are computed at training, validation, and serving time — there is no hidden feature drift.

1. ELO Rating Gap

ELO is a single number per team that updates after every game — winning teams gain points, losing teams lose them, and the size of the swing depends on the margin of victory and the strength of the opponent. We use FiveThirtyEight's published NBA ELO parameters (K=20, +100 home-court advantage, 75/25 inter-season regression toward 1505).

In our backtest this turned out to be the model's most important feature.

2-4. Rolling Point Differential (20-, 10-, 5-game windows)

For each team we compute the average margin of victory across their last 20, 10, and 5 completed games. Three different windows give the model different views — the 20-game view is statistically stable, the 5-game view reacts quickly to form changes, and the 10-game view splits the difference.

Why raw point differential rather than a fancier metric? Because it's grounded in games that actually happened, captured directly from final scores, with no API dependency or version-drift risk.

5. Rest Day Gap

The difference between how many days each team has had since their last game. Positive when the home team is fresher. Rest is well-documented to matter — teams shoot worse and turn the ball over more on zero days of rest.

6-7. Back-to-Back Flags

Two binary inputs — was the home team on the second leg of a back-to-back? Was the visitor? These overlap with rest days but the model gets to learn that B2B specifically is a step-change effect rather than a smooth one.

In the trained model, the visitor B2B flag matters more than the home B2B flag — visitors compound travel fatigue with back-to-back fatigue.

The Model: XGBoost + Isotonic Calibration

The classifier is XGBoost — gradient-boosted decision trees. It's well-suited to tabular numeric data with a handful of inputs, robust against feature scale mismatches, and outputs a probability.

We use three published-best-practice training tricks:

60/20/20 train/calibration/test split. Sixty percent of the data trains the raw XGBoost model. Twenty percent is held out and used to fit an isotonic regression calibration on top of the raw probabilities. The final twenty percent is touched by neither and used to report accuracy.
Stable feature schema. The training code and the live inference code share the same function (features.build_feature_vector) so the seven inputs are defined identically — no possibility of a silent train/serve gap.
Reproducible from data. The full training database (about 7,600 NBA games with final scores) is committed to our public repository. Anyone can clone the repo, run python src/train.py, and reproduce the model exactly.

Calibration: What Confidence Numbers Mean

Raw XGBoost probabilities don't have to mean what they say. A model might output 70% on games that actually win 60% of the time — overconfident — or 60% on games that actually win 70% — underconfident. Isotonic regression fixes this by learning a monotonic mapping from "raw model output" to "calibrated probability" using games the model never saw during training.

The result, measured on our historical backtest of 7,146 games:

Model says	Actually wins
54%	52%
65%	63%
74%	72%
85%	86%
100%	95%

Live numbers and the full reliability table are on the Track Record page and refresh daily.

How LOCKs Are Decided

A "LOCK" isn't the model's most confident pick in isolation. It's a comparison between our model's implied point spread and the Vegas closing spread. When the gap is three points or more, we flag the game. The direction is set by the sign of the gap:

If our model favours the home team by more than Vegas does, the LOCK is HOME
If our model favours the visitor by more than Vegas does, the LOCK is VISITOR

This isn't betting advice; it's a flag that the model has reasonable disagreement with the market. Vegas wins most disagreements, because Vegas is excellent.

How Often the Model Is Wrong

About a third of the time, straight up — overall accuracy of 67.1% means we miss roughly 32.9% of games. On games where the model is least confident (50-55%), it's essentially a coin flip and shouldn't be treated as a signal. On games where the model is most confident (predicted >65%), it hits about 75% — still wrong about one in five times.

Streaks happen. Any string of losses lasting fewer than ~10 games is consistent with the model working as designed and just running cold for a stretch.

What the Model Doesn't See

Lineup news that breaks after the morning ETL runs
Coaching matchups, individual player matchups, defensive schemes
Pace specifically — captured indirectly through point differentials but not as a standalone feature in the current version
Anything Vegas knows that we don't — sharps and books have private information

We're transparent about this because honest limitations are part of an honest model. The accuracy gap between our 67.1% and Vegas's roughly 68% is almost entirely down to information we can't see.

How We'd Like to Improve It

The model gets better when we add features that aren't already captured indirectly. On our short list:

True team efficiency metrics (offensive/defensive rating, true shooting %) at the historical date — needs a backfill of historical advanced-stats snapshots first
Per-player impact via RAPTOR-style ratings, properly attributed to who's available tonight
A temporal train/test split (train on older seasons, test on newer) — the current 60/20/20 random split slightly overstates accuracy because it can train on a future game when predicting a past one

Source Code

Everything described here is in our public GitHub repository:

features.py — feature definitions
elo.py — ELO math
train.py — model training + calibration
predict.py — daily inference
backtest.py — accuracy + Brier score evaluation

Have a feature you think we should add? Email ben.g.ballard@gmail.com.