Machine Learning and the Quest for a Perfect March Madness Bracket

Written by Nathan Babcock:

Every year in March, the nation’s premier college basketball programs come together and play in the NCAA Division 1 Men’s and Women’s Basketball tournaments, also known as NCAA March Madness. Every March, millions of people, even the President, fill out their bracket predictions and watch the intense games with dreams of winning the huge cash prizes. The tournament never ceases to live up to its name–madness! The madness is impossible to predict, but many try using masses of data and machine learning in an attempt to create the perfect bracket.

The odds of predicting the outcome of the tournament, let alone every single game, is a next to impossible task. Why? Unprecedented upsets can occur, star players can easily be injured in a grueling and never ending tournament schedule, amongst other factors. Mathematically speaking, in order to find the number of possible outcomes for a March Madness bracket with all 63 games, we would calculate 2^63, or 9,223,372,036,854,775,808. This means that when filling out your bracket, if you decided to flip a coin to decide the winner of each game you would have 1 in 9.2 quintillion odds to predict all 63 games correctly. Yes, flipping a coin is an irrational way to select the winner of certain tournaments, especially when historically speaking the 16-seeds and 15-seeds almost always lose their first-round matchups. Due to this, one professor has estimated that the odds are closer to one in 128 billion. However, ‘almost’ is the word that stands out here because there are always outliers–just look at the 2018 Men’s tournament when the 16-seed UMBC Retrievers knocked out the #1 overall seed Virginia in their first round game, or when Harvard’s 16-seed Women’s team beat the #1 seed Stanford Cardinal in the 1998 NCAA tournament. These anomalies speak to the unpredictability of the NCAA tournament, and why it is often hard to rely on historical precedent and trends. So where should people look? – data and algorithms.

Every year, Google hosts an NCAA Tournament Machine Learning competition where data scientists can showcase their skills and attempt to predict the NCAA Tournament bracket and win prize-money. What is machine learning? It is a form of data analysis and a form of artificial intelligence that can be used to automate analytical model building and can learn from data, draw conclusions, and make predictions with as little human intervention as possible. This competition differed from the typical bracket challenge that ESPN, CBS, or other media companies host. As participant Lotan Weininger stated: “Rather than starting from the tournament lineup and predicting games round-by-round in an attempt to predict an overall winner, competitors create predictions for the outcome of every possible tournament game that could occur; all 2,278 of them. And it doesn’t stop there: alongside predicting an overall winner and loser, competitors must also predict the percentage likelihood that one team will defeat another, with competition points awarded on a logarithmic scale. This means that confident predictions are punished more than conservative ones when incorrect.” The contestants in the competition were given a massive dataset of over 40 million rows with data collected over decades from college basketball games. With a dataset of that caliber, it may be challenging to draw real conclusions. The role of the machine learning model would be to allow the model to analyze the data from the countless basketball games and draw conclusions from it and discover a real-world pattern. However, this is challenging because the dataset provided by Google only captures a portion of the real world.

How can we determine what differentiates a team from being great, average, or mediocre? This is an essential question to answer when predicting a bracket. The easiest way to separate teams into categories is to classify them as “tournament teams” or “non-tournament”. After all, only the best 64 teams in the country make the tournament. The differences between these teams are quite telling:

It is clear that some of the metrics derived from the dataset can differentiate teams from being great and mediocre. Lotan Weininger, the data scientist mentioned previously, was able to pick out seven different metrics that best separate the teams: KenPom ranking, offensive rating, defensive rating, NET ranking, tempo, possession time per game, and adjusted KenPom ranking. Ultimately, these metrics can be used in a binary classification machine learning algorithm with a logistic regression model to predict the probability of one team beating the other on this mass scale. This is just one method to the madness, but many contestants took alternative approaches!

March Madness will always captivate people all over the world, even people who do not like or watch basketball. There is so much at stake for both players and fans during this unpredictable tournament, so maybe consider using data to your advantage when entering next years’ bracket pool!

Sources:

“The Absurd Odds of A Perfect NCAA Bracket”. Ncaa.Com, 2022, https://www.ncaa.com/news/basketball-men/bracketiq/2022-03-10/perfect-ncaa-bracket-absurd-odds-march-madness dream#:~:text=As%20such%2C%20the

“How We Predicted March Madness Using Machine Learning”. Medium, 2021, https://lotanweininger.medium.com/march-madness-machine-learning-2dbacc948874.

“How To Use Machine Learning To Predict NCAA March Madness”. Analytics8, 2021, https://www.analytics8.com/blog/how-to-use-machine-learning-to-predict-ncaa-march-madness/

“March Madness”. Medium, 2020, https://towardsdatascience.com/march-madness-9212109bc8e8.