What does it mean to be the best fighter in the world? Is it the person with the best kick? The best punch? Or is it something else? This project seeks to determine which factors are the best predictors of success in the UFC. It explores every fight since the organization's inception in the hopes of gaining insight into what separates winners and losers. We start with an exploratory visual analysis of all variables before cleaning the data and building our machine learning models. Currently, the algorithms are able to predict the outcome with approximately 55-60% 57-66% 85-88% accuracy. Our goal is to improve that over time through iteration.
I was originally inspired by this kaggle kernel as well as my own background in martial arts growing up. Their model had a peak accuracy between 58-63% and I wanted to see if I could improve upon that. I wrote a new scraper, pulling data from fightmetric.com instead of ufc.com, spent more time on exploratory analysis and maintained a smaller overall feature set. That last point especially helped me overcome the curse of dimensionality problem and reach a peak accuracy of 88%.
The Ultimate Fighting Championship (UFC) is a mixed martial arts fighting organization based out of Las Vegas, Nevada. Founded in 1992, it's original goal was to pit fighters from around the world against each other to see which fighting style was the best. In the beginning, there were no rules and there were no weight classes. Eye gouges, groin strikes, headbutts and biting were all legal. Fighters weighing 160 lbs would get tossed in against 400 lbs behemoths. This type of no holds bar contest had never been attempted before and it quickly gained notoriety for its brutality.
It also gave rise to a new way of thinking about fighting. When kung-fu masters, boxers, kickboxers, and karate experts all met for the first time, no one could have predicted the outcome. Most people expected the larger, stronger fighters to win. On the contrary, it was an unassuming 178 pound man named Royce Gracie who would go on to win the first contest. What was even stranger was that no one could explain how he won the first time (or the second or the third). He didn't pummel his opponent into submission as expected, but instead locked him up using wrestling moves most people hadn't seen before. This was the world's introduction to Brazillian Jiu Jitsu (BJJ).
After Royce went on to win several UFC contests, people began to realize he was not just getting lucky. Fighters started studying this martial art so they could start applying these techniques in their matches or at the very least prevent themselves from being submitted by Royce. Then, they started adding techniques from other fighting styles so that they, too could become the best in the world. This was the beginning of mixed martial arts or MMA. It started out with dozens of different styles, but a handful started to show up repeatedly: BJJ, wrestling, boxing and kickboxing. Over time, fighters' skill in these 4 areas evolved to where no one art was able to completely dominate the others.
Now, the UFC is watched by millions worldwide and its support continues to grow. The rules were changed to protect the safety of the fighters. You can no longer gouge your opponent's eyes or pull hair or bite. All fighters are now required to wear standard MMA gloves and shoes are no longer allowed in the octagon. There are now weight classes that range from 115 lbs all the way up to 265. A lot has changed in the UFC and this project seeks to determine which components matter most in a match.
The data was scraped from www.fightmetrics.com with Beautiful Soup in two stages. The first script compiled a list of event urls and wrote them to a CSV. The second script crawled through those urls and extracted all available fight data into a pandas dataframe. There are 31 features and 8624 records.
During our exploration, we'll examine the data from several perspectives. We'll put it under a microscope through univariate analysis, get a bird's eye view using a correlation matrix and view it from ringside through multivariate analysis.
# Noteboook Functionality
import warnings
warnings.filterwarnings('ignore')
# Linear Algebra
import pandas as pd
pd.options.display.max_columns = 50
pd.options.display.max_colwidth = 75
from IPython.display import display
import numpy as np
from pprint import pprint
from scipy.stats import norm
import random
from datetime import datetime
# Visualizations
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
sns.set_style("white")
colors = sns.color_palette()
from plotnine import *
from plotnine.data import *
from mizani.breaks import date_breaks
from mizani.formatters import date_format
import skmisc
from PIL import Image
from wordcloud import WordCloud
import branca
import folium
#SciKit Learn Models
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC, LinearSVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import Perceptron
from sklearn.linear_model import SGDClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, precision_score, recall_score, classification_report, confusion_matrix
from sklearn.preprocessing import MinMaxScaler
fight_data = pd.read_csv("data/fights_v5.csv")
fight_df = pd.DataFrame(fight_data)
fight_df.info()
From this, we can see that we have a total of 32 Columns and one dependent variable. The columns themselves have 7 object types (First Name, Last Name, Method, Referee, Date, Location and Title), and 24 int types. This however does not give us a complete picture of the data, so we'll use a few other pandas functions to get a better glimpse.
fight_df.describe(include=['O'])
fight_df.describe()
Most of the variables have low means and large maximum values, which means they are probably highly skewed to the right. It might make sense to apply a log transform to some of the distributions when conducting the univariate analysis.
In the categorical overview, we can see that Herb Dean is truly a prolific ref! Who knew that he officiated nearly 1300 UFC fights? I would've guessed John McCarthy was the leader in that respect, but Herb Dean has definitely paid his dues. You'll never guess who's the most prolific UFC fighter. Spoler alert: It's Michael Bisping. I bet he wasn't even on your radar.
Nothing to see here. The only column that has any missing data is the referee column and it's less than 0.4%. Let's keep it moving
percent_missing = (fight_df.isnull().sum()/fight_df.isnull().count()).sort_values()
percent_missing
Before diving deeper into the dataset, let's take a moment to fix a few minor details.
It bugs me slightly that each fighter's name is buried in the middle of the dataframe. The analysis will be a lot simpler if that information was the first thing we saw in each row. Let's take care of that and move the date, fight_id, win/loss and method columns to the front as well.
# Creates a copy of the dataframe for testing
fight_df = pd.DataFrame(fight_data.copy())
# Rearranges the columns so the name, date, fight_id, win/loss and method appear first
cols = fight_df.columns.tolist()
cols = cols[17:18] + cols[4:5] + cols[7:8] + cols[-1:] + cols[16:17] + cols[:4] + cols[5:7] + cols[8:16] + cols[18:-1]
fight_df = fight_df[cols]
# Sets the index as the fight_id
# fight_df.set_index('fight_id', inplace=True)
# Change kd to knock_down for readability
fight_df.rename(columns={'kd': 'knockdowns'}, inplace=True)
fight_df.tail()
Right off the bat I can see the data is organized in reverse-chronological order. Art Jimmerson fought in the first UFC event and I still remember him because he fought with one boxing glove. It was a crazy time... I'll reverse the dataframe so it's in chronological order and re-index the records.
I'm surprised by the lack of NaN values in this dataset. Apparently, Fightmetrics is very organized! I'll have to crawl deep to see if there are any inconsistencies.
fight_df['fight_id'] = abs(4337 - fight_df['fight_id'])
fight_df = fight_df[::-1]
index = list(range(len(fight_df)))
fight_df.index = index
fight_df['outcome'] = 0
fight_df.loc[fight_df['win/loss'] == 'W', 'outcome'] = 1
Let's format the date column as datetime objects and convert fight_id as a categorical variable while we're at it.
fight_df.loc[:, 'date'] = pd.to_datetime(fight_df.date)
fight_df['fight_id'] = pd.Categorical(fight_df['fight_id'])
all_fights = fight_df.copy()
fight_df.head()
Ah, much better! The earliest fights are at the beginning of the dataset and the later fights are at the end.
Finally, it looks like some of the locations are not formatted uniformly. We'll standardize them and separate the cities, states and countries into separate columns.
locations = sorted(fight_df['location'].unique())
locations_dict = {
'location': {
'Berlin, Germany': 'Berlin, Berlin, Germany',
'Chiba, Japan': 'Chiba, Chiba, Japan',
'Rio de Janeiro, Brazil': 'Rio de Janeiro, Rio de Janeiro, Brazil',
'Saitama, Japan': 'Saitama, Saitama, Japan',
'Sao Paulo, Brazil': 'Sao Paulo, Sao Paulo, Brazil',
'Singapore, Singapore': 'Marina Bay, Singapore',
'Tokyo, Japan': 'Tokyo, Tokyo, Japan',
}
}
updated_locations = fight_df.replace(locations_dict)
#sorted(updated_locations['location'].unique())
updated_locations['city'] = updated_locations.location.str.split(', ').str.get(0)
updated_locations['state'] = updated_locations.location.str.split(',').str.get(1)
updated_locations['country'] = updated_locations.location.str.split(',').str.get(-1)
updated_locations.to_csv('data/updated_locations.csv')
In this section, we'll use visualizations to get a clearer picture of the fight data. There are 32 variables at play here, so we'll keep it brief here and go into more depth in the multivariate section.
To start, we'll analyze data that is not specific to individual fighters, but to UFC fights in general. We'll examine the average number of rounds per fight, method used and decision outcome. Let's dive in!
How are the rounds are distributed? Do most fights go the distance or do they usually end early? The maximum number of rounds for each fight depends on the type of fight. Title fights are 5 rounds long while non-title fights are 3 rounds long.
counts = fight_df['round'].value_counts()
plt.figure(figsize=(12,8))
plt.bar(counts.index, counts.values, alpha=.7, color='g')
plt.xlabel('Round')
plt.ylabel('Frequency')
for x,y in zip(counts.index, counts.values):
plt.text(x, y, y, ha='center', va='bottom')
plt.show()
Very interesting. The most frequent fight length was 3 rounds while the least frequent was 4. This tells me that most title fights go to a decision (5:1 ratio), but non-title fights are pretty evenly split between KO's, injury stoppages and submission.
fight_df['method'].value_counts()
The "TKO", "Could" and "Other" immedietely jump out at me as categories worth exploring. I have no idea what the "Could" or "Other" categories represent. I'm guessing the first is a typo but I'll have to do some research to see what's going on there. I'm also curious as to why they placed KO/TKO and TKO into separate groups. I'll have to research that as well to see if we can consolidate the two categories.
display(fight_df[fight_df.method=='Could'].tail(2))
display(fight_df[fight_df.method=='Other'].tail(2))
display(fight_df[fight_df.method=='TKO'].tail(2))
After doing a little market research and totally not wasting time by watching UFC fights, it looks like the "Could" category stands for no contest, "Other" stands for draws and "TKO" stands for fights where one fighter could not continue due to injury. Each of these scenarios present different challenges. Before going further, let's change "Could" to "No Contest", "Other" to "Draw" and change TKO to "Injury".
fight_df.loc[fight_df.method=='Could', 'method'] = "No Contest"
fight_df.loc[fight_df.method=='TKO', "method"] = "Injury"
fight_df.loc[fight_df.method=='Other', "method"] = "Draw"
Fights that stop due to injury are usually fights in which one fighter has been beating his opponent so badly, they had to stop the fight because the opponent couldn't continue. I'll probably do some more market research later, but right now I see that as a win. The goal is to defeat the other fighter, so mission accomplished. I don't know if I'll combine that category with TKO/KO's, but it's a valid win.
No contests and draws, however are a different story. A no contest could result from a fighter using an illegal technique or getting popped for doping or a mulitude of other reasons and a draw is such an insignificant anomaly that it doesn't warrant further investigation. Neither category demonstrates that one side dominated the other and in many cases they are the result of luck. Because of this, they will most likely be dropped from the dataset before building any models.
counts = fight_df['method'].value_counts()
plt.figure(figsize=(12,8))
plt.bar(range(len(counts.index)), counts.values, alpha=.7)
plt.xlabel('Method')
plt.ylabel('Frequency')
plt.xticks(range(len(counts.index)), counts.index)
for x,y in zip(range(len(counts.index)), counts.values):
plt.text(x, y, y, ha='center', va='bottom')
plt.show()
counts = fight_df['win/loss'].value_counts()
plt.figure(figsize=(12,8))
plt.bar(range(len(counts.index)), counts.values, alpha=.7)
plt.xlabel('Outcome')
plt.ylabel('Frequency')
plt.xticks(range(len(counts.index)), counts.index)
for x,y in zip(range(len(counts.index)), counts.values):
plt.text(x, y, y, ha='center', va='bottom')
plt.show()
counts = fight_df[fight_df['win/loss']=='W'].method.value_counts()
plt.figure(figsize=(12,8))
plt.bar(range(len(counts.index)), counts.values, alpha=.7)
plt.xlabel('Win Type')
plt.ylabel('Frequency')
plt.xticks(range(len(counts.index)), counts.index)
for x,y in zip(range(len(counts.index)), counts.values):
plt.text(x, y, y, ha='center', va='bottom')
plt.show()
counts = fight_df[fight_df['win/loss']=='D'].method.value_counts()
plt.figure(figsize=(12,8))
plt.bar(range(len(counts.index)), counts.values, alpha=.7)
plt.xlabel('Draw Type')
plt.ylabel('Frequency')
plt.xticks(range(len(counts.index)), counts.index)
for x,y in zip(range(len(counts.index)), counts.values):
plt.text(x, y, y, ha='center', va='bottom')
plt.show()
counts = fight_df[fight_df['win/loss']=='NC'].method.value_counts()
plt.figure(figsize=(12,8))
plt.bar(range(len(counts.index)), counts.values, alpha=.7)
plt.xlabel('NC Type')
plt.ylabel('Frequency')
plt.xticks(range(len(counts.index)), counts.index)
for x,y in zip(range(len(counts.index)), counts.values):
plt.text(x, y, y, ha='center', va='bottom')
plt.show()
Let's restrict our view to wins that result from one of the following:
These are situations where one fighter was able to prove their dominance over another through their fighting skill. The remaining categories are at best the result of random chance and not relevant to our interests:
If we want to build a classifier that distinguishes between winners and losers, we should look at fights based on factors that are within their control.
fight_df = fight_df[(fight_df['win/loss'] == 'W') | (fight_df['win/loss'] == 'L')]
fight_df = fight_df[(fight_df.method != 'Overturned') & (fight_df.method != 'No Contest') & (fight_df.method != 'DQ')]
fight_df.method.value_counts()
Let's take a bird's eye view of the data. We'll use a heatmap to spot correlations and make quick inferences about the dataset before diving into the individual fight variables.
corrmat = fight_df.corr()
plt.figure(figsize=(16,12))
sns.heatmap(corrmat, square=True)
plt.show()
Wow, there appears to be a strong corrleation between every strike and strike attempt. Wayne Gretsky was right, "You miss 100% of the shots you don't take." There are several other related variables scattered throughout the dataset. It seems significant attempts, head attempts and distance attempts are all closely related. Also of note is the relationship between the number of passes and the number of takedowns. Interestingly, the number of knowckdowns does not seem to be strongly correlated with any particular variable. There is a small connection between knockdowns, win/loss and head strikes.
There does not appear to be a lot of negative correlations. The only small standouts are sub attempts to fight_id, knockdowns to takedown attempts and sub attempts to distance strikes. The last two make sense to a certain extent. If a a fighter is a good striker, they'd have little reason to try to take someone their opponent down and if they're they're a submission artist they would probably throw fewer distance strikes. The relationship between sub attempts and fight_id doesn't make as much sense to me right now. In the beginning, not many people were familiar with bjj so I would expect the number of submission attempts to grow over time. I'll have to take a closer look at that later.
Applying a log transform yields...
cols = [u'body', u'body_attempts', u'clinch', u'clinch_attempts', u'distance',
u'distance_attempts', u'ground', u'ground_attempts',
u'head', u'head_attempts', u'knockdowns', u'leg', u'leg_attempts',
u'pass', u'reversals', u'sig_attempts', u'sig_strikes',
u'strike_attempts', u'strikes', u'sub_attempts', u'takedowns',
u'td_attempts']
fight_logs = fight_df.copy()
fight_logs.loc[:, cols] = fight_df[cols] + 1
fight_logs.loc[:, cols] = fight_logs[cols].apply(np.log)
corrmat = fight_logs.drop(['fight_id', 'round'], axis=1).corr()
plt.figure(figsize=(16,12))
sns.heatmap(corrmat, square=True)
plt.show()
I was curious about how the number of strikes varied, depending on whether a fighter won or lost. On average, the winner outstruck the loser, but some margins were larger than others. For instance, the number of leg and body attacks were pretty evenly split, but the number of head strikes differed by about a 2:1 ratio. This may hold value for our classifiers later.
# Convert to long form first
strike_targets = [
'body',
'body_attempts',
'head',
'head_attempts',
'leg',
'leg_attempts'
]
targets_df = fight_df.melt(
id_vars=['name', 'win/loss'],
value_vars=strike_targets)
#plt.figure(figsize=(12,8))
sns.factorplot(
x='variable',
y='value',
kind='bar',
hue='win/loss',
palette='Set1',
data=targets_df,
size=8
)
plt.ylabel('Average Value')
plt.xlabel('Strike Type')
plt.xticks(rotation=60, ha='right')
plt.show()
For strike types, the differences between the population of fight winners and losers is even wider. For instance, the number of strikes landed is almost a 2:1 ratio between winners and losers, but the number of ground strikes landed is a 5:1 ratio. This tells me if a fighter can take the fight to the ground, they open up the opportunity to dominate the fight from there.
# Convert to long form first
strike_types = [
'ground',
'ground_attempts',
'distance',
'distance_attempts',
'clinch',
'clinch_attempts',
'sig_strikes',
'sig_attempts',
'strikes',
'strike_attempts',
]
types_df = fight_df.melt(
id_vars=['name', 'win/loss'],
value_vars=strike_types)
#plt.figure(figsize=(12,8))
sns.factorplot(
x='variable',
y='value',
kind='bar',
hue='win/loss',
palette='Set1',
data=types_df,
size=8
)
plt.ylabel('Average Value')
plt.xlabel('Strike Type')
plt.xticks(rotation=60, ha='right')
plt.show()
Grappling statistics show a stark contrast between the winner and loser populations in every category except reversals. Winning fighters pass at a 3:1 ratio vs losing fighers, attempt twice as many submissions and take down their oppponents twice as frequently. One possible reason for the even split between reversals could be the nature of reversals itself. If a fighter completed a reversal, it means they were probably in trouble in the first place, so that may not have been enough to overcome whatever else happened in the fight.
grappling_vars = [
'name',
'win/loss',
'sub_attempts',
'pass',
'reversals',
'takedowns'
]
grappling_df = fight_df.melt(
id_vars=['name', 'date', 'win/loss'],
value_vars=grappling_vars[2:])
sns.factorplot(
x='variable',
y='value',
kind='bar',
hue='win/loss',
palette='Set1',
data=grappling_df,
size=8
)
plt.title('Grappling')
plt.show()
Let's take at the distribution of submissions attempts vs the method used. My theory is that submission attempts are higher in fights that end that way.
sns.factorplot(
x='method',
y='sub_attempts',
kind='bar',
palette='Set1',
data=fight_df,
size=8
)
plt.title('Submission Attempts vs Win Type')
plt.show()
As expected, fights that end in submission have a higher percentage of submisssion attempts greater than zero. Interestingly, fights that end in decision are second highest in the average number of submission attempts. This could be due to the nature of how fights are scored overall since submission attempts do not give a fighter as many points as strikes.
#plt.figure(figsize=(12,8))
sns.factorplot('knockdowns', hue='win/loss', palette='Set1', kind='count', data=fight_df, size=8)
plt.title('Knockdowns')
plt.show()
This was just a fun exercise to show how often different referees and fighters showed up in the database. The larger the name, the more prolific the person.
def custom_color_func(word, font_size, position, orientation, random_state=None,
**kwargs):
return "hsl(0, 100%, {}%)".format(random.randint(25, 50))
# Extract referee text from fight_df
text = " ".join("_".join(ref.split()) for ref in fight_df[fight_df.referee.notnull()].referee.tolist())
# read the mask image
ufc_mask = np.array(Image.open("images/UFC_Logo.png"))
wc = WordCloud(background_color="white", mask=ufc_mask, random_state=42)
# generate word cloud
wc.generate(text)
wc.recolor(color_func=custom_color_func, random_state=3)
# store to file
wc.to_file("images/ufc_red_wordcloud.png")
# show
plt.figure(figsize=(12,8))
plt.imshow(wc, interpolation='bilinear')
plt.axis("off")
plt.show()
def custom_color_func(word, font_size, position, orientation, random_state=None,
**kwargs):
return "hsl(0, 100%, {}%)".format(random.randint(25, 50))
# Extract referee text from fight_df
text = " ".join("_".join(fighter.split()) for fighter in fight_df[fight_df.name.notnull()].name.tolist())
# read the mask image
fighter_mask = np.array(Image.open("images/shadowboxing.jpg"))
wc = WordCloud(max_words=1000, background_color="white", mask=fighter_mask, random_state=1)
# generate word cloud
wc.generate(text)
wc.recolor(color_func=custom_color_func, random_state=3)
# store to file
wc.to_file("images/ufc_fighter_red_wordcloud.png")
# show
plt.figure(figsize=(14,12))
plt.imshow(wc.recolor(color_func=custom_color_func, random_state=3), interpolation='bilinear')
plt.axis("off")
plt.show()
Now that we have an overview of the variables involved, let's take a look at some of the interactions.
We can see that as the number of strikes increases, the probability of winning increases as well up to a point. It appears that past 150 strikes, the win probability starts to taper off.
strike_wins = fight_df.loc[fight_df.outcome == 1, 'strikes'].value_counts()
strike_losses = fight_df.loc[fight_df.outcome == 0, 'strikes'].value_counts()
strike_overall = pd.concat([strike_wins, strike_losses], join="inner", axis=1)
strike_overall.columns = ['wins', 'losses']
strike_overall.win_perc = strike_overall.wins / (strike_overall.wins + strike_overall.losses)
plt.figure(figsize = (12,8))
plt.scatter(strike_overall.index, strike_overall.win_perc)
plt.xlabel('Strikes')
plt.ylabel('Win Probability')
plt.show()
How have fights changed over time? Do fighters throw more or fewer strikes? Have there been more submission attempts? Curious minds want to know! Let's dig in.
So far I've been using matplotlib and seaborn for the visualizations, but I decided to switch it up this time and throw some ggplot into the mix. I really like the ggplot syntax in R because it is clean and in my opinion less clunky than matplotlib. In addition, ggplot has smoothing and loess functions that make it a snap to adjust the signal to noise ratio in time series data.
On average, we can see that the number of strikes thrown per fight has risen steadily over time. The black line tracks each individual event, which could include several fights in the same year, while the blue line applies a smoothing function to cut down the noise.
(ggplot(aes(x='date', y='strikes'), data=fight_df.groupby('date')['strikes'].mean().reset_index())
+ geom_line()
+ stat_smooth(colour='blue', span=0.2)
+ scale_x_datetime(expand=[0,0], breaks=date_breaks('1 year'), labels=date_format('%Y'))
+ scale_y_continuous(expand=[0,0], breaks=range(0,90,10))
+ theme_classic()
+ theme(figure_size=(12, 8))
+ labs(title="Strikes Landed Per Fight")
)
(ggplot(aes(x='date', y='clinch'), data=fight_df.groupby('date')['clinch'].mean().reset_index())
+ geom_line()
+ stat_smooth(colour='blue', span=0.2)
+ scale_x_datetime(expand=[0,0], breaks=date_breaks('1 year'), labels=date_format('%Y'))
+ scale_y_continuous(expand=[0,0], breaks=range(0,16,2))
+ theme_classic()
+ theme(figure_size=(12, 8))
+ labs(title="Clinch Use Per Fight")
)
sub_wins = fight_df[(fight_df['win/loss']=='W') & (fight_df['method']=='Submission')]
sub_wins = sub_wins.groupby(sub_wins.date)['name'].count().reset_index()
sub_wins = sub_wins.rename(columns={'name':'submissions'})
(ggplot(aes(x='date', y='submissions'), data=sub_wins)
+ geom_line()
+ stat_smooth(colour='blue', span=0.2)
+ scale_x_datetime(expand=[0,0], breaks=date_breaks('1 year'), labels=date_format('%Y'))
+ scale_y_continuous(expand=[0,0], breaks=range(0,16,2))
+ theme_classic()
+ theme(figure_size=(12, 8))
+ labs(title="Submissions Per Fight")
)
(ggplot(aes(x='date', y='round'), data=fight_df.groupby(fight_df.date)['round'].mean().reset_index())
+ geom_line()
+ stat_smooth(colour='blue', span=0.1)
+ scale_x_datetime(expand=[0,0], breaks=date_breaks('1 year'), labels=date_format('%Y'))
+ theme_classic()
+ theme(figure_size=(12, 8))
+ labs(title="Rounds Per Fight")
)
sub_attempts = fight_df.groupby(fight_df.date)['sub_attempts'].mean().reset_index()
(ggplot(aes(x='date', y='sub_attempts'), data=sub_attempts)
+ geom_line()
+ stat_smooth(colour='blue', span=0.2)
+ scale_x_datetime(expand=[0,0], breaks=date_breaks('1 year'), labels=date_format('%Y'))
+ theme_classic()
+ theme(figure_size=(12, 8))
+ labs(title="Submission Attempts Per Fight")
)
The number of fights per year grew exponentially until it peaked in 2014 and leveled off slightly. The sharp dip from 2016 to 2017 is slightly misleading since this data was scraped in September of 2017.
plt.figure(figsize=(12,8))
plt.plot(fight_df.groupby(fight_df.date.dt.year)['name'].count()/2, color=colors[2])
plt.title('Fights Per Year Over Time')
plt.xlabel('Year')
plt.ylabel('Number of Fights')
plt.show()
fight_df.date.dt.weekday_name.value_counts()/2
This was one of the most interesting visualizations for me because it showed how different fighting strategies shifted over time. Submissions dominated in the beginning before being overtaken by knockouts from 1999-2008, which were then overtaken by judge's decision.
method_counts = fight_df[fight_df['win/loss']=='W'].groupby([fight_df.date.dt.year,'method'], as_index=True)['name'].count()
method_counts = method_counts.reset_index()
method_counts.rename(columns={'name':'count'}, inplace=True)
methods_complete = method_counts.pivot(index='date',columns='method',values='count')
methods_complete.fillna(0,inplace=True)
methods_complete.reset_index(drop=False,inplace=True)
methods_complete = methods_complete.melt(id_vars=['date'], var_name='method')
methods_complete = methods_complete.rename(columns={'value':'count'})
methods_complete[['date','count']] = methods_complete[['date','count']].astype(int)
methods_complete.index.name=None
methods_complete.sort_values(['date','method'],inplace=True)
methods_complete.reset_index(drop=True,inplace=True)
plt.figure(figsize=(12,8))
sns.pointplot(
x='date',
y='count',
data=methods_complete,
hue='method',
ci=None
)
plt.ylabel('Frequency')
plt.xlabel('Year')
plt.show()
How do fighters compare? Who's undefeated? Who is the most accurate striker? Who has the most KO's? In the next section, we'll group fight data by fighter and take a closer look.
There were a surprising number of undefeated fighters in the dataset, so I narrowed the view to fighters with at least 5 fights who have fought within the last 2 years. Even with this restriction, there were 8 undefeated fighters.
# Restrict to fighters with at least 5 fights who have fought within the last 2 years
repeat_fighters = fight_df[(fight_df.groupby('name').distance.transform(len)>=5) & (fight_df.groupby('name')['date'].transform(max)>=datetime(2015, 1, 1))]
repeat_fighters = repeat_fighters.groupby('name')
win_percentages = repeat_fighters.agg({'outcome':'sum'})/repeat_fighters.agg({'outcome':'count'})
win_percentages = win_percentages.reset_index().sort_values(by=['outcome'])[::-1]
win_percentages.outcome *= 100
plt.figure(figsize=(12,8))
sns.barplot(y='name', x='outcome', data= win_percentages[:10], palette="Purples_d")
plt.title('Top 10 Win Percentages in the UFC')
plt.xlabel('Win Percentage')
plt.ylabel('Fighter')
plt.show()
recent_fighters = fight_df[(fight_df.groupby('name')['date'].transform(len)>=5)&(fight_df.groupby('name')['date'].transform(max)>=datetime(2015,1,1))]
strike_accuracy = recent_fighters.groupby('name')['strikes'].sum()/recent_fighters.groupby('name')['strike_attempts'].sum()
strike_accuracy = strike_accuracy.reset_index().dropna().sort_values(by=[0])[::-1]
strike_accuracy[0] *= 100
plt.figure(figsize=(12,8))
sns.barplot(y='name', x=0, data= strike_accuracy[:10], palette="Reds_d")
plt.title('Top 10 Most Accurate Strikers in the UFC')
plt.xlabel('Striking Accuracy')
plt.ylabel('Fighter')
plt.show()
# Restrict to fighters with at least 5 fights and who have fought since 2012
recent_fighters = fight_df[(fight_df.groupby('name')['date'].transform(len)>=5)&(fight_df.groupby('name')['date'].transform(max)>=datetime(2012,1,1))]
sig_strike_accuracy = recent_fighters.groupby('name')['sig_strikes'].sum()/recent_fighters.groupby('name')['sig_attempts'].sum()
sig_strike_accuracy = sig_strike_accuracy.reset_index().dropna().sort_values(by=[0])[::-1]
sig_strike_accuracy[0] *= 100
plt.figure(figsize=(12,8))
sns.barplot(y='name', x=0, data= sig_strike_accuracy[:10], palette="Reds_d")
plt.title('Top 10 Most Accurate Significant Strikers in the UFC')
plt.xlabel('Significant Striking Accuracy')
plt.ylabel('Fighter')
plt.show()
takedowns = fight_df.groupby('name').sum().takedowns.reset_index().sort_values(by=['takedowns'])[::-1]
plt.figure(figsize=(12,8))
sns.barplot(y='name', x='takedowns', data= takedowns[:10], palette="Greens_d")
plt.title('Most Takedowns in the UFC')
plt.xlabel('Takedowns')
plt.ylabel('Fighter')
plt.show()
# Restrict to fighters with at least 5 takedowns who have fought within the last 5 years
recent_fighters = fight_df[(fight_df.groupby('name').takedowns.transform(sum)>=5)&(fight_df.groupby('name')['date'].transform(max)>datetime(2012,1,1))]
takedown_accuracy = recent_fighters.groupby('name').sum().takedowns/recent_fighters.groupby('name').sum().td_attempts
takedown_accuracy = takedown_accuracy.reset_index().dropna().sort_values(by=[0])[::-1]
takedown_accuracy[0] *= 100
plt.figure(figsize=(12,8))
sns.barplot(y='name', x=0, data= takedown_accuracy[:10], palette="Greens_d")
plt.title('Most Accurate Takedown Artists in the UFC')
plt.xlabel('Takedown Accuracy')
plt.ylabel('Fighter')
plt.show()
win_type = fight_df[fight_df['win/loss']=='W'].groupby(['name','method']).outcome.count()
win_type = win_type.reset_index()
win_type = win_type.rename(columns={'outcome':'value'})
win_type = win_type.sort_values(by=['value'])[::-1]
for method in ['Submission', 'KO/TKO', 'Decision']:
plt.figure(figsize=(12,8))
sns.barplot(y='name', x='value', data=win_type[win_type.method==method][:10], palette='Blues_d')
plt.title('Most {}s in UFC'.format(method))
plt.show()
temp_df = fight_df.groupby('name')
streaks = []
for name in fight_df.name.unique():
s = 0
max_s = 0
fighter = temp_df.get_group(name)
#display(george.values)
for fight in fighter.itertuples():
if fight.outcome:
s += 1
max_s = max(max_s,s)
else:
s = 0
streaks.append([name,max_s])
streaks_sorted = sorted(streaks, key=lambda x: x[1], reverse = True)
names = [streak[0] for streak in streaks_sorted[:10]]
streak_len = [streak[1] for streak in streaks_sorted[:10]]
plt.figure(figsize=(12,8))
sns.barplot(
y=names,
x=streak_len,
palette='Greens_d'
)
plt.title('Longest Win Streak')
plt.show()
The first step in data cleaning is to remove obvious outliers and columns that will not contribute to the model. We've already dropped no contests, DQ's and draws. Let's check other areas.
Now that the missing values have been taken care of, it might be a good idea to check for any logical errors. Any situations where strikes landed > strikes attempted, takedowns > takedown attempts, etc should be looked at carefully. Let's run a short script to see if any of the above situations are present.
a_cols = [
u'body_attempts', u'clinch_attempts', u'distance_attempts',
u'ground_attempts', u'head_attempts', u'leg_attempts', u'sig_attempts',
u'strike_attempts', u'td_attempts'
]
l_cols = [
u'body', u'clinch', u'distance', u'ground', u'head', u'leg', u'sig_strikes',
u'strikes', u'takedowns'
]
for attempts, landed in zip(a_cols, l_cols):
if len(fight_df[fight_df[attempts]<fight_df[landed]][[attempts,landed]])>0:
print attempts
print fight_df[fight_df[attempts]<fight_df[landed]][[attempts,landed]]
Looks good, there are no records where the number of techniques attempted exceeds the number of techniques landed.
What is it about each figher that makes him a favorite? Is it the difference in strikes, in takedown percentage? The judging rules state that fighters will be scored according to effective striking, grappling aggression and octogon control. We will use this to our advantage to identify patterns in fight outcomes.
We will add a column for accuracy and see how it affects the model.
a_cols = [
u'body_attempts', u'clinch_attempts', u'distance_attempts',
u'ground_attempts', u'head_attempts', u'leg_attempts', u'sig_attempts',
u'strike_attempts', u'td_attempts'
]
l_cols = [
u'body', u'clinch', u'distance', u'ground', u'head', u'leg', u'sig_strikes',
u'strikes', u'takedowns'
]
for attempt, landed in zip(a_cols, l_cols):
accuracy_col = landed+'_accuracy'
fight_df[accuracy_col] = fight_df.takedowns/fight_df.td_attempts
fight_df[accuracy_col] = fight_df[accuracy_col].fillna(0)
Next, we will create a running tally of differences between each fighter in each fight. My hope is that by comparing the winners and losers of each individual fight, we might be able to determine which features, if any can predict success in the UFC.
The following script sorts the dataframe by id and then by win/loss so that the fight loser is always presented first.
fight_df.sort_values(['fight_id', 'win/loss'], inplace=True)
fight_df.index = range(len(fight_df))
len(fight_df[(fight_df['win/loss']=='W') & (fight_df.index%2==0)])
The losing fighters are now presented first, which will make the process of separating winners and losers slightly easier.
This block adds accuracy features for each fighter. It creates a ratio of strikes attempted vs strikes landed for each applicable category.
cols = [u'body', u'body_attempts', u'clinch', u'clinch_attempts', u'distance',
u'distance_attempts', u'ground', u'ground_attempts',
u'head', u'head_attempts', u'knockdowns', u'leg', u'leg_attempts',
u'pass', u'reversals', u'sig_attempts', u'sig_strikes',
u'strike_attempts', u'strikes', u'sub_attempts', u'takedowns',
u'td_attempts']
acc_cols = [
u'body_accuracy', u'clinch_accuracy', u'distance_accuracy',
u'ground_accuracy', u'head_accuracy', u'leg_accuracy',
u'sig_strikes_accuracy', u'strikes_accuracy', u'takedowns_accuracy'
]
cols += acc_cols
scaler = MinMaxScaler()
scaler.fit(fight_df[cols])
fight_df[cols] = scaler.transform(fight_df[cols])
fighter_a = fight_df[fight_df.index%2==0]
fighter_b = fight_df[fight_df.index%2==1]
fighter_a.reset_index(inplace=True, drop=True)
fighter_b.reset_index(inplace=True, drop=True)
fight_loser_diff = fighter_a.copy()
fight_winner_diff = fighter_b.copy()
fight_loser_ratio = fighter_a.copy()
fight_winner_ratio = fighter_b.copy()
fight_loser_diff.loc[:, cols] = fighter_a[cols] - fighter_b[cols]
fight_winner_diff.loc[:, cols] = fighter_b[cols] - fighter_a[cols]
fight_diffs = pd.concat(
[fight_winner_diff,fight_loser_diff],
ignore_index=True
)
The following plots show that fight winners and losers show clear separation across the number of ground strikes, overall strikes and clinches landed.
sns.lmplot(
'ground_attempts',
'ground',
data=fight_diffs[::-1],
hue='win/loss',
palette='Set1',
scatter_kws={'alpha':.3},
fit_reg=False,
size=8,
aspect=1.5
)
plt.show()
sns.lmplot(
'strike_attempts',
'strikes',
data=fight_diffs[::-1],
hue='win/loss',
palette='Set1',
scatter_kws={'alpha':.3},
fit_reg=False,
size=8,
aspect=1.2
)
plt.show()
sns.lmplot(
'clinch_attempts',
'clinch',
data=fight_diffs,
hue='win/loss',
palette='Set1',
scatter_kws={'alpha':.3},
fit_reg=False,
size=8,
aspect=1.2
)
plt.show()
The next step is to get a running tally of each fighter's stats and compare them to his or her opponent. The first step will be to get the sum of all stats up to current divided by the number of fights.
fight_copy = fight_df.copy()
for i in fight_df.index:
cur_name = fight_df.loc[i, 'name']
prev_fights = fight_df.iloc[:i]
fight_copy.loc[i, cols] = np.mean(prev_fights.loc[prev_fights.name == cur_name, cols])
fight_copy = fight_copy.iloc[1:]
test_copy = fight_copy.copy()
Now that we have a running tally, the next step is to exclude all fights where one of the fighters is making their debut.
for i in test_copy.fight_id.tolist():
f1 = test_copy[(test_copy.fight_id == i) & (test_copy.outcome == 1)]
f2 = test_copy[(test_copy.fight_id == i) & (test_copy.outcome == 0)]
if sum(f1.isnull().any().values) or sum(f2.isnull().any().values):
test_copy = test_copy[test_copy.fight_id != i]
test_copy
fight_tallies = test_copy.copy()
The next step is to subtract the differences between each fighter, then take the log difference. Fortunately, we have the code for that!
fight_tallies.reset_index(inplace=True, drop=True)
fighter_x = fight_tallies[fight_tallies.index%2==0]
fighter_y = fight_tallies[fight_tallies.index%2==1]
fighter_x.reset_index(inplace=True, drop=True)
fighter_y.reset_index(inplace=True, drop=True)
tally_loser_diff = fighter_x.copy()
tally_winner_diff = fighter_y.copy()
tally_loser_diff.loc[:, cols] = fighter_x[cols] - fighter_y[cols]
tally_winner_diff.loc[:, cols] = fighter_y[cols] - fighter_x[cols]
tally_diffs = pd.concat(
[tally_loser_diff, tally_winner_diff],
ignore_index=True
)
We're going to be methodical when building and testing our algorithms. It's easy to run in circles due to the plethora of algorithms and parameters available. Is a decision tree a better choice? Or should we try Random Forest? What about Support Vector Machines? The machines need our support! Rather than going through all these possibilities and forgetting the results, we'll build pipelines and document our process as we go along. By staying organized, it will be easier to tell if we've already tried a solution. It may also serve to highlight the desired path a bit (one can only hope).
Here's a list of models we have so far. We may add to this list, but it's a good starting place.
Of course, machine learning isn't just about finding that perfect algorithm. The results depend heavily on earlier steps in the process. One might even say that most of the accuracy depends on the exploratory analysis and feature engineering. That's why so much time was spent in that phase.
Now that we have a few algorithms to explore, we'll pay close attention to each one's hyperparameters. If the data is unbalanced, we'll need to account for that and if it's overfitting / underfitting, we may have to make a different set of changes. There's a lot to absorb here. Let's take our time and keep in mind that data analysis is not a linear process. Sometimes we have to go back and forth between modeling, wrangling, exploring and feature engineering. We can't be afraid to get our hands dirty!
numerical_diffs = fight_diffs.select_dtypes(include=[np.int64, 'float'])
numerical_diffs = numerical_diffs.drop(['round'], axis=1)
numerical_tally_diffs = tally_diffs.select_dtypes(include=[np.int64, np.float])
numerical_tally_diffs = numerical_tally_diffs.drop(['round'], axis=1)
# We store the prediction of each model in our dict
# Also included helper functions for each model
def percep(X_train,Y_train,X_test,Y_test,Models):
perceptron = Perceptron(max_iter = 1000, tol = 0.001, random_state=42)
perceptron.fit(X_train, Y_train)
Y_pred = perceptron.predict(X_test)
Models['Perceptron'] = [
accuracy_score(Y_test,Y_pred),
precision_score(Y_test, Y_pred),
recall_score(Y_test, Y_pred),
confusion_matrix(Y_test,Y_pred),
]
return
def ranfor(X_train,Y_train,X_test,Y_test,Models):
randomfor = RandomForestClassifier(max_features="sqrt",
n_estimators = 700,
max_depth = None,
n_jobs=-1,
random_state=42
)
randomfor.fit(X_train,Y_train)
Y_pred = randomfor.predict(X_test)
Models['Random Forests'] = [
accuracy_score(Y_test,Y_pred),
precision_score(Y_test, Y_pred),
recall_score(Y_test, Y_pred),
confusion_matrix(Y_test,Y_pred),
]
return
def dec_tree(X_train,Y_train,X_test,Y_test,Models):
decision_tree = DecisionTreeClassifier(random_state=42)
decision_tree.fit(X_train, Y_train)
Y_pred = decision_tree.predict(X_test)
Models['Decision Tree'] = [
accuracy_score(Y_test,Y_pred),
precision_score(Y_test, Y_pred),
recall_score(Y_test, Y_pred),
confusion_matrix(Y_test,Y_pred),
]
return
def SGDClass(X_train,Y_train,X_test,Y_test,Models):
sgd = SGDClassifier(max_iter = 1000, tol = 0.001, penalty='l1', random_state=42)
sgd.fit(X_train, Y_train)
Y_pred = sgd.predict(X_test)
Models['SGD Classifier'] = [
accuracy_score(Y_test,Y_pred),
precision_score(Y_test, Y_pred),
recall_score(Y_test, Y_pred),
confusion_matrix(Y_test,Y_pred),
]
return
def SVCClass(X_train,Y_train,X_test,Y_test,Models):
svc = SVC(random_state=42)
svc.fit(X_train, Y_train)
Y_pred = svc.predict(X_test)
Models['SVM'] = [
accuracy_score(Y_test,Y_pred),
precision_score(Y_test, Y_pred),
recall_score(Y_test, Y_pred),
confusion_matrix(Y_test,Y_pred),
]
return
def linSVC(X_train,Y_train,X_test,Y_test,Models):
linear_svc = LinearSVC(random_state=42)
linear_svc.fit(X_train, Y_train)
Y_pred = linear_svc.predict(X_test)
Models['Linear SVM'] = [
accuracy_score(Y_test,Y_pred),
precision_score(Y_test, Y_pred),
recall_score(Y_test, Y_pred),
confusion_matrix(Y_test,Y_pred),
]
return
def bayes(X_train,Y_train,X_test,Y_test,Models):
gaussian = GaussianNB()
gaussian.fit(X_train, Y_train)
Y_pred = gaussian.predict(X_test)
Models['Bayes'] = [
accuracy_score(Y_test,Y_pred),
precision_score(Y_test, Y_pred),
recall_score(Y_test, Y_pred),
confusion_matrix(Y_test,Y_pred),
]
return
def Nearest(X_train,Y_train,X_test,Y_test,Models):
knn = KNeighborsClassifier(n_neighbors = 3)
knn.fit(X_train, Y_train)
Y_pred = knn.predict(X_test)
Models['KNN'] = [
accuracy_score(Y_test,Y_pred),
precision_score(Y_test, Y_pred),
recall_score(Y_test, Y_pred),
confusion_matrix(Y_test,Y_pred),
]
def run_all_and_Plot(df):
Models = dict()
from sklearn.model_selection import train_test_split
#X_all = df.drop(['win/loss'], axis=1)
#y_all = df['win/loss']
X_all = df.drop(['outcome'], axis=1)
y_all = df['outcome']
X_train, X_test, Y_train, Y_test = train_test_split(X_all, y_all, test_size=0.2, random_state=42)
percep(X_train,Y_train,X_test,Y_test,Models)
ranfor(X_train,Y_train,X_test,Y_test,Models)
dec_tree(X_train,Y_train,X_test,Y_test,Models)
SGDClass(X_train,Y_train,X_test,Y_test,Models)
SVCClass(X_train,Y_train,X_test,Y_test,Models)
linSVC(X_train,Y_train,X_test,Y_test,Models)
bayes(X_train,Y_train,X_test,Y_test,Models)
Nearest(X_train,Y_train,X_test,Y_test,Models)
return Models
def plot_bar(dict, i):
titles = ['Accuracy', 'Precision', 'Recall']
labels = tuple(dict.keys())
y_pos = np.arange(len(labels))
values = [dict[n][i] for n in dict]
plt.bar(y_pos, values, align='center', alpha=0.5)
plt.xticks(y_pos, labels,rotation='vertical')
plt.ylabel(titles[i])
plt.title('Model {} Scores'.format(titles[i]))
plt.show()
def plot_cm(dict):
count = 1
fig = plt.figure(figsize=(10,10))
for model in dict:
cm = dict[model][3]
labels = ['W','L']
ax = fig.add_subplot(4,4,count)
cax = ax.matshow(cm)
plt.title(model,y=-0.8)
fig.colorbar(cax)
ax.set_xticklabels([''] + labels)
ax.set_yticklabels([''] + labels)
plt.xlabel('Predicted')
plt.ylabel('True')
# plt.subplot(2,2,count)
count+=1
plt.tight_layout()
plt.show()
exclusions = [
u'body_attempts',
u'clinch_attempts',
u'distance_attempts',
u'head_attempts',
u'ground_attempts',
u'leg_attempts',
u'sig_attempts',
u'strikes_attempts',
u'td_attempts',
]
model_cols = [col for col in cols if col not in exclusions]
model_cols.append('outcome')
score_metrics = run_all_and_Plot(numerical_diffs[model_cols])
CompareAll = dict()
CompareAll['Baseline'] = score_metrics
for i in range(3):
for key,val in score_metrics.items():
print(str(key) +' '+ str(val[i]))
plot_bar(score_metrics, i)
plot_cm(score_metrics)
score_metrics = run_all_and_Plot(numerical_tally_diffs)
for i in range(3):
for key,val in score_metrics.items():
print(str(key) +' '+ str(val[i]))
plot_bar(score_metrics, i)
plot_cm(score_metrics)
This was an informative first step in understanding how different fight metrics impact the outcome of a UFC fight. The time series graphs helped me understand how fighting strategies have changed over time. We can see that although BJJ practitioners had a significant advantage over other fighters early on, other fighters were eventually able to narrow that gap over time. The fighter rankings opened up different avenues to explore what separated top fighters from those on the other end of the spectrum.
The next step is to improve the future fight prediction algorithm. Currently, our models are able to judge the winner from past fights with 85-88% accuracy, but it is not quite able to predict the winner of future matchups. This may require different feature engineering or perhaps the incorporation of new features via a scraper rewrite. Currently the data that is fed into the algorithms is based on the difference in previous average metrics between two fighters. The reasoning was that if a fighter was outstriking or outgrappling his or her opponent in the past, it may help us understand how he or she would perform in the future. Unfortunately, this was not the case and yielded a peak accuracy of 58%. While disappointing, I'm confident that this can be improved through iteration.
There is still a lot that can be explored from this dataset. It would be interesting to understand what goes into a fight win streak or how to spot a bad matchup from a stylistic perspective. For now, however; I'm happy with the results. Initially, the model was able to judge the outcome with around 55-60% accuracy, but by testing out different ideas and iterating on our process, we were able to achieve 88% accuracy. Progress will continue, but this feels like a good stopping point for now. Thanks for taking the time to read through this analysis.
The fightmetric scraper reached completion a couple days ago with added functionality. Initially, I scraped the fights in one large chunk, but this had several drawbacks: it took hours to get any results, there was no way to tell whether it was pulling data properly and when it crashed, it resulted in a loss of all data (omg losing all progress after a script has been running for 3 hours is maddening!). The new iteration runs the scraper in chunks in a separate jupyter notebook. Each script processes 200 lines of fight urls and saves the results to separate CSVs. Enabled runtools
in nbextensions
to add the run all cells ignore errors
option. This prevents the whole process from stoping if there's an error in one of the cells. This is a design principal I'm going to hold onto for future large projects because it prevents a lot of headaches.
The following features have been added/updated: