Premable
The sports industry market has seen sizable gains in the last 2 decades - with deep commercialization fueling franchise growth, sponsorship, product-athlete deals, and media-coverage.
Sports performance has been increasingly measured, and analytics plays a role in all professional leagues - from roster selection, training, to optimal athlete performance.
Motivation
Sports betting is a domain, where careful observation, expertise and analysis could be leveraged to inform preferential decision making - in a similar way to the financial industry, where time-series trajectories, metrics and market sentiment shed light on favorable investment opportunities.
With an abundance of performance statistics available - it begs the questions whether sports-records data can inform betting strategies that deliver higher returns compared to standard naive, or intuition based approaches. In the context of Mixed Martial Arts (MMA), I hope to discover whether historical performance provides a reliable way to infer specific outcomes.
Sourcing Data
The primary foundational data source is event outcomes for several thousand fights in various professional fighting leagues over the last 5 years. Detailed performance metrics (strikes attempted/landed etc.) for events are not leveraged in this initial protocol - but they may be added to supplement the tool in future iterations.
Data is scraped directly from web-pages as databases for this data are not typically curated by any body or organization.
Specific libraries like rvest
allow interaction with the html elements (and their attributes) on a given webpage.
1a. Inital properties of a page
# event results
library(rvest)
library(tidyverse)
library(glue)
url <- "https://www.sherdog.com/fighter/Justin-Gaethje-46648"
read_html(url) %>% html_nodes("table") %>% html_table()
## [[1]]
## X1 X2
## 1 Result Fighter
## 2 loss Khabib Nurmagomedov
## 3 win Tony Ferguson
## 4 win Donald Cerrone
## 5 win Edson Barboza
## 6 win James Vick
## 7 loss Dustin Poirier
## 8 loss Eddie Alvarez
## 9 win Michael Johnson
## 10 win Luiz Firmino
## 11 win Brian Foster
## 12 win Luis Palomino
## 13 win Luis Palomino
## 14 win Melvin Guillard
## 15 win Nick Newell
## 16 win Richard Patishnock
## 17 win Dan Lauzon
## 18 win Brian Cobb
## 19 win Gesias Cavalcante
## 20 win Adrian Valdez
## 21 win Drew Fickett
## 22 win Sam Young
## 23 win Marcus Edwards
## 24 win Donnie Bell
## 25 win Joe Kelso
## 26 win Kevin Croom
## X3
## 1 Event
## 2 UFC 254 - Nurmagomedov vs. GaethjeOct / 24 / 2020
## 3 UFC 249 - Ferguson vs. GaethjeMay / 09 / 2020
## 4 UFC Fight Night 158 - Cowboy vs. GaethjeSep / 14 / 2019
## 5 UFC on ESPN 2 - Barboza vs. GaethjeMar / 30 / 2019
## 6 UFC Fight Night 135 - Gaethje vs. VickAug / 25 / 2018
## 7 UFC on Fox 29 - Poirier vs. GaethjeApr / 14 / 2018
## 8 UFC 218 - Holloway vs. Aldo 2Dec / 02 / 2017
## 9 UFC - The Ultimate Fighter 25 FinaleJul / 07 / 2017
## 10 WSOF 34 - Gaethje vs. FirminoDec / 31 / 2016
## 11 WSOF 29 - Gaethje vs. FosterMar / 12 / 2016
## 12 WSOF 23 - Gaethje vs. Palomino 2Sep / 18 / 2015
## 13 WSOF 19 - Gaethje vs. PalominoMar / 28 / 2015
## 14 WSOF 15 - Branch vs. OkamiNov / 15 / 2014
## 15 WSOF 11 - Gaethje vs. NewellJul / 05 / 2014
## 16 WSOF 8 - Gaethje vs. PatishnockJan / 18 / 2014
## 17 WSOF 6 - Burkman vs. CarlOct / 26 / 2013
## 18 WSOF 3 - Fitch vs. Burkman 2Jun / 14 / 2013
## 19 WSOF 2 - Arlovski vs. JohnsonMar / 23 / 2013
## 20 RITC - Rage in the Cage 164Nov / 16 / 2012
## 21 RITC - Rage in the Cage 163Oct / 20 / 2012
## 22 RITC - Rage in the Cage 162Sep / 29 / 2012
## 23 ROF 43 - Bad BloodJun / 02 / 2012
## 24 ROF 42 - Who's NextDec / 17 / 2011
## 25 BTT MMA 2 - GenesisOct / 01 / 2011
## 26 ROF 41 - Bragging RightsAug / 20 / 2011
## X4 X5 X6
## 1 Method/Referee R Time
## 2 Technical Submission (Triangle Choke)Jason Herzog 2 1:34
## 3 TKO (Punch)Herb Dean 5 3:39
## 4 TKO (Punches)Jerin Valel 1 4:18
## 5 KO (Punch)Keith Peterson 1 2:30
## 6 KO (Punches)Kevin MacDonald 1 1:27
## 7 TKO (Punches)Herb Dean 4 0:33
## 8 KO (Knee)Herb Dean 3 3:59
## 9 TKO (Punches and Knees)John McCarthy 2 4:48
## 10 TKO (Doctor Stoppage)Dan Miragliotta 3 5:00
## 11 TKO (Leg Kicks)Tom Johnson 1 1:43
## 12 TKO (Punches)Ryan Brueggeman 2 4:30
## 13 TKO (Leg Kicks and Punches)Al Guinee 3 3:57
## 14 Decision (Split)Andrew Glenn 3 5:00
## 15 TKO (Punches)Telis Assimenios 2 3:09
## 16 TKO (Punches and Elbows)Troy Waugh 1 1:09
## 17 KO (Punches)Jorge Alonso 2 1:40
## 18 TKO (Leg Kicks)Steve Mazzagatti 3 2:19
## 19 TKO (Doctor Stoppage)Keith Peterson 1 2:27
## 20 TKO (Punches)N/A 2 0:19
## 21 KO (Punch)N/A 1 0:12
## 22 Submission (Rear-Naked Choke)N/A 2 1:58
## 23 Decision (Unanimous)Tom Johnson 3 5:00
## 24 TKO (Punches)Eric Heinz 2 2:57
## 25 TKO (Punches)Adam Martinez 1 4:32
## 26 KO (Slam)Tom Johnson 1 1:01
##
## [[2]]
## X1 X2 X3
## 1 Result Fighter Event
## 2 win Aaron Carter ROF 40 - BacklashApr / 16 / 2011
## 3 win Scott Cleve ROF 38 - AscensionJun / 05 / 2010
## 4 win Steve Hanna FTW - WarApr / 17 / 2010
## 5 win Kevin Gonzales ROF 35 - Summer BrawlAug / 01 / 2009
## 6 win Nick Rhoads RITC 127 - Rage in the Cage 127May / 16 / 2009
## 7 win Austin Greer VFC 27 - MayhemMay / 01 / 2009
## 8 win Ben DeAnda BUTS - Battle Under The Stars 1Aug / 02 / 2008
## X4 X5 X6
## 1 Method/Referee R Time
## 2 TKO (Knees)Tim Mills 3 1:48
## 3 Decision (Split)Eric Heinz 3 3:00
## 4 TKO (Punches)N/A 1 1:01
## 5 Submission (Armbar)Curtis Thrasher 1 1:20
## 6 Submission (Armbar)N/A 3 2:07
## 7 TKO (Punches)Jim Axtell 2 1:56
## 8 KO (Punch)Curtis Thrasher 1 0:27
##
## [[3]]
## X1 X2
## 1 NA {firstname} { "nickname" }{lastname}
##
## [[4]]
## X1 X2
## 1 NA {firstname} { "nickname" }{lastname}
##
## [[5]]
## X1 X2
## 1 NA {name} - {title}
Reading all items table
from within a node of type table returns something which approximates a desired result set.
1b. Defining specific elements of interest
Using a tool such as SelectorGadget one can inspect the specific css element id for the html content of interest, in this case:
read_html(url) %>% html_nodes("body > div.container > div:nth-child(3) > div.col_left > section:nth-child(4) > div > div.content.table") %>%
html_children() %>% html_table(header = TRUE) %>%
as.data.frame() -> sample_results
head(sample_results)
## Result Fighter
## 1 loss Khabib Nurmagomedov
## 2 win Tony Ferguson
## 3 win Donald Cerrone
## 4 win Edson Barboza
## 5 win James Vick
## 6 loss Dustin Poirier
## Event
## 1 UFC 254 - Nurmagomedov vs. GaethjeOct / 24 / 2020
## 2 UFC 249 - Ferguson vs. GaethjeMay / 09 / 2020
## 3 UFC Fight Night 158 - Cowboy vs. GaethjeSep / 14 / 2019
## 4 UFC on ESPN 2 - Barboza vs. GaethjeMar / 30 / 2019
## 5 UFC Fight Night 135 - Gaethje vs. VickAug / 25 / 2018
## 6 UFC on Fox 29 - Poirier vs. GaethjeApr / 14 / 2018
## Method.Referee R Time
## 1 Technical Submission (Triangle Choke)Jason Herzog 2 1:34
## 2 TKO (Punch)Herb Dean 5 3:39
## 3 TKO (Punches)Jerin Valel 1 4:18
## 4 KO (Punch)Keith Peterson 1 2:30
## 5 KO (Punches)Kevin MacDonald 1 1:27
## 6 TKO (Punches)Herb Dean 4 0:33
2. Extract all fighters in a weight-class
The above table provides the results for a specific fighter (with a given web-page address). In order to build a database we need all results from all fighters in this division. The initial (“seeded”) fighter table can be created by extracting the href
from one of the html nodes from this page.
seed_div <- read_html(url) %>% html_nodes("td:nth-child(2) a , .cnaccept") %>% html_attr(.,"href") %>% as.data.frame() %>% unique()
names(seed_div)[1]<-"fighterName"
rownames(seed_div)<-str_split(seed_div$fighterName,"/",simplify = TRUE)[,3]
head(seed_div)
## fighterName
## Khabib-Nurmagomedov-56035 /fighter/Khabib-Nurmagomedov-56035
## Tony-Ferguson-31239 /fighter/Tony-Ferguson-31239
## Donald-Cerrone-15105 /fighter/Donald-Cerrone-15105
## Edson-Barboza-46259 /fighter/Edson-Barboza-46259
## James-Vick-81956 /fighter/James-Vick-81956
## Dustin-Poirier-50529 /fighter/Dustin-Poirier-50529
Then we can use a function to iterate over the initial seed list and extract the fighters in each table.
# fighter crawler function
fighter_crawler <- function(url){
base_url <- "https://www.sherdog.com"
fighter_url <- glue(base_url,"{url}")
tmp <-read_html(fighter_url) %>% html_nodes("td:nth-child(2) a , .cnaccept") %>% html_attr(.,"href") %>% as.data.frame()
nam<-paste0(fighter_url,"_")
assign(nam,tmp)
}
seed <- seed_div %>% slice_head(n=8)
# iterate over a group of fighters
map_df(seed$fighterName,fighter_crawler) %>% unique %>% rowid_to_column() -> div_lightweights
names(div_lightweights)[2]<-"fighterName"
3. Extract all results in a weight-class
The same iteration principle can be applied to extract all the results from a division.
# results crawler function
url_crawler <- function(fighter){
base_url <- "https://www.sherdog.com"
full_url <- glue(base_url,"{fighter}")
read_html(full_url) %>%
html_nodes("body > div.container > div:nth-child(3) > div.col_left > section:nth-child(4) > div > div.content.table") %>% html_children() %>%
html_table(header = TRUE, fill=TRUE) %>%
as.data.frame()
}
# test on a sample
url_crawler("/fighter/Khabib-Nurmagomedov-56035") -> sample_results
# execute over all fighters
map_df(div_lightweights$fighterName, url_crawler,.id = "rowid") -> results_lightweights
The result set returns 4873 rows.
Data Preparation
The following steps need to be carried out to clean the data (string manipulation), and derive some useful metrics.
library(stringr)
results_lightweights -> wc
Clean Column Names
div_lightweights %>% mutate(Opponent = str_trim(str_replace_all(str_remove_all(str_split(fighterName,"fighter/",simplify = TRUE)[,2],"[0-9]"),"-"," "))) -> div_lightweights
wc %>% select(rowid:Time) %>%
mutate(Method=str_c(str_split(Method.Referee,"[)]",simplify = TRUE)[,1],")")) %>%
select(-Method.Referee) %>% rename(Round = R) %>%
na.omit -> wc
Extract Dates of Events
wc %>% mutate(DateEvent= as.Date(str_replace_all(str_replace_all(str_sub(wc$Event,start = -15),"/","-")," ",""),format = "%b-%d-%Y")) -> wc
str_sub(wc$Event,start=-15) <-""
Calculate Streaks - Prepare
wc[order(wc$DateEvent),] %>%
group_by(rowid) %>%
mutate(Outcome = case_when(Result=="win" ~ 1, TRUE ~0)) %>%
mutate(lagged = lag(Outcome)) %>% mutate(start=(Outcome != lagged)) -> wc
wc %>% mutate(start=replace_na(start,TRUE)) -> wc
wc %>% mutate(streak_id = cumsum(start)) ->wc
Calculate Streak Length and Arrange
wc %>% group_by(rowid,streak_id) %>%
mutate(streak=row_number()) %>%
ungroup() %>%
arrange(.,rowid,DateEvent) %>%
rename(PreviousOutcome=lagged) -> wc
Previous Result
wc %>% group_by(rowid) %>% mutate(StreakLength = lag(streak)) %>%
mutate(StreakLength = case_when(PreviousOutcome == 0 ~ -StreakLength, TRUE ~ StreakLength)) -> wc
Win Ratio
wc %>% ungroup() %>% group_by(rowid) %>%
mutate(WinRatio = lag(cumsum(Outcome))/(row_number()-1)) %>%
mutate(CountFights =row_number()-1) ->wc
Finish Ratio and Finish Previous
wc %>% mutate(FinishRatio = lag(!str_detect(Method,"Decision") & Outcome == 1)) %>%
mutate(FinishRatio=replace_na(FinishRatio,FALSE)) %>%
mutate(FinishRatio = cumsum(FinishRatio)/(CountFights)) %>%
mutate(FinishPrevious = case_when(!str_detect(Method,"Decision") & Outcome == 1 ~TRUE, TRUE ~ FALSE)) %>%
mutate(FinishPrevious=lag(FinishPrevious)) -> wc
Finished Ratio and Finished Previous
wc %>% mutate(FinishedRatio = lag(!str_detect(Method,"Decision") & Outcome == 0)) %>% mutate(FinishedRatio = replace_na(FinishedRatio,FALSE)) %>%
mutate(FinishedRatio = cumsum(FinishedRatio)/(CountFights)) %>%
mutate(FinishedPrevious = case_when(!str_detect(Method,"Decision") & Outcome == 0 ~TRUE, TRUE ~ FALSE)) %>%
mutate(FinishedPrevious = lag(FinishedPrevious)) -> wc
Processing Summary and Next Steps
Some minor adjustments (Fighter-Fighter Event Matrix) of the data-structure needs to be made before a model can be built for prediction. This and the model-workflow in the next post.
library(formattable)
formattable::format_table(head(cleaned_data))
X | rowid | Result | Fighter | Event | Round | Time | Method | DateEvent | Outcome | PreviousOutcome | start | streak_id | streak | StreakLength | WinRatio | CountFights | FinishRatio | FinishPrevious | FinishedRatio | FinishedPrevious | FinishedCount | FinishCount | DamageDiff |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
1 | 1 | win | Kevin Croom | ROF 41 - Bragging Rights | 1 | 1:01 | KO (Slam) | 2011-08-20 | 1 | NA | TRUE | 1 | 1 | NA | NA | 0 | NA | NA | NA | NA | 0 | 0 | 0 |
2 | 1 | win | Joe Kelso | BTT MMA 2 - Genesis | 1 | 4:32 | TKO (Punches) | 2011-10-01 | 1 | 1 | FALSE | 1 | 2 | 1 | 1 | 1 | 1.00 | TRUE | 0 | FALSE | 0 | 1 | 1 |
3 | 1 | win | Donnie Bell | ROF 42 - Who’s Next | 2 | 2:57 | TKO (Punches) | 2011-12-17 | 1 | 1 | FALSE | 1 | 3 | 2 | 1 | 2 | 1.00 | TRUE | 0 | FALSE | 0 | 2 | 2 |
4 | 1 | win | Marcus Edwards | ROF 43 - Bad Blood | 3 | 5:00 | Decision (Unanimous) | 2012-06-02 | 1 | 1 | FALSE | 1 | 4 | 3 | 1 | 3 | 1.00 | TRUE | 0 | FALSE | 0 | 3 | 3 |
5 | 1 | win | Sam Young | RITC - Rage in the Cage 162 | 2 | 1:58 | Submission (Rear-Naked Choke) | 2012-09-29 | 1 | 1 | FALSE | 1 | 5 | 4 | 1 | 4 | 0.75 | FALSE | 0 | FALSE | 0 | 3 | 3 |
6 | 1 | win | Drew Fickett | RITC - Rage in the Cage 163 | 1 | 0:12 | KO (Punch) | 2012-10-20 | 1 | 1 | FALSE | 1 | 6 | 5 | 1 | 5 | 0.80 | TRUE | 0 | FALSE | 0 | 4 | 4 |