Bootstrapping a Sports-Betting Assistant : Part I

Dan Gray

2021/01/02

Premable

The sports industry market has seen sizable gains in the last 2 decades - with deep commercialization fueling franchise growth, sponsorship, product-athlete deals, and media-coverage.

Sports performance has been increasingly measured, and analytics plays a role in all professional leagues - from roster selection, training, to optimal athlete performance.

Motivation

Sports betting is a domain, where careful observation, expertise and analysis could be leveraged to inform preferential decision making - in a similar way to the financial industry, where time-series trajectories, metrics and market sentiment shed light on favorable investment opportunities.

With an abundance of performance statistics available - it begs the questions whether sports-records data can inform betting strategies that deliver higher returns compared to standard naive, or intuition based approaches. In the context of Mixed Martial Arts (MMA), I hope to discover whether historical performance provides a reliable way to infer specific outcomes.

Sourcing Data

The primary foundational data source is event outcomes for several thousand fights in various professional fighting leagues over the last 5 years. Detailed performance metrics (strikes attempted/landed etc.) for events are not leveraged in this initial protocol - but they may be added to supplement the tool in future iterations.

Data is scraped directly from web-pages as databases for this data are not typically curated by any body or organization.

Specific libraries like rvest allow interaction with the html elements (and their attributes) on a given webpage.

1a. Inital properties of a page

# event results
library(rvest)
library(tidyverse)
library(glue)

url <- "https://www.sherdog.com/fighter/Justin-Gaethje-46648"
read_html(url) %>% html_nodes("table") %>% html_table() 
## [[1]]
##        X1                  X2
## 1  Result             Fighter
## 2    loss Khabib Nurmagomedov
## 3     win       Tony Ferguson
## 4     win      Donald Cerrone
## 5     win       Edson Barboza
## 6     win          James Vick
## 7    loss      Dustin Poirier
## 8    loss       Eddie Alvarez
## 9     win     Michael Johnson
## 10    win        Luiz Firmino
## 11    win        Brian Foster
## 12    win       Luis Palomino
## 13    win       Luis Palomino
## 14    win     Melvin Guillard
## 15    win         Nick Newell
## 16    win  Richard Patishnock
## 17    win          Dan Lauzon
## 18    win          Brian Cobb
## 19    win   Gesias Cavalcante
## 20    win       Adrian Valdez
## 21    win        Drew Fickett
## 22    win           Sam Young
## 23    win      Marcus Edwards
## 24    win         Donnie Bell
## 25    win           Joe Kelso
## 26    win         Kevin Croom
##                                                         X3
## 1                                                    Event
## 2        UFC 254 - Nurmagomedov vs. GaethjeOct / 24 / 2020
## 3            UFC 249 - Ferguson vs. GaethjeMay / 09 / 2020
## 4  UFC Fight Night 158 - Cowboy vs. GaethjeSep / 14 / 2019
## 5       UFC on ESPN 2 - Barboza vs. GaethjeMar / 30 / 2019
## 6    UFC Fight Night 135 - Gaethje vs. VickAug / 25 / 2018
## 7       UFC on Fox 29 - Poirier vs. GaethjeApr / 14 / 2018
## 8             UFC 218 - Holloway vs. Aldo 2Dec / 02 / 2017
## 9      UFC - The Ultimate Fighter 25 FinaleJul / 07 / 2017
## 10            WSOF 34 - Gaethje vs. FirminoDec / 31 / 2016
## 11             WSOF 29 - Gaethje vs. FosterMar / 12 / 2016
## 12         WSOF 23 - Gaethje vs. Palomino 2Sep / 18 / 2015
## 13           WSOF 19 - Gaethje vs. PalominoMar / 28 / 2015
## 14               WSOF 15 - Branch vs. OkamiNov / 15 / 2014
## 15             WSOF 11 - Gaethje vs. NewellJul / 05 / 2014
## 16          WSOF 8 - Gaethje vs. PatishnockJan / 18 / 2014
## 17                WSOF 6 - Burkman vs. CarlOct / 26 / 2013
## 18             WSOF 3 - Fitch vs. Burkman 2Jun / 14 / 2013
## 19            WSOF 2 - Arlovski vs. JohnsonMar / 23 / 2013
## 20              RITC - Rage in the Cage 164Nov / 16 / 2012
## 21             RITC  - Rage in the Cage 163Oct / 20 / 2012
## 22              RITC - Rage in the Cage 162Sep / 29 / 2012
## 23                       ROF 43 - Bad BloodJun / 02 / 2012
## 24                      ROF 42 - Who's NextDec / 17 / 2011
## 25                      BTT MMA 2 - GenesisOct / 01 / 2011
## 26                 ROF 41 - Bragging RightsAug / 20 / 2011
##                                                   X4 X5   X6
## 1                                     Method/Referee  R Time
## 2  Technical Submission (Triangle Choke)Jason Herzog  2 1:34
## 3                               TKO (Punch)Herb Dean  5 3:39
## 4                           TKO (Punches)Jerin Valel  1 4:18
## 5                           KO (Punch)Keith Peterson  1 2:30
## 6                        KO (Punches)Kevin MacDonald  1 1:27
## 7                             TKO (Punches)Herb Dean  4 0:33
## 8                                 KO (Knee)Herb Dean  3 3:59
## 9               TKO (Punches and Knees)John McCarthy  2 4:48
## 10              TKO (Doctor Stoppage)Dan Miragliotta  3 5:00
## 11                        TKO (Leg Kicks)Tom Johnson  1 1:43
## 12                      TKO (Punches)Ryan Brueggeman  2 4:30
## 13              TKO (Leg Kicks and Punches)Al Guinee  3 3:57
## 14                      Decision (Split)Andrew Glenn  3 5:00
## 15                     TKO (Punches)Telis Assimenios  2 3:09
## 16                TKO (Punches and Elbows)Troy Waugh  1 1:09
## 17                          KO (Punches)Jorge Alonso  2 1:40
## 18                   TKO (Leg Kicks)Steve Mazzagatti  3 2:19
## 19               TKO (Doctor Stoppage)Keith Peterson  1 2:27
## 20                                  TKO (Punches)N/A  2 0:19
## 21                                     KO (Punch)N/A  1 0:12
## 22                  Submission (Rear-Naked Choke)N/A  2 1:58
## 23                   Decision (Unanimous)Tom Johnson  3 5:00
## 24                           TKO (Punches)Eric Heinz  2 2:57
## 25                        TKO (Punches)Adam Martinez  1 4:32
## 26                              KO (Slam)Tom Johnson  1 1:01
## 
## [[2]]
##       X1             X2                                             X3
## 1 Result        Fighter                                          Event
## 2    win   Aaron Carter               ROF 40 - BacklashApr / 16 / 2011
## 3    win    Scott Cleve              ROF 38 - AscensionJun / 05 / 2010
## 4    win    Steve Hanna                       FTW - WarApr / 17 / 2010
## 5    win Kevin Gonzales           ROF 35 - Summer BrawlAug / 01 / 2009
## 6    win    Nick Rhoads RITC 127 - Rage in the Cage 127May / 16 / 2009
## 7    win   Austin Greer                 VFC 27 - MayhemMay / 01 / 2009
## 8    win     Ben DeAnda BUTS - Battle Under The Stars 1Aug / 02 / 2008
##                                   X4 X5   X6
## 1                     Method/Referee  R Time
## 2               TKO (Knees)Tim Mills  3 1:48
## 3         Decision (Split)Eric Heinz  3 3:00
## 4                   TKO (Punches)N/A  1 1:01
## 5 Submission (Armbar)Curtis Thrasher  1 1:20
## 6             Submission (Armbar)N/A  3 2:07
## 7            TKO (Punches)Jim Axtell  2 1:56
## 8          KO (Punch)Curtis Thrasher  1 0:27
## 
## [[3]]
##   X1                                   X2
## 1 NA {firstname} { "nickname" }{lastname}
## 
## [[4]]
##   X1                                   X2
## 1 NA {firstname} { "nickname" }{lastname}
## 
## [[5]]
##   X1               X2
## 1 NA {name} - {title}

Reading all items table from within a node of type table returns something which approximates a desired result set.

1b. Defining specific elements of interest

Using a tool such as SelectorGadget one can inspect the specific css element id for the html content of interest, in this case:

read_html(url) %>% html_nodes("body > div.container > div:nth-child(3) > div.col_left > section:nth-child(4) > div > div.content.table") %>% 
  html_children() %>% html_table(header = TRUE) %>% 
  as.data.frame() -> sample_results

head(sample_results)
##   Result             Fighter
## 1   loss Khabib Nurmagomedov
## 2    win       Tony Ferguson
## 3    win      Donald Cerrone
## 4    win       Edson Barboza
## 5    win          James Vick
## 6   loss      Dustin Poirier
##                                                     Event
## 1       UFC 254 - Nurmagomedov vs. GaethjeOct / 24 / 2020
## 2           UFC 249 - Ferguson vs. GaethjeMay / 09 / 2020
## 3 UFC Fight Night 158 - Cowboy vs. GaethjeSep / 14 / 2019
## 4      UFC on ESPN 2 - Barboza vs. GaethjeMar / 30 / 2019
## 5   UFC Fight Night 135 - Gaethje vs. VickAug / 25 / 2018
## 6      UFC on Fox 29 - Poirier vs. GaethjeApr / 14 / 2018
##                                      Method.Referee R Time
## 1 Technical Submission (Triangle Choke)Jason Herzog 2 1:34
## 2                              TKO (Punch)Herb Dean 5 3:39
## 3                          TKO (Punches)Jerin Valel 1 4:18
## 4                          KO (Punch)Keith Peterson 1 2:30
## 5                       KO (Punches)Kevin MacDonald 1 1:27
## 6                            TKO (Punches)Herb Dean 4 0:33

2. Extract all fighters in a weight-class

The above table provides the results for a specific fighter (with a given web-page address). In order to build a database we need all results from all fighters in this division. The initial (“seeded”) fighter table can be created by extracting the href from one of the html nodes from this page.

seed_div <- read_html(url) %>% html_nodes("td:nth-child(2) a , .cnaccept") %>% html_attr(.,"href") %>% as.data.frame() %>% unique()
names(seed_div)[1]<-"fighterName"
rownames(seed_div)<-str_split(seed_div$fighterName,"/",simplify = TRUE)[,3]

head(seed_div)
##                                                  fighterName
## Khabib-Nurmagomedov-56035 /fighter/Khabib-Nurmagomedov-56035
## Tony-Ferguson-31239             /fighter/Tony-Ferguson-31239
## Donald-Cerrone-15105           /fighter/Donald-Cerrone-15105
## Edson-Barboza-46259             /fighter/Edson-Barboza-46259
## James-Vick-81956                   /fighter/James-Vick-81956
## Dustin-Poirier-50529           /fighter/Dustin-Poirier-50529

Then we can use a function to iterate over the initial seed list and extract the fighters in each table.

# fighter crawler function
fighter_crawler <- function(url){
    base_url <- "https://www.sherdog.com"
    fighter_url <- glue(base_url,"{url}")
    
    tmp <-read_html(fighter_url) %>% html_nodes("td:nth-child(2) a , .cnaccept") %>% html_attr(.,"href") %>% as.data.frame()
    nam<-paste0(fighter_url,"_")
    assign(nam,tmp)
    
}

seed <- seed_div %>% slice_head(n=8)

# iterate over a group of fighters
map_df(seed$fighterName,fighter_crawler) %>% unique %>% rowid_to_column() -> div_lightweights 
names(div_lightweights)[2]<-"fighterName"

3. Extract all results in a weight-class

The same iteration principle can be applied to extract all the results from a division.

# results crawler function
url_crawler <- function(fighter){
  base_url <- "https://www.sherdog.com"
  
  full_url <- glue(base_url,"{fighter}")
  
  read_html(full_url) %>% 
    html_nodes("body > div.container > div:nth-child(3) > div.col_left > section:nth-child(4) > div > div.content.table") %>% html_children() %>% 
    html_table(header = TRUE, fill=TRUE) %>% 
    as.data.frame()
}

# test on a sample
url_crawler("/fighter/Khabib-Nurmagomedov-56035") -> sample_results


# execute over all fighters
map_df(div_lightweights$fighterName, url_crawler,.id = "rowid") -> results_lightweights

The result set returns 4873 rows.

Data Preparation

The following steps need to be carried out to clean the data (string manipulation), and derive some useful metrics.

library(stringr)

results_lightweights -> wc

Clean Column Names

div_lightweights %>%  mutate(Opponent = str_trim(str_replace_all(str_remove_all(str_split(fighterName,"fighter/",simplify = TRUE)[,2],"[0-9]"),"-"," "))) -> div_lightweights

wc %>% select(rowid:Time) %>% 
  mutate(Method=str_c(str_split(Method.Referee,"[)]",simplify = TRUE)[,1],")")) %>% 
  select(-Method.Referee) %>% rename(Round = R) %>% 
  na.omit -> wc

Extract Dates of Events

wc %>% mutate(DateEvent= as.Date(str_replace_all(str_replace_all(str_sub(wc$Event,start = -15),"/","-")," ",""),format = "%b-%d-%Y")) -> wc

str_sub(wc$Event,start=-15) <-""

Calculate Streaks - Prepare

wc[order(wc$DateEvent),] %>% 
  group_by(rowid) %>% 
  mutate(Outcome = case_when(Result=="win" ~ 1, TRUE ~0)) %>% 
  mutate(lagged = lag(Outcome)) %>% mutate(start=(Outcome != lagged)) -> wc

wc %>% mutate(start=replace_na(start,TRUE)) -> wc

wc %>% mutate(streak_id = cumsum(start)) ->wc

Calculate Streak Length and Arrange

wc %>%  group_by(rowid,streak_id) %>% 
  mutate(streak=row_number()) %>% 
  ungroup() %>% 
  arrange(.,rowid,DateEvent) %>% 
  rename(PreviousOutcome=lagged) -> wc

Previous Result

wc %>% group_by(rowid) %>% mutate(StreakLength = lag(streak)) %>% 
  mutate(StreakLength = case_when(PreviousOutcome == 0 ~ -StreakLength, TRUE ~ StreakLength)) -> wc

Win Ratio

wc %>% ungroup() %>% group_by(rowid) %>%  
  mutate(WinRatio = lag(cumsum(Outcome))/(row_number()-1)) %>% 
  mutate(CountFights =row_number()-1) ->wc

Finish Ratio and Finish Previous

wc %>% mutate(FinishRatio = lag(!str_detect(Method,"Decision") & Outcome == 1)) %>%
  mutate(FinishRatio=replace_na(FinishRatio,FALSE)) %>% 
  mutate(FinishRatio = cumsum(FinishRatio)/(CountFights)) %>% 
  mutate(FinishPrevious = case_when(!str_detect(Method,"Decision") & Outcome == 1 ~TRUE, TRUE ~ FALSE)) %>% 
  mutate(FinishPrevious=lag(FinishPrevious)) -> wc

Finished Ratio and Finished Previous

wc %>% mutate(FinishedRatio = lag(!str_detect(Method,"Decision") & Outcome == 0)) %>% mutate(FinishedRatio = replace_na(FinishedRatio,FALSE)) %>% 
  mutate(FinishedRatio = cumsum(FinishedRatio)/(CountFights)) %>% 
  mutate(FinishedPrevious = case_when(!str_detect(Method,"Decision") & Outcome == 0 ~TRUE, TRUE ~ FALSE)) %>% 
  mutate(FinishedPrevious = lag(FinishedPrevious)) -> wc

Processing Summary and Next Steps

Some minor adjustments (Fighter-Fighter Event Matrix) of the data-structure needs to be made before a model can be built for prediction. This and the model-workflow in the next post.

library(formattable)

formattable::format_table(head(cleaned_data))
X rowid Result Fighter Event Round Time Method DateEvent Outcome PreviousOutcome start streak_id streak StreakLength WinRatio CountFights FinishRatio FinishPrevious FinishedRatio FinishedPrevious FinishedCount FinishCount DamageDiff
1 1 win Kevin Croom ROF 41 - Bragging Rights 1 1:01 KO (Slam) 2011-08-20 1 NA TRUE 1 1 NA NA 0 NA NA NA NA 0 0 0
2 1 win Joe Kelso BTT MMA 2 - Genesis 1 4:32 TKO (Punches) 2011-10-01 1 1 FALSE 1 2 1 1 1 1.00 TRUE 0 FALSE 0 1 1
3 1 win Donnie Bell ROF 42 - Who’s Next 2 2:57 TKO (Punches) 2011-12-17 1 1 FALSE 1 3 2 1 2 1.00 TRUE 0 FALSE 0 2 2
4 1 win Marcus Edwards ROF 43 - Bad Blood 3 5:00 Decision (Unanimous) 2012-06-02 1 1 FALSE 1 4 3 1 3 1.00 TRUE 0 FALSE 0 3 3
5 1 win Sam Young RITC - Rage in the Cage 162 2 1:58 Submission (Rear-Naked Choke) 2012-09-29 1 1 FALSE 1 5 4 1 4 0.75 FALSE 0 FALSE 0 3 3
6 1 win Drew Fickett RITC - Rage in the Cage 163 1 0:12 KO (Punch) 2012-10-20 1 1 FALSE 1 6 5 1 5 0.80 TRUE 0 FALSE 0 4 4