Pop Music Geniuses: Stevie Wonder Trumped by Katy Perry?

Summary

With the rolling of time, so ‘advances’ the perceptions of taste, and quality. What was fashionable in the 70’s may not be the recipe for (commercial!) success today. Using the sparklyr package I investigate the trends in pop music based on data from the Billboard Hot 100 Year End Charts. This set of data is the aggregated yearly rankings of singles released in the US, and thus a reasonable measure of music popularity into the mass market.

Implementation

I spin up a connection to a local Spark instance, which is this case is trivial. However tackling larger (280GB) datasets would require an expansion out onto a cluster framework.

library(sparklyr)
library(dplyr)

conf <- spark_config()  
conf$`sparklyr.cores.local` <- 2
conf$`sparklyr.shell.driver-memory` <- "1G"
conf$spark.memory.fraction <- 0.7

sc <- spark_connect(master="local",config = conf)

billboard <- read.csv("C:/Users/daniel/Desktop/locstore/portfolio/billboard_lyrics_1964-2015.csv")

billboard <- copy_to(sc, billboard,overwrite=TRUE)
rm(billboard)

# list tables
src_tbls(sc)

## [1] "billboard"

Who are the top performing musical artists over the period 1958-2015, based on prominence? No surprises here. Madonna, Elton John, Mariah Carey, Michael Jackson all have over 20 appearances in the rankings!

# create link to Spark Data Frame
hits <- tbl(sc, "billboard")

# most prominent artist :: multiple occurences
repeat_performers <-hits %>% 
  select(Artist,Rank) %>% 
  group_by(Artist) %>% 
  count(Artist,sort=TRUE) %>% 
  filter(n>2) %>% 
  collect()
  
repeat_performers

## # A tibble: 482 x 2
##    Artist              n
##    <chr>           <dbl>
##  1 madonna            35
##  2 elton john         26
##  3 mariah carey       25
##  4 michael jackson    22
##  5 stevie wonder      22
##  6 janet jackson      22
##  7 whitney houston    19
##  8 taylor swift       19
##  9 rihanna            19
## 10 the beatles        17
## # ... with 472 more rows

Musical success could also be defined based on overall quality - say placing in the top tier (Top 25) of the Billboard Top 100. Spoiler - lots of RnB and Adult Contemporary!

# most number top25 placings 
quality_performers <-hits %>% 
  select(Artist,Rank,Year) %>% 
  filter(Rank <25) %>% 
  group_by(Artist) %>% 
  count(Artist,sort=TRUE) 

quality_performers

## # Source:     spark<?> [?? x 2]
## # Groups:     Artist
## # Ordered by: desc(n)
##    Artist              n
##    <chr>           <dbl>
##  1 mariah carey       14
##  2 whitney houston     9
##  3 usher               9
##  4 madonna             8
##  5 rihanna             8
##  6 michael jackson     8
##  7 elton john          7
##  8 janet jackson       7
##  9 bee gees            7
## 10 boyz ii men         6
## # ... with more rows

Which artists never quite made it to the big-time? To answer this questions we can select all artists who have a highest ranking of no more than Top 80, split over decades by “bucketing” the years variable into bins. Miguel, Keyshia Cole and others never quite gained mainstream popularity. As did Toby Keith with his string of 4 “hits” in the 2000’s.

decades <- c(1960.1, 1970.1, 1980.1, 1990.1, 2000.1, 2010.1, 2020.1)

decades_labels <-c("1960-1970", "1970-1980", "1980-1990",
                   "1990-2000", "2000-2010","2010-2015")

partial_success <-hits %>% 
  select(Artist,Rank,Year) %>% 
  mutate(Year=as.numeric(Year)) %>%
  ft_bucketizer("Year","Decade",splits=decades) %>%
  group_by(Artist) %>%
  mutate(Best_Rank = min(Rank,na.rm=TRUE)) %>% 
  ungroup() %>% 
  filter(best_rank > 80) %>% 
  group_by(Artist,Decade) %>% 
  count(Artist,sort=TRUE) %>% 
  collect() %>% 
  mutate(Decade=factor(Decade,labels=decades_labels)) %>% 
  filter(n>1) %>% 
  arrange(desc(Decade),desc(n))

partial_success

## # A tibble: 19 x 3
##    Artist                    Decade        n
##    <chr>                     <fct>     <dbl>
##  1 miguel                    2010-2015     3
##  2 5 seconds of summer       2010-2015     2
##  3 lee brice                 2010-2015     2
##  4 lil wayne featuring drake 2010-2015     2
##  5 toby keith                2000-2010     4
##  6 alan jackson              2000-2010     3
##  7 keyshia cole              2000-2010     3
##  8 tpain                     2000-2010     2
##  9 zac brown band            2000-2010     2
## 10 haddaway                  1990-2000     2
## 11 joe jackson               1980-1990     2
## 12 david lee roth            1980-1990     2
## 13 the time                  1980-1990     2
## 14 parliament                1970-1980     2
## 15 the who                   1970-1980     2
## 16 todd rundgren             1970-1980     2
## 17 joe walsh                 1970-1980     2
## 18 the originals             1960-1970     2
## 19 the sandpipers            1960-1970     2

One way of thinking about mainstream success is just taking the overall rankings into account. Here I build a look-up-table to assimilate ranks per year by joining to the artist table.

# lookup-table containing rank and year
LUT <- hits %>% 
  select(Artist,Rank,Year) %>% 
  collect()

# assimialting the data table of ranks per year
rank_perYear <-left_join(repeat_performers,LUT,"Artist")

rank_perYear %>% 
  arrange(Rank,desc(n)) %>% collect()

## # A tibble: 2,746 x 4
##    Artist                  n  Rank  Year
##    <chr>               <dbl> <int> <int>
##  1 elton john             26     1  1997
##  2 mariah carey           25     1  2005
##  3 whitney houston        19     1  1993
##  4 the beatles            17     1  1968
##  5 the black eyed peas    16     1  2009
##  6 chicago                15     1  1989
##  7 rod stewart            14     1  1977
##  8 boyz ii men            13     1  1992
##  9 beyonce                12     1  2007
## 10 olivia newtonjohn      12     1  1982
## # ... with 2,736 more rows

And now comes the real question - which artist that has made repeat appearances in the Top 100 list is the most popular based solely on their average ranking? It turns out no-one is on average as popular as Matchbox Twenty, and Jason Miraz was as popular in the late 2000s as John Denver was in the early 1970s.

# who has the best quality music occasionally
consistent_performers <-rank_perYear %>% 
  group_by(Artist) %>% 
  summarize(mean_rank = mean(Rank,na.rm=TRUE),
                mean_year = mean(Year,na.rm=TRUE)) %>% 
  arrange(mean_rank) %>% 
  collect()
  
consistent_performers

## # A tibble: 482 x 3
##    Artist                mean_rank mean_year
##    <chr>                     <dbl>     <dbl>
##  1 matchbox twenty            7        2001.
##  2 the association            8.67     1967.
##  3 the weeknd                10.3      2015 
##  4 blondie                   12.8      1980.
##  5 starship                  14.3      1986 
##  6 all4one                   15        1994.
##  7 janet                     15.7      2000 
##  8 tony orlando and dawn     15.8      1973 
##  9 bruno mars                17.5      2012.
## 10 atlantic starr            17.7      1988.
## # ... with 472 more rows

Naturally this is a biased representation of musical impact. A somewhat more normalized measure of success divides the average rank by the number of appearances. This rewards consistent performers, who may (still) have a relatively high ranking.

# normalized success
normalized_success <- left_join(consistent_performers,repeat_performers,"Artist") %>% 
    mutate(norm_SR = mean_rank/n) %>% 
    arrange(norm_SR)

normalized_success

## # A tibble: 482 x 5
##    Artist          mean_rank mean_year     n norm_SR
##    <chr>               <dbl>     <dbl> <dbl>   <dbl>
##  1 mariah carey         28.5     1997.    25    1.14
##  2 madonna              49.3     1991.    35    1.41
##  3 whitney houston      30.5     1991.    19    1.61
##  4 janet jackson        37.7     1991.    22    1.71
##  5 elton john           47.0     1982.    26    1.81
##  6 usher                28.7     2004.    14    2.05
##  7 michael jackson      47.4     1986.    22    2.15
##  8 bruno mars           17.5     2012.     8    2.19
##  9 stevie wonder        48.3     1976.    22    2.20
## 10 katy perry           31.6     2011     14    2.26
## # ... with 472 more rows

spark_disconnect(sc)

## NULL

All is well in this musical-world! Mariah Carey is back at the top, accompanied by Whitney Houston and Elton John. Which brings us to the surprise of the day.

With a clearly higher average ranking (more popular) Katy Perry is just nudged by a consistent run of hits from Stevie Wonder.

Dan Gray

2019/03/07

Summary

Implementation