Summary
With the rolling of time, so ‘advances’ the perceptions of taste, and quality. What was fashionable in the 70’s may not be the recipe for (commercial!) success today. Using the sparklyr
package I investigate the trends in pop music based on data from the Billboard Hot 100 Year End Charts. This set of data is the aggregated yearly rankings of singles released in the US, and thus a reasonable measure of music popularity into the mass market.
Implementation
I spin up a connection to a local Spark instance, which is this case is trivial. However tackling larger (280GB) datasets would require an expansion out onto a cluster framework.
library(sparklyr)
library(dplyr)
conf <- spark_config()
conf$`sparklyr.cores.local` <- 2
conf$`sparklyr.shell.driver-memory` <- "1G"
conf$spark.memory.fraction <- 0.7
sc <- spark_connect(master="local",config = conf)
billboard <- read.csv("C:/Users/daniel/Desktop/locstore/portfolio/billboard_lyrics_1964-2015.csv")
billboard <- copy_to(sc, billboard,overwrite=TRUE)
rm(billboard)
# list tables
src_tbls(sc)
## [1] "billboard"
Who are the top performing musical artists over the period 1958-2015, based on prominence? No surprises here. Madonna, Elton John, Mariah Carey, Michael Jackson all have over 20 appearances in the rankings!
# create link to Spark Data Frame
hits <- tbl(sc, "billboard")
# most prominent artist :: multiple occurences
repeat_performers <-hits %>%
select(Artist,Rank) %>%
group_by(Artist) %>%
count(Artist,sort=TRUE) %>%
filter(n>2) %>%
collect()
repeat_performers
## # A tibble: 482 x 2
## Artist n
## <chr> <dbl>
## 1 madonna 35
## 2 elton john 26
## 3 mariah carey 25
## 4 michael jackson 22
## 5 stevie wonder 22
## 6 janet jackson 22
## 7 whitney houston 19
## 8 taylor swift 19
## 9 rihanna 19
## 10 the beatles 17
## # ... with 472 more rows
Musical success could also be defined based on overall quality - say placing in the top tier (Top 25) of the Billboard Top 100. Spoiler - lots of RnB and Adult Contemporary!
# most number top25 placings
quality_performers <-hits %>%
select(Artist,Rank,Year) %>%
filter(Rank <25) %>%
group_by(Artist) %>%
count(Artist,sort=TRUE)
quality_performers
## # Source: spark<?> [?? x 2]
## # Groups: Artist
## # Ordered by: desc(n)
## Artist n
## <chr> <dbl>
## 1 mariah carey 14
## 2 whitney houston 9
## 3 usher 9
## 4 madonna 8
## 5 rihanna 8
## 6 michael jackson 8
## 7 elton john 7
## 8 janet jackson 7
## 9 bee gees 7
## 10 boyz ii men 6
## # ... with more rows
Which artists never quite made it to the big-time? To answer this questions we can select all artists who have a highest ranking of no more than Top 80, split over decades by “bucketing” the years variable into bins. Miguel, Keyshia Cole and others never quite gained mainstream popularity. As did Toby Keith with his string of 4 “hits” in the 2000’s.
decades <- c(1960.1, 1970.1, 1980.1, 1990.1, 2000.1, 2010.1, 2020.1)
decades_labels <-c("1960-1970", "1970-1980", "1980-1990",
"1990-2000", "2000-2010","2010-2015")
partial_success <-hits %>%
select(Artist,Rank,Year) %>%
mutate(Year=as.numeric(Year)) %>%
ft_bucketizer("Year","Decade",splits=decades) %>%
group_by(Artist) %>%
mutate(Best_Rank = min(Rank,na.rm=TRUE)) %>%
ungroup() %>%
filter(best_rank > 80) %>%
group_by(Artist,Decade) %>%
count(Artist,sort=TRUE) %>%
collect() %>%
mutate(Decade=factor(Decade,labels=decades_labels)) %>%
filter(n>1) %>%
arrange(desc(Decade),desc(n))
partial_success
## # A tibble: 19 x 3
## Artist Decade n
## <chr> <fct> <dbl>
## 1 miguel 2010-2015 3
## 2 5 seconds of summer 2010-2015 2
## 3 lee brice 2010-2015 2
## 4 lil wayne featuring drake 2010-2015 2
## 5 toby keith 2000-2010 4
## 6 alan jackson 2000-2010 3
## 7 keyshia cole 2000-2010 3
## 8 tpain 2000-2010 2
## 9 zac brown band 2000-2010 2
## 10 haddaway 1990-2000 2
## 11 joe jackson 1980-1990 2
## 12 david lee roth 1980-1990 2
## 13 the time 1980-1990 2
## 14 parliament 1970-1980 2
## 15 the who 1970-1980 2
## 16 todd rundgren 1970-1980 2
## 17 joe walsh 1970-1980 2
## 18 the originals 1960-1970 2
## 19 the sandpipers 1960-1970 2
One way of thinking about mainstream success is just taking the overall rankings into account. Here I build a look-up-table to assimilate ranks per year by joining to the artist table.
# lookup-table containing rank and year
LUT <- hits %>%
select(Artist,Rank,Year) %>%
collect()
# assimialting the data table of ranks per year
rank_perYear <-left_join(repeat_performers,LUT,"Artist")
rank_perYear %>%
arrange(Rank,desc(n)) %>% collect()
## # A tibble: 2,746 x 4
## Artist n Rank Year
## <chr> <dbl> <int> <int>
## 1 elton john 26 1 1997
## 2 mariah carey 25 1 2005
## 3 whitney houston 19 1 1993
## 4 the beatles 17 1 1968
## 5 the black eyed peas 16 1 2009
## 6 chicago 15 1 1989
## 7 rod stewart 14 1 1977
## 8 boyz ii men 13 1 1992
## 9 beyonce 12 1 2007
## 10 olivia newtonjohn 12 1 1982
## # ... with 2,736 more rows
And now comes the real question - which artist that has made repeat appearances in the Top 100 list is the most popular based solely on their average ranking? It turns out no-one is on average as popular as Matchbox Twenty, and Jason Miraz was as popular in the late 2000s as John Denver was in the early 1970s.
# who has the best quality music occasionally
consistent_performers <-rank_perYear %>%
group_by(Artist) %>%
summarize(mean_rank = mean(Rank,na.rm=TRUE),
mean_year = mean(Year,na.rm=TRUE)) %>%
arrange(mean_rank) %>%
collect()
consistent_performers
## # A tibble: 482 x 3
## Artist mean_rank mean_year
## <chr> <dbl> <dbl>
## 1 matchbox twenty 7 2001.
## 2 the association 8.67 1967.
## 3 the weeknd 10.3 2015
## 4 blondie 12.8 1980.
## 5 starship 14.3 1986
## 6 all4one 15 1994.
## 7 janet 15.7 2000
## 8 tony orlando and dawn 15.8 1973
## 9 bruno mars 17.5 2012.
## 10 atlantic starr 17.7 1988.
## # ... with 472 more rows
Naturally this is a biased representation of musical impact. A somewhat more normalized measure of success divides the average rank by the number of appearances. This rewards consistent performers, who may (still) have a relatively high ranking.
# normalized success
normalized_success <- left_join(consistent_performers,repeat_performers,"Artist") %>%
mutate(norm_SR = mean_rank/n) %>%
arrange(norm_SR)
normalized_success
## # A tibble: 482 x 5
## Artist mean_rank mean_year n norm_SR
## <chr> <dbl> <dbl> <dbl> <dbl>
## 1 mariah carey 28.5 1997. 25 1.14
## 2 madonna 49.3 1991. 35 1.41
## 3 whitney houston 30.5 1991. 19 1.61
## 4 janet jackson 37.7 1991. 22 1.71
## 5 elton john 47.0 1982. 26 1.81
## 6 usher 28.7 2004. 14 2.05
## 7 michael jackson 47.4 1986. 22 2.15
## 8 bruno mars 17.5 2012. 8 2.19
## 9 stevie wonder 48.3 1976. 22 2.20
## 10 katy perry 31.6 2011 14 2.26
## # ... with 472 more rows
spark_disconnect(sc)
## NULL
All is well in this musical-world! Mariah Carey is back at the top, accompanied by Whitney Houston and Elton John. Which brings us to the surprise of the day.
With a clearly higher average ranking (more popular) Katy Perry is just nudged by a consistent run of hits from Stevie Wonder.