Matching Mood to Music: Sentiment Analysis of Chart Hits in the 70's and 80's | Dan Gray

Summary

Moody blues or bubblegum-pop? It it possible to extract relevant metrics of positivity or negativity from song lyrics?

Here I again use the Billboard Hot 100 Year End Charts dataset to investigate which artists tend to produce music associated with positive or negative sentiment, based on lyrical content. I also briefly explore the temporal evolution of Chart Rankings based on overall sentiment.

Implementation

Setting up the spark environment requires a definition of how many cores, and RAM are accessible for processing.

library(sparklyr)
library(tidyverse)
library(tidytext)

# set up spark configuration
conf <- spark_config()  
conf$`sparklyr.cores.local` <- 2
conf$`sparklyr.shell.driver-memory` <- "1G"
conf$spark.memory.fraction <- 0.7

# make connection
sc <- spark_connect(master="local",config = conf)

# read in data
billboard <- read.csv("C:/Users/eoedd/Desktop/locstore/portfolio/billboard_lyrics_1964-2015.csv",stringsAsFactors = FALSE)

# copy to spark
billboard <- copy_to(sc, billboard,overwrite=TRUE)
rm(billboard)


# list tables
src_tbls(sc)

## [1] "billboard"

# create link to Spark Data Frame
hits <- tbl(sc, "billboard")

For this local implementation, I choose to trim the data to a reasonable size for the ft_ operations. With a modern workstation the operation could be scaled for all the data.

The ft_bucketizer function splits a variable into groups, in a similar way to cut (with some slight variations in the boundaries and inclusion constraints.)
I choose to only complete cases of observations to avoid an error being thrown on the ft_tokenizer step. The function ft_tokenizer takes a string, converts it to lowercase and splits it by white-space.

At this point we need to collect the results to R for processing into a single word per row format.

The first step is packing the words into a list, as the type character. The individual words are then split into their own rows by unnesting from the list.

# subset data for testing
words_70s_80s <- hits %>% 
  select(Artist,Year,Lyrics,Rank) %>% 
  filter(Year > 1969 & Year < 1990) %>% 
  na.omit() %>% 
  filter(Rank < 51)

# group ranks
tiers <- c(0,11,21,31,41,51)
tier_labels <-c("1-10", "11-20", "21-30",
                   "31-40", "41-50")

# compute lyric-splits 
lyrics_70s_80s <- words_70s_80s %>% 
  ft_bucketizer("Rank","Tiers",splits=tiers) %>% 
  ft_tokenizer("Lyrics","word") %>% 
  collect() %>% 
  mutate(word=lapply(word,as.character)) %>% 
  unnest(word)

The sentiment scores are held in the tidytext packages.

# get sentiment matrix
sentiments <- get_sentiments("afinn")
glimpse(sentiments)

## Observations: 2,476
## Variables: 2
## $ word  <chr> "abandon", "abandoned", "abandons", "abducted", "abducti...
## $ score <int> -2, -2, -2, -2, -2, -2, -3, -3, -3, -3, 2, 2, 1, -1, -1,...

All that is left to do in terms of data preparation is inner-join the individuals lyrics to the sentiment scores. In this case I added a grouping variable to summarize the scores by Artist.

# calculate sentiments per artist
sent_artists <- lyrics_70s_80s %>%
  inner_join(sentiments,by="word") %>% 
  group_by(Artist) %>%
  summarize(positivity=sum(score)) 

# most sentimental +ve
sent_artists %>% 
  arrange(desc(positivity)) %>% 
  top_n(15)

## # A tibble: 15 x 2
##    Artist            positivity
##    <chr>                  <int>
##  1 kool  the gang           427
##  2 whitney houston          418
##  3 jody watley              254
##  4 commodores               246
##  5 wings                    201
##  6 billy idol               195
##  7 natalie cole             195
##  8 dr hook                  191
##  9 stevie wonder            191
## 10 foreigner                172
## 11 donna summer             170
## 12 olivia newtonjohn        162
## 13 george michael           159
## 14 eagles                   158
## 15 barbra streisand         157

# least sentimental -ve
sent_artists %>% 
  arrange(positivity) %>% 
  top_n(-15)

## # A tibble: 17 x 2
##    Artist                 positivity
##    <chr>                       <int>
##  1 bon jovi                      -75
##  2 duran duran                   -68
##  3 don mclean                    -57
##  4 kenny loggins                 -55
##  5 neil diamond                  -49
##  6 lou rawls                     -38
##  7 daryl hall  john oates        -37
##  8 eddie holman                  -37
##  9 the police                    -37
## 10 corey hart                    -36
## 11 klymaxx                       -36
## 12 bj thomas                     -35
## 13 john waite                    -35
## 14 pet shop boys                 -34
## 15 falco                         -33
## 16 juice newton                  -33
## 17 paper lace                    -33

A few pleasant surprises here - just scrolling through the names in the respective “positive” and “negative” lists tends to evoke an association with feel-good (Kool & the Gang, Dr Hook) vs. reflective music (Neil Diamond, Falco). Naturally the samples are biased by occurrence. A better measure of artist sentimentality could be achieved by normalising by count (n).

An alternative inner-join gives us the most used lyrics and their associated sentiment.

# join tables
sentiment_lyrics <- lyrics_70s_80s %>%
  inner_join(sentiments,by="word")

# which words are used most
sentiment_lyrics %>% 
  select(word,score) %>% 
  group_by(score) %>% 
  count(word,sort=TRUE) %>% 
  arrange(desc(n)) %>% 
  top_n(15)

## Selecting by n

## # A tibble: 110 x 3
## # Groups:   score [10]
##    score word        n
##    <int> <chr>   <int>
##  1     3 love     3138
##  2    -1 no       1261
##  3     2 like     1228
##  4     1 want      890
##  5     1 yeah      857
##  6    -2 ill       620
##  7     3 good      444
##  8     2 sweet     283
##  9     2 better    253
## 10     1 feeling   191
## # ... with 100 more rows

Lastly its straightforward to trace the evolution of positivity or negativity in lyrics, split by ranking hierarchy over the period 1970-1989.

# compute score for average and total positivity
sent_ranks <- sentiment_lyrics %>% 
  group_by(Year,Tiers) %>%
  summarize(tot_positivity=sum(score),
            avg_positivity=mean(score)) 

# plot of sentiment patterns  
ggplot(sent_ranks,aes(Year,tot_positivity,colour=Tiers)) +
  geom_point() +
  geom_smooth(se=FALSE,method="loess") +
  ylab("Total Positivity Lyrical Content") +
  facet_wrap(~Tiers,nrow = 2) +
  theme_plain() + guides(colour=FALSE)

Interestingly the Top 10 songs have the most pronounced temporal variability (closely matched by those just making the Top 50) whereas intermediate rankings seems to have a more erratic emotional palette.

The early 1980’s seemed to be a time of rather negative sentiment. However 1981 was an outright anomaly for the most popular songs in that year!

Matching Mood to Music: Sentiment Analysis of Chart Hits in the 70’s and 80’s

Dan Gray

2019/03/13

Summary

Implementation