How has Music Changed over Time - A Spotify Data Analysis

Jacob Santangelo

Introduction

Over the past 50 to 100 years, as music listening has become a staple in many people's lives, we have seen enormous changes in the types of music that are popular and are listened to by the general public. From R&B, to Rock and Roll, to Hip Hop, popular music has come in many faces. In this project, we are going to look at how music has changed over time using data analysis. There are certainly differences between songs made in the 60s, and those made in present day. The contrast between the Beatles' melodies and Kanye West's sample heavy hits are striking to say the least. But could the same not be said when comparing Ed Sheeran with Kendrick Lamar, two artists who are making music within the same time frame? We need a good way to quantify how popular music has changed over time that reflects the overall trends in music.

To do this, we are going to look at the Billboard top 10 songs from their end of year list, over the years of 1951 to 2020. While not a perfect metric of what is popular over the course of a year, it will hopefully be able to pick up on the general trends for what music people are listening to. This means that while genres like Metal, and early Electronic and Hip Hop may have been important to the music scene as a whole, they are not going to be picked up in the top 10 of the charts, due to not being the number one thing being played on radio. Looking at Billboard means that this analysis will be heavily US focused, due to this being the US charts, and may not give enough attention to music across the world as a whole.

The Billboard top 10 seems like a great idea, the big problem now is how to analyze it. We want to heavily focus on the musical contents of the song, rather than things like album sales, or lyric content, which could make for interesting analysis in other projects. But to find information about the actual contents of the song may prove to be difficult. Luckily, it turns out that Spotify, the largest music streaming platform in the world, has interesting data regarding the musical contents of each of their songs. Let us see what trends can be found through looking the data for each of the billboard hits collected.

Here are some of libraries that will need to be imported. (you may have to install spotipy)

In [601]:
# you may have to do "pip install spotipy" first
import spotipy
from spotipy.oauth2 import SpotifyClientCredentials

# tables and graphing
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

# machine learning
from sklearn.linear_model import LinearRegression, LogisticRegression
from sklearn.preprocessing import PolynomialFeatures
from scipy.stats import pearsonr

# cross validation
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import train_test_split

# decision tree
from sklearn import tree
from sklearn.tree import DecisionTreeRegressor

Data Collection

To do the analysis on the spotify data of these songs, we are going to need to use the spotify library "spotipy". The documentation for this library can be found at https://spotipy.readthedocs.io/en/2.16.1/ . This library uses Spotify's web API to return metadata regarding artists, songs, and playlists. To access Spotify's web API, you need to register for an account at https://developer.spotify.com/ . From here, going to https://developer.spotify.com/dashboard/applications, we have created an app which, essentially serves no purpose other than to make requests for this project. When looking at the app, it has a Client ID, and a Client Secret ID, which can be used to authenticate usage for the API in this python program. For this, I made a new Spotify account with this authentication shown below.

In [602]:
# credentials
client_id = "28a5dce71e444cd6b1c5bd640bad3795"
client_secret = "da2a5a8ab344431cbc9b77639cf41640"

# log into app
creds = SpotifyClientCredentials(client_id=client_id,client_secret=client_secret)
spot = spotipy.Spotify(client_credentials_manager=creds)

Next, the important task is collecting the Billboard top 10's for each year. Just by searching on spotify for playlists, I have found a playlist that already contains the top 10 songs from 1951 to 2017. This is linked here: https://open.spotify.com/playlist/6KOwiWg5zwrt83nEcx7HyI.

The problem with this playlist is that a few of the versions of the songs have been deleted on spotify, so I copied this playlist to my account, and readded these songs in their correct place. I also added the billboard year end top 10 for 2018 through 2020. This playlist is found here: https://open.spotify.com/playlist/52RMPHv7LgvTVdiBtHzaBD . The plan is to read in each song for this playlist, and assign them a year based on their ordering. The problem is that spotify's API does not allow for requests on playlists longer than 100 songs. Past 100 songs, it stops reading the playlist.

To solve this problem, I had to split the playlist by decade, creating new playlists for 50s, 60s, etc. The IDs for these playlists are put into a list to be iterated over. These playlists contain the same songs as the giant playlist, but can be accessed on their own by going to https://open.spotify.com/playlist/(insert playlist id).

In [603]:
# playlists 50s, 60s ...
playlist_ids = ["3nKWjlZ1EKYtGGGFCfZgGj", "3iDMFLvXYMIscy02TM97Fg",
               "65bqRcayc5mrwmVjjf9tTI", "18uxnsrdIzFjpewqWX3m9d",
               "4AuliOnOTGaHC0HpJ9QLQg", "5H7KTVzHlPfqED10JjxJim",
               "1FYOLItA4u30DTJ0RIxHdP"]

song_ids = []
song_title = []
artists = []

For each decade, the tracks of the playlist are found, and for each track, information is stored in seperate lists.

In [604]:
for decade_id in playlist_ids:
    play_dec = spot.playlist(decade_id) #gets playlist
    data_dec = play_dec["tracks"] #tracks in plalist
    
    # iterate over tracks, getting information
    for track in data_dec["items"]: 
        song_ids.append(track['track']['id'])
        song_title.append(track['track']['name'])
        artists.append(track['track']['artists'][0]['name']) #first if there are multiple
    

Here is where the metadata from the songs are stored. Using the audio_features method, we can get the song information for a given song id.

In [605]:
song_info_list = []

# takes a minute or two to get all info
for i in range(len(song_ids)):
    song_info_list.append(spot.audio_features(song_ids[i])) 
    
for i in range(len(song_info_list)):
    song_info_list[i] = song_info_list[i][0]

The metadata is put into a Pandas DataFrame, with a column for each variable that Spotify collects data on. The other information that has been collected, such as title and artist, is added. The year is also calculated based on where the song is in the playlist.

In [606]:
song_table = pd.DataFrame(song_info_list)


#add new column that is length of song in minutes (as opposed to milliseconds)
song_table["duration_min"] = song_table.apply(lambda row: row.duration_ms / (1000 * 60), axis=1)

song_table["title"] = song_title
song_table["artist"] = artists

# array containing year for every song
years = [0 for i in range(len(song_info_list))]

# iterates over years, going up by 1 every 10th year
year = 1951
count = 0
for i in range(len(song_info_list)):
    years[i] = year
    count += 1
    if count == 10:
        year += 1
        count = 0
    
song_table["year"] = years

# removes uselses columns, such as ids
cols = song_table.columns.tolist()
cols = cols = cols[-3:] + cols[:-3]
song_table = song_table[cols]
song_table = song_table.drop(["type", "uri", "track_href", "analysis_url"], axis = 1)
In [ ]:
Lets take a look at some of the data collected from Spotify.
In [607]:
song_table
Out[607]:
title artist year danceability energy key loudness mode speechiness acousticness instrumentalness liveness valence tempo id duration_ms time_signature duration_min
0 Too Young Nat King Cole 1951 0.191 0.202 8 -14.450 1 0.0289 0.9230 0.000362 0.2890 0.181 92.091 44ZAlAAn0bAEzUJgZky04H 205707 1 3.428450
1 Because of You (with Percy Faith & His Orchestra) Tony Bennett 1951 0.235 0.222 6 -13.280 1 0.0302 0.9740 0.001090 0.2550 0.227 84.846 4EcmCSEL77iVCdhylq0yG2 177333 3 2.955550
2 How High The Moon Les Paul 1951 0.695 0.582 11 -9.567 0 0.0381 0.6870 0.009140 0.1780 0.817 118.880 0onHzLoj8yQbjbHtJzwjBx 127267 4 2.121117
3 Come On-A My House Rosemary Clooney 1951 0.531 0.676 10 -7.168 0 0.1250 0.8400 0.000000 0.4000 0.957 183.208 2scbGWVjIYsfieovyRyfv1 117267 4 1.954450
4 Be My Love (From "The Toast of New Orleans") Ray Sinatra 1951 0.240 0.279 0 -15.265 1 0.0373 0.9740 0.000017 0.1740 0.210 110.355 6l8tdrHjXjoM2U8mxfoXnL 210160 4 3.502667
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
695 Someone You Loved Lewis Capaldi 2020 0.501 0.405 1 -5.679 1 0.0319 0.7510 0.000000 0.1050 0.446 109.891 7qEHsqek33rTcFNT9PFqLf 182161 4 3.036017
696 Say So Doja Cat 2020 0.787 0.673 11 -4.577 0 0.1580 0.2560 0.000004 0.0904 0.786 110.962 3Dv1eDb0MEgF93GpLXlucZ 237893 4 3.964883
697 I Hope Gabby Barrett 2020 0.466 0.536 6 -6.227 1 0.0429 0.0951 0.000000 0.1140 0.377 75.998 23T0OX7QOiIUFShSzbJ5Uo 210773 4 3.512883
698 WHATS POPPIN (feat. DaBaby, Tory Lanez & Lil W... Jack Harlow 2020 0.904 0.723 11 -5.224 0 0.2600 0.0631 0.000000 0.1850 0.835 145.013 2MbdDtCv5LUVjYy9RuGTgC 227478 4 3.791300
699 Dance Monkey Tones And I 2020 0.824 0.588 6 -6.400 0 0.0924 0.6920 0.000104 0.1490 0.513 98.027 2XU0oxnq2qxCpomAAuJY8K 209438 4 3.490633

700 rows × 18 columns

It seems that Spotify has metadata for many different measurements. Some of these are obvious, such as tempo, measured in beats per minute, and duration_ms, which is the duration of the song in milliseconds. However, what does Spotify mean when it measures "danceability", and "speechiness"?

From Spotify's API website https://developer.spotify.com/documentation/web-api/reference/tracks/get-audio-features/, the varibles are described as (paraphrased by me):

Danceability: How suitable a track is for dancing based on a combination of musical elements, with 1 being most danceable, and 0 being least.

Energy: A "perceptual measure of intensity and activity". This is measured by looking at dynamic range, perceived loudness, and timbre.

Key: Estimated key of track. C = 0, C# = 1, ..., B = 11.

Loudness: The average loudness (dB) of a track.

Speechiness: The presence of spoken words, with 1 being exclusively spoken words, and 0 being none.

Acoustiness: "A confidence measure from 0.0 to 1.0 of whether the track is acoustic."

Liveness: The precene of an audience in the recording.

Valence: How positive/happy a track sounds.

While these do not give an amazing explanation for how they are measured, they tell us the general idea for what each variable represents. This can be used by us to analyze the trends for songs over time periods.

Data Analysis

First, lets measure some of the variables over a period of time. I have initially created some scatterplots of variables that, when looking at all songs, have some interesting trends and outliers. Matplotlib is used to make the graphs. Information on scatter plots can be found here https://matplotlib.org/3.3.3/api/_as_gen/matplotlib.pyplot.scatter.html

In [608]:
fig = plt.figure(figsize=(15,8))

# Plot of song length, settings
ax = fig.add_subplot()
ax.scatter(song_table.year, song_table.duration_min)
ax.set_title("Song Length vs Year", fontsize=15)
ax.set_xlabel('Year',fontsize = 15) #xlabel
ax.set_ylabel('Song Length (minutes)', fontsize = 15)#ylabel

# shows name of outliers: songs longer than 450000ms, or 7.5 min
for i in range(song_table.shape[0]):
    if song_table.iloc[i]["duration_ms"] > 450000:
        ax.annotate(song_table.iloc[i]["title"], (song_table.iloc[i]["year"] + .5, song_table.iloc[i]["duration_min"]))
        
plt.show()

In the "Song Length vs Year" plot, we can see that the average duration of songs increased dramatically from around 2.5 minutes in the 60s, to almost 5 minutes in the 80s and 90s, and back down to 4 minutes in the present day. This is the only variable that has seemed to both increase and decrease over this time period. The increase starting in the 70s may be due to Rock music becoming popular, with these songs being longer due to guitar solos and long instrumentals. This lines up with the decline in rock at the end of the 90s, as the song length goes back down. When looking at the outliers, we can see some unusually long songs, especially for those on the Billboard top 10. I believe that the version of Funkytown in the playlist was the 8 minute version instead of the 4 minute version, which most likely was the one played on radio. Additionally, Green Day's "Holiday / Boulevard of Broken Dreams" is listed as one song, but likely was two songs when played on the radio.

Next, lets look at a variable that was not initially clear, such as "speechiness".

In [609]:
# same settings, new figure
fig2 = plt.figure(figsize=(15,8))
ax = fig2.add_subplot()
ax.scatter(song_table.year, song_table.speechiness)
ax.set_title("Song \"Speechiness\" vs Year", fontsize=15)
ax.set_xlabel('Year',fontsize = 15) #xlabel
ax.set_ylabel('Song Speechiness (percent)', fontsize = 15)#ylabel

# labels outliers: speechiness is greater than .35, which only a few songs have
for i in range(song_table.shape[0]):
    if song_table.iloc[i]["speechiness"] > .35:
        ax.annotate(song_table.iloc[i]["title"], (song_table.iloc[i]["year"] + .5, song_table.iloc[i]["speechiness"]))
        
plt.show()

We can see that overall, most songs for every year have a very low speechiness. However, most of the points with speechiness above .4 are hip hop songs, and are coming after the year 2000. While this variable does a good job at recognizing certian songs as hip hop, it may not do a good job at predicting year a whole, as many of current songs are not hip hop, and still have a low speechiness.

Next, lets take an average for each of these variables for a given year. Then, a plot of their mean vs year will be displayed, alongside their non-mean scatterplot. This will allow us to further see trends in the data

In [610]:
# list of variable names
variables = ["danceability", "energy", "loudness", "speechiness", "acousticness", "instrumentalness", "liveness", "valence", "tempo", "duration_min"]

# plots all songs, and average per year next to one another
fig3, axes = plt.subplots(len(variables), 2, figsize=(15,40), squeeze=False)
plt.subplots_adjust(hspace=.4)

# graphs each variable vs year
for i in range(len(variables)):
    song_table.plot(kind='scatter', x="year", y=variables[i], ax=axes[i, 0],
                    title = "Song " + variables[i] + " vs Year")
    mean_year.plot(kind='scatter', x="year", y=variables[i], ax=axes[i, 1],
                  title = "Mean " + variables[i] + " vs Year")

plt.show()

It seems that many of the variables are correlated with year when looking at their scatterplots. Danceability, energy, and loudness, all have significant increases over the time period analyzed. When thinking about popular songs, this makes sense, as dance and party music has become more popular over time. When looking at the loudness scatterplot, it is noteworthy that around 1970, loudness immediately drops, and then has a steep increase around 2000. I am not sure why this would be, as rock music is often very loud. This may be due to some change in how songs are produced, such that the volume level has such a sharp change. However, it can be seen that in general, there is an upward trend.

Acousticness and Intstrumentalness have both gone down, with acousticness having a stronger correlation from the scatterplots. Songs have become less acoustic and have less instrumental over time, as music has become more electronic. Acousticness has decreased, but there are still songs from the 2010s that have high acousticness, and those in the 50s with low acousticness. Additionally, it seems that most songs, even in the 50s, have low instrumentalness, and the trend is only pulled up by a few outliers in each year.

Finally, looking at liveness, valence, and tempo, there is no clear trend among the data. This makes sense, because none of these recordings are going to have a live audience, there is no reason for music to have become more "happy" over time, and songs can always have a wide range of tempos.

Next, we will run a linear regression on each variable's average per year, showing the regression and residual plot. I couldn't get these to display side by side, so you will have to scroll. This will be done using skikit-learn's linear regression function, with documentation here https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html

In [611]:
# make plots for each variable
for i in range(len(variables)):
    
    # normal scatterplot
    mean_year.plot(kind='scatter', x="year", y=variables[i],
              title = "Mean " + variables[i] + " vs Year with Lin Regression")

    # fit regression with varaible
    regression = LinearRegression().fit(mean_year["year"].to_numpy().reshape(-1, 1), mean_year[variables[i]])
    plt.plot(mean_year["year"], regression.coef_[0] * mean_year["year"] + regression.intercept_, color="red")

    # residual column added
    res_col_name = "res_" + variables[i]
    mean_year[res_col_name] = mean_year.apply(lambda row: 0.0, axis=1) #set default value first

    # residual calculated
    for j in range(mean_year.shape[0]):
        mean_year.at[j, res_col_name] = mean_year.at[j, variables[i]] - regression.predict(mean_year.at[j, "year"].reshape(-1, 1))

    # plot residuals
    mean_year.plot(kind='scatter', x="year", y=res_col_name,
          title = "Mean " + variables[i] + " Residuals")

The regressions for each of the variables show expected results based on the scatterplots. The residuals for danceability, energy, and loudness are low, random, and show no patterns, which is good for a linear regression. The same can be said for liveness, valence, and tempo, however the correlation between year and these variables is already weak.

For danceability, energy, and loudness, lets look at the R^2 coefficient, and its p-value.

In [612]:
r, p_value = pearsonr(mean_year['year'], mean_year['danceability'])
print("R^2 for year vs danceability: " + str(np.power(r,2)))
print("p-value for year vs danceability: " + str(p_value))

r, p_value = pearsonr(mean_year['year'], mean_year['energy'])
print("R^2 for year vs energy: " + str(np.power(r,2)))
print("p-value for year vs energy: " + str(p_value))

r, p_value = pearsonr(mean_year['year'], mean_year['loudness'])
print("R^2 for year vs loudness: " + str(np.power(r,2)))
print("p-value for year vs loudness: " + str(p_value))
R^2 for year vs danceability: 0.6578371045076544
p-value for year vs danceability: 1.7209880650121242e-17
R^2 for year vs energy: 0.5150629686192544
p-value for year vs energy: 2.7277109853806854e-12
R^2 for year vs loudness: 0.6604118316701051
p-value for year vs loudness: 1.3287180446864125e-17

It is seen that all three have a somewhat high R^2, raning from 51% to 66%. This means that this amount of the variation in each of the three variables is explained by variation in year. For each of the three combinations, we test the null hypothesis that there is no correlation between the two variables. For each one, the p-value is extremely low, meaning that we reject the null hypothesis, as the probability of seeing results as extreme as ours would be very low if it were true.

For the plot of duration_min and year, it seems that the relationship is quadratic. Lets make a regression with quadratic terms to see if this will fit the data better. This is done using skikit-learn's polynomial features function, with docs shown here https://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.PolynomialFeatures.html

In [613]:
# polynomial features 2, will make array of [1, year, year^2]
poly = PolynomialFeatures(degree=2)
in_arr = poly.fit_transform(mean_year["year"].to_numpy().reshape(-1, 1))

# fit regression
regression = LinearRegression().fit(in_arr, mean_year["duration_min"])

# normal scatterplot
mean_year.plot(kind='scatter', x="year", y="duration_min",
          title = "Mean duration_min vs Year with Quadratic Regression")

# plot regression line
plt.plot(mean_year["year"], np.dot(in_arr, regression.coef_) + regression.intercept_, color="red")

# calculate residual
mean_year["res_duration_min2"] = mean_year.apply(lambda row: row.duration_min - regression.predict(poly.fit_transform([[row.year]])), axis=1)

# plot residuals
mean_year.plot(kind="scatter", x="year", y="res_duration_min2", title = "Mean duration_min Residuals with Quadratic Regression")

# plot useful information
print("duration_min = " + str(regression.coef_[2]) + "x^2 + "  + str(regression.coef_[1]) + "x + " + str(regression.intercept_))
print("R^2: " + str(regression.score(in_arr, mean_year["duration_min"])))
duration_min = -0.0012136990327853425x^2 + 4.838345631299008x + -4817.655144566868
R^2: 0.735552951203934

The quadratic regression predicts this much better, and the residual graph is also shown to be better here. Based on the regression, song length can be predicted based on the year it was made in, and the square of that year. The R^2 is also high at 73.5%, which further shows that there is a strong relationship here.

Now, what if instead of predicting variables from year, we wanted to predict year given these variables? In other words, given a song, listen to it, find information about its metadata, and the predict what year the song was made in. For this, lets try a linear regresseion, with each of the variables that seemed to have some correlation, being an input variable, and year being the output variable.

In [614]:
# variables used in this regression
used_vars = ["danceability", "energy", "loudness", "speechiness", "acousticness", "instrumentalness", "duration_ms"]

y = song_table.year.to_numpy() #predicting year
xs = song_table[used_vars].to_numpy() # using these variables

regression = LinearRegression().fit(xs, y)

# calculate residuals
song_table["residual_mult"] = song_table.apply(lambda row: row.year - regression.predict(row[used_vars].to_numpy().reshape(1, -1)), axis=1)
song_table["residual_mult"] = song_table.apply(lambda row: row.residual_mult[0], axis=1)
In [615]:
# show residuals
song_table.plot(kind='scatter', x="year", y="residual_mult",
          title = "Mean duration_min vs Year with Quadratic Regression")
Out[615]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f1ca19d0970>

Looking at the residuals for the regression, it can be seen that songs in the 50s have large negative residuals, and songs in the 2010s have large positive residuals. This is not good for our model, and tells us that this model may not be great for measuring the year of a song.

Lets try a linear regression with interactive variables for each of the inputs. The degree of the polynomial will be 2, so each input will have a squared term (x^2), and a linear term (x), along with an interaction with other variables (x*y).

In [616]:
# polynomial featuers, but this time each x value has constant term
# linear term, and quadratic term, as well as an interaction term
# with all other variables
poly = PolynomialFeatures(degree=2)
in_arr = poly.fit_transform(xs)

regression2 = LinearRegression().fit(in_arr, y)
In [617]:
# calculate residuals
song_table["residual_inter"] = song_table.apply(lambda row: row.year - regression2.predict(poly.fit_transform(row[used_vars].to_numpy().reshape(1, -1))), axis=1)

# plot residuals
song_table.plot(kind='scatter', x="year", y="residual_inter",
                title = "Regression Residual per Song Interactive Regression")


#print(regression2.coef_)
Out[617]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f1ca107bb80>

This still seems to do poorly when looking at residual plot. This should not be the case, especially considering how many variables were used in the regression, I would predict overfitting to be more likely than poor predictions. Why was year able to predict each variable reasonably well, but they cannot predict year? Part of it may be due to song duration having a squared relationship with year. When switching the x and y axis of this graph, it can be seen that for a given duration, there are two possible years to predict. This would make it hard for this variable to have positive impact in the regression.

Decision Tree

Another way that we can predict year is by using a decision tree. With this not necessarily being based on a linear relationship, it may do better at predicting year. The scikit-learn library has a decision tree regressor that can be used for our needed purpose. Lets make a tree based on our input and output variables. Its documentation can be found here with examples https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html

When making the decision tree, we need to choose some of the parameters that will be used to make the tree. First, for the criterion, "Mse" is selected, so mean squared error will be minimized as the loss function. Next, the decision for what variable to split on at each node will be decided using the best split, which will help us get a strong prediction, as well as getting the same tree each time. Now, we must choose between some of the other important variables for the tree, such as max_depth, and min_samples_split. Max depth says how far down the tree can go. The further down, the more splits that can be made, but there is an increased risk of overfitting. Min samples split tells us that if there are a certain number of samples in a given node, then we need to continue splitting (unless we are at max_depth). This has similar problems, with two small of a split size could lead to overfitting.

Before deciding what values are best for the parameters, lets just try to run the tree with min_samples_split equal to 4, to see how the residuals of the prediction are.

In [618]:
# make decision tree
regressor = DecisionTreeRegressor(criterion="mse", splitter="best", min_samples_split=4)
reg_tree = regressor.fit(xs, y)

# calculate residuals
song_table["residual"] = song_table.apply(lambda row: row.year - reg_tree.predict(row[used_vars].to_numpy().reshape(1, -1)), axis=1)
song_table["residual"] = song_table.apply(lambda row: row.residual[0], axis=1)

It can be seen in the residual plot below that the residuals are largely 0 and are distributed randomly. This is helpful, and tells us that our model is on the right track.

In [619]:
# average residuals
mean_year = song_table.groupby(["year"], as_index=False).mean()

# plot residuals for every song
song_table.plot(kind='scatter', x="year", y="residual",
                title = "Regression Residual per Song")
Out[619]:
<matplotlib.axes._subplots.AxesSubplot at 0x7f1caa067b80>

Cross Validation

To find what the best max depth is, lets do some cross validation to see how well each version of the model predicts songs that are not included in the training set.

The model is created for each max_depth from 4 to 20, and then split randomly into a training set containing 90% of the data, and a test set containing the remaining 10%. This is repeated 10 times over a random split of the data. Each time, the testing data is used on the model created, and the sum of the absoluted value of the residuals is added to the score. These are averaged together to get the average residual error, and averaged again to get the average residual for a given model using cross validation.

In [620]:
score_per_depth = []

# iterate over depths of 4 through 19
for i in range(4,20):
    
    # regressor made, and fit with data
    regressor3 = DecisionTreeRegressor(criterion="mse", splitter="best", min_samples_split=4, max_depth=i)
    reg_tree = regressor3.fit(xs, y)

    scores = []
    for j in range(10):
        
        # training and testing data selected
        x_train,x_test,y_train,y_test = train_test_split(xs,y,test_size=0.1,shuffle=True)
        reg_tree5 = regressor3.fit(x_train, y_train)
        
        # sum of abs(residuals) is calculated
        sum = 0
        for tx, ty in zip(x_test, y_test):
            sum += abs(reg_tree5.predict(tx.reshape(1, -1)) - ty)
            
        # average of residuals added
        scores.append(sum / (700 * .1))
    
    
    # average of all training sets is stored
    score_per_depth.append(np.asarray(scores).mean())

Lets see what the residuals were for each depth

In [621]:
# depths go from 4-19
depths = [i for i in range(4,20)]
score_per_depth

# plot scores
fig3 = plt.figure(figsize=(10,5))
ax6 = fig3.add_subplot()
ax6.scatter(depths, score_per_depth)
ax6.set_title("Max Depth Cross Validation", fontsize=15)
ax6.set_xlabel('Max Depth',fontsize = 15) #xlabel
ax6.set_ylabel('Average Residual', fontsize = 15)#ylabel
Out[621]:
Text(0, 0.5, 'Average Residual')

From the test we just did, it seems that all max_depths perform at reasonable the same level, at around 9 to 10 average error for a given song. This is pretty good, considering that the songs over a given 10 year period are going to be roughly the same. Since they are all about the same, I would prefer to pick a tree with a smaller max depth for simplicity, so I choose depth = 5 (the second item in list) for the final model, as that also was the best performer. I choose the training and testing sets randomly, so it may be slightly different results if you run it again.

A downside of using a decision tree is that, with many different splits, it is not as intuitive as to how each variable impacts the year prediction, as opposed to a linear regression, which shows how a change in each variable directly changes the year. However, with small depths, we can see a general idea of how the tree is splitting, and what variables it is looking at.

Lets look at the same tree with a max depth of 2

In [622]:
regressor4 = DecisionTreeRegressor(criterion="mse", splitter="best", max_depth=2)

# show decision tree
plt.figure(figsize=(10,10))
tree.plot_tree(regressor4.fit(xs, y), feature_names = used_vars, filled=True)
Out[622]:
[Text(279.0, 453.0, 'duration_ms <= 186167.0\nmse = 408.25\nsamples = 700\nvalue = 1985.5'),
 Text(139.5, 271.8, 'danceability <= 0.729\nmse = 294.604\nsamples = 207\nvalue = 1966.831'),
 Text(69.75, 90.59999999999997, 'mse = 133.342\nsamples = 172\nvalue = 1963.029'),
 Text(209.25, 90.59999999999997, 'mse = 666.993\nsamples = 35\nvalue = 1985.514'),
 Text(418.5, 271.8, 'loudness <= -7.084\nmse = 248.179\nsamples = 493\nvalue = 1993.339'),
 Text(348.75, 90.59999999999997, 'mse = 177.352\nsamples = 275\nvalue = 1985.087'),
 Text(488.25, 90.59999999999997, 'mse = 143.29\nsamples = 218\nvalue = 2003.748')]

It seems that the variables it splits on are duration, danceability, and loudness. If a song is shorter than 186000 ms (about 3 minutes) and not danceable, it is predicted to be from the 60s. If it is short and danceable, it is the 80s. Long and quiet is the 80s, and long and loud is the 2000s. We can also see how many songs go into each bin at the bottom, looking at the samples number.

This can tell us intitively that in the 60s, songs were short and not very dance focused. Towards the 80s, if a song was short, it was danceable (pop/dance music), and if it was long, it was not loud (possibly long rock song). In the present day, songs are still fairly long, and are very loud (modern day pop and hip hop). For our model with more depth, it is hard to say exactly how each of the variables is used, but this simplified version can give us an idea for the general trend of music over the past 70 years.

Finally, lets take an R^2 value of our final model to see how well it explains the inputted data. Since the max depth is 5, the min_samples_split variable is useless since we will never get to a level where there are only 4 samples, so this will be removed.

In [623]:
regressor_final = DecisionTreeRegressor(criterion="mse", splitter="best", max_depth=5)
reg_tree = regressor_final.fit(xs, y)
r2 = regressor_final.score(xs, y)
print("R^2 is " + str(r2))
R^2 is 0.8002047836139728

With an R^2 of 80%, about 80% of the variation in year can be explained by the variables used in the decision tree. This is very good, and shows that there is a clear relationship.

Conclusion

In conclusion, it seems that music has had a drastic change in its sound over the past 70 years. We have seen louder, more danceable music come about, especially in the last 20 years, with hip hip becoming almost the new version of pop when it comes to the charts. We have also seen some information that was non necessarily expected, such as that hit songs became longer starting in the 80s, and started to come back down in length over the past 20 years. When trying to predict an overall trend for how things have changed over the years, it is fairly doable, but there will always be outlier songs due to how much differentiation there is in music, even on the charts.

A decision tree was helpful at predicting year given the non-linear relationships many of the variables have with year, and showed that year can generally be predicted within 10 years of accuracy. The next step logically would be to predict future years and see how they are expecteed to perform in some of these metrics. However, I have my doubts that the future of music will be so predictable, and with many of these variables being between 0 and 1, there is a limit to how far these variables can go in one direction.

Overall, the music we have seen on the charts has shown that trends do exist in popular music. It would be cool to see how trends affect different genres individually. In the, future, this could be done by only looking at Rock charts, or Pop charts. Additionally, different metrics could be used to choose the songs for each year. Instead, we could look at the grammy winners/nominees for each year, or a critics top songs of the year to see how critically acclaimed music sounds differently each year. But for now, we should appreciate how much variety there is in music, and how many unexpected changes there will be in the future.