Spotify Dataset

3 min readJan 11, 2023

This project is made to perform exploratory data analysis using Python on music-related datasets. Then using a tool for visualization Power Bi. You can find the code on my GitLab: https://gitlab.com/chamimahmoud/spotify-datasets

The first we will begin with is importing the libraries we need and our data.

import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt
import seaborn as sns

 df_tracks = pd.read_csv('../input/spotify-datasets/tracks.csv')

In order to have an idea about our data, columns, and rows we can do df_tracks.head(), it allows us to check the first lines of our data.

We need to check if there is any null value in our dataset by using:

pd.isnull(df_tracks).sum()

The first thing we can do with our data is to check for example the most popular songs in the world and sort them in descending order and check the 10 first:

most_popular = df_tracks.query('popularity>90',inplace=False).sort_values('popularity',ascending=False)
most_popular[:10]

Also what we can do is check the correlation between our elements, and the most common method is to use a heatmap from the seaborn library, which helps us to visualize the correlation between our elements:

corr_df=df_tracks.drop(['key','mode','explicit'],axis=1).corr(method="pearson")
plt.figure(figsize=(14,8))
heatmap = sns.heatmap(corr_df,annot=True,fmt=".1g",vmin=-1,vmax=1,center=0,cmap='inferno', linewidths=1,linecolor="Black")
heatmap.set_title("Correlation HeatMap Between Variable")
heatmap.set_xticklabels(heatmap.get_xticklabels(),rotation=90)

We can clearly notice that the highest correlation is between loudness and energy, and again we can visualize it with a graph:

sample_df = df_tracks.sample(int(0.004*len(df_tracks)))
plt.figure(figsize=(14,8))
sns.regplot(data = sample_df ,y = "loudness",x="energy", color='c').set(title = "The correlation between Loudness and eneregy")

Another thing we can do with our data is to check how many songs were released in the last few years:

df_tracks['dates']=df_tracks.index.get_level_values('release_date')
df_tracks.dates=pd.to_datetime(df_tracks.dates)
years=df_tracks.dates.dt.year
sns.displot(years,discrete=True,aspect=2,height=5,kind='hist').set(title="Number of songs per year")

Finally, we can notice through a graph that the duration of songs changed during the last few years.

The longest one was in 1935, and now we can notice that’s going down:

 total_dr =df_tracks.duration
fig_dims = (18,7)
fig, ax =plt.subplots(figsize = fig_dims)
fig = sns.barplot(x= years, y = total_dr, ax = ax, errwidth=False).set(title="Year vs Duration")
plt.xticks(rotation=90)

To visualize more our data, we used PowerBi to have a general idea about our data.

Spotify Dataset

Written by mahmoud chami