Premier League Matches 1993–2022

mahmoud chami
6 min readFeb 22, 2023

Introduction

This dataset covers every match played in the English Premier League history from the start in 1992 to the final week of the 2021–2022 season.

About the Premier League

For those who are not familiar with football, here is a definition of the premier league.

The Premier League was founded on 20 February 1992 following the decision of clubs in the Football League First Division to break away from the Football League, founded in 1888, and take advantage of a lucrative television rights sale to Sky.

In total, 50 clubs have competed since the inception of the Premier League in 1992: forty-eight English and two Welsh clubs. Seven teams overall have won the title: Manchester United (13), Manchester City (6), Chelsea (5), Arsenal (3), Blackburn Rovers (1), Leicester City (1), and Liverpool (1).

Analysis ideas 💡

In this article, we will answer three major questions:

  1. Does the home stadium ground give any advantage? And if the answer is yes, what’s the quantity for this advantage?
  2. What is the best way to collect points, defensive or attacking play?
  3. Who is the best club manager in PL history?

Explore Data

  1. Import the librairies

The libraries we will use in this project are:

Pandas, Seaborn, Matplotlib

2) Import the data:

data = pd.read_csv('your_/eplmatches.csv')

3) Explore and analyze the data

Let’s take a look at our data:

data.shape #To know the number of columns and rows
data.info() # To know the type of our data and the number of columns
data.isnull().sum() # To know the null values

Before we begin our analysis, we need to understand the meaning of each column:

Season_En_Year: The year of the end of the season
Wk: Week number of match
Date: Date of match
Home: Home team name
HomeGoals: Home team goals
AwayGoals: Away team goal
FTR: Did Home/Away team win the match or did it end in a draw

Now we understand our data, let’s answer the questions.

Question 1:

Does the home stadium ground give any advantage? And if the answer is yes, what’s the quantity for this advantage?

The first thing we need to do is to draw the Home teams and the goals they scored, to know how many goals scored Home and Away.

plt.figure(figsize = (20,8))
sns.barplot(data = data, x= data["Home"], y=data["HomeGoals"])
mean =data["HomeGoals"].mean()
mean # The result is 1.52

Now we will draw the Home teams and the AwayGoals:

plt.figure(figsize = (20,8))
sns.barplot(data = data, x= data["Home"], y=data["AwayGoals"])
mean =data["AwayGoals"].mean()
mean #The result is 1.14

Also to be sure of our analysis, we will draw a pie to show from the column “FTR” the percentage of the H, A, D to see the impact of playing home and away:

Home_win = data['FTR'].values == 'H'
Home_win = Home_win.mean()
Away_win = data['FTR'].values == 'A'
Away_win=Away_win.mean()
draw = data['FTR'].values == 'D'
draw= draw.mean()
FTR=[Home_win,Away_win,draw]
Type_of_FTR = ['Home','Away','D']
plt.figure(figsize=(14,8))
palette_color = sns.color_palette('bright')
plt.pie(FTR, labels=Type_of_FTR ,colors=palette_color,autopct='%.0f%%')

Sum up of this first part:

We can notice in the graphs we drew that most teams scored at least 1 goal when they play at home, also the mean of home goals is 1,5 goal
however, the mean of away goals is 1,14. And in the Pie, it is clear that 46% of teams that played home win. To conclude playing at home gives an advantage to the team.

Question 2:

What is the best way to collect points, defensive or attacking play?

We can check the common results in the PL and how many times it repeats:

data['Result'].value_counts().to_frame().reset_index().\
rename(columns= {'index' : 'Resultado', 'Result' : 'Conteo'}).head(10)
 Score Count
0 1 - 1 1348
1 1 - 0 1231
2 2 - 1 1012
3 2 - 0 957
4 0 - 0 955
5 0 - 1 864
6 1 - 2 731
7 2 - 2 568
8 3 - 1 517
9 0 - 2 515
HomeGoalCount = data.groupby(['Home'])['HomeGoals'].sum().reset_index()
HomeGoalCount.columns = ['index','Home Goals']

AwayGoalCount = data.groupby(['Away'])['AwayGoals'].sum().reset_index()
AwayGoalCount.columns = ['index','Away Goals']

TotalGoalScored = pd.merge(HomeGoalCount, AwayGoalCount, how ='right', on ='index')
TotalGoalScored.sort_values(by=['Home Goals','Away Goals'], ascending = False).head(10)
TotalGoalScored['Total'] = TotalGoalScored['Home Goals'] + TotalGoalScored['Away Goals']
TotalGoalScored['Diferencial'] = TotalGoalScored['Home Goals'] - TotalGoalScored['Away Goals']
TotalGoalScored.sort_values(by = ['Total'], ascending = False).head(10)
TotalGoalConceded['Total'] = TotalGoalConceded['HomeGoals'] + TotalGoalConceded['AwayGoals']
TotalGoalConceded['Diferencial'] = TotalGoalConceded['HomeGoals'] - TotalGoalConceded['AwayGoals']
TotalGoalConceded.sort_values(by = ['Total'], ascending = False).head(10)
Top10GoalsConced = TotalGoalConceded.sort_values(by=['Total'], ascending = False).head(10)
Top10GoalsConced = Top10GoalsConced.drop('Total', axis = 1)
Top10GoalsConced.plot(x='index', kind='bar', stacked=True)

plt.title('Home vs Away Goals Conceded', fontsize = 24, fontweight = 'bold')
plt.xlabel('Teams', fontsize = 24, fontweight = 'bold')
plt.ylabel('Goals', fontsize = 24, fontweight = 'bold')
plt.show()

The answer to this question is that the best way to collect points is to attack and score.

Question 3:

Who is the best club manager in PL history?

To answer this question we will answer first who is the best club in PL history.

HomeWins = data.loc[data['FTR'] == 'H', 'WinningTeam'].value_counts().to_frame().\
reset_index().rename(columns= {'WinningTeam' : 'HomeWin'})
HomeWins = data.loc[data['FTR'] == 'H', 'WinningTeam'].value_counts().to_frame().\
reset_index().rename(columns= {'WinningTeam' : 'HomeWin'})
HomeWins = data.loc[data['FTR'] == 'H', 'WinningTeam'].value_counts().to_frame().\
reset_index().rename(columns= {'WinningTeam' : 'HomeWin'})
AwayLosses = data.loc[data['FTR'] == 'A', 'LosingTeam'].value_counts().to_frame().\
reset_index().rename(columns= {'LosingTeam' : 'AwayLose'})
TotalWins = pd.merge(HomeWins, AwayWins,how ='right',on ='index')
TotalLosses = pd.merge(HomeLosses, AwayLosses,how ='right',on ='index')
WinsLosses.sort_values(by = 'HomeLose', ascending = False).head()
WinsLosses['TotalWins'] = WinsLosses['HomeWin'] + WinsLosses['AwayWin']
WinsLosses['TotalLoses'] = WinsLosses['HomeLose'] + WinsLosses['AwayLose']
WinsLosses['DifWinsLoses'] = WinsLosses['TotalWins'] - WinsLosses['TotalLoses']
WinsLosses.sort_values(by = 'TotalWins', ascending = False).head(10)
Matches = data['Home'].value_counts().to_frame().reset_index().rename(columns = {'Home' : 'Matches'})

Matches['Matches'] = (Matches['Matches'] * 2)

Matches.head(10)

Based on these two tables, we can conclude that the best team in PL history is “Manchester United”, and after some research, the best manager of it is M. Alex Ferguson who won 13 titles from 1993–2013

Conclusion:

To sum up, this dataset of the premier league helps us to know the history of the PL first and also the best team in this league with a good number of goals scored, and also the historical part of it since the United started playing when the PL was created. And I am a football lover I had entrainment analyzing this dataset and writing this article.

Dataset source: worldfootballR, Premier League

--

--

mahmoud chami

I am Mahmoud Chami, I am an international polyvalent engineering student at the Institute of Advanced Industrial Technologie.