Data Analysis using SQL

mahmoud chami
2 min readFeb 1, 2023

--

One of the tasks of a Data Analyst is to analyze the data, and there are different tools to do that. But this time I want to try new things which is creating tables from our data using SQL. And to do that we need to follow these steps:

Step1: Import the libraries

import numpy as np
import pandas as pd
import sqlite3
import matplotlib.pyplot as plt
#You need to add the path of your data in this section
path = "../your path"
database = path + 'database.sqlite'

Step2: Create the connection to the DataBase and create variables for each sheet

data = sqlite3.connect(database)
tables = pd.read_sql("""SELECT *
FROM sqlite_master
WHERE type='table';""", data)

And we do the same thing for the other sheet:

countries = pd.read_sql("""SELECT *
FROM Country;""", data)
leagues = pd.read_sql("""SELECT *
FROM League
JOIN Country ON Country.id = League.country_id;""", data)

Depending on your logic and your analysis you can modify the code. If you want the same data or you want to have only the 10 first elements or choose some columns you are interested in and to do this you can use “join” to select the elements you are interested in. Here is an example of a Join

detailed_matches = pd.read_sql("""SELECT Match.id, 
Country.name AS country_name,
League.name AS league_name,
season,
stage,
date,
HT.team_long_name AS home_team,
AT.team_long_name AS away_team,
home_team_goal,
away_team_goal
FROM Match
JOIN Country on Country.id = Match.country_id
JOIN League on League.id = Match.league_id
LEFT JOIN Team AS HT on HT.team_api_id = Match.home_team_api_id
LEFT JOIN Team AS AT on AT.team_api_id = Match.away_team_api_id
WHERE country_name = 'Spain'
ORDER by date
LIMIT 10;""", data)

Step3: Analysis of the Data

Finally, we can do our analysis of our data using the library Matplotlib to draw graphs to make our analysis.

df = pd.DataFrame(index=np.sort(leages_by_season['season'].unique()), columns=leages_by_season['country_name'].unique())

df.loc[:,'Germany'] = list(leages_by_season.loc[leages_by_season['country_name']=='Germany','avg_goals'])
df.loc[:,'Spain'] = list(leages_by_season.loc[leages_by_season['country_name']=='Spain','avg_goals'])
df.loc[:,'France'] = list(leages_by_season.loc[leages_by_season['country_name']=='France','avg_goals'])
df.loc[:,'Italy'] = list(leages_by_season.loc[leages_by_season['country_name']=='Italy','avg_goals'])
df.loc[:,'England'] = list(leages_by_season.loc[leages_by_season['country_name']=='England','avg_goals'])

df.plot(figsize=(12,5),title='Over Time Goals Per Game Average')
df = pd.DataFrame(index=np.sort(leages_by_season['season'].unique()), columns=leages_by_season['country_name'].unique())

df.loc[:,'Germany'] = list(leages_by_season.loc[leages_by_season['country_name']=='Germany','avg_goal_dif'])
df.loc[:,'Spain'] = list(leages_by_season.loc[leages_by_season['country_name']=='Spain','avg_goal_dif'])
df.loc[:,'France'] = list(leages_by_season.loc[leages_by_season['country_name']=='France','avg_goal_dif'])
df.loc[:,'Italy'] = list(leages_by_season.loc[leages_by_season['country_name']=='Italy','avg_goal_dif'])
df.loc[:,'England'] = list(leages_by_season.loc[leages_by_season['country_name']=='England','avg_goal_dif'])

df.plot(figsize=(12,5),title='Average Goals Difference In vs Out')

Conclusion

To sum up, the purpose of this article is to show that we have different ways to use our data either with SQL Queries or by reading a CSV file depending on your need and requirements. However, as a data Analyst, you should know different ways and tools to manipulate your data, so you can make good conclusions and decisions.

--

--

mahmoud chami

I am Mahmoud Chami, I am an international polyvalent engineering student at the Institute of Advanced Industrial Technologie.