Data Analysis using SQL
One of the tasks of a Data Analyst is to analyze the data, and there are different tools to do that. But this time I want to try new things which is creating tables from our data using SQL. And to do that we need to follow these steps:
Step1: Import the libraries
import numpy as np
import pandas as pd
import sqlite3
import matplotlib.pyplot as plt
#You need to add the path of your data in this section
path = "../your path"
database = path + 'database.sqlite'
Step2: Create the connection to the DataBase and create variables for each sheet
data = sqlite3.connect(database)
tables = pd.read_sql("""SELECT *
FROM sqlite_master
WHERE type='table';""", data)
And we do the same thing for the other sheet:
countries = pd.read_sql("""SELECT *
FROM Country;""", data)
leagues = pd.read_sql("""SELECT *
FROM League
JOIN Country ON Country.id = League.country_id;""", data)
Depending on your logic and your analysis you can modify the code. If you want the same data or you want to have only the 10 first elements or choose some columns you are interested in and to do this you can use “join” to select the elements you are interested in. Here is an example of a Join
detailed_matches = pd.read_sql("""SELECT Match.id,
Country.name AS country_name,
League.name AS league_name,
season,
stage,
date,
HT.team_long_name AS home_team,
AT.team_long_name AS away_team,
home_team_goal,
away_team_goal
FROM Match
JOIN Country on Country.id = Match.country_id
JOIN League on League.id = Match.league_id
LEFT JOIN Team AS HT on HT.team_api_id = Match.home_team_api_id
LEFT JOIN Team AS AT on AT.team_api_id = Match.away_team_api_id
WHERE country_name = 'Spain'
ORDER by date
LIMIT 10;""", data)
Step3: Analysis of the Data
Finally, we can do our analysis of our data using the library Matplotlib to draw graphs to make our analysis.
df = pd.DataFrame(index=np.sort(leages_by_season['season'].unique()), columns=leages_by_season['country_name'].unique())
df.loc[:,'Germany'] = list(leages_by_season.loc[leages_by_season['country_name']=='Germany','avg_goals'])
df.loc[:,'Spain'] = list(leages_by_season.loc[leages_by_season['country_name']=='Spain','avg_goals'])
df.loc[:,'France'] = list(leages_by_season.loc[leages_by_season['country_name']=='France','avg_goals'])
df.loc[:,'Italy'] = list(leages_by_season.loc[leages_by_season['country_name']=='Italy','avg_goals'])
df.loc[:,'England'] = list(leages_by_season.loc[leages_by_season['country_name']=='England','avg_goals'])
df.plot(figsize=(12,5),title='Over Time Goals Per Game Average')
df = pd.DataFrame(index=np.sort(leages_by_season['season'].unique()), columns=leages_by_season['country_name'].unique())
df.loc[:,'Germany'] = list(leages_by_season.loc[leages_by_season['country_name']=='Germany','avg_goal_dif'])
df.loc[:,'Spain'] = list(leages_by_season.loc[leages_by_season['country_name']=='Spain','avg_goal_dif'])
df.loc[:,'France'] = list(leages_by_season.loc[leages_by_season['country_name']=='France','avg_goal_dif'])
df.loc[:,'Italy'] = list(leages_by_season.loc[leages_by_season['country_name']=='Italy','avg_goal_dif'])
df.loc[:,'England'] = list(leages_by_season.loc[leages_by_season['country_name']=='England','avg_goal_dif'])
df.plot(figsize=(12,5),title='Average Goals Difference In vs Out')
Conclusion
To sum up, the purpose of this article is to show that we have different ways to use our data either with SQL Queries or by reading a CSV file depending on your need and requirements. However, as a data Analyst, you should know different ways and tools to manipulate your data, so you can make good conclusions and decisions.