Retail Sales 2018–2022

mahmoud chami
3 min readFeb 8, 2023

In today’s article, we will make an analysis of a dataset about retail sales in 2018–2022. We will analyze the data first, then we will check if there are any missing values or outliers… And finally, we will make a model to make a prediction about the sales.

The primary step is to import the libraries, in our case we have numpy, pandas, seaborn, and matplotlib:

import numpy as np 
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

After that, we read our dataset and take a first look at the name of the columns, their shape of it, some key numbers, and the null values.

df = pd.read_csv('retail-sales-dataset-2018-2022/sales_1.csv')

We notice that the format of the date looks like “202212”, we need to change it to “%Y%m%d”

df['Date'] = pd.to_datetime(df['Date'].astype(str), format='%Y%m')

(optional) You can add a column called ‘Nb_sales’, it will help you to have an idea about the number of sales made. And to add this column you need to divide the Sales by the Average Price.

df['Nb_sales'] = df['Sales_pkg'] / df['Avg_price_pkg']

After checking our data, we will move to make an analysis. The first thing we can check is the sales per the average price

sns.relplot(x=df['Avg_price_pkg'], y =df['Sales_pkg'], kind="line")
plt.xlabel("Average Price")
plt.title("Sales per Average Price")

Or the during the years:

sns.relplot(x=df['Date'], y =df['Sales_pkg'], kind="line")
plt.title("The sales during the year")

And since we have a lot of groups, we can check sales depending on the groups:

sns.catplot(data=df,x=df['Group'], y =df['Sales_pkg'])
plt.xlabel("Different Groups")
plt.title("Sales per Groups")

Let’s move on to make our model. To make it we can use two methods, the first one is to choose one by one from the library sklearn.linear_model or we can make a function and then check the best score of our models.

The first method:

from sklearn.model_selection import GridSearchCV

from sklearn.linear_model import Lasso
from sklearn.tree import DecisionTreeRegressor

def find_best_model_using_gridsearchcv(X,y):
algos = {
'linear_regression' : {
'model': LinearRegression(),
'params': {
'normalize': [True, False]
'lasso': {
'model': Lasso(),
'params': {
'alpha': [1,2],
'selection': ['random', 'cyclic']
'decision_tree': {
'model': DecisionTreeRegressor(),
'params': {
'criterion' : ['mse','friedman_mse'],
'splitter': ['best','random']
scores = []
cv = ShuffleSplit(n_splits=5, test_size=0.2, random_state=0)
for algo_name, config in algos.items():
gs = GridSearchCV(config['model'], config['params'], cv=cv, return_train_score=False),y)
'model': algo_name,
'best_score': gs.best_score_,
'best_params': gs.best_params_

return pd.DataFrame(scores,columns=['model','best_score','best_params'])


Or the second method, you choose one of the models and you test the score:

dtr = DecisionTreeRegressor(),y_train)


To conclude, in this article apart from the fact that we need to check and analyze our data, we find out that we can use a function with the different models, and then show their scores or plot it, and then we can choose the best and the fit model for our dataset.



mahmoud chami

I am Mahmoud Chami, I am an international polyvalent engineering student at the Institute of Advanced Industrial Technologie.