Amazon Reviews Project

mahmoud chami
2 min readJan 18, 2023

In this project, we will walk through a real-world python machine learning project using the sci-kit learn package. We will create a model that automatically classifies text as having a positive or negative or neutral emotion. We accomplish this by using Amazon reviews as training data. We will apply it for Books reviews, you can use it for any other article’s review.

To begin our project, the first thing we will do is import the libraries we will need:

import pandas as pd
import matplotlib.pyplot as plt
import json

After that, we will define our “Class”, we will see the functionality later on the code:

class Sentiment:

class Review:
def __init__(self,text,score):
self.score = score
self.sentiment = self.get_sentiment()

def get_sentiment(self):
if self.score <= 2:
return Sentiment.NEGATIVE
elif self.score == 3:
return Sentiment.NEUTRAL
else: #For score = 4 or score = 5
return Sentiment.POSITIVE
class ReviewsContainer:
def __init__(self, reviews): = reviews

def get_text(self):
return [x.text for x in]

def get_sentiment(self):
return [x.sentiment for x in]

def distribute(self):
negative = list(filter(lambda x : x.sentiment == Sentiment.NEGATIVE,
positive = list(filter(lambda x : x.sentiment == Sentiment.POSITIVE,
neutral = list(filter(lambda x : x.sentiment == Sentiment.NEUTRAL,

We will read our file, since it is a json file we are going to write it differently than a normal data frame:

file_name = '../Books_small.json' #change it with your own path
reviews = []
with open(file_name) as f:
for line in f:
review = json.loads(line)

Let’s move on to split our data, and to do that we will use the ‘train_test_split’ from the ‘library sklearn.model_selection’:

from sklearn.model_selection import train_test_split
train_df, test_df = train_test_split(reviews, test_size= 0.20, random_state=0)

train_container = ReviewsContainer(train_df)
test_container = ReviewsContainer(test_df)

x_train = train_container.get_text()
y_train = train_container.get_sentiment()

x_test = test_container.get_text()
y_test = test_container.get_sentiment()

After getting our train_df and test_df, we are going to choose the right model, and then make a prediction for the first test:

from sklearn.linear_model import LogisticRegression

clf_log = LogisticRegression(), y_train)
clf_log.predict(x_test_vector[0]) #predict the sentiment of the first element

We can check the score of our model by using: “clf_log.score(x_test_vector, y_test)”, we got a score of 81.5% which is a good score in our case.

The last thing we will do is to check the percentage to know/ predict the real sentiment and to do that we use the f1_score which can be interpreted as a harmonic mean of the precision and recall.

from sklearn.metrics import f1_score
f1_score(y_test, clf_log.predict(x_test_vector) , average=None, labels= [Sentiment.POSITIVE, Sentiment.NEUTRAL,Sentiment.NEGATIVE])

Finally, we can make a test with some test_set, or apply our model to new data.

test_set=['great','bad book', 'horrible','long','nice']
new_test= vectorize.transform(test_set)

The link to the full project:

The link to the dataset:



mahmoud chami

I am Mahmoud Chami, I am an international polyvalent engineering student at the Institute of Advanced Industrial Technologie.