Car Price Prediction

4 min readFeb 15, 2023

Introduction:

The aim of this project is to build or develop a Machine Learning model to predict the price of used cars at a reasonable rate. In this model we used Random Forest Regressor to predict the price of a used car based on its features like Manufacturing years, selling price, kilometer driven, fuel types, number of sellers, transmission types & owners, etc. A customer or buyer or seller may know the actual market price or values before purchasing or selling his/her car. In this project, we implemented and evaluated various learning methods of machine learning on this dataset.

Dataset Description:

— Car_Name: Give the information of the car Name.
— Year: Give the information about the Manufacturing year of the car.
— Selling_Price: Give the information about the selling price.
— Present_Price: Give the information about the present price in the showroom.
—Kms_Driven: Give the information about cars driven in km.
— Fuel_Type: Give the information about which type of fuel is used Petrol, Diesel, or CNG.
— Seller_Type: Give the information about the seller_type Dealer or Individual.
— Transmission: Give the information about the Transmission system whether is Manual or Automatic.
— Owner: Give the information about Owner 0, 1, or 3.

Explore The Data:

1- Import the libraries:

The libraries we will use in this project are: numpy, pandas, seaborn, matplotlib,sklearn

2- Import our data:

car_df = pd.read_csv('car-price-prediction/car data.csv') #Change this path with yours

3- Explore the data:

Let’s have a look at the data:

car_df.head() #Show the 5 first rows
car_df.info() # Show the information about the data, 
#show the names of the columns, the Non-Null Count, 
#and the Dtype of each column

We will add a new column called “Number_of_years” to know the existence of the car, but before we add a column called “Current_Year”:

car_dataset['Current_year'] = 2023
car_dataset["Number_of_years"] = car_dataset["Current_year"] - car_dataset["Year"]

We will drop some columns to clean our data, in order to keep only the necessary data to make our model later:

car_dataset = car_dataset.drop(['Car_Name','Year','Current_year'],axis=1)

We notice that we have three types of Fuel in the “Fuel_Type” column, in order to keep it easy and clear later to train our model, we will make three categories for the type of fuel. To do that we will use “get_dummies” from the “Pandas” library.

car_dataset = pd.get_dummies(car_dataset,drop_first=True)
car_dataset.head()

Before moving to our model, we need to know the correlation between our data, and to make it easy to check it we will use the “heatmap” from the library “seaborn”:

correlation = car_dataset.corr()
plt.figure(figsize=(14,8))
sns.heatmap(correlation.corr(), annot=True,cmap="RdYlGn")

Make our model:

The first step is to choose our target, in our case the target is to predict the price of the car:

X = car_dataset.iloc[:,1:]

y = car_dataset.iloc[:,0]

Now we split our data to train and test, using the “train_test_split” from the library “sklearn.model_selection”:

X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2)

And finally, let’s choose the right model for our case, and to that, we can follow the map from this website: https://scikit-learn.org/stable/tutorial/machine_learning_map/index.html
The advantage of choosing this model is that this class implements a meta-estimator that fits a number of randomized decision trees (a.k.a. extra-trees) on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting.

from sklearn.ensemble import RandomForestRegressor
rf_Random = RandomForestRegressor()

To choose the best CV we use the “RandomizedSearchCV” from the library “Sklearn.model_selection”

random_grid = {'n_estimators':n_estimators,
                "max_features":max_features,
                "max_depth":max_depth,
                "min_samples_split":min_samples_split,
                'min_samples_leaf':min_samples_leaf}
rf = RandomForestRegressor()
rf_random = RandomizedSearchCV(estimator=rf, param_distributions =random_grid,scoring='neg_mean_squared_error',n_iter=10,cv=5,verbose=2,random_state=42,n_jobs=1)
rf_random.fit(X_train,y_train)
rf_random.best_params_

For last step, let’s make our prediction and use our “X_test” to make the prediction to find our results.

predictions=rf_random.predict(X_test)
sns.displot(predictions) # We can plot our prediction to visualize it

Conclusion:

In conclusion, we have successfully developed a Machine Learning model to accurately predict the market value of used cars. By considering important features such as the manufacturing year, selling price, distance driven, fuel type, number of sellers, transmission type, and ownership history, our model provides customers, buyers, and sellers with a reliable estimate of the value of their vehicle. Through the implementation and evaluation of various machine learning techniques on the dataset, we were able to achieve the best possible results. Overall, our project has demonstrated the effectiveness of machine learning in predicting the price of used cars, which can greatly benefit the automotive industry and its consumers.

Important Links:

Data Source: www.cardekho.com
The GitLab link for the project: https://gitlab.com/chamimahmoud/car-price-prediction