Customer Clustering using K means 📊


Skills: 🐍python, 💾 SQL, 👨‍💻 Machine Learning

1. Overview

This is and End-to-End Machine Learning Project for Customer Clusterization, encompassing many skills such as:

Our main goal here is to use KMeans clustering to define the customer importance, based on two main variables: the total monetary spent (BRL), and frequency of purchase by customer.

About the Dataset:

This is a Brazilian ecommerce public dataset of orders made at Olist Store. The dataset has information of 100k orders from 2016 to 2018 made at multiple marketplaces in Brazil. Its features allows viewing an order from multiple dimensions: from order status, price, payment and freight performance to customer location, product attributes and finally reviews written by customers. This is real commercial data, it has been anonymised, and references to the companies and partners

This data was collected and can be seen at Kaggle website


Frist, let's run the commands that will help python excecute CRUD (Create, Read, Update, Delete) operations at mySQL workbench. We need to provide to the connector parameters the Host, User, Password and Database on mySQL Workbench with access to the related Database.

Next, take a look into the data Schema, from which we will build our DataFrame:

:
Source: Kaggle


Next, create the SQL query and execute, then use pandas to convert the data into a Pandas DF:
OBS: The query is written as a string, therefore we cannot see if the query is working until we tranform into a python object!

2. EDA

EDA stands for Explanatory Data Analysis, and consists of performing initial investigations on data, so as to discover patterns, spot anomalies, test hypothesis and check assumptions with the help of summary statistics and graphical representations.
Here we need to clean the data. Fill empty registrations and see the data distribution among different clients:

By looking into the histogram, we can assume most of the clients do not spend much on the platform. They mostly spent between 1 and 400 BRL for all the time they have registered.

Last, reshape the number of transactions based on the time a client has entered the system, creating a column that measures average purchases/year:

Then, we check the data same as for the totalamount feature:

That means our data is full with outliers. We have to drop them, to build a more reliable model:

3. KMeans clusterization


Next step is to build KMeans model, using the data fetched from mySQL:

K-means clustering is a method of vector quantization, originally from signal processing, that aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean (cluster centers or cluster centroid), serving as a prototype of the cluster.

K-means clustering minimizes within-cluster variances (squared Euclidean distances), but not regular Euclidean distances, which would be the more difficult Weber problem: the mean optimizes squared errors, whereas only the geometric median minimizes Euclidean distances.

Use an Elbow Plot to verify the ideal number of clusters in the data.

4. Conclusion

It is clear that beyond k = 5 the Sum Squared Distance starts to take a more plain decrease. Therefore, it is best to assume the optimal number of clusters is 5. Last we will use a scatterplot to see how the data is clustered:

And now we have it! Using ML, we can continuously classify new customers and update the database.