--- title: 'IDS 572: Clustering Project' author: "" date: 2020-04-15T21:13:14-05:00 categories: [] tags: ["Clustering"] summary: Comparing K-Means, K-Medoids, and Hierarchical clustering on a business dataset. ---

Introduction

CRISA, a well-known market research company, tracks about 60-70 brands within 30 product categories in order to best determine marketing strategies for their clients. To do this, CRISA conducts household panels in India, and has data covering about 50,000 urban and 25,000 rural Indian households. Optimal households for research are selected using stratified sampling, and in urban areas data captures information from 80% of the market.

CRISA uses this data to provide market research services to their clients. These clients include two main groups:

For a long time now, CRISA has implemented segmentation algorithms which cluster consumers based on their demographic characteristics. There is now a demand for CRISA to segment the market further in order to better capture brand loyalty and the consumer purchasing process. Two sets of variables which CRISA wants to implement clustering on are:

The objective of this additional clustering is to gain insight on purchase behaviors and brand loyalty, and identify the most important attributes which help to identify this behavior. This way, CRISA's clients can better use the information provided to make decisions. The goal is to develop unique strategies targeting different segments, in order to better reach individuals in each cluster and increase brand loyalty of consumers. This is more cost-effective than implementing a general marketing strategy which may only appeal to a fraction of consumers.

Data Exploration and Cleaning

We converted several categorical variables into dummy variables by applying one hot encoding to understand difference between clusters such as Mother Tongue, Gender, Children and Education.

K-Means Clustering

Purchase Behavior Variables

Firstly, we built a clustering model by only using variables are related to 'Purchase behavior'. Purchasing behavior can be identified based on these attributes: the number of brands purchased, brand loyalty, the number of transactions, the number of runs purchasing same brand, volume of product and average price. We built clustering models by changing some parameters such as centers, nstart and iter.max in Kmeans model. It is shown that when we changed parameters of k-means model, there is no significant difference between models. Our baseline model is created by selecting 25 random sets and using 12 iterations.

KMeans3 SumOfSquare
Cluster 1 1242.74
Cluster 2 1141.76
Cluster 3 1585.15
Total Within 3969.64
Between 2020.36
Total 5990
Clusters Size
Cluster 1 259
Cluster 2 166
Cluster 3 175

Then, we checked characteristics of clusters to understand differences between clusters. Households size of cluster 3 is higher than other clusters. This increases the consumption such as number of brands, transactions and volume. Moreover, households in cluster 3 have higher brand loyalty than other clusters (generally consume products of same brand) and their affluence index is lower than others. Education level is greatest in cluster 1 and smallest in cluster 3.

In light of this information, marketing strategies must vary from cluster to cluster. For example, if you aim to reach people who have lower economic class and higher brand loyalty, you should consider households in cluster 3 and shape your marketing planning regarding patterns of cluster 3.

Cluster SEC HS Affluence maxBr No of Brands No of Trans Brand Runs Volume
1 2.339768 3.474903 16.60618 0.2193920 3.200772 23.91120 13.509652 7778.097
2 2.409639 5.108434 20.99398 0.2350520 5.138554 50.01205 27.180723 16856.536
3 2.822857 4.382857 13.86286 0.7250765 2.857143 23.98286 8.228571 13349.429

Next, we drew cluster plot by using the fviz package. It can be clearly seen that this clustering model is not optimal since clusters are heavily overlapping. This model was not able to segment households successfully in the intersection area.

Then, we applied both the elbow and silhouette method to decide the number of clusters in the model. According to elbow method, we can say that best k value is 6. Alternatively, the silhouette method determined the number of clusters as 4. As a result, we selected the number of clusters as 4 since if we increase the number of clusters, the scope of the business cannot be easily managed by marketing teams.

This plot shows the clusters when we apply 4 different clusters regarding elbow and silhoutte model. This model is slightly better than previous model, but still not best because clusters are overlapping a decent amount.

KMeans3 SumOfSquare
Cluster 1 1170.81
Cluster 2 502.28
Cluster 3 879.76
Cluster 4 875.11
Total Within 3427.96
Between 2562.04
Total 5990

Basis for Purchase Variables

Secondly, we applied k-means clustering by using different variables. Basis of purchase variables obtains percent of volume purchased not on promotion, on promo code 6 and other than 6, proposition of beauty, health and baby products.

KMeans3 SumOfSquare
Cluster 1 1693
Cluster 2 229.32
Cluster 3 2500.69
Total Within 4423.01
Between 1566.99
Total 5990
Clusters Size
Cluster 1 335
Cluster 2 73
Cluster 3 192

The graph below indicates that basis for purchase variables are not sufficient to segment households in consumption of consumer goods using 3 clusters. Clusters overlapped and so variance between clusters is small.

Then, we tried to learn behavior differences between clusters. Households in cluster 2 have higher brand loyalty (77%) than other clusters (meaning they generally consume products of same brand) and their affluence index is lowest among all others. Also, social economic status of cluster 2 is the lower than others (almost 3.4). People in cluster 3 are more educated than others. Overall, households in cluster 1 and 3 shows similar patterns in consumption.

Based on these insights, marketing strategies must vary from cluster to cluster. For instance, if you work on launching products which have medium price, you should target households in cluster 2 and build your marketing strategies matching with characteristics of cluster 2.

Cluster SEC HS Affluence maxBr No of Brands No of Trans Brand Runs Volume
1 2.591045 4.483582 17.14030 0.3670430 3.800000 31.60597 15.853731 13008.752
2 3.356164 4.150685 8.90411 0.7642848 2.904110 25.46575 8.506849 13279.315
3 2.015625 3.697917 19.89583 0.2290487 3.630208 32.52604 18.328125 9487.188

Then, we applied both elbow and silhouette methods to decide the number of clusters in the model. According to the elbow method, we can say that the best k value is 7. Besides of this, silhouette method determined the number of clusters as 8. As a result, instead of using 7 or 8 for the number of clusters, we determined the number of clusters as 4 since this help to manage marketing plans systematically.

This plot shows the clustering model when we ran 4 different clusters considering elbow and silhoutte above. This model is slightly better than previous model using only 3 clusters, but still not best because the clusters are overlapping.

KMeans3 SumOfSquare
Cluster 1 1170.81
Cluster 2 502.28
Cluster 3 879.76
Cluster 4 875.11
Total Within 3427.96
Between 2562.04
Total 5990

Combined Variables

Lastly, we applied k-means clustering by using combined variables in part a and part b. Combined variables includes both purchase behavior and basis for purchase variables. The tables below give information about clustering model when we selected k as 3.

KMeans3 SumOfSquare
Cluster 1 878.75
Cluster 2 3440.63
Cluster 3 5068.85
Total Within 9388.23
Between 2591.77
Total 11980
Clusters Size
Cluster 1 71
Cluster 2 243
Cluster 3 286

The graph below indicates that combined variables are slightly better than previos models, yet it is still not sufficient to segment households. Clusters overlapped (intersections are not easy to identify) and so variance between clusters is small.

Then, we checked characteristics of clusters whether we see vital differences between clusters. Household sizes of clusters are similar to each other. Moreover, households in cluster 1 have higher brand loyalty than other clusters (hence transaction of brand runs is also highest) and their affluence index is lower than others. Order of educational level is cluster 3 > cluster 2 > cluster 1. People in cluster 3 are more educated than others. Overall, households in cluster 1 and 3 shows similar pattern in consumption.

Based on this information, marketing strategies must vary from cluster to cluster. For example, if you launch premium products into the market, you have to reach people who have higher affluence index and education level which corresponds to cluster 3. For cluster 3, purchasing power is higher and not price oriented, but rather quality oriented.

Cluster SEC HS Affluence maxBr No of Brands No of Trans Brand Runs Volume
1 3.450704 4.140845 8.098591 0.7808103 2.732394 23.91549 7.492958 13055.14
2 2.588477 4.242798 15.395062 0.4660238 3.193416 24.25103 11.251029 12173.96
3 2.188811 4.160839 20.615385 0.1889798 4.237762 38.81469 21.625874 11411.45

Then, we applied both elbow and silhouette method to do benchmarking among clusters. According to the elbow method, we can say that best k value is 6. Alternatively, silhouette method determined the number of clusters as 9. As a result, we determined the number of clusters as 6 due to the business implications of having too many clusters.

This plot shows the clustering model when we apply 6 different clusters regarding benchmarking analysis above. Within clusters SSQ is higher and between clusters SSQ is lower than the previous model. This model is slightly better than previous models, but still not best because clusters are overlapping.

KMeans3 SumOfSquare
Cluster 1 1506.44
Cluster 2 1526.87
Cluster 3 733.63
Cluster 4 790.09
Cluster 5 1743.87
Cluster 6 1095.17
Total Within 7396.07
Between 4583.93
Total 11980

K-Medoids Clustering

Purchase Behavior Variables

Several issues occur when using the k-means clustering. For example, the k-means algorithm is very sensitive to outliers and noise since the mean statistic is sensitive to these occurrences. In order to address these issues, we chose to explore k-medoids clustering as a second clustering technique.

First, we will apply this clustering using the purchasing behavior. We started by using 3 clusters again. Here is a plot of the clusters using this choice.

size max_diss av_diss diameter separation
160 6.474649 2.044774 9.059644 0.4692172
242 8.784529 2.255817 11.178010 0.6147253
198 10.019243 2.553799 13.270996 0.4692172

This plot does not vary much visually from the k-means algorithm. Once again, though, we wanted to verify what the optimal number of clusters actually is. Therefore, we chose to use both the elbow and silhouette methods to check for the optimal number of clusters. The elbow method does not demonstrate a clear elbow, but the silhouette method suggests 4 clusters to be optimal. Because of this, we will run the k-medoids again using 4 clusters instead of 3.

The graph below indicates that k medoids model works slightly better than k mean regarding basis for purchase variables, yet there is a still overlapping so distance between clusters is small.

size max_diss av_diss diameter separation
156 6.474649 2.005804 9.059644 0.4692172
212 7.349266 2.017348 9.071292 0.6147253
180 8.425903 2.324993 11.441277 0.4692172
52 7.794822 3.004423 12.126794 1.0068283

According to table, cluster 2 has higher affluence index and lower brand loyalty than other clusters. On the other hand, households in cluster 3 is the most loyal customers in this market and brand runs metric is the lowest among all households. Cluster 1 and Cluster 4 have similar consumption patterns with slight differences. Household size of cluster 4 is the highest; hence total consumption vary significantly from other clusters.

Cluster SEC HS Affluence maxBr No of Brands No of Trans Brand Runs Volume
1 2.391026 3.141026 13.98077 0.1591571 2.634615 20.59615 11.185897 6841.955
2 2.316038 4.551887 21.52830 0.2478214 5.014151 44.36792 25.268868 12592.934
3 2.727778 4.005556 13.92222 0.7224212 2.866667 22.88889 8.205556 10747.722
4 2.788461 6.519231 18.48077 0.2947516 3.692308 37.55769 16.769231 28408.173

Basis for Purchase Variables

Now, just like with k means, we will cluster on the basis for purchase variables. Once again, three clusters will be used as a baseline. Here is the result of the three clusters:

size max_diss av_diss diameter separation
290 9.565754 1.967946 12.687231 0.4508119
237 15.534372 2.951697 21.109567 0.4508119
73 6.857316 1.564375 7.758894 0.6283877

The silhouette method clearly indicates that 2 clusters would be optimal in the case of clustering on basis for purchase variables.

Running the pam algorithm with only two clusters yields clusters of very different sizes, as is visible below.

size max_diss av_diss diameter separation
254 9.825695 1.717100 12.687231 0.4625144
179 11.236392 2.662583 13.730433 0.4625144
70 6.857316 1.466505 7.758894 0.6932850
97 15.234251 2.775529 21.109567 0.6949584

This table shows differences between clusters. For example, if we want to launch premium soap, we should try to understand cluster 2 and cluster 4. Nonetheless, when we want to increase sales of low priced products, we have to focus cluster 1 and cluster 3.

Cluster SEC HS Affluence maxBr No of Brands No of Trans Brand Runs Volume
1 2.622047 4.433071 16.374016 0.4110557 3.610236 30.13780 14.606299 13184.57
2 2.039106 3.905028 20.061453 0.2254913 3.765363 34.11173 17.944134 10513.85
3 3.385714 3.928571 8.471429 0.7709884 2.857143 24.92857 8.114286 12940.21
4 2.391753 4.278351 19.268041 0.2473120 4.030928 32.84536 20.216495 10434.90

Combined Variables

Lastly, we will impliment the k medoids algorithm on the combined variables. First, we will start with the baseline of three clusters.

size max_diss av_diss diameter separation
308 10.502413 3.628277 14.88709 1.296149
72 8.068223 3.358211 11.94597 1.656396
220 16.028957 4.176661 21.67919 1.296149

The three clusters are visualized below. The above table shows that the size differences in clusters are large. Instead, we will search for the optimal cluster size in an attempt to balance this.

The silhouette method indicates the optimal number of clusters as 8. This is a large number of clusters, though, given the business implications - 8 separate marketing plans would be very intensive to develop and implement. Therefore, we will select 4 as the best number of clusters given the result of the elbow method. 4 clusters also do not look too bad in the silhouette graph.

size max_diss av_diss diameter separation
177 10.327528 3.317120 13.87253 0.8590833
196 9.397880 3.490261 14.05870 1.0664848
65 8.068223 3.176974 11.94597 1.2785782
162 16.028957 4.023063 21.67919 0.8590833

The clusters are somewhat closer in size than just using 3 groups, though cluster 3 is still a lot smaller than the others. In the table below, we can see the differences between the 4 clusters on some of the key variables. Cluster three has higher brand loyalty than the other clusters. The other three, larger clusters vary on some important variables such as brand runs, volume, and household size. Cluster 4 has the lowest household size, and purchases the smallest volume.

Cluster SEC HS Affluence maxBr No of Brands No of Trans Brand Runs Volume
1 2.728814 4.711864 16.050847 0.4815375 3.112994 25.58757 10.949153 14096.186
2 2.403061 4.821429 21.066326 0.2102490 5.000000 46.08673 26.132653 14036.668
3 3.446154 3.938461 7.953846 0.7977172 2.646154 23.92308 7.123077 13018.692
4 1.987654 2.962963 16.820988 0.2743024 2.956790 22.06790 11.901235 6521.204

Hierarchical Clustering

As a third clustering algorithm, we chose to use hierarchical clustering.

Purchase Behavior Variables

We implemented agglomerative hierarchical clustering by using different distance techniques such as weighted, complete and Ward's method.

We chose the clustering based on Ward's method rather than complete method. The sizes of clusters based on complete and weighted measures vary significantly. Most of households are in cluster 1 (86%). However, clustering with Ward's method creates more balanced clusters. We can check the agglomerative coefficient, which measures the amount of clustering structure found (values closer to 1 suggest strong clustering structure). We can clearly say that Ward's method is better than others on the basis of this dataset.

Method Value
Agg. Coef of Weighted 0.89
Agg. Coef of Complete 0.93
Agg. Coef of Ward 0.98
Cluster label based on weighted method Size
1 328
2 269
3 3
Cluster label based on Complete method Size
1 533
2 53
3 14
Cluster label based on Ward's method Size
1 217
2 241
3 142

We checked for the optimal number of clusters using the hierarchical method.

This determines 5 clusters to be good based on both the elbow and silhouette methods. However, we determined the number of clusters as 4 since when we increase the number of clusters, size of several clusters is smaller. This does not add any significant value to business, because adding value of focusing some households in small clusters is not worth it.

Cluster label k=4 Size
1 217
2 202
3 142
4 39
Cluster label k=5 Size
1 193
2 202
3 142
4 39
5 24

When we check features of the households, we can say that cluster 2 has much higher affluence index but less brand loyalty, meaning households in cluster 2 purchase a wider variety of brands in the market. On the other hand, cluster 1 has higher brand loyalty in this market. The average household size of cluster 4 is much higher than others, which is why volume of consumption is also higher.

Cluster SEC HS Affluence maxBr No of Brands No of Trans Brand Runs Volume
1 2.645161 3.903226 14.12903 0.6618342 3.082949 22.84332 9.336405 10251.530
2 2.326733 4.678218 21.40594 0.2316532 5.039604 45.64356 25.950495 13423.292
3 2.500000 3.436620 14.50000 0.1445449 2.598591 22.75352 11.450704 7752.676
4 2.589744 6.025641 19.56410 0.3023641 3.230769 32.92308 14.282051 28510.128

The first table shows cluster distribution of observations visually and the second one indicates the dendogram of the hierarchical clustering method which was built by using Ward's method with 4 clusters.

Basis for Purchase Variables

Secondly, we applied hierarchical clustering by using different variables. Basis of purchase variables obtains percent of volume purchased not on promotion, on promo code 6 and other than 6, proposition of beauty, health and baby products.

We chose the clustering based on Ward's method rather than the complete method. The sizes of the clusters based on complete and weighted measure vary significantly. Most of households are in cluster 1 (99%). However, clustering with Ward's method creates more balanced clusters. We can clearly say that Ward's method is better than the others on the basis of this dataset.

Method Value
Agg. Coef of Weighted 0.95
Agg. Coef of Complete 0.96
Agg. Coef of Ward 0.98
Cluster label based on weighted method Size
1 592
2 7
3 1
Cluster label based on Complete method Size
1 594
2 5
3 1
Cluster label based on Ward's method Size
1 419
2 66
3 115

Using these clustering variables, we again checked for the optimal clusters using both silhouette and elbow methods. The elbow and silhouette methods indicate 7-8 clusters as optimal, but this is a large amount from a business perspective.

Instead, we decided to try smaller numbers of clusters and compare. In the first table, cluster 1 dominates other clusters, thus, 4 clusters are better than others. On the other hand, when we increase the number of clusters, some clusters are smaller and that is not sufficient for market segmentation. As a result, we determined the number of clusters as 4 since the clusters are more balanced than others. Concentrating on feasible customer segments is much better.

Cluster label k=3 Size
1 419
2 66
3 115
Cluster label k=4 Size
1 183
2 236
3 66
4 115
Cluster label k=5 Size
1 25
2 236
3 66
4 115
5 158

The first table shows the cluster distribution of observations, and the clusters are overlapping. The features were not able to seperate observations properly. The second one indicates the dendogram of the hierarchical clustering method which was built by using Ward's method with 4 clusters.

The following table indicates the differences between the clusters. Cluster 3 shows significant differences from the other clusters (higher brand loyalty and less affluence index). Nonetheless, clusters 1 and 4 show similar consumption behaviors.

Cluster SEC HS Affluence maxBr No of Brands No of Trans Brand Runs Volume
1 2.065574 3.841530 19.224044 0.2375024 3.677596 32.89071 16.912568 10649.81
2 2.635593 4.559322 17.072034 0.4206826 3.758475 31.03814 15.211864 13402.30
3 3.439394 3.848485 7.636364 0.7899970 2.818182 24.54545 7.757576 12936.59
4 2.373913 4.191304 18.791304 0.2421365 3.791304 32.41739 19.600000 10288.61

Combined Variables

Lastly, we applied hierarchical clustering by using combined variables in part a and part b. Combined variables includes both purchase behavior and basis for purchase variables.

We used Ward's method to measure distance between points (Ward performs well) and cut the tree by checking dendrogram. Overall, we can use 5 clusters to segment households and then concentrate on characteristics of these clusters.

Method Value
Agg. Coef of Ward 0.96
Cluster label k=3 Size
1 340
2 68
3 192
Cluster label k=4 Size
1 290
2 68
3 192
4 50
Cluster label k=5 Size
1 109
2 181
3 68
4 192
5 50

We also checked the elbow and silhouette method for the combined clustering variables. Elbow doesn't give a super distinctive result in this case. Silhouette indicates that we should use 2 clusters, but this is very few, so we still feel a few more are better.

First table shows cluster distribution of observations, and it is evident that clusters are overlapping, meaning the features were not able to seperate observations properly. The second one indicates dendrogram of hierarchical clustering method which was built by using Ward's method with 4 clusters.

This table shows differences between clusters based on combined variables. It can be clearly seen that behavior of cluster 1 and 2 are totally different. Cluster 2 has the lowest affluence index and the highest brand loyalty.(focus this group for low priced products). For premium products, you have to consider households in cluster 1 and cluster 3.

Cluster SEC HS Affluence maxBr No of Brands No of Trans Brand Runs Volume
1 2.400000 4.465517 19.720690 0.2172261 4.434483 38.97241 21.748276 12500.06
2 3.382353 3.852941 8.397059 0.7820743 2.720588 23.45588 7.308823 12401.69
3 2.578125 4.281250 15.708333 0.4877121 3.088542 25.20833 11.229167 12566.09
4 1.580000 2.720000 18.120000 0.2582438 2.360000 19.10000 9.820000 5356.80

Best Segmentation and Clustering Model

We implemented 3 different clustering methods: k-means, k-medoids and hierarchical clustering. The clusters obtained from these procedures are slightly different since the modeling approach varies from model to model. For example, similarity of k-means is based on means. However, k-means is sensitive to outliers and hence k-medoids calculates distances using the median. Additionally, agglomerative hierarchical clustering begins with the maximum number of clusters (the number of observations) and then continues until one cluster is left. We have a chance to cut the dendrogram at the proper level. The advantage of hierarchical clustering is that this model gives an analysis more in depth than other models.

We decided to use a low number of clusters such as 3 and 4 to segment households, because we did not find any benefit from using higher number of clusters in terms of business interpretation. Besides, the total number of households is in order of 100, a considerably small dataset, which also justifies using small number of clusters.

Overall, k-medoids performs slightly better than k-means. Hierarchical clustering and k-medoids shows similar performance by considering distribution of observations. We selected k-medoids model with 4 clusters for best segmentation and clustering. This shows the segmentation of the households in the soap market.

Cluster SEC HS Affluence maxBr No of Brands No of Trans Brand Runs Volume
1 2.728814 4.711864 16.050847 0.4815375 3.112994 25.58757 10.949153 14096.186
2 2.403061 4.821429 21.066326 0.2102490 5.000000 46.08673 26.132653 14036.668
3 3.446154 3.938461 7.953846 0.7977172 2.646154 23.92308 7.123077 13018.692
4 1.987654 2.962963 16.820988 0.2743024 2.956790 22.06790 11.901235 6521.204

The table below states that cluster 3 has the lowest affluence index and higher brand loyalty. Cluster 1 and 2 have similar households size, yet their brand loyalties are different. Cluster 4 purchases less than other clusters because the household size is lower. This model explains consumer behaviors more and adds vital value to marketing planning.

For this one best segmentation, we built a decision tree by predicting labels of observations. We used information gain as splitting criteria and determined minsplit and complexity parameters as 40 and 0.001, respectively. Then, we checked variable importance as this table shows importance score of each feature:

##            Avg__Price              Pr_Cat_3            Brand_Runs 
##            154.719836            140.866784            118.814076 
##              Pr_Cat_1         No__of__Trans         No__of_Brands 
##             98.126803             45.781570             42.372578 
##            Others_999  Pur_Vol_No_Promo____              Pr_Cat_2 
##             39.129662             36.143951             36.056754 
##              Vol_Tran                 maxBr     Pur_Vol_Promo_6__ 
##             32.375785             31.303730             25.681229 
##    Trans___Brand_Runs            PropCat_15 Pur_Vol_Other_Promo__ 
##             13.043221             10.900374              8.560410 
##            PropCat_12          Total_Volume             PropCat_5 
##              7.630262              6.009459              5.007882

According to this table, 'Avg__Price','Pr_Cat_3','Brand_Runs','Pr_Cat_1' and 'No__of__Trans' are the most 5 important variable to predict clustering label accurately. This shows that the most important variables are combination of purchase behavior and basis for purchase.

This table demonstrates the confusion matrix of training data and train accuracy:

##    predTrn
##       1   2   3   4
##   1 116   7   1   3
##   2  11 120   3   3
##   3   0   0  50   0
##   4   8  20   0  78
Metric Result
Training Accuracy 0.8667

This table shows the confusion matrix of testing data and testing accuracy:

##    predTst
##      1  2  3  4
##   1 44  4  1  1
##   2 12 44  0  3
##   3  0  0 15  0
##   4  7 12  0 37
Metric Result
Testing Accuracy 0.7778