[Clustering Analysis in R] #4. Data Analysis

Finally, we can step into the process for clustering analysis, which is to separate customers for their characteristics and to find representative tendency of each group (segment). To that end, I will use two approaches: k-means and hierarchical clustering analysis. The former is to find if independent groups have high similarity from their representative observation within a group while the latter classifies observations according to similarities from the bottom in the case of agglomerative clustering. You can get more information about k-means clustering at this page and hierarchical clustering at this page in Wikipedia.

There are several packages to do clustering analysis. cluster is one of the popular packages and provides diverse functions. Detail explanation about the package can be found in this document. You need to install the package and load its library as follows.
[code language=”r”]
install.packages(“cluster”)
library(cluster)
[/code]

Based on this package, I will start with k-means clustering, which is most frequently used. In advance of clustering analysis, finding the optimal number of clusters is important. This number is called “k”. It means the number of representative observation for a group. These “k” observations effectively minimize the sum of squares with other observations within groups.
[code language=”r”]
d5 <- (nrow(d4)-1)*sum(apply(d4,2,var))
for (i in 2:15) d5[i] <- sum(kmeans(d4, centers=i)$withinss)
plot(1:15, d5, type="b", xlab = "Number of Clusters", ylab = "Within groups sum of squares")
[/code]

As you can see from the chart, the elbow point is 13. It means that the sum of squares within groups is hardly reduced by more than 13. “k” can be 13. Anyway, I wanted to have less clusters than the obtained “k” value for the product since the number of total samples is 2010, which is deducted two samples with N/A from the original samples of 2012. I chose four instead of 13.
[code language=”r”]
d6 <- pam(d4, 4, metric = "euclidean", stand = TRUE)
head(d6)
[/code]

You can see which observation is the representative for a group and how many observations are in each group. Also, you can know what kinds of characteristics each group has (refer to the four IDs). This data can be saved as txt file by using sink(). FYI, it is complicated to use write() since returning outcome, d6, is list.

[ Outcome of clustering analysis (part) ]

Based on agnes(), hierarchical clustering analysis can be shown as the dendrogram as follows.
[code language=”r”]
d7 <- agnes(d4, metric = “euclidean”, stand = TRUE, method = “average”)
plot(d7)
[/code]

Leave a Reply Cancel reply