{"id":250,"date":"2014-12-31T13:25:54","date_gmt":"2014-12-31T18:25:54","guid":{"rendered":"http:\/\/jkthinks.synology.me\/?p=1752"},"modified":"2020-09-04T22:58:35","modified_gmt":"2020-09-04T22:58:35","slug":"clustering-analysis-in-r-4-data-analysis","status":"publish","type":"post","link":"https:\/\/www.jkthinks.com\/?p=250","title":{"rendered":"[Clustering Analysis in R] #4. Data Analysis"},"content":{"rendered":"<p>Finally, we can step into the process for clustering analysis, which is to separate customers for their characteristics and to find representative tendency of each group (segment). To that end, I will use two approaches: k-means and hierarchical clustering analysis. The former is to find if independent groups have high similarity from their representative observation within a group while the latter classifies observations according to similarities from the bottom in the case of agglomerative clustering. You can get more information about k-means clustering <a href=\"http:\/\/en.wikipedia.org\/wiki\/K-means_clustering\">at this page<\/a> and hierarchical clustering <a href=\"http:\/\/en.wikipedia.org\/wiki\/Hierarchical_clustering\">at this page<\/a> in Wikipedia.<\/p>\n<p>There are several packages to do clustering analysis. <code>cluster<\/code> is one of the popular packages and provides diverse functions. Detail explanation about the package can be found <a href=\"http:\/\/cran.r-project.org\/web\/packages\/cluster\/cluster.pdf\">in this document<\/a>. You need to install the package and load its library as follows.<br \/>\n[code language=&#8221;r&#8221;]<br \/>\ninstall.packages(\u201ccluster\u201d)<br \/>\nlibrary(cluster)<br \/>\n[\/code]<\/p>\n<p>Based on this package, I will start with k-means clustering, which is most frequently used. In advance of clustering analysis, finding the optimal number of clusters is important. This number is called \u201ck\u201d. It means the number of representative observation for a group. These \u201ck\u201d observations effectively minimize the sum of squares with other observations within groups.<br \/>\n[code language=&#8221;r&#8221;]<br \/>\nd5 &lt;- (nrow(d4)-1)*sum(apply(d4,2,var))<br \/>\nfor (i in 2:15) d5[i] &lt;- sum(kmeans(d4, centers=i)$withinss)<br \/>\nplot(1:15, d5, type=&quot;b&quot;, xlab = &quot;Number of Clusters&quot;, ylab = &quot;Within groups sum of squares&quot;)<br \/>\n[\/code]<\/p>\n<p><a href=\"http:\/\/jkthinks.synology.me\/wp-content\/uploads\/2014\/12\/elbow.jpg\"><img loading=\"lazy\" class=\" size-full wp-image-1742 aligncenter\" src=\"http:\/\/jkthinks.synology.me\/wp-content\/uploads\/2014\/12\/elbow.jpg\" alt=\"elbow\" width=\"369\" height=\"359\" \/><\/a><\/p>\n<p>As you can see from the chart, the elbow point is 13. It means that the sum of squares within groups is hardly reduced by more than 13. \u201ck\u201d can be 13. Anyway, I wanted to have less clusters than the obtained \u201ck\u201d value for the product since the number of total samples is 2010, which is deducted two samples with N\/A from the original samples of 2012. I chose four instead of 13.<br \/>\n[code language=&#8221;r&#8221;]<br \/>\nd6 &lt;- pam(d4, 4, metric = &quot;euclidean&quot;, stand = TRUE)<br \/>\nhead(d6)<br \/>\n[\/code]<\/p>\n<p>You can see which observation is the representative for a group and how many observations are in each group. Also, you can know what kinds of characteristics each group has (refer to the four IDs). This data can be saved as txt file by using <code>sink()<\/code>. FYI, it is complicated to use <code>write()<\/code> since returning outcome, d6, is list.<\/p>\n<p style=\"text-align: center;\"><b>[ Outcome of clustering analysis (part) ]<\/b><\/p>\n<p><strong><strong><img loading=\"lazy\" class=\" size-full wp-image-1737 aligncenter\" src=\"http:\/\/jkthinks.synology.me\/wp-content\/uploads\/2014\/12\/outcome-of-clustering.jpg\" alt=\"outcome of clustering\" width=\"669\" height=\"97\" \/><\/strong><\/strong><\/p>\n<p>Based on <code>agnes()<\/code>, hierarchical clustering analysis can be shown as the dendrogram as follows.<br \/>\n[code language=&#8221;r&#8221;]<br \/>\nd7 &lt;- agnes(d4, metric = \u201ceuclidean\u201d, stand = TRUE, method = \u201caverage\u201d)<br \/>\nplot(d7)<br \/>\n[\/code]<\/p>\n<p><strong><br \/>\n<a href=\"http:\/\/jkthinks.synology.me\/wp-content\/uploads\/2014\/12\/dendrogram.jpg\"><img loading=\"lazy\" class=\" wp-image-1738 aligncenter\" src=\"http:\/\/jkthinks.synology.me\/wp-content\/uploads\/2014\/12\/dendrogram-1024x529.jpg\" alt=\"dendrogram\" width=\"766\" height=\"396\" \/><\/a><\/strong><\/p>\n<p>&nbsp;<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Finally, we can step into the process for clustering analysis, which is to separate customers for their characteristics and to find representative tendency of each group (segment). To that end, I will use two approaches: k-means and hierarchical clustering analysis. The former is to find if independent groups have high similarity from their representative observation [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[9],"tags":[],"_links":{"self":[{"href":"https:\/\/www.jkthinks.com\/index.php?rest_route=\/wp\/v2\/posts\/250"}],"collection":[{"href":"https:\/\/www.jkthinks.com\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.jkthinks.com\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.jkthinks.com\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.jkthinks.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=250"}],"version-history":[{"count":1,"href":"https:\/\/www.jkthinks.com\/index.php?rest_route=\/wp\/v2\/posts\/250\/revisions"}],"predecessor-version":[{"id":279,"href":"https:\/\/www.jkthinks.com\/index.php?rest_route=\/wp\/v2\/posts\/250\/revisions\/279"}],"wp:attachment":[{"href":"https:\/\/www.jkthinks.com\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=250"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.jkthinks.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=250"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.jkthinks.com\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=250"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}