{"id":17,"date":"2015-03-31T21:33:41","date_gmt":"2015-04-01T02:33:41","guid":{"rendered":"http:\/\/jkthinks.synology.me\/?p=1798"},"modified":"2020-09-04T22:58:34","modified_gmt":"2020-09-04T22:58:34","slug":"principal-component-analysis-factor-analysis-in-r","status":"publish","type":"post","link":"https:\/\/www.jkthinks.com\/?p=17","title":{"rendered":"Principal Component Analysis &#038; Factor Analysis in R"},"content":{"rendered":"<p>Let\u2019s say, there is a chunk of survey data, which consists of more than fifty questions. Even the number of total respondents reaches 60,000. Maybe it will take you a lot of time to analyze them according to your original intention or analysis objectives. Most people try to classify data or divide them into pieces after getting summary information such as mean and standard deviation for each variable.<\/p>\n<p>As I mentioned in \u201cClustering Analysis (or Cluster Analysis)\u201d, it is important to look into similarities of the data to get more informative groups from much varied and dispersed raw data. To that end, Principal Component Analysis and Factor Analysis are very effective way to find what components or factors mainly explain the causality between dependent and independent variables.<\/p>\n<p>Before doing some work in R, we need to clarify what they exactly are and how we use these approaches in applied areas such as data science and marketing.<\/p>\n<blockquote><p><b>Principal component analysis<\/b> (<b>PCA<\/b>) is a statistical procedure that uses an <a href=\"http:\/\/en.wikipedia.org\/wiki\/Orthogonal_transformation\">orthogonal transformation<\/a> to convert a set of observations of possibly correlated variables into a set of values of <a href=\"http:\/\/en.wikipedia.org\/wiki\/Correlation_and_dependence\">linearly uncorrelated<\/a> variables called principal components. The number of principal components is less than or equal to the number of original variables.  &#8211; <a href=\"http:\/\/en.wikipedia.org\/wiki\/Principal_component_analysis\">Wikipedia<\/a><\/p><\/blockquote>\n<p>Simply put, PCA enables users to find different components from independent variables by comparing a correlation between those variables. In most cases of marketing survey, these components mean survey questions. One more important thing is that a component is not a question itself but a combination of questions. Also, the number of components is less than or equal to the number of variables or questions in a survey as being stated in the above description in Wikipedia.<strong><strong> <\/strong><\/strong><\/p>\n<blockquote><p><b>Factor analysis<\/b> is a <a href=\"http:\/\/en.wikipedia.org\/wiki\/Statistics\">statistical<\/a> method used to describe <a href=\"http:\/\/en.wikipedia.org\/wiki\/Variance\">variability<\/a> among observed, correlated <a href=\"http:\/\/en.wikipedia.org\/wiki\/Variable_(mathematics)\">variables<\/a> in terms of a potentially lower number of unobserved variables called factors. &#8211; <a href=\"http:\/\/en.wikipedia.org\/wiki\/Factor_analysis\">Wikipedia<\/a><\/p><\/blockquote>\n<p>Factor analysis is just to find factors, which are often referred to as components. Actually, there has been a controversy over whether these two methods, PCA and Factor Analysis, are equivalent. However, I will not dive into the deeper area of how they are mathematically derived and how different they are.<\/p>\n<p>These methods are assumed on the basis that variables do not have categorical or dummy data (o or 1) but have continuous and numeric data. The categorical data can be analyzed by <a href=\"http:\/\/en.wikipedia.org\/wiki\/Correspondence_analysis\">Correspondence Analysis<\/a>. FYI, a simpler way to handle multiple type data is to use FactoMineR package in R. If variables consist of continuous numeric data, you are not required to additionally install these packages.<\/p>\n<p>Let\u2019s do a simple analysis for PCA and Factor Analysis.<\/p>\n<p>I will use wine DB in FactoMineR, which has 21 observations and 31 variables. As you can see from the head of the DB, there are two categorical variables of <i>Label<\/i> and <i>Soil<\/i>.<\/p>\n<p style=\"text-align: center;\">Summary of DB<\/p>\n<p><strong><strong><a href=\"http:\/\/jkthinks.synology.me\/wp-content\/uploads\/2015\/03\/DB_wine.jpg\"><img loading=\"lazy\" class=\"aligncenter wp-image-1800\" src=\"http:\/\/jkthinks.synology.me\/wp-content\/uploads\/2015\/03\/DB_wine-1024x529.jpg\" alt=\"DB_wine\" width=\"600\" height=\"310\" \/><\/a><\/strong><\/strong><\/p>\n<pre class=\"lang:r decode:true\">library(FactoMineR)\r\nlibrary(psych)\r\nlibrary(corrgram)\r\ndata(wine)\r\nsummary(wine)\r\n\r\n# Check correlation between variables except categorical variables\r\ncorrgram(cor(wine[,4:31]), type=\"corr\", upper.panel=panel.conf)\r\n\r\n# Principal Component Analysis\r\n# Add more rows to WINE DB to avoid errors when the number of observations is less than the number of variables\r\nwine1 &lt;- rbind(wine, wine)\r\npca &lt;- princomp(wine1[,4:31], scores=TRUE, cor=TRUE)\r\n\r\n# Check the analysis result\r\nsummary(pca)\r\nloadings(pca)<\/pre>\n<p>Upon running these codes, there are components, which are the combination of some variables. You can know which variables are strongly correlated with which component, resulting in typology that makes each component unique from others (refer to red boxes in the following captured image). From this result, you can put a new label for each component to match its sub-characteristics. This new label represents <a href=\"http:\/\/en.wikipedia.org\/wiki\/Latent_variable\">latent variables<\/a>, which are hidden aspects of a group, category, etc.<\/p>\n<p><a href=\"http:\/\/jkthinks.synology.me\/wp-content\/uploads\/2015\/03\/PCA_wine.jpg\"><img loading=\"lazy\" class=\"aligncenter wp-image-1802\" src=\"http:\/\/jkthinks.synology.me\/wp-content\/uploads\/2015\/03\/PCA_wine-1024x559.jpg\" alt=\"PCA_wine\" width=\"600\" height=\"328\" \/><\/a><\/p>\n<p>Now, you may think how many components will be appropriate for the wine DB. To that end, you need to draw a bar graph or a scree plot of PCA. The component that has eigenvalue (y-value of the following bar graph) more than one can be categorized as a significant group and, therefore component 1 to 5  are meaningful in this case. You can refer <a href=\"http:\/\/en.wikipedia.org\/wiki\/Eigenvalues_and_eigenvectors\">here<\/a> more information about eigenvalue.<\/p>\n<pre class=\"lang:r decode:true \"># Draw bar graph and scree plot\r\nplot(pca)\r\nscreeplot(pca, npcs=28, type=\"line\", main=\"Scree Plot\")\r\nbiplot(pca)<\/pre>\n<p><a href=\"http:\/\/jkthinks.synology.me\/wp-content\/uploads\/2015\/03\/bargraph_wine.jpeg\"><img loading=\"lazy\" class=\"aligncenter wp-image-1799\" src=\"http:\/\/jkthinks.synology.me\/wp-content\/uploads\/2015\/03\/bargraph_wine.jpeg\" alt=\"bargraph_wine\" width=\"450\" height=\"433\" \/><\/a><\/p>\n<p>Likewise, Factor Analysis classifies variables into predetermined number of factors. In this case, there is an error due to a singular matrix, so I use a different function to get factors.<\/p>\n<pre class=\"lang:r decode:true \"># Factor Analysis\r\nfa &lt;- factanal(wine1[,4:31], factor=6)\r\n\r\n# Use the minimum residual method instead of the function of factanal due to singular matrix error\r\nfa &lt;- fa(r=cor(wine1[,4:31]), nfactors=6, rotate=\"varimax\", smc=FALSE, fm=\"minres\")\r\nloadings(fa)<\/pre>\n<p>The result of Factor Analysis is as follows. There is a correlation between variables and factors, and it shows how much each factor explains data at the end of the table, i.e. the first factor (MR1) accounts for 41%.<\/p>\n<p><a href=\"http:\/\/jkthinks.synology.me\/wp-content\/uploads\/2015\/03\/FA_wine.jpg\"><img loading=\"lazy\" class=\" size-full wp-image-1801 aligncenter\" src=\"http:\/\/jkthinks.synology.me\/wp-content\/uploads\/2015\/03\/FA_wine.jpg\" alt=\"FA_wine\" width=\"577\" height=\"703\" \/><\/a><\/p>\n<p>As I mentioned, mixed data requires other functions like MFA in FactoMineR. You can obtain further information on how to handle this <a href=\"http:\/\/factominer.free.fr\/advanced-methods\/multiple-factor-analysis.html\">here<\/a>.<\/p>\n","protected":false},"excerpt":{"rendered":"<p>Let\u2019s say, there is a chunk of survey data, which consists of more than fifty questions. Even the number of total respondents reaches 60,000. Maybe it will take you a lot of time to analyze them according to your original intention or analysis objectives. Most people try to classify data or divide them into pieces [&hellip;]<\/p>\n","protected":false},"author":1,"featured_media":0,"comment_status":"open","ping_status":"open","sticky":false,"template":"","format":"standard","meta":[],"categories":[9],"tags":[],"_links":{"self":[{"href":"https:\/\/www.jkthinks.com\/index.php?rest_route=\/wp\/v2\/posts\/17"}],"collection":[{"href":"https:\/\/www.jkthinks.com\/index.php?rest_route=\/wp\/v2\/posts"}],"about":[{"href":"https:\/\/www.jkthinks.com\/index.php?rest_route=\/wp\/v2\/types\/post"}],"author":[{"embeddable":true,"href":"https:\/\/www.jkthinks.com\/index.php?rest_route=\/wp\/v2\/users\/1"}],"replies":[{"embeddable":true,"href":"https:\/\/www.jkthinks.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcomments&post=17"}],"version-history":[{"count":1,"href":"https:\/\/www.jkthinks.com\/index.php?rest_route=\/wp\/v2\/posts\/17\/revisions"}],"predecessor-version":[{"id":275,"href":"https:\/\/www.jkthinks.com\/index.php?rest_route=\/wp\/v2\/posts\/17\/revisions\/275"}],"wp:attachment":[{"href":"https:\/\/www.jkthinks.com\/index.php?rest_route=%2Fwp%2Fv2%2Fmedia&parent=17"}],"wp:term":[{"taxonomy":"category","embeddable":true,"href":"https:\/\/www.jkthinks.com\/index.php?rest_route=%2Fwp%2Fv2%2Fcategories&post=17"},{"taxonomy":"post_tag","embeddable":true,"href":"https:\/\/www.jkthinks.com\/index.php?rest_route=%2Fwp%2Fv2%2Ftags&post=17"}],"curies":[{"name":"wp","href":"https:\/\/api.w.org\/{rel}","templated":true}]}}