Principal Component Analysis & Factor Analysis in R

Let’s say, there is a chunk of survey data, which consists of more than fifty questions. Even the number of total respondents reaches 60,000. Maybe it will take you a lot of time to analyze them according to your original intention or analysis objectives. Most people try to classify data or divide them into pieces after getting summary information such as mean and standard deviation for each variable.

As I mentioned in “Clustering Analysis (or Cluster Analysis)”, it is important to look into similarities of the data to get more informative groups from much varied and dispersed raw data. To that end, Principal Component Analysis and Factor Analysis are very effective way to find what components or factors mainly explain the causality between dependent and independent variables.

Before doing some work in R, we need to clarify what they exactly are and how we use these approaches in applied areas such as data science and marketing.

Principal component analysis (PCA) is a statistical procedure that uses an orthogonal transformation to convert a set of observations of possibly correlated variables into a set of values of linearly uncorrelated variables called principal components. The number of principal components is less than or equal to the number of original variables. – Wikipedia

Simply put, PCA enables users to find different components from independent variables by comparing a correlation between those variables. In most cases of marketing survey, these components mean survey questions. One more important thing is that a component is not a question itself but a combination of questions. Also, the number of components is less than or equal to the number of variables or questions in a survey as being stated in the above description in Wikipedia.

Factor analysis is a statistical method used to describe variability among observed, correlated variables in terms of a potentially lower number of unobserved variables called factors. – Wikipedia

Factor analysis is just to find factors, which are often referred to as components. Actually, there has been a controversy over whether these two methods, PCA and Factor Analysis, are equivalent. However, I will not dive into the deeper area of how they are mathematically derived and how different they are.

These methods are assumed on the basis that variables do not have categorical or dummy data (o or 1) but have continuous and numeric data. The categorical data can be analyzed by Correspondence Analysis. FYI, a simpler way to handle multiple type data is to use FactoMineR package in R. If variables consist of continuous numeric data, you are not required to additionally install these packages.

Let’s do a simple analysis for PCA and Factor Analysis.

I will use wine DB in FactoMineR, which has 21 observations and 31 variables. As you can see from the head of the DB, there are two categorical variables of Label and Soil.

Summary of DB

library(FactoMineR)
library(psych)
library(corrgram)
data(wine)
summary(wine)

# Check correlation between variables except categorical variables
corrgram(cor(wine[,4:31]), type="corr", upper.panel=panel.conf)

# Principal Component Analysis
# Add more rows to WINE DB to avoid errors when the number of observations is less than the number of variables
wine1 <- rbind(wine, wine)
pca <- princomp(wine1[,4:31], scores=TRUE, cor=TRUE)

# Check the analysis result
summary(pca)
loadings(pca)

Upon running these codes, there are components, which are the combination of some variables. You can know which variables are strongly correlated with which component, resulting in typology that makes each component unique from others (refer to red boxes in the following captured image). From this result, you can put a new label for each component to match its sub-characteristics. This new label represents latent variables, which are hidden aspects of a group, category, etc.

Now, you may think how many components will be appropriate for the wine DB. To that end, you need to draw a bar graph or a scree plot of PCA. The component that has eigenvalue (y-value of the following bar graph) more than one can be categorized as a significant group and, therefore component 1 to 5 are meaningful in this case. You can refer here more information about eigenvalue.

# Draw bar graph and scree plot
plot(pca)
screeplot(pca, npcs=28, type="line", main="Scree Plot")
biplot(pca)

Likewise, Factor Analysis classifies variables into predetermined number of factors. In this case, there is an error due to a singular matrix, so I use a different function to get factors.

# Factor Analysis
fa <- factanal(wine1[,4:31], factor=6)

# Use the minimum residual method instead of the function of factanal due to singular matrix error
fa <- fa(r=cor(wine1[,4:31]), nfactors=6, rotate="varimax", smc=FALSE, fm="minres")
loadings(fa)

The result of Factor Analysis is as follows. There is a correlation between variables and factors, and it shows how much each factor explains data at the end of the table, i.e. the first factor (MR1) accounts for 41%.

As I mentioned, mixed data requires other functions like MFA in FactoMineR. You can obtain further information on how to handle this here.

Leave a Reply Cancel reply