[Clustering Analysis in R] #2. Data Processing

By and large, there are two types of data: quantitative and qualitative data. If you want to do any kind of analysis such as regression and classification, you need to transform qualitative data to quantitative. Most data that have a category can be transformed to dummy variable. For instance, male can be zero and female can be one. This approach can be applied to a wide range of qualitative data such as jobs and personal interests.

I made several dummy variables of gender and income level (low, med, and high) as well as personal preference to several activities, which can be used to find customers’ interests.

Load data from your desktop or Google Docs (you can refer to this post how to retrieve your data from Google Docs).

It’s time to process data prior to analysis.

First, I check data with head(). You will know what data will look like and find what you have to change. If you think the order of columns is not appropriate to analyze, you can reset it. I want to push dummy variables to the back. FYI, the total number of columns of data that I used is 11.

d1 <- d[c(1, 7, 2, 3, 4, 5, 6, 8, 9, 10, 11)]

This makes the seventh column relocated to the second column of the data frame and other columns are moved to the back.

Second, you may see that some names of data have space like “Monthly Purchase”. The space in between a name is not appropriate for names in R, so those names including unnecessary space should be changed to new names.

colnames(d1)[2] <- “Purchase”

colnames(d1)[5] <- “highIncome”

Third, you have to check if data are numeric data. This step is very important since both scale function and agnes function that you will use soon need numeric data. In order to check whether class of data is numeric, I generally use sapply().

sapply(d1, class)

sapply() conducts the function of “class” for d1. You will know which columns are numeric and which are character. You can change all diverse types of data to numeric data by simply using sapply() again.

d2 <- as.data.frame(sapply(d1, as.numeric))

The reason why I used as.data.frame() is because sapply() returns list.

Fourth, it is important to check NA from data with is.na(). If there are several missing holes, they often distort analysis results. Therefore you would better eliminate NA data by using na.omit(), even if agnes function doesn’t care about NA.

d3 <- na.omit(d2)

Lastly, all data except dummy variables are standardized for feature scaling by using scale(). You can check why feature scaling is required for analysis from here.

d4 <- cbind(as.data.frame(scale(d3[1:2])), d3[3:11])

As you see above, there are two types of numeric data: general number and dummy. So dummy variables from column three to column 11 are simply binded with standardized data of column one and two.

Leave a Reply Cancel reply