[Clustering Analysis in R] #3. Data Diagnostics

Now, we need to diagnose whether these data are adequate for analysis in a way that those results are not originated from biased sample distribution and correlated variables. To that end, muliticollinearity test clarifies correlation between independent variables and I used corrgram() for that matter, which is one of the packages in R.

> install.packages(“corrgram”)

> library(corrgram)

> corrgram(cor(d4), type=”corr”, upper.panel=panel.conf)

The result is shown as the following chart. Darker the cell, the stronger the correlation is. According to multicollinearity, most variables show low correlation to each other.

If there are high correlations between independent variables, you need to adjust those variables by multiplying or adding each other.

In addition, it is important to confirm whether samples are normally distributed with several methods. Boxplot is one of the easiest way to check how samples are distributed.

> boxplot(d4)

There are boxplots of independent variables. As you see from the chart above, first two non-dummy variables show a good shape of a normal distribution. Based on this result, we can decide if data transformation is needed. Square root, log or logit is usually used for data transformation.

Data processing requires a lot of time to check if there are statistical issues. Data can be changed or transformed for an analysis objective throughout the process. In that sense, this step is more important than actual data analysis. As the words of “Garbage in, garbage out”, data should be processed in consideration of many aspects of an actual environment and analysis objectives. If some data are distorted or do not reflect the actual situation, it is difficult to say that results from the data are meaningful to explain relations between variables that we want to clarify.

Leave a Reply Cancel reply