[Clustering Analysis in R] #1. Introduction & Data Gathering

Before Starting the Process

The reason why I have started writing these postings is because, as a beginner and learner of data science, I wanted to share my knowledge about clustering analysis and develop them based on active discussions. Main objective of these postings is to understand the whole process from data gathering to analysis.

I chose clustering analysis as the first topic since it is widely used in marketing, especially STP, as well as machine learning. I will work on other parts such as regression and classification after finishing the clustering analysis. Actually, there are several approaches (or programs) to get clusters. I previously did it with excel by using Solver program. However, Solver is not a free tool and Excel is not a good program to carry out large amounts of data. I think that the best substitute for these issues is R. From these reasons, these postings will be based on R and its packages.

Briefly, the process of data analysis consists of data gathering, data processing, data diagnostics, and analysis (interpretation). I will cover most steps, but skip some points such as data transformation and validation process. Those topics are so important that I will handle them later on in detail. A series of postings will be focused on the basic scheme of data processing and application of clustering analysis by using R package.

Data Gathering

I need virtual customer survey data, which includes demographic, usage and preference information. To that end, I randomly picked sample data of 2012 according to characteristics of each variable, of which proportion is decided by following assumptions.

no gender preference
customers range from late teen to sixties
low price elasticity as in CPG
different preference and interests in sports, art, music, and books among consumers

At this step, you can use internal DB or crawled data from SNS as raw data. However, it is required for an analyzer to create data based on random sampling because of several reasons such as data size issue and alignment with an actual environment.

[ Random samples ]

The next will be the process of cleaning and scrubbing data to improve the quality of analysis.

Leave a Reply Cancel reply