Maximum-likelihood Estimation in R

If you are a marketer of a sports team and your mission is to boost the sales of annual membership for home games, what do you do first? You may want to know about which factors you should focus on to encourage customers to renew their annual membership. Actually, you need to wonder what is more important than other possible aspects in terms of cost efficiency. Upon getting some basic stuff such as mean and standard deviation of variables, most people can start with a regression analysis. However, if you try to find parameters in accordance with the objective of increasing the repurchase rate, it is better for you to go with MLE (Maximum-likelihood Estimation).

Simply, MLE is a method of estimating the parameters of a statistical model (Wikipedia). Compared with a simple linear regression that formulates a linear trend from data, MLE is to obtain parameters from given data, parameters that maximize the likelihood of a certain event. In the case, likelihood is how probably a customer buys an annual ticket. If you want to know more about MLE, visit this website.

Let’s make a data table prior to analyzing MLE in R. Even though there are several ways in which you can randomly pick values for your dependent and independent variables, I will use csv type data in this post.

As you can see from the table above, data consists of 5 variables and 20 observations. Among the variables, “repurchase” is a dependent variable, meaning whether a customer renews a membership. Regarding independent variables, there are two categorical data; gender and age. Now, we need to find what kind of customers are more likely to renew their annual tickets by obtaining and comparing parameters.

Now, let’s open R studio and load the sample data (part 1). Two parts are required to do MLE in R; the first is the declaration of the function and the second is using optim function. In order to make the process clearer, it is better to explain optim function first (part 3). Optim is used to find parameters which minimize the result of function declared. You can get some sense on how to use optim function here. Regarding the declaration part, you can use any kind of distribution and model such as Poisson distribution and linear model. In the case of log likelihood, use the formula as follows:

The reason why a negative sign is multiplied to logLL at part 2 is because optim is a function to find parameters to minimize logLL. In order words, a negative sign enables the smallest value of logLL to get reversed to the largest value.

## part 1. repurchase analysis based on sample data
library(stats4)
data <- as.data.frame(read.csv("sample.csv"))

## part 2. declare the function
fn <- function(data, par){
 logLL <- sum(data$repurchase*log(par[1] + par[2]*data$gender + par[3]*data$young + par[4]*data$middle) + (1-data$repurchase)*log(1-(par[1] + par[2]*data$gender + par[3]*data$young + par[4]*data$middle)))
 return(-logLL)
}

## part 3. optim function
out <- optim(par = c(0.1, 0.1, 0.1, 0.1), fn, data=data)
out

Having run the model, you can get parameters (coefficients) of independent variables at $par and know which variable influences more than others. For instance, the first value of $par is intercept, the second is the coefficient of independent variable (refer sample output image). If “young” has the biggest value, you need to target young customers and do more promotions for this group. To apply more diverse functions to MLE, refer this document.

Sample output image

Leave a Reply Cancel reply