Practical Machine Learning Cookbook

上QQ阅读APP看书，第一时间看更新

Tobit regression - measuring the students' academic aptitude

Let us measure the academic aptitude of a student on a scale of 200-800. This measurement is based on the model using reading and math scores. The nature of the program in which the student has been enrolled is also to be taken into consideration. There are three types of programs: academic, general, and vocational. The problem is that some students may answer all the questions on the academic aptitude test correctly and score 800 even though it is likely that these students are not truly equal in aptitude. This may be true for all the students who may answer all the questions incorrectly and score 200.

Getting ready

In order to complete this recipe we shall be using a student's dataset. The first step is collecting the data.

Step 1 - collecting data

To develop the Tobit regression model we shall use the student dataset titled tobit, which is available at http://www.ats.ucla.edu/stat/data/tobit.csv in an MS Excel format. There are 201 data rows and five variables in the dataset. The four numeric measurements are as follows:

id
read
math
apt

The non-numeric measurement is as follows:

prog

How to do it...

Let's get into the details.

Step 2 - exploring data

The first step is to load the following packages. The require() function is designed for use inside other functions; it returns FALSE and gives a warning (rather than an error as library () does by default) if the package does not exist. Use the following commands:

 > require(ggplot2)
 > require(GGally)
 > require(VGAM)

Note

Version info: Code for this page was tested in R version 3.2.3 (2015-12-10)

Explore the data and understand the relationships among the variables. Begin by importing the CSV data file named gala.txt. This will save the data to the dat data frame as follows:

> dat <- read.table("d:/tobit.csv", header=TRUE, sep=",", row.names="id")

In this dataset, the lowest value of apt is 352. This indicates that no student has received the lowest score of 200. Even though censoring from below was possible, it is not required in this dataset. Use the following command:

> summary(dat)

The results are as follows:

Id read math prog apt 
Min. : 1.0 Min. :28.0 Min. :33.0 academic : 45 Min. :352
1st Qu.: 50.8 1st Qu.:44.0 1st Qu.:45.0 general :105 1st Qu.:576
Median :100.5 Median :50.0 Median :52.0 vocational: 50 Median :633
Mean :100.5 Mean :52.2 Mean :52.6 Mean :640
3rd Qu.:150.2 3rd Qu.:60.0 3rd Qu.:59.0 3rd Qu.:705
Max. :200.0 Max. :76.0 Max. :75.0 Max. :800

Step 3 - plotting data

Write is a function that gives the density of a normal distribution for a given mean and standard deviation, which has been scaled on the count metric. In order to generate the histogram formulate count as density * sample size * bin width use the following code:

 > f <- function(x, var, bw = 15) {
 dnorm(x, mean = mean(var), sd(var)) * length(var) * bw
 }

Now we shall set up the base plot as follows:

> p <- ggplot(dat, aes(x = apt, fill=prog))

Now we shall prepare a histogram, colored by proportion in different programs with a normal distribution overlaid as follows:

> p + stat_bin(binwidth=15) +
 stat_function(fun = f, size = 1,
 args = list(var = dat$apt))

The histogram plotted is shown in the following figure:

Looking at the preceding histogram, we can see the censoring in the values of apt, that is, there are far more cases with scores between 750 to 800 than one would expect compared to the rest of the distribution.

In the following alternative histogram, the excess of cases where apt=800 have been highlighted. In the following histogram, the breaks option produces a histogram where each unique value of apt has its own bar (by setting breaks equal to a vector containing values from the minimum of apt to the maximum of apt). Because apt is continuous, most values of apt are unique in the dataset, although close to the center of the distribution there are a few values of apt that have two or three cases.

The spike on the far right of the histogram is the bar for cases where apt=800, the height of this bar, relative to all the others, clearly shows the excess number of cases with this value. Use the following command:

> p + stat_bin(binwidth = 1) + stat_function(fun = f, size = 1, args = list(var = dat$apt, 
 bw = 1))

Step 4 - exploring relationships

The following command enables use to explore the bivariate relationships in the dataset:

> cor(dat[, c("read", "math", "apt")])

The results are as follows:

 read math apt
read 1.0000000 0.6622801 0.6451215
math 0.6622801 1.0000000 0.7332702
apt 0.6451215 0.7332702 1.0000000

Now plot the matrix as follows:

> ggpairs(dat[, c("read", "math", "apt")])

In the first row of the scatterplot matrix, the scatterplots display a relationship between read and apt. The relationship between math and apt is also established.

Step 5 - training the model

Run the Tobit model, using the vglm function of the VGAM package using this command:

    > summary(m <- vglm(apt ~ read + math + prog, tobit(Upper = 800), data = dat))

The results are as follows:

Call:
vglm(formula = apt ~ read + math + prog, family = tobit(Upper = 800), 
 data = dat)

Pearson residuals:
 Min 1Q Median 3Q Max
mu -2.5684 -0.7311 -0.03976 0.7531 2.802
loge(sd) -0.9689 -0.6359 -0.33365 0.2364 4.845

Coefficients:
 Estimate Std. Error z value Pr(>|z|) 
(Intercept):1 209.55956 32.54590 6.439 1.20e-10 ***
(Intercept):2 4.18476 0.05235 79.944 < 2e-16 ***
read 2.69796 0.61928 4.357 1.32e-05 ***
math 5.91460 0.70539 8.385 < 2e-16 ***
proggeneral -12.71458 12.40857 -1.025 0.305523 
progvocational -46.14327 13.70667 -3.366 0.000761 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Number of linear predictors: 2 
Names of linear predictors: mu, loge(sd)
Dispersion Parameter for tobit family: 1
Log-likelihood: -1041.063 on 394 degrees of freedom
Number of iterations: 5

The preceding output informs us about the options specified.

The table labeled coefficients gives the coefficients their standard errors and the z-statistic. No p-values are included in the summary table.

The interpretation of the Tobit regression coefficients is similar to that of OLS regression coefficients. The linear coefficients effect is on the uncensored latent variable:

For a one-unit increase in read, there is a 2.6981 point increase in the predicted value of apt.
A one-unit increase in math is associated with a 5.9146 unit increase in the predicted value of apt.
The terms for prog have a slightly different interpretation. The predicted value of apt is -46.1419 points lower for students in a vocational program than for students in an academic program.
The coefficient labeled (Intercept):1 is the intercept or constant for the model.
The coefficient labeled (Intercept):2 is an ancillary statistic. Exponential of this value, is analogous to the square root of the residual variance in OLS regression. The value of 65.6773 can be compared to the standard deviation of academic aptitude, which was 99.21, a substantial reduction.

The final log likelihood, -1041.0629, is shown toward the bottom of the output; it can be used in comparisons of nested models.

Step 6 - testing the model

Calculate the p - values for each of the coefficients in the model. Calculate the p - value for each of the coefficients using z - values and then display them in a tabular manner. The coefficients for read, math, and prog = 3 (vocational) are statistically significant as follows:

 > ctable <- coef(summary(m))
 > pvals <- 2 * pt(abs(ctable[, "z value"]), df.residual(m), lower.tail = FALSE) 
 > cbind(ctable, pvals)

The results are as follows:

 Estimate Std. Error z value Pr(>|z|) pvals
(Intercept):1 209.559557 32.54589921 6.438893 1.203481e-10 3.505839e-10
(Intercept):2 4.184759 0.05234618 79.943922 0.000000e+00 1.299833e-245
read 2.697959 0.61927743 4.356625 1.320835e-05 1.686815e-05
math 5.914596 0.70538721 8.384892 5.077232e-17 9.122434e-16
proggeneral -12.714581 12.40856959 -1.024661 3.055230e-01 3.061517e-01
progvocational -46.143271 13.70667208 -3.366482 7.613343e-04 8.361912e-04

We can test the significance of the program type overall by fitting a model without a program in it and using a likelihood ratio test as follows:

 > m2 <- vglm(apt ~ read + math, tobit(Upper = 800), data = dat) 
 > (p <- pchisq(2 * (logLik(m) - logLik(m2)), df = 2, lower.tail = FALSE))

The results are as follows:

 [1] 0.003155176

The statistical significance of the prog variable is indicated by the p - value equal to 0.0032. We calculate the upper and lower 95% confidence intervals for the coefficients as follows:

 > b <- coef(m)
 > se <- sqrt(diag(vcov(m)))
 > cbind(LL = b - qnorm(0.975) * se, UL = b + qnorm(0.975) * se)

The results are as follows:

 LL UL
(Intercept):1 145.770767 273.348348
(Intercept):2 4.082163 4.287356
read 1.484198 3.911721
math 4.532062 7.297129
proggeneral -37.034931 11.605768
progvocational -73.007854 -19.278687

By plotting residuals to one, we can assess the absolute as well as relative (Pearson) values and assumptions such as normality and homogeneity of variance. This shall help in examining the model and the data fit.

We may also wish to examine how well our model fits the data. One way to start is with plots of the residuals to assess their absolute as well as relative (Pearson) values and assumptions such as normality and homogeneity of variance. Use the following commands:

 > dat$yhat <- fitted(m)[,1]
 > dat$rr <- resid(m, type = "response")
 > dat$rp <- resid(m, type = "pearson")[,1]
 > par(mfcol = c(2, 3))
 > with(dat, {
 plot(yhat, rr, main = "Fitted vs Residuals")
 qqnorm(rr)
 plot(yhat, rp, main = "Fitted vs Pearson Residuals")
 qqnorm(rp)
 plot(apt, rp, main = "Actual vs Pearson Residuals")
 plot(apt, yhat, main = "Actual vs Fitted")
 })

The plot is as shown in the following screenshot:

Establish the correlation as follows:

> (r <- with(dat, cor(yhat, apt)))

The results are as follows:

[1] 0.7825

Variance accounted for is calculated as follows:

> r^2

The results are as follows:

[1] 0.6123

The correlation between the predicted and observed values of apt is 0.7825. If we square this value, we get the multiple squared correlation, this indicates that the predicted values share 61.23% of their variance with apt.