
Tobit regression - measuring the students' academic aptitude
Let us measure the academic aptitude of a student on a scale of 200-800. This measurement is based on the model using reading and math scores. The nature of the program in which the student has been enrolled is also to be taken into consideration. There are three types of programs: academic, general, and vocational. The problem is that some students may answer all the questions on the academic aptitude test correctly and score 800 even though it is likely that these students are not truly equal in aptitude. This may be true for all the students who may answer all the questions incorrectly and score 200.
Getting ready
In order to complete this recipe we shall be using a student's dataset. The first step is collecting the data.
Step 1 - collecting data
To develop the Tobit regression model we shall use the student dataset titled tobit, which is available at http://www.ats.ucla.edu/stat/data/tobit.csv in an MS Excel format. There are 201 data rows and five variables in the dataset. The four numeric measurements are as follows:
id
read
math
apt
The non-numeric measurement is as follows:
prog
How to do it...
Let's get into the details.
Step 2 - exploring data
The first step is to load the following packages. The require()
function is designed for use inside other functions; it returns FALSE
and gives a warning (rather than an error as library ()
does by default) if the package does not exist. Use the following commands:
> require(ggplot2) > require(GGally) > require(VGAM)
Note
Version info: Code for this page was tested in R version 3.2.3 (2015-12-10)
Explore the data and understand the relationships among the variables. Begin by importing the CSV data file named gala.txt
. This will save the data to the dat
data frame as follows:
> dat <- read.table("d:/tobit.csv", header=TRUE, sep=",", row.names="id")
In this dataset, the lowest value of apt
is 352. This indicates that no student has received the lowest score of 200. Even though censoring from below was possible, it is not required in this dataset. Use the following command:
> summary(dat)
The results are as follows:
Id read math prog apt Min. : 1.0 Min. :28.0 Min. :33.0 academic : 45 Min. :352 1st Qu.: 50.8 1st Qu.:44.0 1st Qu.:45.0 general :105 1st Qu.:576 Median :100.5 Median :50.0 Median :52.0 vocational: 50 Median :633 Mean :100.5 Mean :52.2 Mean :52.6 Mean :640 3rd Qu.:150.2 3rd Qu.:60.0 3rd Qu.:59.0 3rd Qu.:705 Max. :200.0 Max. :76.0 Max. :75.0 Max. :800
Step 3 - plotting data
Write is a function that gives the density of a normal distribution for a given mean and standard deviation, which has been scaled on the count metric. In order to generate the histogram formulate count as density * sample size * bin width use the following code:
> f <- function(x, var, bw = 15) { dnorm(x, mean = mean(var), sd(var)) * length(var) * bw }
Now we shall set up the base plot as follows:
> p <- ggplot(dat, aes(x = apt, fill=prog))
Now we shall prepare a histogram, colored by proportion in different programs with a normal distribution overlaid as follows:
> p + stat_bin(binwidth=15) + stat_function(fun = f, size = 1, args = list(var = dat$apt))
The histogram plotted is shown in the following figure:

Looking at the preceding histogram, we can see the censoring in the values of apt
, that is, there are far more cases with scores between 750 to 800 than one would expect compared to the rest of the distribution.
In the following alternative histogram, the excess of cases where apt
=800 have been highlighted. In the following histogram, the breaks option produces a histogram where each unique value of apt
has its own bar (by setting breaks equal to a vector containing values from the minimum of apt
to the maximum of apt). Because apt
is continuous, most values of apt
are unique in the dataset, although close to the center of the distribution there are a few values of apt that have two or three cases.
The spike on the far right of the histogram is the bar for cases where apt
=800, the height of this bar, relative to all the others, clearly shows the excess number of cases with this value. Use the following command:
> p + stat_bin(binwidth = 1) + stat_function(fun = f, size = 1, args = list(var = dat$apt, bw = 1))

Step 4 - exploring relationships
The following command enables use to explore the bivariate relationships in the dataset:
> cor(dat[, c("read", "math", "apt")])
The results are as follows:
read math apt read 1.0000000 0.6622801 0.6451215 math 0.6622801 1.0000000 0.7332702 apt 0.6451215 0.7332702 1.0000000
Now plot the matrix as follows:
> ggpairs(dat[, c("read", "math", "apt")])

In the first row of the scatterplot matrix, the scatterplots display a relationship between read and apt. The relationship between math and apt is also established.
Step 5 - training the model
Run the Tobit model, using the vglm
function of the VGAM package using this command:
> summary(m <- vglm(apt ~ read + math + prog, tobit(Upper = 800), data = dat))
The results are as follows:
Call: vglm(formula = apt ~ read + math + prog, family = tobit(Upper = 800), data = dat)
Pearson residuals: Min 1Q Median 3Q Max mu -2.5684 -0.7311 -0.03976 0.7531 2.802 loge(sd) -0.9689 -0.6359 -0.33365 0.2364 4.845
Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept):1 209.55956 32.54590 6.439 1.20e-10 *** (Intercept):2 4.18476 0.05235 79.944 < 2e-16 *** read 2.69796 0.61928 4.357 1.32e-05 *** math 5.91460 0.70539 8.385 < 2e-16 *** proggeneral -12.71458 12.40857 -1.025 0.305523 progvocational -46.14327 13.70667 -3.366 0.000761 *** --- Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 Number of linear predictors: 2 Names of linear predictors: mu, loge(sd) Dispersion Parameter for tobit family: 1 Log-likelihood: -1041.063 on 394 degrees of freedom Number of iterations: 5
The preceding output informs us about the options specified.
The table labeled coefficients gives the coefficients their standard errors and the z-statistic. No p-values are included in the summary table.
The interpretation of the Tobit regression coefficients is similar to that of OLS regression coefficients. The linear coefficients effect is on the uncensored latent variable:
- For a one-unit increase in read, there is a
2.6981
point increase in the predicted value ofapt
. - A one-unit increase in
math
is associated with a5.9146
unit increase in the predicted value ofapt
. - The terms for
prog
have a slightly different interpretation. The predicted value of apt is-46.1419
points lower for students in a vocational program than for students in an academic program. - The coefficient labeled
(Intercept):1
is the intercept or constant for the model. - The coefficient labeled
(Intercept):2
is an ancillary statistic. Exponential of this value, is analogous to the square root of the residual variance in OLS regression. The value of65.6773
can be compared to the standard deviation of academic aptitude, which was99.21
, a substantial reduction.
The final log likelihood, -1041.0629
, is shown toward the bottom of the output; it can be used in comparisons of nested models.
Step 6 - testing the model
Calculate the p - values for each of the coefficients in the model. Calculate the p - value for each of the coefficients using z - values and then display them in a tabular manner. The coefficients for read
, math
, and prog
= 3 (vocational) are statistically significant as follows:
> ctable <- coef(summary(m)) > pvals <- 2 * pt(abs(ctable[, "z value"]), df.residual(m), lower.tail = FALSE) > cbind(ctable, pvals)
The results are as follows:
Estimate Std. Error z value Pr(>|z|) pvals (Intercept):1 209.559557 32.54589921 6.438893 1.203481e-10 3.505839e-10 (Intercept):2 4.184759 0.05234618 79.943922 0.000000e+00 1.299833e-245 read 2.697959 0.61927743 4.356625 1.320835e-05 1.686815e-05 math 5.914596 0.70538721 8.384892 5.077232e-17 9.122434e-16 proggeneral -12.714581 12.40856959 -1.024661 3.055230e-01 3.061517e-01 progvocational -46.143271 13.70667208 -3.366482 7.613343e-04 8.361912e-04
We can test the significance of the program type overall by fitting a model without a program in it and using a likelihood ratio test as follows:
> m2 <- vglm(apt ~ read + math, tobit(Upper = 800), data = dat) > (p <- pchisq(2 * (logLik(m) - logLik(m2)), df = 2, lower.tail = FALSE))
The results are as follows:
[1] 0.003155176
The statistical significance of the prog variable is indicated by the p - value equal to 0.0032
. We calculate the upper and lower 95% confidence intervals for the coefficients as follows:
> b <- coef(m) > se <- sqrt(diag(vcov(m))) > cbind(LL = b - qnorm(0.975) * se, UL = b + qnorm(0.975) * se)
The results are as follows:
LL UL (Intercept):1 145.770767 273.348348 (Intercept):2 4.082163 4.287356 read 1.484198 3.911721 math 4.532062 7.297129 proggeneral -37.034931 11.605768 progvocational -73.007854 -19.278687
By plotting residuals to one, we can assess the absolute as well as relative (Pearson) values and assumptions such as normality and homogeneity of variance. This shall help in examining the model and the data fit.
We may also wish to examine how well our model fits the data. One way to start is with plots of the residuals to assess their absolute as well as relative (Pearson) values and assumptions such as normality and homogeneity of variance. Use the following commands:
> dat$yhat <- fitted(m)[,1] > dat$rr <- resid(m, type = "response") > dat$rp <- resid(m, type = "pearson")[,1] > par(mfcol = c(2, 3)) > with(dat, { plot(yhat, rr, main = "Fitted vs Residuals") qqnorm(rr) plot(yhat, rp, main = "Fitted vs Pearson Residuals") qqnorm(rp) plot(apt, rp, main = "Actual vs Pearson Residuals") plot(apt, yhat, main = "Actual vs Fitted") })
The plot is as shown in the following screenshot:

Establish the correlation as follows:
> (r <- with(dat, cor(yhat, apt)))
The results are as follows:
[1] 0.7825
Variance accounted for is calculated as follows:
> r^2
The results are as follows:
[1] 0.6123
The correlation between the predicted and observed values of apt
is 0.7825
. If we square this value, we get the multiple squared correlation, this indicates that the predicted values share 61.23% of their variance with apt
.