Practical Machine Learning Cookbook
上QQ阅读APP看书,第一时间看更新

Introduction

Discriminant analysis is used to distinguish distinct sets of observations and allocate new observations to previously defined groups. For example, if a study was to be carried out in order to investigate the variables that discriminate between fruits eaten by (1) primates, (2) birds, or (3) squirrels, the researcher could collect data on numerous fruit characteristics of those species eaten by each of the animal groups. Most fruits will naturally fall into one of the three categories. Discriminant analysis could then be used to determine which variables are the best predictors of whether a fruit will be eaten by birds, primates, or squirrels. Discriminant analysis is commonly used in biological species classification, in medical classification of tumors, in facial recognition technologies, and in the credit card and insurance industries for determining risk. The main goals of discriminant analysis are discrimination and classification. The assumptions regarding discriminant analysis are multivariate normality, equality of variance-covariance within group and low multicollinearity of the variables.

Multinomial logistic regression is used to predict categorical placement in or the probability of category membership on a dependent variable, based on multiple independent variables. It is used when the dependent variable has more than two nominal or unordered categories, in which dummy coding of independent variables is quite common. The independent variables can be either dichotomous (binary) or continuous (interval or ratio in scale). Multinomial logistic regression uses maximum likelihood estimation to evaluate the probability of categorical membership. It uses maximum likelihood estimation rather than the least squares estimation used in traditional multiple regression. The general form of the distribution is assumed. The starting values of the estimated parameters are used and the likelihood that the sample came from a population with those parameters is computed. The values of the estimated parameters are adjusted iteratively until the maximum likelihood value for the estimated parameters is obtained.

Tobit regression is used to describe the relationship between non-negative dependent variables and independent variables. It is also known as a censored regression model, designed to estimate linear relationships between variables when there is either left or right censoring in the dependent variable. Censoring takes place when cases with a value at or above some threshold, all take on the value of that threshold, so that the true value might be equal to the threshold, but it might also be higher. The Tobit model has been used in a large number of applications where the dependent variable is observed to be zero for some individuals in the sample (automobile expenditures, medical expenditures, hours worked, wages, and so on). This model is for metric dependent variables and then it is limited in the sense that we observe it only if it is above or below some cut off level. For example:

  • The wages may be limited from below by the minimum wage
  • The donation amount given to charity
  • Top coding income
  • Time used and leisure activity of individuals

Poisson regression deals with situations in which the dependent variable is a count. Poisson regression is similar to regular multiple regression except that the dependent (Y) variable is an observed count that follows the Poisson distribution. Thus, the possible values of Y are the nonnegative integers: 0, 1, 2, 3, and so on. It is assumed that large counts are rare. Hence, Poisson regression is similar to logistic regression, which also has a discrete response variable. However, the response is not limited to specific values as it is in logistic regression.