Lab 5: Binary Logistic Regression

An overview of binary logistic regression and provides two examples to demonstrate the models and their interpretation.

library(pander) #pander()
library(psych) # describe()
library(gtsummary) #tbl_summary()
library(equatiomatic) # extract_eq()
library(sjPlot) # tab_xtab(), tab_model()
library(tidyverse)

This lab follows along the logistic regression lecture presented in class and provides two examples to demonstrate the models and their interpretation.

What is Binary Logistic Regression?

  • It is a regression with an outcome variable (or dependent variable) that is dichotomous/binary (i.e., only two categories, such as Yes or No, 0 or 1, Disorder or No Disorder, Win or Lose).

    • The predictor variables (or independent variables or explanatory variables) can be either continuous or categorical
  • In binary logistic regression, we are interested in predicting which of two possible events (e.g., win or lose) are going to happen given the predictor(s) variables

  • Assumes a binomial distribution or “a probability distribution that summarizes the likelihood that a value will take one of two independent values under a given set of parameters or assumptions.” ( Source)


Binary Logistic Regression in R

Admissions Example

This example comes from the UCLA Statistical Consulting page found here. This example assesses how GRE, GPA, and prestige of the undergraduate institutions effect graduate school admission.

Variable Description
gre Graduate Record Exam (GRE) scores, continuous
gpa Grade point average, continuous
rank Rank of undergraduate institutions, 1 = highest prestige, 4 = lowest, categorical
admit Admission decision, 0 = did not admit, 1 = admitted, binary

References

Hosmer, D. & Lemeshow, S. (2000). Applied Logistic Regression (Second Edition). New York: John Wiley & Sons, Inc.

Long, J. Scott (1997). Regression Models for Categorical and Limited Dependent Variables. Thousand Oaks, CA: Sage Publications.


Read in data

log <- read.csv("https://stats.idre.ucla.edu/stat/data/binary.csv") %>% 
  mutate(rank = factor(rank),  #we need to convert this variable to a factor
         admit = factor(admit, levels = c("0", "1"), labels = c("Not Admitted", "Admitted"))) # add labels to outcome

Summary

# psych::describe()
round(describe(log),2) %>% 
  pander() 
  vars n mean sd median trimmed mad min
admit* 1 400 1.32 0.47 1 1.27 0 1
gre 2 400 587.7 115.5 580 589.1 118.6 220
gpa 3 400 3.39 0.38 3.4 3.4 0.4 2.26
rank* 4 400 2.48 0.94 2 2.48 1.48 1

Table continues below

  max range skew kurtosis se
admit* 2 1 0.78 -1.39 0.02
gre 800 580 -0.14 -0.36 5.78
gpa 4 1.74 -0.21 -0.6 0.02
rank* 4 3 0.1 -0.91 0.05
# I like this table more, but you can do either! Especially if you're getting some errors with gtsummary(). 

# gtsummary::tbl_summary()
tbl_summary(log,
            statistic = list(all_continuous() ~ "{mean} ({sd})"),
                             missing = "no")

Characteristic N = 4001
admit
Not Admitted 273 (68%)
Admitted 127 (32%)
gre 588 (116)
gpa 3.39 (0.38)
rank
1 61 (15%)
2 151 (38%)
3 121 (30%)
4 67 (17%)
1 n (%); Mean (SD)

Contingency tables, sometimes called cross tabulations or cross tabs, that displays the frequency distribution of your categorical variables. In R, we use sjPlot::tab_xtabs() (Found by Karen!). We want to check the contingency table to make sure there are values in each cell. Since we only have one categorical variable, rank, we only need to check one contingency table.

sjPlot::tab_xtab(var.row = log$rank, var.col = log$admit, title = "Cross tab of Admit rate by Rank of Undergraduate Insitute", show.row.prc = TRUE, show.summary=F)
Cross tab of Admit rate by Rank of Undergraduate Insitute
rank admit Total
Not Admitted Admitted
1 28
45.9 %
33
54.1 %
61
100 %
2 97
64.2 %
54
35.8 %
151
100 %
3 93
76.9 %
28
23.1 %
121
100 %
4 55
82.1 %
12
17.9 %
67
100 %
Total 273
68.2 %
127
31.8 %
400
100 %

This table shows the row and column totals, as well as row percentages. You can also ask for column percentages, if you prefer.


Binary Logistic Model

Recall in linear regression, we used the lm() function. With logistic regression, we use glm(), or general linear model. The set up is similar to linear regression, however. One important note is to make sure we identify the type of “family function,” or the description of the error distribution used in the model. For logistic regression, it’s family = "binomial".

In our example, the predictors are GRE scores, GPA, and university rank related to graduate school admission status.

mylogit <- glm(admit ~ gre + gpa + rank, data = log, family = "binomial")
summary(mylogit)
## 
## Call:
## glm(formula = admit ~ gre + gpa + rank, family = "binomial", 
##     data = log)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.6268  -0.8662  -0.6388   1.1490   2.0790  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -3.989979   1.139951  -3.500 0.000465 ***
## gre          0.002264   0.001094   2.070 0.038465 *  
## gpa          0.804038   0.331819   2.423 0.015388 *  
## rank2       -0.675443   0.316490  -2.134 0.032829 *  
## rank3       -1.340204   0.345306  -3.881 0.000104 ***
## rank4       -1.551464   0.417832  -3.713 0.000205 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 499.98  on 399  degrees of freedom
## Residual deviance: 458.52  on 394  degrees of freedom
## AIC: 470.52
## 
## Number of Fisher Scoring iterations: 4

Let’s dissect this output. It looks similar to regular simple linear regression but interpretation is different (as we saw in lecture). When looking at the coefficients, we see gre, gpa, and the levels of rank are significant predictors of admission status.

With binary logistic regression, regression coefficient is the change in the log odds of the outcome for a one unit increase in the predictor variable, holding others constant. The intercept is the log odds for students who has zero on all the predictors, a value not observed in this sample (e.g., no one has GPA or GRE score of zero).

So this is how we would interpret those same variables when running a logistic regression model (Note: it’s a z-statistic now since these are standardized coefficients):

  • GRE was significantly related to admission decision (z = 2.070, p = 0.038). Specifically, the results indicate that for every one-unit increase in GRE score, there is an increase of B = 0.002 in the the log odds of being admitted to graduate school, controlling for GPA and university rank.

Additionally, this is how we would interpret the categorical variable rank. Note that because this is a categorical variable, we are comparing the categories to a comparison group (or a reference group). In this example, the comparison group is rank 1:

  • Students who attended rank 2 universities are less likely to be admitted to graduate school, compared to rank 1 universities (z = -0.675, p = 0.033).

The log odds, or logits, are not always easily interpreted and can be converted to odds ratios for easier interpretation.


Odds Ratios

The coefficients presented above are log odds. To convert them to odd ratios, we take the exponent of the coefficients (Slide 21 in lecture):

$$ odds ratio = e^{\beta} $$

#coverting the log odds (logits) to odds ratios
exp(coef(mylogit))
## (Intercept)         gre         gpa       rank2       rank3       rank4 
##   0.0185001   1.0022670   2.2345448   0.5089310   0.2617923   0.2119375
## odds ratios and 95% CI
exp(cbind(OR = coef(mylogit), confint(mylogit)))
##                    OR       2.5 %    97.5 %
## (Intercept) 0.0185001 0.001889165 0.1665354
## gre         1.0022670 1.000137602 1.0044457
## gpa         2.2345448 1.173858216 4.3238349
## rank2       0.5089310 0.272289674 0.9448343
## rank3       0.2617923 0.131641717 0.5115181
## rank4       0.2119375 0.090715546 0.4706961

These values likely make a little more sense. Considering GRE in the example again:

  • GRE was significantly related to admission decision (z = 2.070, p = 0.038). Specifically, for every one-unit increase in GRE score, the odds of being admitted to graduate school increase by a factor of 1.002, controlling for GPA and university rank. (Note this is a very small amount because one point increase in GRE doesn’t represent a large difference).

Said differently,

  • GRE was significantly related to admission decision (z = 2.070, p = 0.038). Specifically, for every one-unit increase in GRE score, we expect to see about a .002% increase in the odds of being admitted to graduate school, controlling for GPA and university rank.

Predicted Probability

To make the coefficients easier to interpret, we can take the odds ratio values and convert them to probabilities. To do this, we take the odds ratio and divide it by the odds ratio plus 1 (See slide 25):

$$ probability = \frac{odds}{odds + 1} $$

Before we dive right into conversions, we could also calculate the predicted probabilities of admission for specific values of the the predictors. Recall in lecture (Slide 23) to find the log odds of an x-value, we want to plug it into our regression equation:

extract_eq(mylogit, use_coefs = TRUE)

$$ \log\left[ \frac { \widehat{P( \operatorname{admit} = \operatorname{Admitted} )} }{ 1 - \widehat{P( \operatorname{admit} = \operatorname{Admitted} )} } \right] = -3.99 + 0(\operatorname{gre}) + 0.8(\operatorname{gpa}) - 0.68(\operatorname{rank}{\operatorname{2}}) - 1.34(\operatorname{rank}{\operatorname{3}}) - 1.55(\operatorname{rank}_{\operatorname{4}}) $$

To create predicted probabilities, we need to create a new data frame with the x-value we want the independent variables to take on to create our predictions (In lecture, slide 23, the value was SurvRate = 40). For this example, let’s use the mean. And since we have a categorical predictor, rank, lets look at the mean of gre and gpa and each value of university rank. We could, of course, do all the calculations by hand. But R is easier :)

Let’s first create the data frame using with:

pred_prob <- with(log, data.frame(gre = mean(gre), gpa = mean(gpa), rank = factor(1:4)))
pred_prob
##     gre    gpa rank
## 1 587.7 3.3899    1
## 2 587.7 3.3899    2
## 3 587.7 3.3899    3
## 4 587.7 3.3899    4

So we have our mean of gpa and gre and each level of rank. Which means we are going to get four predicted probabilities, at the mean of the other two variables. If we did not have a categorical predictor, we would only have one.

Now let’s create the predicted probabilities using the mean of gre and gpa and each level of rank using the predict() function. What R is doing here is taking the logits from mylogit and converting them to predicted probabilities based on the values we provided in pred_prob. We identify the object as pred_prob$rankP to create a new column of the within the pred_prob dataset called rankP.

pred_prob$rankP <- predict(mylogit, newdata = pred_prob, type = "response")
pred_prob
##     gre    gpa rank     rankP
## 1 587.7 3.3899    1 0.5166016
## 2 587.7 3.3899    2 0.3522846
## 3 587.7 3.3899    3 0.2186120
## 4 587.7 3.3899    4 0.1846684

Great! Now we have predicted probabilities. Here is our interpretation:

  • For those with an average GPA and GRE score (holding GPA and GRE constant), the predicted probabilities of being accepted into a graduate program is .516 for those in the highest ranked undergraduate institutions (rank = 1) and .184 for students from the lowest ranked undergraduate institutions (rank=4). This demonstrates the relation of rank of the university in the probability of being admitted to graduate school.

Classification Table

A classification table is similar to a contingency table, but in this table we compare the actual values to the predicted values based on the binary logistic regression model. This is a way to see how well the model did at predicting the outcome of being admitted to graduate school.

In this code, we take actual fitted values from the model (mylogit$fitted.values), and separate them by a cut off of p = .5. Those who had a fitted value of above .50 were assigned “Admitted.” We are comparing the values given by the model (fitted values) and whether or not they actually got admitted. The purpose of this is to see how effective our model is at predicting admission status

# Converting from probability to actual output
log$fitted_admit <- ifelse(mylogit$fitted.values >= 0.5, "1", "0") #1 = Admitted, 0 = Not Admitted

# Generating the classification table
ctab <- table(log$admit, log$fitted_admit)
ctab
##               
##                  0   1
##   Not Admitted 254  19
##   Admitted      97  30
#I (Karen) like this table better because it include row and column totals
sjPlot::tab_xtab(var.row = log$admit, var.col = log$fitted_admit, title = "Observed Admission versus Predited Admission counts", show.row.prc = TRUE, show.summary = F )
Observed Admission versus Predited Admission counts
admit fitted_admit Total
0 1
Not Admitted 254
93 %
19
7 %
273
100 %
Admitted 97
76.4 %
30
23.6 %
127
100 %
Total 351
87.8 %
49
12.2 %
400
100 %

Here, we can see that the model correctly predicted 30 of 127, or 23.6%, of the students that actually got admitted. Thus, it misclassified 76.4% (97) of the admitted students were predicted to be “not admitted.” Additionally, we see that it correctly classified 254 of 273 (93%) of the not admitted students. Thus this model did a better job at predicting who was not admitted than those who were admitted.


Accuracy

We can check the accuracy percentage:

accuracy <- sum(diag(ctab))/sum(ctab)*100
accuracy
## [1] 71

Overall, the accuracy of our model is 71%, or our logistic model is able to classify 71% of the observations correctly.


Hierarchical model building

Just as we did in linear regression, we can proceed through a model building process with logistic regression. In this context, we are interested in seeing how we can improve the prediction of the outcome in a hierarchical fashion.

Let’s say we fit the following models:

  1. Predicting admission from just GPA (log1)

  2. Predicting admission from GPA and GRE (log2)

  3. Predicting admission using GPA, GRE, and rank (log3)

We can create a table of all three models using tab_model of the sjPlot package. We can compare the R2 estimates across the models. Using the anovafunction, we can also test if there is a statistical improvement in fit by adding the predictor(s). In logistic regression models, we test if the residual deviance significant decreases. We are using the ANOVA function, but it’s not an ANOVA test like you may have done before. In this case, we are using ANOVA to compare the deviance for each model to the subsequent model. If the chi-square value is significant we can interpret that the addition of the extra variables (going from log1 to log 2, for example) significantly improves the model prediction (e.g., lowers the residual variance).

log1 <- glm(admit ~ gpa , data = log, family = "binomial")
log2 <- glm(admit ~ gpa + gre , data = log, family = "binomial")
log3 <- glm(admit ~ gpa + gre + rank, data = log, family = "binomial")

tab_model(log1, log2, log3)
  admit admit admit
Predictors Odds Ratios CI p Odds Ratios CI p Odds Ratios CI p
(Intercept) 0.01 0.00 – 0.09 \<0.001 0.01 0.00 – 0.06 \<0.001 0.02 0.00 – 0.17 \<0.001
gpa 2.86 1.61 – 5.20 \<0.001 2.13 1.14 – 4.02 0.018 2.23 1.17 – 4.32 0.015
gre 1.00 1.00 – 1.00 0.011 1.00 1.00 – 1.00 0.038
rank \[2\] 0.51 0.27 – 0.94 0.033
rank \[3\] 0.26 0.13 – 0.51 \<0.001
rank \[4\] 0.21 0.09 – 0.47 \<0.001
Observations 400 400 400
R2 Tjur 0.032 0.047 0.102
anova(log1, log2, log3, test="Chisq")
## Analysis of Deviance Table
## 
## Model 1: admit ~ gpa
## Model 2: admit ~ gpa + gre
## Model 3: admit ~ gpa + gre + rank
##   Resid. Df Resid. Dev Df Deviance  Pr(>Chi)    
## 1       398     486.97                          
## 2       397     480.34  1   6.6236   0.01006 *  
## 3       394     458.52  3  21.8265 7.088e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Based on the ANOVA results, we see that the addition of each additional predictor signficnalty improves the prediction of graduate school admission. Note, you don’t have to do this, but if you have research questions about the specific predictive power, controloing for the others, you can use this apporach.


The API Example

Data from UCLA - Academic Performance Index. Note: This is the same example as the simple linear regression lab (Lab 5). However, we are changing the model to have a binary outcome for logistic regression. Below is the list of variables we used in this analysis for lab, noting that there are more variables in the API dataset.

Variable Description
api00 Academic Performance Index, 0 = under statewide target of 800, 1 = above 800, binary
enroll School enrollment, continuous
Meals Number of free and reduced lunch, continuous
full Percentage of teachers will full credentials, continuous

Here, we are reading in the data directly from the UCLA website. Additionally, we are dichotomizing the API (api00) variable to separate schools above statewide target of 800 and below. This cut-off may be outdated and/or controversial, however, we are going to use this for the purpose of providing an example of how to do logistic regression in R.

api <- read.csv(
"https://stats.idre.ucla.edu/wp-content/uploads/2019/02/elemapi2v2.csv") %>% 
  select(api00, full, enroll, meals) %>%
  mutate(api_factor = cut(api00,
                     levels = c("0", "1"),
                     breaks = c(0, 800, 940),
                     labels = c("Below Target", "At or Above Target"))) # here we are dichotomizing API, see lab 1 for a refresher on what this all does.

For a short descriptor of each variable, scroll down here.

# psych::describe()
round(describe(api),2) %>% 
  pander() 
  vars n mean sd median trimmed mad min
api00 1 400 647.6 142.2 643 645.8 177.2 369
full 2 400 84.55 14.95 88 86.6 14.83 37
enroll 3 400 483.5 226.4 435 459.4 202.4 130
meals 4 400 60.31 31.91 67.5 62.18 37.81 0
api_factor* 5 400 1.19 0.39 1 1.12 0 1

Table continues below

  max range skew kurtosis se
api00 940 571 0.1 -1.13 7.11
full 100 63 -0.97 0.17 0.75
enroll 1570 1440 1.34 3.02 11.32
meals 100 100 -0.41 -1.2 1.6
api_factor* 2 1 1.55 0.42 0.02
# I like this table more, but you can do either! Especially if you're getting some errors with gtsummary(). Useful when you have a subset, but here we left all variables in the sample so it's a long table. 

# gtsummary::gtsummary()
#tbl_summary(api,
#            statistic = list(all_continuous() ~ "{mean} ({sd})"),
#                             missing = "no")

There is no contingency table in this example because all of our predictors are continuous.

Binary Logistic Model

We are using a binary logistic regression model to see how school characteristics relate to a school making adequate progress (api=1) or not (api=0).

Binary outcome: API adequate progress =1, not adequate progress=0 Predictors: % fully credential teachers, student enrollment, and % on free and reduced lunch

mylogit <- glm(api_factor ~ full + enroll + meals, data = api, family = "binomial")
summary(mylogit)
## 
## Call:
## glm(formula = api_factor ~ full + enroll + meals, family = "binomial", 
##     data = api)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.86557  -0.16221  -0.03269  -0.01028   3.12996  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -3.0746093  2.7320279  -1.125   0.2604    
## full         0.0582641  0.0282187   2.065   0.0389 *  
## enroll       0.0004592  0.0013843   0.332   0.7401    
## meals       -0.1096184  0.0144185  -7.603  2.9e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 391.86  on 399  degrees of freedom
## Residual deviance: 147.45  on 396  degrees of freedom
## AIC: 155.45
## 
## Number of Fisher Scoring iterations: 8

Odds Ratios

exp(coef(mylogit))
## (Intercept)        full      enroll       meals 
##  0.04620768  1.05999495  1.00045932  0.89617604
## odds ratios and 95% CI
exp(cbind(OR = coef(mylogit), confint(mylogit)))
##                     OR        2.5 %    97.5 %
## (Intercept) 0.04620768 0.0001656294 7.8435280
## full        1.05999495 1.0055761469 1.1237801
## enroll      1.00045932 0.9978836793 1.0032716
## meals       0.89617604 0.8684201733 0.9192615

Predicted Probability

#creating the means of each variable
pred_prob_api<- with(api, data.frame(full = mean(full), enroll = mean(enroll), meals = mean(meals)))
pred_prob_api
##    full  enroll  meals
## 1 84.55 483.465 60.315
pred_prob_api$apiP<- predict(mylogit, newdata = pred_prob_api, type = "response")
pred_prob_api
##    full  enroll  meals       apiP
## 1 84.55 483.465 60.315 0.01058167

This result says that for schools that have average on all the predictors, the probablity of being at target is .01. yikes!

Classification Table

# Converting from probability to actual output
api$fitted_api <- ifelse(mylogit$fitted.values >= 0.5, "1", "0") #1 = Above Average, 0 = Below

# Generating the classification table
ctab <- table(api$api_factor, api$fitted_api)
ctab
##                     
##                        0   1
##   Below Target       304  19
##   At or Above Target  15  62
#I (Karen) like this table better because it include row and column totals
sjPlot::tab_xtab(var.row = api$api_factor, var.col = api$fitted_api, title = "Observed Progress versus Predited progress counts", show.row.prc = TRUE, show.summary = F )
Observed Progress versus Predited progress counts
api_factor fitted_api Total
0 1
Below Target 304
94.1 %
19
5.9 %
323
100 %
At or Above Target 15
19.5 %
62
80.5 %
77
100 %
Total 319
79.8 %
81
20.2 %
400
100 %

Accuracy

accuracy <- sum(diag(ctab))/sum(ctab)*100
accuracy
## [1] 91.5

Sample write up

Using a binary logistic regression, we explored how school’s characteristics (% fully credential teachers, studnet enrollment, and % on free and reduced lunch) relate to the adequate progress of a school. Both the percent of credentialed teachers (z = 2.06, p = .04) and the % free and reduced lunch (z = -.11, p < .01) were significant predictors. Specifically the percent of fully credentialed teachers has an OR = 1.05, 95%CI[1.01, 1.12]. This implies that for a one percent increase in the fully credentialed teachers, schools have a 5% higher chance of making target. Free and reduced lunch had estimated OR =.90, 95% CI [.86, .92], which indicated that for each additional percent of students on free and reduced lunch, there is a significant decrease in the log odds of being at target (z = -7.60, p < .01). That is, for each additional percent increase of free or reduced lunch, that school has a 11% decrease in the odds of making it to target growth, controlling for the other predictors. There was no effect of the enrollment of the school, when controlling for the other variables. Overall, this model had high prediction accuracy, predicting 91% schools accurately.