Lab 5: Binary Logistic Regression

An overview of binary logistic regression and provides two examples to demonstrate the models and their interpretation.

library(pander) #pander()
library(psych) # describe()
library(gtsummary) #tbl_summary()
library(equatiomatic) # extract_eq()
library(sjPlot) # tab_xtab(), tab_model()
library(tidyverse)

This lab follows along the logistic regression lecture presented in class and provides two examples to demonstrate the models and their interpretation.

What is Binary Logistic Regression?

It is a regression with an outcome variable (or dependent variable) that is dichotomous/binary (i.e., only two categories, such as Yes or No, 0 or 1, Disorder or No Disorder, Win or Lose).
- The predictor variables (or independent variables or explanatory variables) can be either continuous or categorical
In binary logistic regression, we are interested in predicting which of two possible events (e.g., win or lose) are going to happen given the predictor(s) variables
Assumes a binomial distribution or “a probability distribution that summarizes the likelihood that a value will take one of two independent values under a given set of parameters or assumptions.” ( Source)

Binary Logistic Regression in R

Admissions Example

This example comes from the UCLA Statistical Consulting page found here. This example assesses how GRE, GPA, and prestige of the undergraduate institutions effect graduate school admission.

Variable	Description
gre	Graduate Record Exam (GRE) scores, continuous
gpa	Grade point average, continuous
rank	Rank of undergraduate institutions, 1 = highest prestige, 4 = lowest, categorical
admit	Admission decision, 0 = did not admit, 1 = admitted, binary

References

Hosmer, D. & Lemeshow, S. (2000). Applied Logistic Regression (Second Edition). New York: John Wiley & Sons, Inc.

Long, J. Scott (1997). Regression Models for Categorical and Limited Dependent Variables. Thousand Oaks, CA: Sage Publications.

Read in data

log <- read.csv("https://stats.idre.ucla.edu/stat/data/binary.csv") %>% 
  mutate(rank = factor(rank),  #we need to convert this variable to a factor
         admit = factor(admit, levels = c("0", "1"), labels = c("Not Admitted", "Admitted"))) # add labels to outcome

Summary

# psych::describe()
round(describe(log),2) %>% 
  pander()

	vars	n	mean	sd	median	trimmed	mad	min
admit*	1	400	1.32	0.47	1	1.27	0	1
gre	2	400	587.7	115.5	580	589.1	118.6	220
gpa	3	400	3.39	0.38	3.4	3.4	0.4	2.26
rank*	4	400	2.48	0.94	2	2.48	1.48	1

Table continues below

	max	range	skew	kurtosis	se
admit*	2	1	0.78	-1.39	0.02
gre	800	580	-0.14	-0.36	5.78
gpa	4	1.74	-0.21	-0.6	0.02
rank*	4	3	0.1	-0.91	0.05

# I like this table more, but you can do either! Especially if you're getting some errors with gtsummary(). 

# gtsummary::tbl_summary()
tbl_summary(log,
            statistic = list(all_continuous() ~ "{mean} ({sd})"),
                             missing = "no")

Characteristic	N = 400¹
admit
Not Admitted	273 (68%)
Admitted	127 (32%)
gre	588 (116)
gpa	3.39 (0.38)
rank
1	61 (15%)
2	151 (38%)
3	121 (30%)
4	67 (17%)
¹ n (%); Mean (SD)

Contingency tables, sometimes called cross tabulations or cross tabs, that displays the frequency distribution of your categorical variables. In R, we use sjPlot::tab_xtabs() (Found by Karen!). We want to check the contingency table to make sure there are values in each cell. Since we only have one categorical variable, rank, we only need to check one contingency table.

sjPlot::tab_xtab(var.row = log$rank, var.col = log$admit, title = "Cross tab of Admit rate by Rank of Undergraduate Insitute", show.row.prc = TRUE, show.summary=F)

Cross tab of Admit rate by Rank of Undergraduate Insitute
rank	admit		Total
rank	Not Admitted	Admitted	Total
1	28 45.9 %	33 54.1 %	61 100 %
2	97 64.2 %	54 35.8 %	151 100 %
3	93 76.9 %	28 23.1 %	121 100 %
4	55 82.1 %	12 17.9 %	67 100 %
Total	273 68.2 %	127 31.8 %	400 100 %

This table shows the row and column totals, as well as row percentages. You can also ask for column percentages, if you prefer.

Binary Logistic Model

Recall in linear regression, we used the lm() function. With logistic regression, we use glm(), or general linear model. The set up is similar to linear regression, however. One important note is to make sure we identify the type of “family function,” or the description of the error distribution used in the model. For logistic regression, it’s family = "binomial".

In our example, the predictors are GRE scores, GPA, and university rank related to graduate school admission status.

mylogit <- glm(admit ~ gre + gpa + rank, data = log, family = "binomial")
summary(mylogit)

## 
## Call:
## glm(formula = admit ~ gre + gpa + rank, family = "binomial", 
##     data = log)
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.6268  -0.8662  -0.6388   1.1490   2.0790  
## 
## Coefficients:
##              Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -3.989979   1.139951  -3.500 0.000465 ***
## gre          0.002264   0.001094   2.070 0.038465 *  
## gpa          0.804038   0.331819   2.423 0.015388 *  
## rank2       -0.675443   0.316490  -2.134 0.032829 *  
## rank3       -1.340204   0.345306  -3.881 0.000104 ***
## rank4       -1.551464   0.417832  -3.713 0.000205 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 499.98  on 399  degrees of freedom
## Residual deviance: 458.52  on 394  degrees of freedom
## AIC: 470.52
## 
## Number of Fisher Scoring iterations: 4

Let’s dissect this output. It looks similar to regular simple linear regression but interpretation is different (as we saw in lecture). When looking at the coefficients, we see gre, gpa, and the levels of rank are significant predictors of admission status.

With binary logistic regression, regression coefficient is the change in the log odds of the outcome for a one unit increase in the predictor variable, holding others constant. The intercept is the log odds for students who has zero on all the predictors, a value not observed in this sample (e.g., no one has GPA or GRE score of zero).

So this is how we would interpret those same variables when running a logistic regression model (Note: it’s a z-statistic now since these are standardized coefficients):

GRE was significantly related to admission decision (z = 2.070, p = 0.038). Specifically, the results indicate that for every one-unit increase in GRE score, there is an increase of B = 0.002 in the the log odds of being admitted to graduate school, controlling for GPA and university rank.

Additionally, this is how we would interpret the categorical variable rank. Note that because this is a categorical variable, we are comparing the categories to a comparison group (or a reference group). In this example, the comparison group is rank 1:

Students who attended rank 2 universities are less likely to be admitted to graduate school, compared to rank 1 universities (z = -0.675, p = 0.033).

The log odds, or logits, are not always easily interpreted and can be converted to odds ratios for easier interpretation.

Odds Ratios

The coefficients presented above are log odds. To convert them to odd ratios, we take the exponent of the coefficients (Slide 21 in lecture):

$$ odds ratio = e^{\beta} $$

#coverting the log odds (logits) to odds ratios
exp(coef(mylogit))

## (Intercept)         gre         gpa       rank2       rank3       rank4 
##   0.0185001   1.0022670   2.2345448   0.5089310   0.2617923   0.2119375

## odds ratios and 95% CI
exp(cbind(OR = coef(mylogit), confint(mylogit)))

##                    OR       2.5 %    97.5 %
## (Intercept) 0.0185001 0.001889165 0.1665354
## gre         1.0022670 1.000137602 1.0044457
## gpa         2.2345448 1.173858216 4.3238349
## rank2       0.5089310 0.272289674 0.9448343
## rank3       0.2617923 0.131641717 0.5115181
## rank4       0.2119375 0.090715546 0.4706961

These values likely make a little more sense. Considering GRE in the example again:

GRE was significantly related to admission decision (z = 2.070, p = 0.038). Specifically, for every one-unit increase in GRE score, the odds of being admitted to graduate school increase by a factor of 1.002, controlling for GPA and university rank. (Note this is a very small amount because one point increase in GRE doesn’t represent a large difference).

Said differently,

GRE was significantly related to admission decision (z = 2.070, p = 0.038). Specifically, for every one-unit increase in GRE score, we expect to see about a .002% increase in the odds of being admitted to graduate school, controlling for GPA and university rank.

Predicted Probability

To make the coefficients easier to interpret, we can take the odds ratio values and convert them to probabilities. To do this, we take the odds ratio and divide it by the odds ratio plus 1 (See slide 25):

$$ probability = \frac{odds}{odds + 1} $$

Before we dive right into conversions, we could also calculate the predicted probabilities of admission for specific values of the the predictors. Recall in lecture (Slide 23) to find the log odds of an x-value, we want to plug it into our regression equation:

extract_eq(mylogit, use_coefs = TRUE)

$$ \log\left[ \frac { \widehat{P( \operatorname{admit} = \operatorname{Admitted} )} }{ 1 - \widehat{P( \operatorname{admit} = \operatorname{Admitted} )} } \right] = -3.99 + 0(\operatorname{gre}) + 0.8(\operatorname{gpa}) - 0.68(\operatorname{rank}{\operatorname{2}}) - 1.34(\operatorname{rank}{\operatorname{3}}) - 1.55(\operatorname{rank}_{\operatorname{4}}) $$

To create predicted probabilities, we need to create a new data frame with the x-value we want the independent variables to take on to create our predictions (In lecture, slide 23, the value was SurvRate = 40). For this example, let’s use the mean. And since we have a categorical predictor, rank, lets look at the mean of gre and gpa and each value of university rank. We could, of course, do all the calculations by hand. But R is easier :)

Let’s first create the data frame using with:

pred_prob <- with(log, data.frame(gre = mean(gre), gpa = mean(gpa), rank = factor(1:4)))
pred_prob

##     gre    gpa rank
## 1 587.7 3.3899    1
## 2 587.7 3.3899    2
## 3 587.7 3.3899    3
## 4 587.7 3.3899    4

So we have our mean of gpa and gre and each level of rank. Which means we are going to get four predicted probabilities, at the mean of the other two variables. If we did not have a categorical predictor, we would only have one.

Now let’s create the predicted probabilities using the mean of gre and gpa and each level of rank using the predict() function. What R is doing here is taking the logits from mylogit and converting them to predicted probabilities based on the values we provided in pred_prob. We identify the object as pred_prob$rankP to create a new column of the within the pred_prob dataset called rankP.

pred_prob$rankP <- predict(mylogit, newdata = pred_prob, type = "response")
pred_prob

##     gre    gpa rank     rankP
## 1 587.7 3.3899    1 0.5166016
## 2 587.7 3.3899    2 0.3522846
## 3 587.7 3.3899    3 0.2186120
## 4 587.7 3.3899    4 0.1846684

Great! Now we have predicted probabilities. Here is our interpretation:

For those with an average GPA and GRE score (holding GPA and GRE constant), the predicted probabilities of being accepted into a graduate program is .516 for those in the highest ranked undergraduate institutions (rank = 1) and .184 for students from the lowest ranked undergraduate institutions (rank=4). This demonstrates the relation of rank of the university in the probability of being admitted to graduate school.

Classification Table

A classification table is similar to a contingency table, but in this table we compare the actual values to the predicted values based on the binary logistic regression model. This is a way to see how well the model did at predicting the outcome of being admitted to graduate school.

In this code, we take actual fitted values from the model (mylogit$fitted.values), and separate them by a cut off of p = .5. Those who had a fitted value of above .50 were assigned “Admitted.” We are comparing the values given by the model (fitted values) and whether or not they actually got admitted. The purpose of this is to see how effective our model is at predicting admission status

# Converting from probability to actual output
log$fitted_admit <- ifelse(mylogit$fitted.values >= 0.5, "1", "0") #1 = Admitted, 0 = Not Admitted

# Generating the classification table
ctab <- table(log$admit, log$fitted_admit)
ctab

##               
##                  0   1
##   Not Admitted 254  19
##   Admitted      97  30

#I (Karen) like this table better because it include row and column totals
sjPlot::tab_xtab(var.row = log$admit, var.col = log$fitted_admit, title = "Observed Admission versus Predited Admission counts", show.row.prc = TRUE, show.summary = F )

Observed Admission versus Predited Admission counts
admit	fitted_admit		Total
admit	0	1	Total
Not Admitted	254 93 %	19 7 %	273 100 %
Admitted	97 76.4 %	30 23.6 %	127 100 %
Total	351 87.8 %	49 12.2 %	400 100 %

Here, we can see that the model correctly predicted 30 of 127, or 23.6%, of the students that actually got admitted. Thus, it misclassified 76.4% (97) of the admitted students were predicted to be “not admitted.” Additionally, we see that it correctly classified 254 of 273 (93%) of the not admitted students. Thus this model did a better job at predicting who was not admitted than those who were admitted.

Accuracy

We can check the accuracy percentage:

accuracy <- sum(diag(ctab))/sum(ctab)*100
accuracy

## [1] 71

Overall, the accuracy of our model is 71%, or our logistic model is able to classify 71% of the observations correctly.

Hierarchical model building

Just as we did in linear regression, we can proceed through a model building process with logistic regression. In this context, we are interested in seeing how we can improve the prediction of the outcome in a hierarchical fashion.

Let’s say we fit the following models:

Predicting admission from just GPA (log1)
Predicting admission from GPA and GRE (log2)
Predicting admission using GPA, GRE, and rank (log3)

We can create a table of all three models using tab_model of the sjPlot package. We can compare the R2 estimates across the models. Using the anovafunction, we can also test if there is a statistical improvement in fit by adding the predictor(s). In logistic regression models, we test if the residual deviance significant decreases. We are using the ANOVA function, but it’s not an ANOVA test like you may have done before. In this case, we are using ANOVA to compare the deviance for each model to the subsequent model. If the chi-square value is significant we can interpret that the addition of the extra variables (going from log1 to log 2, for example) significantly improves the model prediction (e.g., lowers the residual variance).

log1 <- glm(admit ~ gpa , data = log, family = "binomial")
log2 <- glm(admit ~ gpa + gre , data = log, family = "binomial")
log3 <- glm(admit ~ gpa + gre + rank, data = log, family = "binomial")

tab_model(log1, log2, log3)

	admit			admit			admit
Predictors	Odds Ratios	CI	p	Odds Ratios	CI	p	Odds Ratios	CI	p
(Intercept)	0.01	0.00 – 0.09	\<0.001	0.01	0.00 – 0.06	\<0.001	0.02	0.00 – 0.17	\<0.001
gpa	2.86	1.61 – 5.20	\<0.001	2.13	1.14 – 4.02	0.018	2.23	1.17 – 4.32	0.015
gre				1.00	1.00 – 1.00	0.011	1.00	1.00 – 1.00	0.038
rank \[2\]							0.51	0.27 – 0.94	0.033
rank \[3\]							0.26	0.13 – 0.51	\<0.001
rank \[4\]							0.21	0.09 – 0.47	\<0.001
Observations	400			400			400
R² Tjur	0.032			0.047			0.102

anova(log1, log2, log3, test="Chisq")

## Analysis of Deviance Table
## 
## Model 1: admit ~ gpa
## Model 2: admit ~ gpa + gre
## Model 3: admit ~ gpa + gre + rank
##   Resid. Df Resid. Dev Df Deviance  Pr(>Chi)    
## 1       398     486.97                          
## 2       397     480.34  1   6.6236   0.01006 *  
## 3       394     458.52  3  21.8265 7.088e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Based on the ANOVA results, we see that the addition of each additional predictor signficnalty improves the prediction of graduate school admission. Note, you don’t have to do this, but if you have research questions about the specific predictive power, controloing for the others, you can use this apporach.

The API Example

Data from UCLA - Academic Performance Index. Note: This is the same example as the simple linear regression lab (Lab 5). However, we are changing the model to have a binary outcome for logistic regression. Below is the list of variables we used in this analysis for lab, noting that there are more variables in the API dataset.


Variable	Description
api00	Academic Performance Index, 0 = under statewide target of 800, 1 = above 800, binary
enroll	School enrollment, continuous
Meals	Number of free and reduced lunch, continuous
full	Percentage of teachers will full credentials, continuous

Here, we are reading in the data directly from the UCLA website. Additionally, we are dichotomizing the API (api00) variable to separate schools above statewide target of 800 and below. This cut-off may be outdated and/or controversial, however, we are going to use this for the purpose of providing an example of how to do logistic regression in R.

api <- read.csv(
"https://stats.idre.ucla.edu/wp-content/uploads/2019/02/elemapi2v2.csv") %>% 
  select(api00, full, enroll, meals) %>%
  mutate(api_factor = cut(api00,
                     levels = c("0", "1"),
                     breaks = c(0, 800, 940),
                     labels = c("Below Target", "At or Above Target"))) # here we are dichotomizing API, see lab 1 for a refresher on what this all does.

For a short descriptor of each variable, scroll down here.

# psych::describe()
round(describe(api),2) %>% 
  pander()

	vars	n	mean	sd	median	trimmed	mad	min
api00	1	400	647.6	142.2	643	645.8	177.2	369
full	2	400	84.55	14.95	88	86.6	14.83	37
enroll	3	400	483.5	226.4	435	459.4	202.4	130
meals	4	400	60.31	31.91	67.5	62.18	37.81	0
api_factor*	5	400	1.19	0.39	1	1.12	0	1

Table continues below

	max	range	skew	kurtosis	se
api00	940	571	0.1	-1.13	7.11
full	100	63	-0.97	0.17	0.75
enroll	1570	1440	1.34	3.02	11.32
meals	100	100	-0.41	-1.2	1.6
api_factor*	2	1	1.55	0.42	0.02

# I like this table more, but you can do either! Especially if you're getting some errors with gtsummary(). Useful when you have a subset, but here we left all variables in the sample so it's a long table. 

# gtsummary::gtsummary()
#tbl_summary(api,
#            statistic = list(all_continuous() ~ "{mean} ({sd})"),
#                             missing = "no")

There is no contingency table in this example because all of our predictors are continuous.

Binary Logistic Model

We are using a binary logistic regression model to see how school characteristics relate to a school making adequate progress (api=1) or not (api=0).

Binary outcome: API adequate progress =1, not adequate progress=0 Predictors: % fully credential teachers, student enrollment, and % on free and reduced lunch

mylogit <- glm(api_factor ~ full + enroll + meals, data = api, family = "binomial")
summary(mylogit)

## 
## Call:
## glm(formula = api_factor ~ full + enroll + meals, family = "binomial", 
##     data = api)
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -1.86557  -0.16221  -0.03269  -0.01028   3.12996  
## 
## Coefficients:
##               Estimate Std. Error z value Pr(>|z|)    
## (Intercept) -3.0746093  2.7320279  -1.125   0.2604    
## full         0.0582641  0.0282187   2.065   0.0389 *  
## enroll       0.0004592  0.0013843   0.332   0.7401    
## meals       -0.1096184  0.0144185  -7.603  2.9e-14 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 391.86  on 399  degrees of freedom
## Residual deviance: 147.45  on 396  degrees of freedom
## AIC: 155.45
## 
## Number of Fisher Scoring iterations: 8

Odds Ratios

exp(coef(mylogit))

## (Intercept)        full      enroll       meals 
##  0.04620768  1.05999495  1.00045932  0.89617604

## odds ratios and 95% CI
exp(cbind(OR = coef(mylogit), confint(mylogit)))

##                     OR        2.5 %    97.5 %
## (Intercept) 0.04620768 0.0001656294 7.8435280
## full        1.05999495 1.0055761469 1.1237801
## enroll      1.00045932 0.9978836793 1.0032716
## meals       0.89617604 0.8684201733 0.9192615

Predicted Probability

#creating the means of each variable
pred_prob_api<- with(api, data.frame(full = mean(full), enroll = mean(enroll), meals = mean(meals)))
pred_prob_api

##    full  enroll  meals
## 1 84.55 483.465 60.315

pred_prob_api$apiP<- predict(mylogit, newdata = pred_prob_api, type = "response")
pred_prob_api

##    full  enroll  meals       apiP
## 1 84.55 483.465 60.315 0.01058167

This result says that for schools that have average on all the predictors, the probablity of being at target is .01. yikes!

Classification Table

# Converting from probability to actual output
api$fitted_api <- ifelse(mylogit$fitted.values >= 0.5, "1", "0") #1 = Above Average, 0 = Below

# Generating the classification table
ctab <- table(api$api_factor, api$fitted_api)
ctab

##                     
##                        0   1
##   Below Target       304  19
##   At or Above Target  15  62

#I (Karen) like this table better because it include row and column totals
sjPlot::tab_xtab(var.row = api$api_factor, var.col = api$fitted_api, title = "Observed Progress versus Predited progress counts", show.row.prc = TRUE, show.summary = F )

Observed Progress versus Predited progress counts
api_factor	fitted_api		Total
api_factor	0	1	Total
Below Target	304 94.1 %	19 5.9 %	323 100 %
At or Above Target	15 19.5 %	62 80.5 %	77 100 %
Total	319 79.8 %	81 20.2 %	400 100 %

Accuracy

accuracy <- sum(diag(ctab))/sum(ctab)*100
accuracy

## [1] 91.5

Sample write up

Using a binary logistic regression, we explored how school’s characteristics (% fully credential teachers, studnet enrollment, and % on free and reduced lunch) relate to the adequate progress of a school. Both the percent of credentialed teachers (z = 2.06, p = .04) and the % free and reduced lunch (z = -.11, p < .01) were significant predictors. Specifically the percent of fully credentialed teachers has an OR = 1.05, 95%CI[1.01, 1.12]. This implies that for a one percent increase in the fully credentialed teachers, schools have a 5% higher chance of making target. Free and reduced lunch had estimated OR =.90, 95% CI [.86, .92], which indicated that for each additional percent of students on free and reduced lunch, there is a significant decrease in the log odds of being at target (z = -7.60, p < .01). That is, for each additional percent increase of free or reduced lunch, that school has a 11% decrease in the odds of making it to target growth, controlling for the other predictors. There was no effect of the enrollment of the school, when controlling for the other variables. Overall, this model had high prediction accuracy, predicting 91% schools accurately.