Portfolio Linear Models for Data Analysis Lab 5: Binary Logistic Regression
Lab 5: Binary Logistic Regression
An overview of binary logistic regression and provides two examples to demonstrate the models and their interpretation.
library(pander) #pander()
library(psych) # describe()
library(gtsummary) #tbl_summary()
library(equatiomatic) # extract_eq()
library(sjPlot) # tab_xtab(), tab_model()
library(tidyverse)
This lab follows along the logistic regression lecture presented in class and provides two examples to demonstrate the models and their interpretation.
What is Binary Logistic Regression?
-
It is a regression with an outcome variable (or dependent variable) that is dichotomous/binary (i.e., only two categories, such as Yes or No, 0 or 1, Disorder or No Disorder, Win or Lose).
- The predictor variables (or independent variables or explanatory variables) can be either continuous or categorical
-
In binary logistic regression, we are interested in predicting which of two possible events (e.g., win or lose) are going to happen given the predictor(s) variables
-
Assumes a binomial distribution or “a probability distribution that summarizes the likelihood that a value will take one of two independent values under a given set of parameters or assumptions.” ( Source)
Binary Logistic Regression in R
Admissions Example
This example comes from the UCLA Statistical Consulting page found here. This example assesses how GRE, GPA, and prestige of the undergraduate institutions effect graduate school admission.
Variable | Description |
---|---|
gre | Graduate Record Exam (GRE) scores, continuous |
gpa | Grade point average, continuous |
rank | Rank of undergraduate institutions, 1 = highest prestige, 4 = lowest, categorical |
admit | Admission decision, 0 = did not admit, 1 = admitted, binary |
References
Hosmer, D. & Lemeshow, S. (2000). Applied Logistic Regression (Second Edition). New York: John Wiley & Sons, Inc.
Long, J. Scott (1997). Regression Models for Categorical and Limited Dependent Variables. Thousand Oaks, CA: Sage Publications.
Read in data
log <- read.csv("https://stats.idre.ucla.edu/stat/data/binary.csv") %>%
mutate(rank = factor(rank), #we need to convert this variable to a factor
admit = factor(admit, levels = c("0", "1"), labels = c("Not Admitted", "Admitted"))) # add labels to outcome
Summary
# psych::describe()
round(describe(log),2) %>%
pander()
vars | n | mean | sd | median | trimmed | mad | min | |
---|---|---|---|---|---|---|---|---|
admit* | 1 | 400 | 1.32 | 0.47 | 1 | 1.27 | 0 | 1 |
gre | 2 | 400 | 587.7 | 115.5 | 580 | 589.1 | 118.6 | 220 |
gpa | 3 | 400 | 3.39 | 0.38 | 3.4 | 3.4 | 0.4 | 2.26 |
rank* | 4 | 400 | 2.48 | 0.94 | 2 | 2.48 | 1.48 | 1 |
Table continues below
max | range | skew | kurtosis | se | |
---|---|---|---|---|---|
admit* | 2 | 1 | 0.78 | -1.39 | 0.02 |
gre | 800 | 580 | -0.14 | -0.36 | 5.78 |
gpa | 4 | 1.74 | -0.21 | -0.6 | 0.02 |
rank* | 4 | 3 | 0.1 | -0.91 | 0.05 |
# I like this table more, but you can do either! Especially if you're getting some errors with gtsummary().
# gtsummary::tbl_summary()
tbl_summary(log,
statistic = list(all_continuous() ~ "{mean} ({sd})"),
missing = "no")
Characteristic | N = 4001 |
---|---|
admit | |
Not Admitted | 273 (68%) |
Admitted | 127 (32%) |
gre | 588 (116) |
gpa | 3.39 (0.38) |
rank | |
1 | 61 (15%) |
2 | 151 (38%) |
3 | 121 (30%) |
4 | 67 (17%) |
1 n (%); Mean (SD) |
Contingency tables, sometimes called cross tabulations or cross tabs, that displays the frequency distribution of your categorical variables.
In R, we use sjPlot::tab_xtabs()
(Found by Karen!).
We want to check the contingency table to make sure there are values in each cell.
Since we only have one categorical variable, rank
, we only need to check one contingency table.
sjPlot::tab_xtab(var.row = log$rank, var.col = log$admit, title = "Cross tab of Admit rate by Rank of Undergraduate Insitute", show.row.prc = TRUE, show.summary=F)
rank | admit | Total | |
---|---|---|---|
Not Admitted | Admitted | ||
1 |
28 45.9 % |
33 54.1 % |
61 100 % |
2 |
97 64.2 % |
54 35.8 % |
151 100 % |
3 |
93 76.9 % |
28 23.1 % |
121 100 % |
4 |
55 82.1 % |
12 17.9 % |
67 100 % |
Total |
273 68.2 % |
127 31.8 % |
400 100 % |
This table shows the row and column totals, as well as row percentages. You can also ask for column percentages, if you prefer.
Binary Logistic Model
Recall in linear regression, we used the lm()
function.
With logistic regression, we use glm()
, or general linear model.
The set up is similar to linear regression, however.
One important note is to make sure we identify the type of “family function,” or the description of the error distribution used in the model.
For logistic regression, it’s family = "binomial"
.
In our example, the predictors are GRE scores, GPA, and university rank related to graduate school admission status.
mylogit <- glm(admit ~ gre + gpa + rank, data = log, family = "binomial")
summary(mylogit)
##
## Call:
## glm(formula = admit ~ gre + gpa + rank, family = "binomial",
## data = log)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.6268 -0.8662 -0.6388 1.1490 2.0790
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.989979 1.139951 -3.500 0.000465 ***
## gre 0.002264 0.001094 2.070 0.038465 *
## gpa 0.804038 0.331819 2.423 0.015388 *
## rank2 -0.675443 0.316490 -2.134 0.032829 *
## rank3 -1.340204 0.345306 -3.881 0.000104 ***
## rank4 -1.551464 0.417832 -3.713 0.000205 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 499.98 on 399 degrees of freedom
## Residual deviance: 458.52 on 394 degrees of freedom
## AIC: 470.52
##
## Number of Fisher Scoring iterations: 4
Let’s dissect this output.
It looks similar to regular simple linear regression but interpretation is different (as we saw in lecture).
When looking at the coefficients, we see gre
, gpa
, and the levels of rank
are significant predictors of admission status.
With binary logistic regression, regression coefficient is the change in the log odds of the outcome for a one unit increase in the predictor variable, holding others constant. The intercept is the log odds for students who has zero on all the predictors, a value not observed in this sample (e.g., no one has GPA or GRE score of zero).
So this is how we would interpret those same variables when running a logistic regression model (Note: it’s a z-statistic now since these are standardized coefficients):
- GRE was significantly related to admission decision (z = 2.070, p = 0.038). Specifically, the results indicate that for every one-unit increase in GRE score, there is an increase of B = 0.002 in the the log odds of being admitted to graduate school, controlling for GPA and university rank.
Additionally, this is how we would interpret the categorical variable rank
.
Note that because this is a categorical variable, we are comparing the categories to a comparison group (or a reference group).
In this example, the comparison group is rank 1:
- Students who attended rank 2 universities are less likely to be admitted to graduate school, compared to rank 1 universities (z = -0.675, p = 0.033).
The log odds, or logits, are not always easily interpreted and can be converted to odds ratios for easier interpretation.
Odds Ratios
The coefficients presented above are log odds. To convert them to odd ratios, we take the exponent of the coefficients (Slide 21 in lecture):
$$ odds ratio = e^{\beta} $$
#coverting the log odds (logits) to odds ratios
exp(coef(mylogit))
## (Intercept) gre gpa rank2 rank3 rank4
## 0.0185001 1.0022670 2.2345448 0.5089310 0.2617923 0.2119375
## odds ratios and 95% CI
exp(cbind(OR = coef(mylogit), confint(mylogit)))
## OR 2.5 % 97.5 %
## (Intercept) 0.0185001 0.001889165 0.1665354
## gre 1.0022670 1.000137602 1.0044457
## gpa 2.2345448 1.173858216 4.3238349
## rank2 0.5089310 0.272289674 0.9448343
## rank3 0.2617923 0.131641717 0.5115181
## rank4 0.2119375 0.090715546 0.4706961
These values likely make a little more sense. Considering GRE in the example again:
- GRE was significantly related to admission decision (z = 2.070, p = 0.038). Specifically, for every one-unit increase in GRE score, the odds of being admitted to graduate school increase by a factor of 1.002, controlling for GPA and university rank. (Note this is a very small amount because one point increase in GRE doesn’t represent a large difference).
Said differently,
- GRE was significantly related to admission decision (z = 2.070, p = 0.038). Specifically, for every one-unit increase in GRE score, we expect to see about a .002% increase in the odds of being admitted to graduate school, controlling for GPA and university rank.
Predicted Probability
To make the coefficients easier to interpret, we can take the odds ratio values and convert them to probabilities. To do this, we take the odds ratio and divide it by the odds ratio plus 1 (See slide 25):
$$ probability = \frac{odds}{odds + 1} $$
Before we dive right into conversions, we could also calculate the predicted probabilities of admission for specific values of the the predictors. Recall in lecture (Slide 23) to find the log odds of an x-value, we want to plug it into our regression equation:
extract_eq(mylogit, use_coefs = TRUE)
$$ \log\left[ \frac { \widehat{P( \operatorname{admit} = \operatorname{Admitted} )} }{ 1 - \widehat{P( \operatorname{admit} = \operatorname{Admitted} )} } \right] = -3.99 + 0(\operatorname{gre}) + 0.8(\operatorname{gpa}) - 0.68(\operatorname{rank}{\operatorname{2}}) - 1.34(\operatorname{rank}{\operatorname{3}}) - 1.55(\operatorname{rank}_{\operatorname{4}}) $$
To create predicted probabilities, we need to create a new data frame with the x-value we want the independent variables to take on to create our predictions (In lecture, slide 23, the value was SurvRate = 40).
For this example, let’s use the mean.
And since we have a categorical predictor, rank
, lets look at the mean of gre
and gpa
and each value of university rank
.
We could, of course, do all the calculations by hand.
But R is easier :)
Let’s first create the data frame using with
:
pred_prob <- with(log, data.frame(gre = mean(gre), gpa = mean(gpa), rank = factor(1:4)))
pred_prob
## gre gpa rank
## 1 587.7 3.3899 1
## 2 587.7 3.3899 2
## 3 587.7 3.3899 3
## 4 587.7 3.3899 4
So we have our mean of gpa
and gre
and each level of rank
.
Which means we are going to get four predicted probabilities, at the mean of the other two variables.
If we did not have a categorical predictor, we would only have one.
Now let’s create the predicted probabilities using the mean of gre
and gpa
and each level of rank
using the predict()
function.
What R is doing here is taking the logits from mylogit
and converting them to predicted probabilities based on the values we provided in pred_prob
.
We identify the object as pred_prob$rankP
to create a new column of the within the pred_prob
dataset called rankP
.
pred_prob$rankP <- predict(mylogit, newdata = pred_prob, type = "response")
pred_prob
## gre gpa rank rankP
## 1 587.7 3.3899 1 0.5166016
## 2 587.7 3.3899 2 0.3522846
## 3 587.7 3.3899 3 0.2186120
## 4 587.7 3.3899 4 0.1846684
Great! Now we have predicted probabilities. Here is our interpretation:
- For those with an average GPA and GRE score (holding GPA and GRE constant), the predicted probabilities of being accepted into a graduate program is .516 for those in the highest ranked undergraduate institutions (rank = 1) and .184 for students from the lowest ranked undergraduate institutions (rank=4). This demonstrates the relation of rank of the university in the probability of being admitted to graduate school.
Classification Table
A classification table is similar to a contingency table, but in this table we compare the actual values to the predicted values based on the binary logistic regression model. This is a way to see how well the model did at predicting the outcome of being admitted to graduate school.
In this code, we take actual fitted values from the model (mylogit$fitted.values
), and separate them by a cut off of p = .5.
Those who had a fitted value of above .50 were assigned “Admitted.” We are comparing the values given by the model (fitted values) and whether or not they actually got admitted.
The purpose of this is to see how effective our model is at predicting admission status
# Converting from probability to actual output
log$fitted_admit <- ifelse(mylogit$fitted.values >= 0.5, "1", "0") #1 = Admitted, 0 = Not Admitted
# Generating the classification table
ctab <- table(log$admit, log$fitted_admit)
ctab
##
## 0 1
## Not Admitted 254 19
## Admitted 97 30
#I (Karen) like this table better because it include row and column totals
sjPlot::tab_xtab(var.row = log$admit, var.col = log$fitted_admit, title = "Observed Admission versus Predited Admission counts", show.row.prc = TRUE, show.summary = F )
admit | fitted_admit | Total | |
---|---|---|---|
0 | 1 | ||
Not Admitted |
254 93 % |
19 7 % |
273 100 % |
Admitted |
97 76.4 % |
30 23.6 % |
127 100 % |
Total |
351 87.8 % |
49 12.2 % |
400 100 % |
Here, we can see that the model correctly predicted 30 of 127, or 23.6%, of the students that actually got admitted. Thus, it misclassified 76.4% (97) of the admitted students were predicted to be “not admitted.” Additionally, we see that it correctly classified 254 of 273 (93%) of the not admitted students. Thus this model did a better job at predicting who was not admitted than those who were admitted.
Accuracy
We can check the accuracy percentage:
accuracy <- sum(diag(ctab))/sum(ctab)*100
accuracy
## [1] 71
Overall, the accuracy of our model is 71%, or our logistic model is able to classify 71% of the observations correctly.
Hierarchical model building
Just as we did in linear regression, we can proceed through a model building process with logistic regression. In this context, we are interested in seeing how we can improve the prediction of the outcome in a hierarchical fashion.
Let’s say we fit the following models:
-
Predicting admission from just GPA (log1)
-
Predicting admission from GPA and GRE (log2)
-
Predicting admission using GPA, GRE, and rank (log3)
We can create a table of all three models using tab_model
of the sjPlot
package.
We can compare the R2 estimates across the models.
Using the anova
function, we can also test if there is a statistical improvement in fit by adding the predictor(s).
In logistic regression models, we test if the residual deviance significant decreases.
We are using the ANOVA function, but it’s not an ANOVA test like you may have done before.
In this case, we are using ANOVA to compare the deviance for each model to the subsequent model.
If the chi-square value is significant we can interpret that the addition of the extra variables (going from log1 to log 2, for example) significantly improves the model prediction (e.g., lowers the residual variance).
log1 <- glm(admit ~ gpa , data = log, family = "binomial")
log2 <- glm(admit ~ gpa + gre , data = log, family = "binomial")
log3 <- glm(admit ~ gpa + gre + rank, data = log, family = "binomial")
tab_model(log1, log2, log3)
admit | admit | admit | |||||||
---|---|---|---|---|---|---|---|---|---|
Predictors | Odds Ratios | CI | p | Odds Ratios | CI | p | Odds Ratios | CI | p |
(Intercept) | 0.01 | 0.00 – 0.09 | \<0.001 | 0.01 | 0.00 – 0.06 | \<0.001 | 0.02 | 0.00 – 0.17 | \<0.001 |
gpa | 2.86 | 1.61 – 5.20 | \<0.001 | 2.13 | 1.14 – 4.02 | 0.018 | 2.23 | 1.17 – 4.32 | 0.015 |
gre | 1.00 | 1.00 – 1.00 | 0.011 | 1.00 | 1.00 – 1.00 | 0.038 | |||
rank \[2\] | 0.51 | 0.27 – 0.94 | 0.033 | ||||||
rank \[3\] | 0.26 | 0.13 – 0.51 | \<0.001 | ||||||
rank \[4\] | 0.21 | 0.09 – 0.47 | \<0.001 | ||||||
Observations | 400 | 400 | 400 | ||||||
R2 Tjur | 0.032 | 0.047 | 0.102 |
anova(log1, log2, log3, test="Chisq")
## Analysis of Deviance Table
##
## Model 1: admit ~ gpa
## Model 2: admit ~ gpa + gre
## Model 3: admit ~ gpa + gre + rank
## Resid. Df Resid. Dev Df Deviance Pr(>Chi)
## 1 398 486.97
## 2 397 480.34 1 6.6236 0.01006 *
## 3 394 458.52 3 21.8265 7.088e-05 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Based on the ANOVA results, we see that the addition of each additional predictor signficnalty improves the prediction of graduate school admission. Note, you don’t have to do this, but if you have research questions about the specific predictive power, controloing for the others, you can use this apporach.
The API Example
Data from UCLA - Academic Performance Index. Note: This is the same example as the simple linear regression lab (Lab 5). However, we are changing the model to have a binary outcome for logistic regression. Below is the list of variables we used in this analysis for lab, noting that there are more variables in the API dataset.
Variable | Description |
api00 | Academic Performance Index, 0 = under statewide target of 800, 1 = above 800, binary |
enroll | School enrollment, continuous |
Meals | Number of free and reduced lunch, continuous |
full | Percentage of teachers will full credentials, continuous |
Here, we are reading in the data directly from the UCLA website.
Additionally, we are dichotomizing the API (api00
) variable to separate schools above statewide target of 800 and below.
This cut-off may be outdated and/or controversial, however, we are going to use this for the purpose of providing an example of how to do logistic regression in R.
api <- read.csv(
"https://stats.idre.ucla.edu/wp-content/uploads/2019/02/elemapi2v2.csv") %>%
select(api00, full, enroll, meals) %>%
mutate(api_factor = cut(api00,
levels = c("0", "1"),
breaks = c(0, 800, 940),
labels = c("Below Target", "At or Above Target"))) # here we are dichotomizing API, see lab 1 for a refresher on what this all does.
For a short descriptor of each variable, scroll down here.
# psych::describe()
round(describe(api),2) %>%
pander()
vars | n | mean | sd | median | trimmed | mad | min | |
---|---|---|---|---|---|---|---|---|
api00 | 1 | 400 | 647.6 | 142.2 | 643 | 645.8 | 177.2 | 369 |
full | 2 | 400 | 84.55 | 14.95 | 88 | 86.6 | 14.83 | 37 |
enroll | 3 | 400 | 483.5 | 226.4 | 435 | 459.4 | 202.4 | 130 |
meals | 4 | 400 | 60.31 | 31.91 | 67.5 | 62.18 | 37.81 | 0 |
api_factor* | 5 | 400 | 1.19 | 0.39 | 1 | 1.12 | 0 | 1 |
Table continues below
max | range | skew | kurtosis | se | |
---|---|---|---|---|---|
api00 | 940 | 571 | 0.1 | -1.13 | 7.11 |
full | 100 | 63 | -0.97 | 0.17 | 0.75 |
enroll | 1570 | 1440 | 1.34 | 3.02 | 11.32 |
meals | 100 | 100 | -0.41 | -1.2 | 1.6 |
api_factor* | 2 | 1 | 1.55 | 0.42 | 0.02 |
# I like this table more, but you can do either! Especially if you're getting some errors with gtsummary(). Useful when you have a subset, but here we left all variables in the sample so it's a long table.
# gtsummary::gtsummary()
#tbl_summary(api,
# statistic = list(all_continuous() ~ "{mean} ({sd})"),
# missing = "no")
There is no contingency table in this example because all of our predictors are continuous.
Binary Logistic Model
We are using a binary logistic regression model to see how school characteristics relate to a school making adequate progress (api=1) or not (api=0).
Binary outcome: API adequate progress =1, not adequate progress=0 Predictors: % fully credential teachers, student enrollment, and % on free and reduced lunch
mylogit <- glm(api_factor ~ full + enroll + meals, data = api, family = "binomial")
summary(mylogit)
##
## Call:
## glm(formula = api_factor ~ full + enroll + meals, family = "binomial",
## data = api)
##
## Deviance Residuals:
## Min 1Q Median 3Q Max
## -1.86557 -0.16221 -0.03269 -0.01028 3.12996
##
## Coefficients:
## Estimate Std. Error z value Pr(>|z|)
## (Intercept) -3.0746093 2.7320279 -1.125 0.2604
## full 0.0582641 0.0282187 2.065 0.0389 *
## enroll 0.0004592 0.0013843 0.332 0.7401
## meals -0.1096184 0.0144185 -7.603 2.9e-14 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## (Dispersion parameter for binomial family taken to be 1)
##
## Null deviance: 391.86 on 399 degrees of freedom
## Residual deviance: 147.45 on 396 degrees of freedom
## AIC: 155.45
##
## Number of Fisher Scoring iterations: 8
Odds Ratios
exp(coef(mylogit))
## (Intercept) full enroll meals
## 0.04620768 1.05999495 1.00045932 0.89617604
## odds ratios and 95% CI
exp(cbind(OR = coef(mylogit), confint(mylogit)))
## OR 2.5 % 97.5 %
## (Intercept) 0.04620768 0.0001656294 7.8435280
## full 1.05999495 1.0055761469 1.1237801
## enroll 1.00045932 0.9978836793 1.0032716
## meals 0.89617604 0.8684201733 0.9192615
Predicted Probability
#creating the means of each variable
pred_prob_api<- with(api, data.frame(full = mean(full), enroll = mean(enroll), meals = mean(meals)))
pred_prob_api
## full enroll meals
## 1 84.55 483.465 60.315
pred_prob_api$apiP<- predict(mylogit, newdata = pred_prob_api, type = "response")
pred_prob_api
## full enroll meals apiP
## 1 84.55 483.465 60.315 0.01058167
This result says that for schools that have average on all the predictors, the probablity of being at target is .01. yikes!
Classification Table
# Converting from probability to actual output
api$fitted_api <- ifelse(mylogit$fitted.values >= 0.5, "1", "0") #1 = Above Average, 0 = Below
# Generating the classification table
ctab <- table(api$api_factor, api$fitted_api)
ctab
##
## 0 1
## Below Target 304 19
## At or Above Target 15 62
#I (Karen) like this table better because it include row and column totals
sjPlot::tab_xtab(var.row = api$api_factor, var.col = api$fitted_api, title = "Observed Progress versus Predited progress counts", show.row.prc = TRUE, show.summary = F )
api_factor | fitted_api | Total | |
---|---|---|---|
0 | 1 | ||
Below Target |
304 94.1 % |
19 5.9 % |
323 100 % |
At or Above Target |
15 19.5 % |
62 80.5 % |
77 100 % |
Total |
319 79.8 % |
81 20.2 % |
400 100 % |
Accuracy
accuracy <- sum(diag(ctab))/sum(ctab)*100
accuracy
## [1] 91.5
Sample write up
Using a binary logistic regression, we explored how school’s characteristics (% fully credential teachers, studnet enrollment, and % on free and reduced lunch) relate to the adequate progress of a school. Both the percent of credentialed teachers (z = 2.06, p = .04) and the % free and reduced lunch (z = -.11, p < .01) were significant predictors. Specifically the percent of fully credentialed teachers has an OR = 1.05, 95%CI[1.01, 1.12]. This implies that for a one percent increase in the fully credentialed teachers, schools have a 5% higher chance of making target. Free and reduced lunch had estimated OR =.90, 95% CI [.86, .92], which indicated that for each additional percent of students on free and reduced lunch, there is a significant decrease in the log odds of being at target (z = -7.60, p < .01). That is, for each additional percent increase of free or reduced lunch, that school has a 11% decrease in the odds of making it to target growth, controlling for the other predictors. There was no effect of the enrollment of the school, when controlling for the other variables. Overall, this model had high prediction accuracy, predicting 91% schools accurately.