Portfolio Linear Models for Data Analysis Lab 2: Simple Linear Regression
Lab 2: Simple Linear Regression
Overview of simple linear regression.
library(psych) #describe()
library(PerformanceAnalytics) #chart.correlation()
library(lm.beta) #lm.beta()
library(sjPlot) #tab_model()
library(gridExtra) #grid.arrange()
library(tidyverse)
This lab will be an overview of simple linear regression and was created along side Karen’s lectures and her code.
Simple Linear Regression
The Equation
The equation of simple linear regression is this:
$$Y_i = \alpha + \beta{x}_i + \epsilon_i $$ Expressed as Betas:
$$ Y_i = \beta_0 + \beta{x}_i + \epsilon_i $$
When we are predicting Y, we express the prediction equation as this:
$$\hat Y = a+ b{x} $$
As presented in lecture and lab 1, here is how we would plug in our intercept and slope values into our equations:
# Reading in physics.csv dataset
physics <- read_csv("physics.csv")
#`msat` predicting `physics`
slr_model <- lm(physics~msat,data=physics)
summary(slr_model)
##
## Call:
## lm(formula = physics ~ msat, data = physics)
##
## Residuals:
## Min 1Q Median 3Q Max
## -36.735 -7.484 -0.243 8.980 30.098
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -58.54334 10.46659 -5.593 7.54e-08 ***
## msat 0.27977 0.02112 13.246 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.76 on 193 degrees of freedom
## (5 observations deleted due to missingness)
## Multiple R-squared: 0.4762, Adjusted R-squared: 0.4735
## F-statistic: 175.5 on 1 and 193 DF, p-value: < 2.2e-16
$$ Y = -58.54 + .28{x} $$
An Example
Data from UCLA - Academic Performance Index:
Variable | Description |
api00 | Academic Performance Index |
enroll | School enrollment |
Meals | Number of free and reduced lunch |
full | Percentage of teachers will full credentials |
Read in data and view first 6 rows. Here, we are reading in the data directly from the UCLA website:
api <- read.csv(
"https://stats.idre.ucla.edu/wp-content/uploads/2019/02/elemapi2v2.csv") %>%
select(api00, full, enroll, meals)
#First 6 rows
head(api)
## api00 full enroll meals
## 1 693 76 247 67
## 2 570 79 463 92
## 3 546 68 395 97
## 4 571 87 418 90
## 5 478 87 520 89
## 6 858 100 343 10
Descriptive Statistics:
describe(api)
## vars n mean sd median trimmed mad min max range skew
## api00 1 400 647.62 142.25 643.0 645.79 177.17 369 940 571 0.10
## full 2 400 84.55 14.95 88.0 86.60 14.83 37 100 63 -0.97
## enroll 3 400 483.46 226.45 435.0 459.41 202.37 130 1570 1440 1.34
## meals 4 400 60.31 31.91 67.5 62.18 37.81 0 100 100 -0.41
## kurtosis se
## api00 -1.13 7.11
## full 0.17 0.75
## enroll 3.02 11.32
## meals -1.20 1.60
Correlations:
cor(api, method = "pearson", use = "complete.obs") %>%
round(2)
## api00 full enroll meals
## api00 1.00 0.57 -0.32 -0.90
## full 0.57 1.00 -0.34 -0.53
## enroll -0.32 -0.34 1.00 0.24
## meals -0.90 -0.53 0.24 1.00
A fancier correlation matrix from Karen’s lecture and code:
chart.Correlation(api, histogram=TRUE, pch=19)
Histogram for API schools:
api %>%
ggplot(aes(x = api00)) +
geom_histogram(col="dark green",
fill="green",
alpha = .2,
binwidth = 30) +
labs(title="Histogram for API for schools", x="API", y="Count") +
xlim(c(200, 1000)) +
ylim(c(0,50))
Simple linear regression:
[Research question]{.ul}: Does school enrollment predict school API?
m1 <- lm(api00 ~ enroll, data = api)
summary(m1)
##
## Call:
## lm(formula = api00 ~ enroll, data = api)
##
## Residuals:
## Min 1Q Median 3Q Max
## -285.50 -112.55 -6.70 95.06 389.15
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 744.25141 15.93308 46.711 < 2e-16 ***
## enroll -0.19987 0.02985 -6.695 7.34e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 135 on 398 degrees of freedom
## Multiple R-squared: 0.1012, Adjusted R-squared: 0.09898
## F-statistic: 44.83 on 1 and 398 DF, p-value: 7.339e-11
Standardized Beta Coefficients:
lm.beta(m1)
##
## Call:
## lm(formula = api00 ~ enroll, data = api)
##
## Standardized Coefficients::
## (Intercept) enroll
## NA -0.3181722
Nicely presented regression table using tab_model()
:
tab_model(m1, show.std = TRUE, show.stat = TRUE, show.zeroinf = TRUE)
api 00 | ||||||
---|---|---|---|---|---|---|
Predictors | Estimates | std. Beta | CI | standardized CI | Statistic | p |
(Intercept) | 744.25 | 0.00 | 712.93 – 775.57 | -0.09 – 0.09 | 46.71 | <0.001 |
enroll | -0.20 | -0.32 | -0.26 – -0.14 | -0.41 – -0.22 | -6.70 | <0.001 |
Observations | 400 | |||||
R2 / R2 adjusted | 0.101 / 0.099 |
Here, we can see both unstandardized and standardized regression coefficients.
Outliers
Outlier Detection
From Karen’s slides:
There are various ways to diagnose outliers:
[Distance (residuals]{.ul}): is useful in identifying potential outliers in the dependent variable (Y).
[Leverage]{.ul} (h~i~): is useful in identifying potential outliers in the independent variable (x’s). Leverage identifies outliers with respect to the X values
[Influence:]{.ul} combines distance and leverage to identify unusually influential observations (use Cook’s D).
Distance (residuals)
The studentized residual:
$$ e_{s_i} = \frac{e_i}{s \sqrt{1-h_i}} = \frac{e_i}{\sqrt{MS_{Error}(1-h_i)}} $$
The student residual identifies outliers with respect to their y values.
To find studentized residuals in R:
library(MASS) #We use the MASS package for this, I don't like this package because it masks the `select` function from dplyr (tidyverse). After we find studentized residuals, let's detach it.
#Using API data (m1)
stud_res <- as.data.frame(studres(m1)) %>%
rename("stud_res" = "studres(m1)")
#View first 6 rows
head(stud_res)
## stud_res
## 1 -0.01397309
## 2 -0.60544501
## 3 -0.88459579
## 4 -0.66480228
## 5 -1.20436717
## 6 1.35389271
#Let's detach the package so we don't get any errors moving forward
detach("package:MASS")
Let’s plot the predictor variable (enroll
) and the corresponding studentized residuals:
plot(api$enroll, as.matrix(stud_res), ylab='Studentized Residuals', xlab='Predictor')
abline(0,0)
We can see that there is a studentized residual with an absolute value greater than (or pretty close to) 3 (top right). Let’s combine the studentized outliers with our dataset to get an idea of which observation is the outlier here.
We can add a column by using tidyverse::add_column()
and then sort by descending using arrange()
and desc()
.
new_api <- api %>%
add_column(stud_res)
#Sort stud_res by descending:
new_api %>%
arrange(desc(stud_res)) %>%
head()
## api00 full enroll meals stud_res
## 8 831 96 1513 2 2.993100
## 242 903 100 696 2 2.222040
## 86 912 97 611 6 2.160181
## 196 894 93 615 13 2.030676
## 339 892 78 602 2 1.995923
## 239 897 75 547 11 1.950388
We can see that observation 8 has a studentized residual of about 3.
Leverage
$$ h_i = \frac{1}{N} + \frac{(x_i - \overline{x})^2}{\sum_{i = 1}^{N}(x_i - \overline{x})^2} $$
To calculate a leverage score for each x in the data set in R, we use the hatvalues()
function to calculate the leverage for each observation in the model:
#Using API example (m1)
hats <- as.data.frame(hatvalues(m1)) %>%
rename("hats" = "hatvalues(m1)")
head(hats)
## hats
## 1 0.005232891
## 2 0.002520470
## 3 0.002882500
## 4 0.002709463
## 5 0.002565239
## 6 0.003464328
We can also plot the leverage values for each observation:
plot(hatvalues(m1), type = 'h')
This plot shows some spikes in leverage values.
Let’s find out the cut off.
Leverage (h~i~) values greater than \(\frac{3(p+1)}{N}\)
are considered influential ($p$ = number of predictors, \(N\)
= sample size)
To calculate the leverage cut-off scores in R:
# Using our API example
#Number of predictors: 1
p <- 1
#Sample Size: 400
n <- 400
leverage <- (3*(p+1))/n
leverage
## [1] 0.015
To find the leverage values above our cut off (Influential values):
influential <- hats %>%
filter(hats > leverage)
influential
## hats
## 8 0.05430490
## 106 0.01704568
## 120 0.03227648
## 130 0.01742135
## 134 0.01868945
## 140 0.01620497
## 145 0.02414863
## 163 0.05592762
## 164 0.02274105
## 165 0.01599870
## 172 0.01599870
## 187 0.02180846
## 193 0.01574307
## 197 0.01995080
## 199 0.01589615
## 210 0.06020004
We can see that observation 8 has the highest leverage value.
If we want to remove all leverage values greater than the cut off from our api
dataset:
new_api_no_lev <- api %>%
filter(hats < leverage)
Influence
Combines distance and leverage to identify unusually influential observations (use Cook’s D).
Cook’s distance measure is defined as:
$$ Cook’s D = \frac{(e_{s_i})^2}{p + 1} * \frac{h_i}{1-h_i} $$
Where \(e_{s_i}\)
is the studentized residual, \(p\)
= number of predictors, and \(h_i\)
is the leverage for observation for the \(i\)
th observation.
To find Cook’s D in R:
First let’s create a scatterplot for our model (enroll
by api00
):
api %>%
ggplot(aes(x = enroll, y = api00)) +
geom_point() +
geom_smooth(method = "lm") +
theme_minimal()
We can clearly see an outlier that is shifting our datapoints to the left. Let’s calculate and plot Cook’s D using R (This is Karen’s code!):
#Cacluating Cook's D
cooksd <- cooks.distance(m1)
#Plotting Values
plot(cooksd, pch="*",
cex=2,
main="Influential Obs by Cooks distance") # plot
abline(h = 4*mean(cooksd, na.rm=T),
col="red") # add cutoff line
text(x=1:length(cooksd)+1,
y=cooksd,
labels=ifelse(cooksd>4*mean(cooksd, na.rm=T),
names(cooksd),""),
col="red") # add labels
As we’ve noted multiple times, observation 8 has proven to be the outlier in this dataset.
Let’s remove the outlier (row 8) using slice()
and compare scatterplots (using the gridExtra:grid.arrange()
to look at the plots side by side):
#Remove outlier
api_no_outlier <- api %>%
slice(-8)
# Scatterplot without the outlier
no_outlier <- api_no_outlier %>%
ggplot(aes(x = enroll, y = api00)) +
geom_point() +
geom_smooth(method = "lm") +
labs(title = "No Outlier") +
theme_minimal()
#Scatterplot with outlier
outlier <- api %>%
ggplot(aes(x = enroll, y = api00)) +
geom_point() +
geom_smooth(method = "lm") +
labs(title = "With Outlier") +
theme_minimal()
grid.arrange(no_outlier, outlier, ncol = 2)
We can also rerun and compare regression models:
#Rerun regression
m2 <- lm(api00 ~ enroll, data = api_no_outlier)
summary(m2)
##
## Call:
## lm(formula = api00 ~ enroll, data = api_no_outlier)
##
## Residuals:
## Min 1Q Median 3Q Max
## -286.945 -111.413 -5.914 95.955 303.286
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 753.23324 16.05899 46.904 < 2e-16 ***
## enroll -0.22057 0.03036 -7.266 1.97e-12 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 133.7 on 397 degrees of freedom
## Multiple R-squared: 0.1174, Adjusted R-squared: 0.1152
## F-statistic: 52.8 on 1 and 397 DF, p-value: 1.975e-12
tab_model(m2, show.std = TRUE, show.stat = TRUE, show.zeroinf = TRUE)
api 00 | ||||||
---|---|---|---|---|---|---|
Predictors | Estimates | std. Beta | CI | standardized CI | Statistic | p |
(Intercept) | 753.23 | -0.00 | 721.66 – 784.80 | -0.09 – 0.09 | 46.90 | <0.001 |
enroll | -0.22 | -0.34 | -0.28 – -0.16 | -0.44 – -0.25 | -7.27 | <0.001 |
Observations | 399 | |||||
R2 / R2 adjusted | 0.117 / 0.115 |
#Compare to model with outlier
summary(m1)
##
## Call:
## lm(formula = api00 ~ enroll, data = api)
##
## Residuals:
## Min 1Q Median 3Q Max
## -285.50 -112.55 -6.70 95.06 389.15
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 744.25141 15.93308 46.711 < 2e-16 ***
## enroll -0.19987 0.02985 -6.695 7.34e-11 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 135 on 398 degrees of freedom
## Multiple R-squared: 0.1012, Adjusted R-squared: 0.09898
## F-statistic: 44.83 on 1 and 398 DF, p-value: 7.339e-11
tab_model(m1, show.std = TRUE, show.stat = TRUE, show.zeroinf = TRUE)
api 00 | ||||||
---|---|---|---|---|---|---|
Predictors | Estimates | std. Beta | CI | standardized CI | Statistic | p |
(Intercept) | 744.25 | 0.00 | 712.93 – 775.57 | -0.09 – 0.09 | 46.71 | <0.001 |
enroll | -0.20 | -0.32 | -0.26 – -0.14 | -0.41 – -0.22 | -6.70 | <0.001 |
Observations | 400 | |||||
R2 / R2 adjusted | 0.101 / 0.099 |
What are the differences in output? Coefficients? R^2^ ?