Portfolio Linear Models for Data Analysis Lab 1: Data Screening and Cleaning
Lab 1: Data Screening and Cleaning
Information on how to clean and screen your data and write up the results.
#Load packages here
library(haven) #for read_sav() function
library(psych) # describe()
library(ggpubr) # ggdensity() and ggqqplot()
library(apaTables) # apa.cor.table()
library(tidyverse)
Data Screening & Cleaning
The data sets used in this lab are used in Chapter 4 of Tabachnick & Fidell (2012; I believe there is a new version from 2019 now). The chapter is provided in GauchoSpace and is a useful resource if you need more information on how to clean and screen your data and write up the results.
Here is a table of the variables and their description:
Variable | Description |
subno | Subject number |
timedrs | Visits to health professionals |
attdrug | Attitudes toward medication |
atthouse | Attitudes toward housework |
income | Ordinal income variable |
emplmnt | Whether currently employed |
mstatus | Whether currently married |
race | Binary race variable |
The style of code and package we will be using for this class is called
tidyverse
.
This shouldn’t be anything too different than what you’ve seen in the previous two courses in this series.
Most functions are within the tidyverse
package and if not, I’ve indicated the packages used in the code chunk above.
As a reminder, if a function does not work and you receive an error like this: could not find function "random_function"
; or if you try to load a package and you receive an error like this: there is no package called `random_package`
, then you will need to install the package using install.packages("random_package")
in the console (the bottom-left window in R studio).
Once you have installed the package you will never need to install it again, however you must always load in the packages at the beginning of your R markdown using library(random_package)
, as shown in this document.
Read in Data
Let’s read in a data set called screen1.sav
:
# Notice this is an SPSS dataset
screen1 <- read_sav("screen1.sav")
Usually, we want to work with .csv files, so let’s convert this data set from .sav to .csv and save it to our computers.
# write_csv saves a .csv version of your dataset to your working directory.
# Enter the name of the object that contains your data set (in this case, "screen1"), then enter the name you want to save your dataset as. We can call it the same thing: "screen1.csv"
write_csv(screen1, "screen1.csv")
Let’s read in the new .csv data set:
# If you get an error, you might need to wait a few seconds for the .csv to save to your folder before running this.
screen1 <- read_csv("screen1.csv")
Use names()
to view the variable names in the data set:
names(screen1)
## [1] "subno" "timedrs" "atthouse"
Let’s read in another data set screen2.sav
and convert it to a .csv like we did before:
# Read in .sav
screen2 <- read_sav("screen2.sav")
# Convert and save a .csv version
write.csv(screen2, "screen2.csv")
# Read in the new .csv version
screen2 <- read_csv("screen2.csv")
# Look at variable names
names(screen2)
## [1] "...1" "income" "mstatus" "race"
Merge Data
What if we want to merge the two data sets?
As with many thing in R, there are multiple ways to do this.
Here is probably the most efficient way to merge data sets using tidyverse::full_join()
.
This will join two data sets based on a “join” variable and return all rows and columns from both data sets.
Because we are joining two data sets that have different variable names to join them by (i.e., the ID or Subject number) , we use by
shown below to let R know that there are different names.
As a reminder, use the pipe operator %>%
to create a sequence of functions.
# Here we created a new dataset called `merged_screen`
merged_screen <- screen1 %>%
full_join(screen2, by = c("subno" = "...1"))
Examine Data
Let’s examine the data set:
# View the first 5 rows
head(merged_screen)
## # A tibble: 6 × 6
## subno timedrs atthouse income mstatus race
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 1 1 27 5 2 1
## 2 2 3 20 6 2 1
## 3 3 0 23 3 2 1
## 4 4 13 28 8 2 1
## 5 5 15 24 1 2 1
## 6 6 3 25 4 2 1
Let’s look at descriptive statistics for each variable.
Because looking at the ID variable’s (subno
) descriptives is unnecessary, we use select()
to remove the variable by using the minus (-
) sign:
merged_screen %>%
select(-subno) %>%
summary()
## timedrs atthouse income mstatus
## Min. : 0.000 Min. : 2.00 Min. : 1.00 Min. :1.000
## 1st Qu.: 2.000 1st Qu.:21.00 1st Qu.: 2.50 1st Qu.:2.000
## Median : 4.000 Median :24.00 Median : 4.00 Median :2.000
## Mean : 7.901 Mean :23.54 Mean : 4.21 Mean :1.778
## 3rd Qu.:10.000 3rd Qu.:27.00 3rd Qu.: 6.00 3rd Qu.:2.000
## Max. :81.000 Max. :35.00 Max. :10.00 Max. :2.000
## NA's :128 NA's :128 NA's :154 NA's :128
## race
## Min. :1.000
## 1st Qu.:1.000
## Median :1.000
## Mean :1.088
## 3rd Qu.:1.000
## Max. :2.000
## NA's :128
Alternatively, we can use the psych::describe()
function to give more information:
merged_screen %>%
select(-subno) %>%
describe()
## vars n mean sd median trimmed mad min max range skew kurtosis
## timedrs 1 465 7.90 10.95 4 5.61 4.45 0 81 81 3.23 12.88
## atthouse 2 465 23.54 4.48 24 23.62 4.45 2 35 33 -0.45 1.52
## income 3 439 4.21 2.42 4 4.01 2.97 1 10 9 0.58 -0.38
## mstatus 4 465 1.78 0.42 2 1.85 0.00 1 2 1 -1.34 -0.21
## race 5 465 1.09 0.28 1 1.00 0.00 1 2 1 2.90 6.40
## se
## timedrs 0.51
## atthouse 0.21
## income 0.12
## mstatus 0.02
## race 0.01
What if we want to look at a subset of the data?
For example, what if we want to subset the data to see whose been to the doctor more than 10 times?
We can use tidyverse::filter()
to subset the data using certain criteria.
merged_screen %>%
filter(timedrs > 10) %>%
describe()
## vars n mean sd median trimmed mad min max range skew
## subno 1 107 324.25 199.63 292 320.41 269.83 4 756 752 0.14
## timedrs 2 107 22.49 14.90 16 19.40 5.93 11 81 70 1.88
## atthouse 3 107 24.33 4.14 24 24.36 4.45 14 35 21 -0.02
## income 4 68 4.19 2.32 4 4.05 1.48 1 10 9 0.52
## mstatus 5 72 1.81 0.40 2 1.88 0.00 1 2 1 -1.51
## race 6 72 1.11 0.32 1 1.02 0.00 1 2 1 2.42
## kurtosis se
## subno -1.14 19.30
## timedrs 3.06 1.44
## atthouse -0.10 0.40
## income -0.42 0.28
## mstatus 0.29 0.05
## race 3.93 0.04
#You can use any operator to filter: >, <, ==, >=, etc.
Recode Missing Values
Let’s check for missing values.
First, how are missing values identified?
They could be -999
, NA
, or literally anything else.
The simplest way to do this is to look back at the summary()
function.
We can see that they are identified as NA
.
Because, for example, timedrs
has 128 NA
’s.
We also know that it’s not -999
because the min and max values look normal.
merged_screen %>%
select(-subno) %>%
summary()
## timedrs atthouse income mstatus
## Min. : 0.000 Min. : 2.00 Min. : 1.00 Min. :1.000
## 1st Qu.: 2.000 1st Qu.:21.00 1st Qu.: 2.50 1st Qu.:2.000
## Median : 4.000 Median :24.00 Median : 4.00 Median :2.000
## Mean : 7.901 Mean :23.54 Mean : 4.21 Mean :1.778
## 3rd Qu.:10.000 3rd Qu.:27.00 3rd Qu.: 6.00 3rd Qu.:2.000
## Max. :81.000 Max. :35.00 Max. :10.00 Max. :2.000
## NA's :128 NA's :128 NA's :154 NA's :128
## race
## Min. :1.000
## 1st Qu.:1.000
## Median :1.000
## Mean :1.088
## 3rd Qu.:1.000
## Max. :2.000
## NA's :128
Additionally, we can look at the unique values (using unique()
) for each variable to see if there are any unusual values or multiple missing values were coded.
For example, for timedrs
(Visits to health professionals), we can see that the unique values don’t look unusual, and there is NA
.
unique(merged_screen$timedrs)
## [1] 1 3 0 13 15 2 7 4 5 12 6 14 60 20 8 9 10 23 39 16 33 38 22 11 34
## [26] 27 30 17 25 19 49 52 18 24 57 21 58 43 37 75 29 56 81 NA
Usually, we want our missing values to be identified as NA
since that’s what R understands as missing by default.
What if your dataset doesn’t have missing as NA
?
What if the missing values are -999
?
Or vise versa?
Here is how you would recode the values from NA
to -999
(but usually, you would never do this since you want the missing to be labeled as NA
):
# Use tidyverse::mutate() adjust an existing variable
# Then use tidyverse::replace_na() to replace -999 with NA for the timedrs variable
new_data_with_NA <- merged_screen %>%
mutate(timedrs = replace_na(timedrs, -999))
# Recode back to `NA` using tidyverse::na_if()
new_data_with_NA <- merged_screen %>%
mutate(timedrs = na_if(timedrs, "-999"))
Recode Continuous Variable into Factor
What if you want to recode a continuous variable into different levels (e.g., high, medium, and low)?
Let’s use the variable income
as an example.
First, let’s recall the descriptives:
merged_screen %>%
select(income) %>%
summary()
## income
## Min. : 1.00
## 1st Qu.: 2.50
## Median : 4.00
## Mean : 4.21
## 3rd Qu.: 6.00
## Max. :10.00
## NA's :154
Here, we can see that the values range from 1 - 10. Lets recode the variable to look like this:
Low | 1 - 3 |
Medium | 4 - 6 |
High | 7 - 10 |
We use cut()
to divide continuous variables into intervals:
income_factor <- merged_screen %>%
mutate(income_factor = cut(income,
breaks = c(0, 3, 6, 10), #Use 0 at the beginning to ensure that 1 is included in the first break, then break by last number in each section (3, 6, 10)
labels = c("low", "medium", "high")))
# View summary
income_factor %>%
select(income, income_factor) %>%
summary()
## income income_factor
## Min. : 1.00 low :189
## 1st Qu.: 2.50 medium:166
## Median : 4.00 high : 84
## Mean : 4.21 NA's :154
## 3rd Qu.: 6.00
## Max. :10.00
## NA's :154
Normality and Transformation
It’s important to inspect our data to see if it is normally distributed.
Many analyses are sensitive to violations of normality so in order to make sure we are confident that our data are normal, there are several things we can look at: density plots, histograms, QQ plots, the Shapiro-Wilk’s normality test, and the descriptives such as skewness and kurtosis.
Normally, we would want to inspect every variable, but for demonstration purposes, lets focus on the timedrs
variable.
Density Plots
A density plot is a visualization of the data over a continuous interval.
As we can see by this density plot, the variable timedrs
is positively skewed.
There are more people that make less doctor visits than those who make more doctor visits.
ggdensity(merged_screen$timedrs, fill = "lightgray")
Histogram
A histogram provides the same information as the density plot but provides a count instead of density on the x-axis.
gghistogram(merged_screen$timedrs, fill = "lightgray")
QQ Plots
QQ plot, or quantile-quantile plot, is a plot of the correlation between a sample and the normal distribution. In a QQ plot, each observation is plotted as a single dot. If the data are normal, the dots should form a straight line. If the data are skewed, you will see either a downward curve (negatively skewed) or upward curve (positively skewed).
ggqqplot(merged_screen$timedrs)
As you can see in this QQ plot, there is an upward curve, which further tells us that we have a positively skewed variable.
Shapiro-Wilk’s Test of Normality
Sometimes it’s difficult to tell whether the variable is normal based on the plots alone (not in this case, however).
The Shapiro-Wilk’s normality test is a significance test that compares the sample distribution to a normal distribution.
If the test is not significant, then the distribution is normal.
As in there is no significant difference between the sample and a normal distribution.
If the test is significant, then the distribution is not normal.
As in, there is a significant difference between the sample and a normal distribution.
We can use shapiro_test
to test this.
#library(rstatix)
#merged_screen %>%
# shapiro_test(timedrs)
To no one’s surprise, the Shapiro-Wilk’s Test of Normality was significant for the timedrs
variable.
Skewness and Kurtosis
One final thing to look are the skewness and kurtosis values in the descriptive statistics provided earlier. There are many different sources that provide different cut-off values, but as a general rule of thumb, skewness and kurtosis values greater than +3/-3 indicate a non-normal distribution. Positive skew values indicate positively skewed variables and negative skew values indicate negatively skewed variables. Positive values of kurtosis indicate leptokurtic distributions, or higher peaks with taller tails than a normal distribution. Negative values of kurtosis indicate platykurtic distributions, or flat peaks with thin tails.
describe(merged_screen$timedrs)
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 465 7.9 10.95 4 5.61 4.45 0 81 81 3.23 12.88 0.51
Here we can see that the skew value is greater than 3 and positive, indicating a positive distribution. The kurtosis value is greater than 3 and positive, probably because, as we saw in the density plot, it has a high peak and tall tails.
Transformation
So, you’ve violated normality, now what? Well, we need to consider the analysis were about to run. Is it robust to minor violations of normality (such as regression or ANOVA)? If we decide that the violation of normality is a big deal, we can use a mathematical function to transform the each point in the data set.
Most of the time, we don’t like to transform a variable unless it’s absolutely necessary. I’ve found that researchers in other departments (other than social science) transform data all the time, but since we’re working with responses from real people it’s a bit trickier. There are many different ways to do this, the one we will use is logarithmic transformation. Here is an in-depth look at how transformations work.
Log transform timedrs
using log()
:
log <- merged_screen %>%
drop_na() %>% #droping NA values
mutate(log_timedrs = log(timedrs + 1)) # Adding a constant (1) here because there were zeros's. Log(0) is undefined. You do not need to add 1 if there are no zeros.
#Before transformation
ggdensity(merged_screen$timedrs, fill = "lightgray")
#After transformation
ggdensity(log$log_timedrs, fill = "lightgray")
#Let's also look at changes in descriptives
describe(log$log_timedrs)
## vars n mean sd median trimmed mad min max range skew kurtosis se
## X1 1 321 1.66 0.97 1.61 1.65 0.76 0 4.33 4.33 0.26 -0.21 0.05
Here, we can see a big improvement in the distribution compared to the previous, untransformed variable.
Regression Overview
Let’s start this section using applied examples:
Simple Linear Regression: Predict physics scores from math SAT scores. $$\hat Y = \beta_0 + \beta_1x $$ $$ physics = \beta_0 + \beta_1 (msat) $$ Multiple Regression: Predict physics scores from: emotional intelligence, IQ, verbal SAT, math SAT, creativity, and abstract thinking. $$ \hat Y = \beta_0 + \beta_1x + \beta_2x … \beta_ix_i $$ $$physics = \beta_0 + \beta_1(ei) + \beta_2(iq) + \beta_3(vsat) + \beta_4(msat) + \beta_5(creativity) + \beta_6(abstract) $$
Read in data:
physics <- read_sav("physics.sav")
write.csv(physics, "physics.csv")
physics <- read_csv("physics.csv") %>%
select(-...1, -idno)
# Descriptive Statistics
describe(physics)
## vars n mean sd median trimmed mad min max range skew
## ei 1 199 107.59 12.17 108.0 107.82 10.38 75.0 139 64.0 -0.15
## iq 2 199 99.99 14.59 100.0 99.89 14.83 57.0 134 77.0 0.01
## vsat 3 199 534.46 105.37 526.0 532.13 108.23 314.0 801 487.0 0.21
## msat 4 199 493.38 42.98 487.0 490.93 43.00 364.0 601 237.0 0.40
## create 5 200 9.57 3.78 9.0 9.53 4.45 1.0 19 18.0 0.14
## abstract 6 199 9.46 2.98 10.0 9.64 2.97 1.0 15 14.0 -0.55
## physics 7 195 79.57 17.58 79.5 79.34 16.90 29.5 119 89.5 0.04
## kurtosis se
## ei -0.20 0.86
## iq -0.23 1.03
## vsat -0.51 7.47
## msat -0.25 3.05
## create -0.57 0.27
## abstract 0.15 0.21
## physics -0.10 1.26
Assumptions of Regression
-
Linearity - Relationship between the outcome variable and each predictor
-
Normality - Are the data normally distributed?
-
Homoscedasticity - Residuals are assumed to have a constant variance (This can be checked after running the model)
-
Multicollinearity - Bivariate correlations are less than .90.
Linearity & Multicollinearity
The easiest way to check linearity of relationship between the outcome variable and each predictor is to check the bivariate correlations and scatterplots. For the most part, most variables are (positively) correlated with each other, implying linear relationships.
To assess multicollinearity, we want to see if there are predictors that are too highly correlated (r > .90)
Use cor()
to look at bivariate correlations for the entire data set (Note: Missing values are not allowed in correlation analyses, use drop_na()
to do listwise deletion):
physics %>%
drop_na() %>% #remove missing data
cor(method = "pearson")
## ei iq vsat msat create abstract
## ei 1.00000000 0.03400695 0.53927222 0.04823131 -0.02356759 0.02311522
## iq 0.03400695 1.00000000 0.64810303 0.70635708 0.64281265 0.57259750
## vsat 0.53927222 0.64810303 1.00000000 0.50502038 0.45670255 0.35357159
## msat 0.04823131 0.70635708 0.50502038 1.00000000 0.57382957 0.50705131
## create -0.02356759 0.64281265 0.45670255 0.57382957 1.00000000 0.42365795
## abstract 0.02311522 0.57259750 0.35357159 0.50705131 0.42365795 1.00000000
## physics -0.38171959 0.35304329 -0.09995838 0.69101577 0.35328045 0.25249211
## physics
## ei -0.38171959
## iq 0.35304329
## vsat -0.09995838
## msat 0.69101577
## create 0.35328045
## abstract 0.25249211
## physics 1.00000000
#Fun tip: `apa.cor.table()` creates an APA formated correlation matrix and saves it to your computer
#apa.cor.table(physics, filename = "cor_table.doc")
You can also subset the data set to include only the variables of interest:
physics %>%
select(ei, iq, physics) %>%
drop_na() %>% #remove missing data
cor()
## ei iq physics
## ei 1.0000000 0.0360554 -0.3941978
## iq 0.0360554 1.0000000 0.3440316
## physics -0.3941978 0.3440316 1.0000000
[Correlation Interpretation:]{.ul}
Among the students in the sample, IQ and physics test scores were positively correlated, r = 0.34, p < .001.
We can also look at partial-correlations using ppcor::pcor()
to see how the predictors relate to the outcome after controlling for the other correlations and semi-partial correlations using ppcor::spcor()
to see how predictors relate to the outcome after controlling for X but not Y.
library(ppcor)
# Partial Correlation
physics %>%
drop_na() %>%
pcor()
## $estimate
## ei iq vsat msat create abstract
## ei 1.00000000 -0.32568951 0.4650756 0.16441386 -0.12115000 0.01488448
## iq -0.32568951 1.00000000 0.4746335 0.17078572 0.23857783 0.29849210
## vsat 0.46507555 0.47463350 1.0000000 0.44653013 0.16077919 -0.13045988
## msat 0.16441386 0.17078572 0.4465301 1.00000000 0.05092278 0.21300341
## create -0.12115000 0.23857783 0.1607792 0.05092278 1.00000000 0.07149435
## abstract 0.01488448 0.29849210 -0.1304599 0.21300341 0.07149435 1.00000000
## physics -0.21342029 0.04833899 -0.5147686 0.80097974 0.08561041 -0.14365807
## physics
## ei -0.21342029
## iq 0.04833899
## vsat -0.51476858
## msat 0.80097974
## create 0.08561041
## abstract -0.14365807
## physics 1.00000000
##
## $p.value
## ei iq vsat msat create
## ei 0.000000e+00 4.805721e-06 1.560295e-11 2.377604e-02 0.0967917262
## iq 4.805721e-06 0.000000e+00 5.208355e-12 1.879216e-02 0.0009463296
## vsat 1.560295e-11 5.208355e-12 0.000000e+00 1.193188e-10 0.0270993829
## msat 2.377604e-02 1.879216e-02 1.193188e-10 0.000000e+00 0.4865040527
## create 9.679173e-02 9.463296e-04 2.709938e-02 4.865041e-01 0.0000000000
## abstract 8.389149e-01 3.021094e-05 7.356918e-02 3.253293e-03 0.3282642697
## physics 3.191877e-03 5.089145e-01 3.517794e-14 1.575852e-43 0.2414812417
## abstract physics
## ei 8.389149e-01 3.191877e-03
## iq 3.021094e-05 5.089145e-01
## vsat 7.356918e-02 3.517794e-14
## msat 3.253293e-03 1.575852e-43
## create 3.282643e-01 2.414812e-01
## abstract 0.000000e+00 4.859688e-02
## physics 4.859688e-02 0.000000e+00
##
## $statistic
## ei iq vsat msat create abstract
## ei 0.0000000 -4.7105727 7.184029 2.2793444 -1.6689948 0.2035648
## iq -4.7105727 0.0000000 7.374048 2.3702833 3.3595141 4.2767879
## vsat 7.1840291 7.3740478 0.000000 6.8243429 2.2276026 -1.7993904
## msat 2.2793444 2.3702833 6.824343 0.0000000 0.6972632 2.9811919
## create -1.6689948 3.3595141 2.227603 0.6972632 0.0000000 0.9801788
## abstract 0.2035648 4.2767879 -1.799390 2.9811919 0.9801788 0.0000000
## physics -2.9873047 0.6617994 -8.210795 18.2952888 1.1750187 -1.9850851
## physics
## ei -2.9873047
## iq 0.6617994
## vsat -8.2107955
## msat 18.2952888
## create 1.1750187
## abstract -1.9850851
## physics 0.0000000
##
## $n
## [1] 194
##
## $gp
## [1] 5
##
## $method
## [1] "pearson"
# Semi-Parial Correlation
physics %>%
drop_na() %>%
spcor()
## $estimate
## ei iq vsat msat create abstract
## ei 1.00000000 -0.24386511 0.3719153 0.11800106 -0.08640342 0.01053850
## iq -0.17991579 1.00000000 0.2816447 0.09053069 0.12831341 0.16334780
## vsat 0.24787217 0.25442843 1.0000000 0.23546184 0.07685947 -0.06208477
## msat 0.06775628 0.07045955 0.2028619 1.00000000 0.02072700 0.08861955
## create -0.08966473 0.18048584 0.1196753 0.03745962 1.00000000 0.05265892
## abstract 0.01184521 0.24886148 -0.1047045 0.17347221 0.05703550 1.00000000
## physics -0.10442322 0.02313364 -0.2870138 0.63952396 0.04107356 -0.06938997
## physics
## ei -0.15465198
## iq 0.02527679
## vsat -0.28329892
## msat 0.54384971
## create 0.06312646
## abstract -0.11550987
## physics 1.00000000
##
## $p.value
## ei iq vsat msat create
## ei 0.0000000000 0.0007207838 1.369299e-07 1.058482e-01 0.23713321
## iq 0.0132390483 0.0000000000 8.634775e-05 2.153898e-01 0.07847628
## vsat 0.0005840884 0.0004109848 0.000000e+00 1.107964e-03 0.29317174
## msat 0.3542482772 0.3353314498 5.117270e-03 0.000000e+00 0.77710849
## create 0.2198294618 0.0129459172 1.009535e-01 6.088274e-01 0.00000000
## abstract 0.8714869384 0.0005542444 1.516168e-01 1.697934e-02 0.43566074
## physics 0.1527254403 0.7520272679 6.223148e-05 3.984530e-23 0.57468919
## abstract physics
## ei 0.88556108 3.360256e-02
## iq 0.02471236 7.299077e-01
## vsat 0.39605865 7.811472e-05
## msat 0.22527418 6.106787e-16
## create 0.47174829 3.881654e-01
## abstract 0.00000000 1.134748e-01
## physics 0.34273760 0.000000e+00
##
## $statistic
## ei iq vsat msat create abstract
## ei 0.0000000 -3.4386200 5.478884 1.6249933 -1.1859843 0.1441198
## iq -2.5011248 0.0000000 4.013921 1.2430931 1.7692850 2.2641585
## vsat 3.4987885 3.5976493 0.000000 3.3130430 1.0541557 -0.8506374
## msat 0.9286874 0.9659206 2.833000 0.0000000 0.2834983 1.2166410
## create -1.2311057 2.5093158 1.648381 0.5126124 0.0000000 0.7211004
## abstract 0.1619921 3.5136730 -1.439726 2.4087159 0.7812205 0.0000000
## physics -1.4358157 0.3164324 -4.097240 11.3757630 0.5621468 -0.9511863
## physics
## ei -2.1405874
## iq 0.3457654
## vsat -4.0395481
## msat 8.8622340
## create 0.8649665
## abstract -1.5902181
## physics 0.0000000
##
## $n
## [1] 194
##
## $gp
## [1] 5
##
## $method
## [1] "pearson"
detach("package:ppcor", unload = TRUE)
Finally, use pairs()
to look at bivariate scatterplots.
Do the relationships look linear?:
pairs(physics)
# Or we can look at individual scatterplots:
physics %>%
ggplot(aes(physics, ei)) +
geom_point() +
geom_smooth(method = "lm", se =F) +
theme_minimal()
Normality
Density Plot:
lapply(physics, ggdensity, fill = "lightgray")
## $ei
##
## $iq
##
## $vsat
##
## $msat
##
## $create
##
## $abstract
##
## $physics
# alternatively, you may look at each one individually like this:
# ggdensity(physics$ei, fill = "lightgray")
QQ Plots for individual predictors (Later, we will want to look at the QQ plot of the entire model, which looks at the normality of residuals):
lapply(physics, ggqqplot, fill = "lightgray")
## $ei
##
## $iq
##
## $vsat
##
## $msat
##
## $create
##
## $abstract
##
## $physics
Shapiro-Wilk’s Test of Normality:
#physics %>%
# shapiro_test(ei, iq, vsat, msat, create, abstract, physics)
#Here, abstract and msat is significant. However, Shapiro-Wilk's test is very sensitive to minor deviations from normality. Our density plot suggests normality.
Skewness and Kurtosis:
describe(physics)
## vars n mean sd median trimmed mad min max range skew
## ei 1 199 107.59 12.17 108.0 107.82 10.38 75.0 139 64.0 -0.15
## iq 2 199 99.99 14.59 100.0 99.89 14.83 57.0 134 77.0 0.01
## vsat 3 199 534.46 105.37 526.0 532.13 108.23 314.0 801 487.0 0.21
## msat 4 199 493.38 42.98 487.0 490.93 43.00 364.0 601 237.0 0.40
## create 5 200 9.57 3.78 9.0 9.53 4.45 1.0 19 18.0 0.14
## abstract 6 199 9.46 2.98 10.0 9.64 2.97 1.0 15 14.0 -0.55
## physics 7 195 79.57 17.58 79.5 79.34 16.90 29.5 119 89.5 0.04
## kurtosis se
## ei -0.20 0.86
## iq -0.23 1.03
## vsat -0.51 7.47
## msat -0.25 3.05
## create -0.57 0.27
## abstract 0.15 0.21
## physics -0.10 1.26
Simple Linear Regression
Finally, we have arrived at our main goal. It’s good to reflect on how much work it takes to prepare the data for analyses. It’s important to clean and screen the data before analyses.
Recall our research hypothesis:
Research Example: Predict physics scores from math SAT scores.
Run a simple linear regression model using lm()
:
#`msat` predicting `physics`
slr_model <- lm(physics~msat,data=physics)
summary(slr_model)
##
## Call:
## lm(formula = physics ~ msat, data = physics)
##
## Residuals:
## Min 1Q Median 3Q Max
## -36.735 -7.484 -0.243 8.980 30.098
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -58.54334 10.46659 -5.593 7.54e-08 ***
## msat 0.27977 0.02112 13.246 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 12.76 on 193 degrees of freedom
## (5 observations deleted due to missingness)
## Multiple R-squared: 0.4762, Adjusted R-squared: 0.4735
## F-statistic: 175.5 on 1 and 193 DF, p-value: < 2.2e-16
[Simple Linear Regression Interpretation:]{.ul}
For every one unit increase in msat
scores, physics
scores increase 0.28 points, p < 0.001.
$$ physics = -58.54 + 0.28 (msat) $$
Looking at the homoscedasticity assumption, we use the plot()
function.
To assess for constant variance, we want to see a straight, horizontal red line across this plot.
# 3 is for the Scale-Location plot. This shows if residuals are spread equally along the ranges of preditors. A straight line means constant variance.
plot(slr_model, 3)
Looking at the normality of residuals assumption (whereas before, we were looking at normality of each predictor), we again use plot()
function.:
# 2 is to assess the normality of residuals (QQ plot for the model) We want to see the points on the diagonal.
plot(slr_model, 2)
Multiple Regression
Recall our research hypothesis:
Research Example: Predict physics scores from: emotional intelligence, IQ, verbal SAT, math SAT, creativity, and abstract thinking.
Run a multiple regression model using lm()
:
# using the period (.) tells R you want ALL the variables except the outcome (`physics`)
# alternatively, you can type out all the variables by doing this:
# lm(physics~ei+iq+vsat+msat+create+abstract, data=physics)
mr_model <- lm(physics~.,data=physics)
summary(mr_model)
##
## Call:
## lm(formula = physics ~ ., data = physics)
##
## Residuals:
## Min 1Q Median 3Q Max
## -18.4419 -6.4058 -0.6015 6.2519 21.1205
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -40.80967 10.06104 -4.056 7.32e-05 ***
## ei -0.20493 0.06860 -2.987 0.00319 **
## iq 0.05214 0.07878 0.662 0.50891
## vsat -0.08664 0.01055 -8.211 3.52e-14 ***
## msat 0.37663 0.02059 18.295 < 2e-16 ***
## create 0.25462 0.21669 1.175 0.24148
## abstract -0.50442 0.25411 -1.985 0.04860 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 8.425 on 187 degrees of freedom
## (6 observations deleted due to missingness)
## Multiple R-squared: 0.7715, Adjusted R-squared: 0.7642
## F-statistic: 105.2 on 6 and 187 DF, p-value: < 2.2e-16
[Multiple Regression Interpretation:]{.ul}
Controlling for all other variables, for every one unit increase in ei
scores, physics
scores decrease 0.20 points, p < 0.001.
Rinse and repeat for remaining predictors. For non-significant predictors, “IQ was not found to be a significant predictor of the model” is sufficient.
Homoscedasticity
Scale-Location plot:
# 3 is for the Scale-Location plot. This shows if residuals are spread equally along the ranges of preditors. A straight line means constant variance.
plot(mr_model, 3)
Normality of Residuals
QQ plot:
# 2 is to assess the normality of residuals (QQ plot for the model) We want to see the points on the diagonal.
plot(mr_model, 2)