Portfolio Data Wrangling in R
Data Wrangling in R
A series of functions that could be useful when preparing your dataset for analysis.
library(stringr) # str_to_lower()
library(janitor) # clean_names()
library(palmerpenguins) # penguins
library(sjmisc) # move_columns
library(tidyverse)
Data Wrangling Tips in R
Let’s use a dataset called penguins
in the palmerpenguins
package (Reference:
Horst AM, Hill AP, & Gorman KB, 2020)
Read in and view data
data("penguins")
head(penguins_raw) # reads first 6 rows
## # A tibble: 6 × 17
## studyName `Sample Number` Species Region Island Stage `Individual ID`
## <chr> <dbl> <chr> <chr> <chr> <chr> <chr>
## 1 PAL0708 1 Adelie Penguin … Anvers Torge… Adul… N1A1
## 2 PAL0708 2 Adelie Penguin … Anvers Torge… Adul… N1A2
## 3 PAL0708 3 Adelie Penguin … Anvers Torge… Adul… N2A1
## 4 PAL0708 4 Adelie Penguin … Anvers Torge… Adul… N2A2
## 5 PAL0708 5 Adelie Penguin … Anvers Torge… Adul… N3A1
## 6 PAL0708 6 Adelie Penguin … Anvers Torge… Adul… N3A2
## # … with 10 more variables: `Clutch Completion` <chr>, `Date Egg` <date>,
## # `Culmen Length (mm)` <dbl>, `Culmen Depth (mm)` <dbl>,
## # `Flipper Length (mm)` <dbl>, `Body Mass (g)` <dbl>, Sex <chr>,
## # `Delta 15 N (o/oo)` <dbl>, `Delta 13 C (o/oo)` <dbl>, Comments <chr>
names(penguins_raw) # lists variable names
## [1] "studyName" "Sample Number" "Species"
## [4] "Region" "Island" "Stage"
## [7] "Individual ID" "Clutch Completion" "Date Egg"
## [10] "Culmen Length (mm)" "Culmen Depth (mm)" "Flipper Length (mm)"
## [13] "Body Mass (g)" "Sex" "Delta 15 N (o/oo)"
## [16] "Delta 13 C (o/oo)" "Comments"
summary(penguins_raw) # provides some descriptive statistics for each variable
## studyName Sample Number Species Region
## Length:344 Min. : 1.00 Length:344 Length:344
## Class :character 1st Qu.: 29.00 Class :character Class :character
## Mode :character Median : 58.00 Mode :character Mode :character
## Mean : 63.15
## 3rd Qu.: 95.25
## Max. :152.00
##
## Island Stage Individual ID Clutch Completion
## Length:344 Length:344 Length:344 Length:344
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## Date Egg Culmen Length (mm) Culmen Depth (mm) Flipper Length (mm)
## Min. :2007-11-09 Min. :32.10 Min. :13.10 Min. :172.0
## 1st Qu.:2007-11-28 1st Qu.:39.23 1st Qu.:15.60 1st Qu.:190.0
## Median :2008-11-09 Median :44.45 Median :17.30 Median :197.0
## Mean :2008-11-27 Mean :43.92 Mean :17.15 Mean :200.9
## 3rd Qu.:2009-11-16 3rd Qu.:48.50 3rd Qu.:18.70 3rd Qu.:213.0
## Max. :2009-12-01 Max. :59.60 Max. :21.50 Max. :231.0
## NA's :2 NA's :2 NA's :2
## Body Mass (g) Sex Delta 15 N (o/oo) Delta 13 C (o/oo)
## Min. :2700 Length:344 Min. : 7.632 Min. :-27.02
## 1st Qu.:3550 Class :character 1st Qu.: 8.300 1st Qu.:-26.32
## Median :4050 Mode :character Median : 8.652 Median :-25.83
## Mean :4202 Mean : 8.733 Mean :-25.69
## 3rd Qu.:4750 3rd Qu.: 9.172 3rd Qu.:-25.06
## Max. :6300 Max. :10.025 Max. :-23.79
## NA's :2 NA's :14 NA's :13
## Comments
## Length:344
## Class :character
## Mode :character
##
##
##
##
How to read in data and make note of the missing values:
data <- read_csv("data.csv", na = c("-999", "NA", "000")) #note: the `na` argument only works for "read_csv" and not "read_sav."
Clean names
We see that the variable names are not “clean.” As in, they have spaces and capital letters.
Usually, we want to use a format called “snake case”, where we have lower case letters and spaces are replaced with underscores.
We can easily do this with the clean_names()
function in the janitor
package:
penguins_new <- penguins_raw %>%
clean_names()
#Compare names
names(penguins_new)
## [1] "study_name" "sample_number" "species"
## [4] "region" "island" "stage"
## [7] "individual_id" "clutch_completion" "date_egg"
## [10] "culmen_length_mm" "culmen_depth_mm" "flipper_length_mm"
## [13] "body_mass_g" "sex" "delta_15_n_o_oo"
## [16] "delta_13_c_o_oo" "comments"
names(penguins_raw)
## [1] "studyName" "Sample Number" "Species"
## [4] "Region" "Island" "Stage"
## [7] "Individual ID" "Clutch Completion" "Date Egg"
## [10] "Culmen Length (mm)" "Culmen Depth (mm)" "Flipper Length (mm)"
## [13] "Body Mass (g)" "Sex" "Delta 15 N (o/oo)"
## [16] "Delta 13 C (o/oo)" "Comments"
Move variable names
We can use the move_columns()
function to rearrange our variable names. Lets say we want to move clutch_completion
to the left of body_mass_g
:
penguins_move <- penguins_new %>%
move_columns(clutch_completion, .before = body_mass_g)
names(penguins_move)
## [1] "study_name" "sample_number" "species"
## [4] "region" "island" "stage"
## [7] "individual_id" "date_egg" "culmen_length_mm"
## [10] "culmen_depth_mm" "flipper_length_mm" "clutch_completion"
## [13] "body_mass_g" "sex" "delta_15_n_o_oo"
## [16] "delta_13_c_o_oo" "comments"
Recoding
Examine class
Let’s examine the clutch_completion
variable.
(a character string denoting if the study nest observed with a full clutch, i.e., 2 eggs).
Let’s see what the labels look like using unique()
:
unique(penguins_new$clutch_completion)
## [1] "Yes" "No"
Right now, the levels are Yes
and No
.
Let’s also check the variable type using class()
:
class(penguins_new$clutch_completion)
## [1] "character"
Let’s make this a factor.
Create factor
Since this variable is not a character (but a factor), we need to convert it.
We can do that using factor()
:
penguins_factor <- penguins_new %>%
mutate(clutch_factor = factor(clutch_completion))
Let’s check the variable again:
class(penguins_factor$clutch_factor)
## [1] "factor"
unique(penguins_factor$clutch_factor)
## [1] Yes No
## Levels: No Yes
Recoding a factor
Let’s say we want to recode our labels from Yes
and No
to 1
and 0
:
penguins_recoded <- penguins_factor %>%
mutate(clutch_recoded = recode(clutch_factor,"Yes" = "1", "No" = "0"))
class(penguins_recoded$clutch_recoded)
## [1] "factor"
unique(penguins_recoded$clutch_recoded)
## [1] 1 0
## Levels: 0 1
Recoding a character into a factor
Let’s look at another variable in the penguins
dataset: Sex
of the penguin:
class(penguins_new$sex)
## [1] "character"
unique(penguins_new$sex)
## [1] "MALE" "FEMALE" NA
Let’s say we didn’t do the process above, and it was still in character
format.
We can combine those two steps by recoding character variables into factors (from MALE
and FEMALE
to 0
and 1
) using recode_factor():
penguins_sex_recoded <- penguins_new %>%
mutate(sex_recoded = recode_factor(sex,"MALE" = "0", "FEMALE" = "1"))
Lets check the variable again:
class(penguins_sex_recoded$sex_recoded)
## [1] "factor"
unique(penguins_sex_recoded$sex_recoded)
## [1] 0 1 <NA>
## Levels: 0 1
Side Note. I also wanted to make a quick note on what happens when you select the variable of interest first vs. not:
penguins_recoded_select <- penguins_new %>%
select(sex) %>%
mutate(sex_recoded = recode_factor(sex,"MALE" = "0", "FEMALE" = "1"))
head(penguins_recoded_select)
## # A tibble: 6 × 2
## sex sex_recoded
## <chr> <fct>
## 1 MALE 0
## 2 FEMALE 1
## 3 FEMALE 1
## 4 <NA> <NA>
## 5 FEMALE 1
## 6 MALE 0
head(penguins_sex_recoded)
## # A tibble: 6 × 18
## study_name sample_number species region island stage individual_id
## <chr> <dbl> <chr> <chr> <chr> <chr> <chr>
## 1 PAL0708 1 Adelie Penguin (Py… Anvers Torge… Adul… N1A1
## 2 PAL0708 2 Adelie Penguin (Py… Anvers Torge… Adul… N1A2
## 3 PAL0708 3 Adelie Penguin (Py… Anvers Torge… Adul… N2A1
## 4 PAL0708 4 Adelie Penguin (Py… Anvers Torge… Adul… N2A2
## 5 PAL0708 5 Adelie Penguin (Py… Anvers Torge… Adul… N3A1
## 6 PAL0708 6 Adelie Penguin (Py… Anvers Torge… Adul… N3A2
## # … with 11 more variables: clutch_completion <chr>, date_egg <date>,
## # culmen_length_mm <dbl>, culmen_depth_mm <dbl>, flipper_length_mm <dbl>,
## # body_mass_g <dbl>, sex <chr>, delta_15_n_o_oo <dbl>, delta_13_c_o_oo <dbl>,
## # comments <chr>, sex_recoded <fct>
Recoding multiple variables
The dataset doesn’t have any two variables that need to be recoded the same, but let’s recode both sex
and clutch_completion
at the same time. Note: this code should be used if you have two variables that need to have the same recoding, but it looks like this works too.
penguins_multiple <- penguins_new %>%
mutate_at(c("sex", "clutch_completion"),
~ recode_factor(., "MALE" = "0", "FEMALE" = "1",
"Yes" = "1", "No" = "0"))
penguins_multiple %>%
select(sex, clutch_completion) %>%
head()
## # A tibble: 6 × 2
## sex clutch_completion
## <fct> <fct>
## 1 0 1
## 2 1 1
## 3 1 1
## 4 <NA> 1
## 5 1 1
## 6 0 1
Creating reference variables
Recall slide 33 in the Multiple Regression lecture:
Remember that when we enter a categorical variable into the model using lm()
, R creates the reference variable for us, and picks the first category as the reference category.
For example, let’s look at the species
variable:
class(penguins_new$species)
## [1] "character"
unique(penguins_new$species)
## [1] "Adelie Penguin (Pygoscelis adeliae)"
## [2] "Gentoo penguin (Pygoscelis papua)"
## [3] "Chinstrap penguin (Pygoscelis antarctica)"
Let’s clean this variable.
First, let’s make the labels lowercase using str_to_lower()
:
penguin_species <- penguins_new %>%
mutate(species = str_to_lower(species))
unique(penguin_species$species)
## [1] "adelie penguin (pygoscelis adeliae)"
## [2] "gentoo penguin (pygoscelis papua)"
## [3] "chinstrap penguin (pygoscelis antarctica)"
Now, let’s just keep the simple names of the penguins: adelie
, gentoo
, and chinstrap
using the separate()
function:
penguins_species_clean <- penguin_species %>%
separate(species, into = c("species", "delete"), sep = " penguin") %>%
select(-delete)
unique(penguins_species_clean$species)
## [1] "adelie" "gentoo" "chinstrap"
class(penguins_species_clean$species)
## [1] "character"
Let’s also convert this new variable into a factor:
penguins_species_clean$species <- factor(penguins_species_clean$species)
unique(penguins_species_clean$species)
## [1] adelie gentoo chinstrap
## Levels: adelie chinstrap gentoo
class(penguins_species_clean$species)
## [1] "factor"
Let’s quickly run a simple linear regression using the species
variable without creating the reference variables.
Let’s predict flipper_length_mm
from species
:
model_1 <- lm(flipper_length_mm ~ species, data = penguins_species_clean)
summary(model_1)
##
## Call:
## lm(formula = flipper_length_mm ~ species, data = penguins_species_clean)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.9536 -4.8235 0.0464 4.8130 20.0464
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 189.9536 0.5405 351.454 < 2e-16 ***
## specieschinstrap 5.8699 0.9699 6.052 3.79e-09 ***
## speciesgentoo 27.2333 0.8067 33.760 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.642 on 339 degrees of freedom
## (2 observations deleted due to missingness)
## Multiple R-squared: 0.7782, Adjusted R-squared: 0.7769
## F-statistic: 594.8 on 2 and 339 DF, p-value: < 2.2e-16
Here, it automatically created the reference variables and made adelie
penguins the reference category.
What if we want to compare our results to gentoo
penguins?
We can do that by converting the chinstrap
and adelie
into their own variables using ifelse()
chinstrap <- ifelse(penguins_species_clean$species == 'chinstrap', 1, 0)
adelie <- ifelse(penguins_species_clean$species == 'adelie', 1, 0)
Then, we add them to our dataset using cbind()
:
penguins_reference <- penguins_species_clean %>%
cbind(chinstrap, adelie)
Let’s look at the model again with gentoo
as the reference variable.
Instead of species
, we use chinstrap
and adelie
:
model_2 <- lm(flipper_length_mm ~ chinstrap + adelie, data = penguins_reference)
summary(model_2)
##
## Call:
## lm(formula = flipper_length_mm ~ chinstrap + adelie, data = penguins_reference)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.9536 -4.8235 0.0464 4.8130 20.0464
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 217.1870 0.5988 362.68 <2e-16 ***
## chinstrap -21.3635 1.0036 -21.29 <2e-16 ***
## adelie -27.2333 0.8067 -33.76 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.642 on 339 degrees of freedom
## (2 observations deleted due to missingness)
## Multiple R-squared: 0.7782, Adjusted R-squared: 0.7769
## F-statistic: 594.8 on 2 and 339 DF, p-value: < 2.2e-16
Changing the reference category does NOT change our model. It only allows us to make a more meaningful comparison.
Collapsing variables
Continous to categorical
In lab 1, I go over how to collapse continuous variables into categorical. Let’s categorize body_mass_g
into 1
, 2
, 3
, and 4
levels.
summary(penguins_new$body_mass_g)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 2700 3550 4050 4202 4750 6300 2
cont_cat <- penguins_new %>%
mutate(body_mass_factor = cut(body_mass_g,
breaks = c(2700, 3550, 4202, 4750, 6300),
labels = c("low", "medium_low", "medium_high", "high")))
Categorical to categorical
Let’s collapse our body_mass_g
even further. This code is useful if you have a Likert-type scale with four categories, such as Strong Agree
to Strongly Disagree
and want to collapse to a category with those who responded Strongly Agree
and Agree
as 1
, and Disagree
and Strongly Disagree
as 2
. We can use fct_collapse
to collapse factor levels into groups.
cat_cat <- cont_cat %>%
mutate(body_mass_collpase =
fct_collapse(body_mass_factor,
low = c("low", "medium_low"),
high = c("medium_high", "high")))