Portfolio Data Wrangling in R
Data Wrangling in R
A series of functions that could be useful when preparing your dataset for analysis.
library(stringr) # str_to_lower()
library(janitor) # clean_names()
library(palmerpenguins) # penguins
library(sjmisc) # move_columns
library(tidyverse)
Data Wrangling Tips in R
Let’s use a dataset called penguins in the palmerpenguins package (Reference:
Horst AM, Hill AP, & Gorman KB, 2020)
Read in and view data
data("penguins")
head(penguins_raw) # reads first 6 rows
## # A tibble: 6 × 17
## studyName `Sample Number` Species Region Island Stage `Individual ID`
## <chr> <dbl> <chr> <chr> <chr> <chr> <chr>
## 1 PAL0708 1 Adelie Penguin … Anvers Torge… Adul… N1A1
## 2 PAL0708 2 Adelie Penguin … Anvers Torge… Adul… N1A2
## 3 PAL0708 3 Adelie Penguin … Anvers Torge… Adul… N2A1
## 4 PAL0708 4 Adelie Penguin … Anvers Torge… Adul… N2A2
## 5 PAL0708 5 Adelie Penguin … Anvers Torge… Adul… N3A1
## 6 PAL0708 6 Adelie Penguin … Anvers Torge… Adul… N3A2
## # … with 10 more variables: `Clutch Completion` <chr>, `Date Egg` <date>,
## # `Culmen Length (mm)` <dbl>, `Culmen Depth (mm)` <dbl>,
## # `Flipper Length (mm)` <dbl>, `Body Mass (g)` <dbl>, Sex <chr>,
## # `Delta 15 N (o/oo)` <dbl>, `Delta 13 C (o/oo)` <dbl>, Comments <chr>
names(penguins_raw) # lists variable names
## [1] "studyName" "Sample Number" "Species"
## [4] "Region" "Island" "Stage"
## [7] "Individual ID" "Clutch Completion" "Date Egg"
## [10] "Culmen Length (mm)" "Culmen Depth (mm)" "Flipper Length (mm)"
## [13] "Body Mass (g)" "Sex" "Delta 15 N (o/oo)"
## [16] "Delta 13 C (o/oo)" "Comments"
summary(penguins_raw) # provides some descriptive statistics for each variable
## studyName Sample Number Species Region
## Length:344 Min. : 1.00 Length:344 Length:344
## Class :character 1st Qu.: 29.00 Class :character Class :character
## Mode :character Median : 58.00 Mode :character Mode :character
## Mean : 63.15
## 3rd Qu.: 95.25
## Max. :152.00
##
## Island Stage Individual ID Clutch Completion
## Length:344 Length:344 Length:344 Length:344
## Class :character Class :character Class :character Class :character
## Mode :character Mode :character Mode :character Mode :character
##
##
##
##
## Date Egg Culmen Length (mm) Culmen Depth (mm) Flipper Length (mm)
## Min. :2007-11-09 Min. :32.10 Min. :13.10 Min. :172.0
## 1st Qu.:2007-11-28 1st Qu.:39.23 1st Qu.:15.60 1st Qu.:190.0
## Median :2008-11-09 Median :44.45 Median :17.30 Median :197.0
## Mean :2008-11-27 Mean :43.92 Mean :17.15 Mean :200.9
## 3rd Qu.:2009-11-16 3rd Qu.:48.50 3rd Qu.:18.70 3rd Qu.:213.0
## Max. :2009-12-01 Max. :59.60 Max. :21.50 Max. :231.0
## NA's :2 NA's :2 NA's :2
## Body Mass (g) Sex Delta 15 N (o/oo) Delta 13 C (o/oo)
## Min. :2700 Length:344 Min. : 7.632 Min. :-27.02
## 1st Qu.:3550 Class :character 1st Qu.: 8.300 1st Qu.:-26.32
## Median :4050 Mode :character Median : 8.652 Median :-25.83
## Mean :4202 Mean : 8.733 Mean :-25.69
## 3rd Qu.:4750 3rd Qu.: 9.172 3rd Qu.:-25.06
## Max. :6300 Max. :10.025 Max. :-23.79
## NA's :2 NA's :14 NA's :13
## Comments
## Length:344
## Class :character
## Mode :character
##
##
##
##
How to read in data and make note of the missing values:
data <- read_csv("data.csv", na = c("-999", "NA", "000")) #note: the `na` argument only works for "read_csv" and not "read_sav."
Clean names
We see that the variable names are not “clean.” As in, they have spaces and capital letters.
Usually, we want to use a format called “snake case”, where we have lower case letters and spaces are replaced with underscores.
We can easily do this with the clean_names() function in the janitor package:
penguins_new <- penguins_raw %>%
clean_names()
#Compare names
names(penguins_new)
## [1] "study_name" "sample_number" "species"
## [4] "region" "island" "stage"
## [7] "individual_id" "clutch_completion" "date_egg"
## [10] "culmen_length_mm" "culmen_depth_mm" "flipper_length_mm"
## [13] "body_mass_g" "sex" "delta_15_n_o_oo"
## [16] "delta_13_c_o_oo" "comments"
names(penguins_raw)
## [1] "studyName" "Sample Number" "Species"
## [4] "Region" "Island" "Stage"
## [7] "Individual ID" "Clutch Completion" "Date Egg"
## [10] "Culmen Length (mm)" "Culmen Depth (mm)" "Flipper Length (mm)"
## [13] "Body Mass (g)" "Sex" "Delta 15 N (o/oo)"
## [16] "Delta 13 C (o/oo)" "Comments"
Move variable names
We can use the move_columns() function to rearrange our variable names. Lets say we want to move clutch_completion to the left of body_mass_g:
penguins_move <- penguins_new %>%
move_columns(clutch_completion, .before = body_mass_g)
names(penguins_move)
## [1] "study_name" "sample_number" "species"
## [4] "region" "island" "stage"
## [7] "individual_id" "date_egg" "culmen_length_mm"
## [10] "culmen_depth_mm" "flipper_length_mm" "clutch_completion"
## [13] "body_mass_g" "sex" "delta_15_n_o_oo"
## [16] "delta_13_c_o_oo" "comments"
Recoding
Examine class
Let’s examine the clutch_completion variable.
(a character string denoting if the study nest observed with a full clutch, i.e., 2 eggs).
Let’s see what the labels look like using unique():
unique(penguins_new$clutch_completion)
## [1] "Yes" "No"
Right now, the levels are Yes and No.
Let’s also check the variable type using class():
class(penguins_new$clutch_completion)
## [1] "character"
Let’s make this a factor.
Create factor
Since this variable is not a character (but a factor), we need to convert it.
We can do that using factor():
penguins_factor <- penguins_new %>%
mutate(clutch_factor = factor(clutch_completion))
Let’s check the variable again:
class(penguins_factor$clutch_factor)
## [1] "factor"
unique(penguins_factor$clutch_factor)
## [1] Yes No
## Levels: No Yes
Recoding a factor
Let’s say we want to recode our labels from Yes and No to 1 and 0:
penguins_recoded <- penguins_factor %>%
mutate(clutch_recoded = recode(clutch_factor,"Yes" = "1", "No" = "0"))
class(penguins_recoded$clutch_recoded)
## [1] "factor"
unique(penguins_recoded$clutch_recoded)
## [1] 1 0
## Levels: 0 1
Recoding a character into a factor
Let’s look at another variable in the penguins dataset: Sex of the penguin:
class(penguins_new$sex)
## [1] "character"
unique(penguins_new$sex)
## [1] "MALE" "FEMALE" NA
Let’s say we didn’t do the process above, and it was still in character format.
We can combine those two steps by recoding character variables into factors (from MALE and FEMALE to 0 and 1) using recode_factor():
penguins_sex_recoded <- penguins_new %>%
mutate(sex_recoded = recode_factor(sex,"MALE" = "0", "FEMALE" = "1"))
Lets check the variable again:
class(penguins_sex_recoded$sex_recoded)
## [1] "factor"
unique(penguins_sex_recoded$sex_recoded)
## [1] 0 1 <NA>
## Levels: 0 1
Side Note. I also wanted to make a quick note on what happens when you select the variable of interest first vs. not:
penguins_recoded_select <- penguins_new %>%
select(sex) %>%
mutate(sex_recoded = recode_factor(sex,"MALE" = "0", "FEMALE" = "1"))
head(penguins_recoded_select)
## # A tibble: 6 × 2
## sex sex_recoded
## <chr> <fct>
## 1 MALE 0
## 2 FEMALE 1
## 3 FEMALE 1
## 4 <NA> <NA>
## 5 FEMALE 1
## 6 MALE 0
head(penguins_sex_recoded)
## # A tibble: 6 × 18
## study_name sample_number species region island stage individual_id
## <chr> <dbl> <chr> <chr> <chr> <chr> <chr>
## 1 PAL0708 1 Adelie Penguin (Py… Anvers Torge… Adul… N1A1
## 2 PAL0708 2 Adelie Penguin (Py… Anvers Torge… Adul… N1A2
## 3 PAL0708 3 Adelie Penguin (Py… Anvers Torge… Adul… N2A1
## 4 PAL0708 4 Adelie Penguin (Py… Anvers Torge… Adul… N2A2
## 5 PAL0708 5 Adelie Penguin (Py… Anvers Torge… Adul… N3A1
## 6 PAL0708 6 Adelie Penguin (Py… Anvers Torge… Adul… N3A2
## # … with 11 more variables: clutch_completion <chr>, date_egg <date>,
## # culmen_length_mm <dbl>, culmen_depth_mm <dbl>, flipper_length_mm <dbl>,
## # body_mass_g <dbl>, sex <chr>, delta_15_n_o_oo <dbl>, delta_13_c_o_oo <dbl>,
## # comments <chr>, sex_recoded <fct>
Recoding multiple variables
The dataset doesn’t have any two variables that need to be recoded the same, but let’s recode both sex and clutch_completion at the same time. Note: this code should be used if you have two variables that need to have the same recoding, but it looks like this works too.
penguins_multiple <- penguins_new %>%
mutate_at(c("sex", "clutch_completion"),
~ recode_factor(., "MALE" = "0", "FEMALE" = "1",
"Yes" = "1", "No" = "0"))
penguins_multiple %>%
select(sex, clutch_completion) %>%
head()
## # A tibble: 6 × 2
## sex clutch_completion
## <fct> <fct>
## 1 0 1
## 2 1 1
## 3 1 1
## 4 <NA> 1
## 5 1 1
## 6 0 1
Creating reference variables
Recall slide 33 in the Multiple Regression lecture:

Remember that when we enter a categorical variable into the model using lm(), R creates the reference variable for us, and picks the first category as the reference category.
For example, let’s look at the species variable:
class(penguins_new$species)
## [1] "character"
unique(penguins_new$species)
## [1] "Adelie Penguin (Pygoscelis adeliae)"
## [2] "Gentoo penguin (Pygoscelis papua)"
## [3] "Chinstrap penguin (Pygoscelis antarctica)"
Let’s clean this variable.
First, let’s make the labels lowercase using str_to_lower():
penguin_species <- penguins_new %>%
mutate(species = str_to_lower(species))
unique(penguin_species$species)
## [1] "adelie penguin (pygoscelis adeliae)"
## [2] "gentoo penguin (pygoscelis papua)"
## [3] "chinstrap penguin (pygoscelis antarctica)"
Now, let’s just keep the simple names of the penguins: adelie, gentoo, and chinstrap using the separate() function:
penguins_species_clean <- penguin_species %>%
separate(species, into = c("species", "delete"), sep = " penguin") %>%
select(-delete)
unique(penguins_species_clean$species)
## [1] "adelie" "gentoo" "chinstrap"
class(penguins_species_clean$species)
## [1] "character"
Let’s also convert this new variable into a factor:
penguins_species_clean$species <- factor(penguins_species_clean$species)
unique(penguins_species_clean$species)
## [1] adelie gentoo chinstrap
## Levels: adelie chinstrap gentoo
class(penguins_species_clean$species)
## [1] "factor"
Let’s quickly run a simple linear regression using the species variable without creating the reference variables.
Let’s predict flipper_length_mm from species:
model_1 <- lm(flipper_length_mm ~ species, data = penguins_species_clean)
summary(model_1)
##
## Call:
## lm(formula = flipper_length_mm ~ species, data = penguins_species_clean)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.9536 -4.8235 0.0464 4.8130 20.0464
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 189.9536 0.5405 351.454 < 2e-16 ***
## specieschinstrap 5.8699 0.9699 6.052 3.79e-09 ***
## speciesgentoo 27.2333 0.8067 33.760 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.642 on 339 degrees of freedom
## (2 observations deleted due to missingness)
## Multiple R-squared: 0.7782, Adjusted R-squared: 0.7769
## F-statistic: 594.8 on 2 and 339 DF, p-value: < 2.2e-16
Here, it automatically created the reference variables and made adelie penguins the reference category.
What if we want to compare our results to gentoo penguins?
We can do that by converting the chinstrap and adelie into their own variables using ifelse()
chinstrap <- ifelse(penguins_species_clean$species == 'chinstrap', 1, 0)
adelie <- ifelse(penguins_species_clean$species == 'adelie', 1, 0)
Then, we add them to our dataset using cbind():
penguins_reference <- penguins_species_clean %>%
cbind(chinstrap, adelie)
Let’s look at the model again with gentoo as the reference variable.
Instead of species, we use chinstrap and adelie:
model_2 <- lm(flipper_length_mm ~ chinstrap + adelie, data = penguins_reference)
summary(model_2)
##
## Call:
## lm(formula = flipper_length_mm ~ chinstrap + adelie, data = penguins_reference)
##
## Residuals:
## Min 1Q Median 3Q Max
## -17.9536 -4.8235 0.0464 4.8130 20.0464
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 217.1870 0.5988 362.68 <2e-16 ***
## chinstrap -21.3635 1.0036 -21.29 <2e-16 ***
## adelie -27.2333 0.8067 -33.76 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 6.642 on 339 degrees of freedom
## (2 observations deleted due to missingness)
## Multiple R-squared: 0.7782, Adjusted R-squared: 0.7769
## F-statistic: 594.8 on 2 and 339 DF, p-value: < 2.2e-16
Changing the reference category does NOT change our model. It only allows us to make a more meaningful comparison.
Collapsing variables
Continous to categorical
In lab 1, I go over how to collapse continuous variables into categorical. Let’s categorize body_mass_g into 1, 2, 3, and 4 levels.
summary(penguins_new$body_mass_g)
## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 2700 3550 4050 4202 4750 6300 2
cont_cat <- penguins_new %>%
mutate(body_mass_factor = cut(body_mass_g,
breaks = c(2700, 3550, 4202, 4750, 6300),
labels = c("low", "medium_low", "medium_high", "high")))
Categorical to categorical
Let’s collapse our body_mass_g even further. This code is useful if you have a Likert-type scale with four categories, such as Strong Agree to Strongly Disagree and want to collapse to a category with those who responded Strongly Agree and Agree as 1, and Disagree and Strongly Disagree as 2. We can use fct_collapse to collapse factor levels into groups.
cat_cat <- cont_cat %>%
mutate(body_mass_collpase =
fct_collapse(body_mass_factor,
low = c("low", "medium_low"),
high = c("medium_high", "high")))