Data manipulation with dplyr
Overview
Teaching: 60 min
Exercises: 5 minQuestions
How can I add variables to my data?
How can I alter the variables already in my data?
Objectives
Use
mutate()
to add and alter variablesUse if_else where approriate
Use
case_when
where approriateUnderstand basic consepts of different data types
Motivation
Often, the data we have do not contain exactly what we need. We might need to change the order of factors, create new variables based on other columns in the data, or even variables conditional on specific values in other columns.
Adding new variables,
In {tidyverse}, when we add new variables, we use the mutate()
function. Just like the other {tidyverse} functions, mutate work specifically with data sets, and provides a nice shorthand for working directly with the columns in the data set.
penguins %>%
mutate(new_var = 1)
# A tibble: 344 × 9
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<fct> <fct> <dbl> <dbl> <int> <int>
1 Adelie Torgersen 39.1 18.7 181 3750
2 Adelie Torgersen 39.5 17.4 186 3800
3 Adelie Torgersen 40.3 18 195 3250
4 Adelie Torgersen NA NA NA NA
5 Adelie Torgersen 36.7 19.3 193 3450
6 Adelie Torgersen 39.3 20.6 190 3650
7 Adelie Torgersen 38.9 17.8 181 3625
8 Adelie Torgersen 39.2 19.6 195 4675
9 Adelie Torgersen 34.1 18.1 193 3475
10 Adelie Torgersen 42 20.2 190 4250
# … with 334 more rows, and 3 more variables: sex <fct>, year <int>,
# new_var <dbl>
The output of this can be hard to spot, depending on the size of the screen. Let us for convenience create a subsetted data set to work on so we can easily see what we are doing.
penguins_s <- penguins %>%
select(1:3, starts_with("bill"))
Lets try our command again on this new data.
penguins_s %>%
mutate(new_var = 1)
# A tibble: 344 × 5
species island bill_length_mm bill_depth_mm new_var
<fct> <fct> <dbl> <dbl> <dbl>
1 Adelie Torgersen 39.1 18.7 1
2 Adelie Torgersen 39.5 17.4 1
3 Adelie Torgersen 40.3 18 1
4 Adelie Torgersen NA NA 1
5 Adelie Torgersen 36.7 19.3 1
6 Adelie Torgersen 39.3 20.6 1
7 Adelie Torgersen 38.9 17.8 1
8 Adelie Torgersen 39.2 19.6 1
9 Adelie Torgersen 34.1 18.1 1
10 Adelie Torgersen 42 20.2 1
# … with 334 more rows
There is now a new column in the data set called “new_var”, and it has the value 1 for all rows!
This is what we told mutate()
to do! We specified a new column by name, and gave it a specific value, 1
.
This works because its easy to assigning a single value to all rows. What if we try to give it three values? What would we expect?
penguins_s %>%
mutate(var = 1:3)
Error in `mutate()`:
! Problem while computing `var = 1:3`.
x `var` must be size 344 or 1, not 3.
Here, it’s failing with a mysterious message. The error is telling us that input must be of size 344 or 1. 344 are the number of rows in the data set, so its telling us the input we gave it is not suitable because its neither of length 344 nor of length 1.
So now we know the premises for mutate, it takes inputs that are either of the same length as there are rows in the data set or length 1.
penguins_s %>%
mutate(var = 1:344)
# A tibble: 344 × 5
species island bill_length_mm bill_depth_mm var
<fct> <fct> <dbl> <dbl> <int>
1 Adelie Torgersen 39.1 18.7 1
2 Adelie Torgersen 39.5 17.4 2
3 Adelie Torgersen 40.3 18 3
4 Adelie Torgersen NA NA 4
5 Adelie Torgersen 36.7 19.3 5
6 Adelie Torgersen 39.3 20.6 6
7 Adelie Torgersen 38.9 17.8 7
8 Adelie Torgersen 39.2 19.6 8
9 Adelie Torgersen 34.1 18.1 9
10 Adelie Torgersen 42 20.2 10
# … with 334 more rows
But generally, we create new columns based on other data in the data set. So let’s do a more useful example. For instance, perhaps we want to use the ratio between the bill length and depth as a measurement for a model.
penguins_s %>%
mutate(bill_ratio = bill_length_mm / bill_depth_mm)
# A tibble: 344 × 5
species island bill_length_mm bill_depth_mm bill_ratio
<fct> <fct> <dbl> <dbl> <dbl>
1 Adelie Torgersen 39.1 18.7 2.09
2 Adelie Torgersen 39.5 17.4 2.27
3 Adelie Torgersen 40.3 18 2.24
4 Adelie Torgersen NA NA NA
5 Adelie Torgersen 36.7 19.3 1.90
6 Adelie Torgersen 39.3 20.6 1.91
7 Adelie Torgersen 38.9 17.8 2.19
8 Adelie Torgersen 39.2 19.6 2
9 Adelie Torgersen 34.1 18.1 1.88
10 Adelie Torgersen 42 20.2 2.08
# … with 334 more rows
So, here we have asked for the ratio between bill length and depth to be calculated and stored in a column named bill_ratio
. Then we selected just the bill
columns to have a peak at the output more directly.
We can do almost anything within a mutate()
to get the values as we want them, also use functions that exist in R to transform the data. For instance, perhaps we want to scale the variables of interest to have a mean of 0 and standard deviation of 1, which is quite common to improve statistical modelling. We can do that with the scale()
function.
penguins_s %>%
mutate(bill_ratio = bill_length_mm / bill_depth_mm,
bill_length_mm_z = scale(bill_length_mm))
# A tibble: 344 × 6
species island bill_length_mm bill_depth_mm bill_ratio bill_length_mm_z[,…
<fct> <fct> <dbl> <dbl> <dbl> <dbl>
1 Adelie Torgersen 39.1 18.7 2.09 -0.883
2 Adelie Torgersen 39.5 17.4 2.27 -0.810
3 Adelie Torgersen 40.3 18 2.24 -0.663
4 Adelie Torgersen NA NA NA NA
5 Adelie Torgersen 36.7 19.3 1.90 -1.32
6 Adelie Torgersen 39.3 20.6 1.91 -0.847
7 Adelie Torgersen 38.9 17.8 2.19 -0.920
8 Adelie Torgersen 39.2 19.6 2 -0.865
9 Adelie Torgersen 34.1 18.1 1.88 -1.80
10 Adelie Torgersen 42 20.2 2.08 -0.352
# … with 334 more rows
Challenge 1
Create a column where bill length is transformed to cm. To transform mm to cm, you must divide the mm value by 10. Name the column bill_length_cm.
Solution 1
penguins_s %>% mutate(bill_length_cm = bill_length_mm / 10)
# A tibble: 344 × 5 species island bill_length_mm bill_depth_mm bill_length_cm <fct> <fct> <dbl> <dbl> <dbl> 1 Adelie Torgersen 39.1 18.7 3.91 2 Adelie Torgersen 39.5 17.4 3.95 3 Adelie Torgersen 40.3 18 4.03 4 Adelie Torgersen NA NA NA 5 Adelie Torgersen 36.7 19.3 3.67 6 Adelie Torgersen 39.3 20.6 3.93 7 Adelie Torgersen 38.9 17.8 3.89 8 Adelie Torgersen 39.2 19.6 3.92 9 Adelie Torgersen 34.1 18.1 3.41 10 Adelie Torgersen 42 20.2 4.2 # … with 334 more rows
Challenge 2
Create a column for body mass in kilos, rather than grams, in the main penguins data set. Name the column body_mass_kg. To transform grams to kilograms, divide the grams by 1000.
Solution 2
penguins %>% mutate(body_mass_kg = body_mass_g / 1000)
# A tibble: 344 × 9 species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g <fct> <fct> <dbl> <dbl> <int> <int> 1 Adelie Torgersen 39.1 18.7 181 3750 2 Adelie Torgersen 39.5 17.4 186 3800 3 Adelie Torgersen 40.3 18 195 3250 4 Adelie Torgersen NA NA NA NA 5 Adelie Torgersen 36.7 19.3 193 3450 6 Adelie Torgersen 39.3 20.6 190 3650 7 Adelie Torgersen 38.9 17.8 181 3625 8 Adelie Torgersen 39.2 19.6 195 4675 9 Adelie Torgersen 34.1 18.1 193 3475 10 Adelie Torgersen 42 20.2 190 4250 # … with 334 more rows, and 3 more variables: sex <fct>, year <int>, # body_mass_kg <dbl>
Adding conditional variables
Sometimes, we want to assign certain data values based on other variables in the data set. For instance, maybe we want to classify all penguins with body mass above 4.5 kg as “large” while the rest are “normal”?
The if_else()
function takes expressions, much like filter()
.
The first value after the expression is the value assigned if the expression is TRUE
, while the second is if the expression is FALSE
penguin_weight <- penguins %>%
select(year, body_mass_g)
penguin_weight %>%
mutate(size = if_else(condition = body_mass_g > 4500,
true = "large",
false = "normal"))
# A tibble: 344 × 3
year body_mass_g size
<int> <int> <chr>
1 2007 3750 normal
2 2007 3800 normal
3 2007 3250 normal
4 2007 NA <NA>
5 2007 3450 normal
6 2007 3650 normal
7 2007 3625 normal
8 2007 4675 large
9 2007 3475 normal
10 2007 4250 normal
# … with 334 more rows
Now we have a column with two values, large
and normal
based on whether the penguins are above or below 4.5 kilos.
We can for instance use that in a plot.
penguin_weight %>%
mutate(size = if_else(condition = body_mass_g > 4500,
true = "large",
false = "normal")) %>%
ggplot() +
geom_jitter(mapping = aes(x = year, y = body_mass_g, colour = size))
Warning: Removed 2 rows containing missing values (geom_point).
That shows us clearly that we have grouped the penguins based on their size. But there is this strange NA
in the plot legend. what is that?
In R, missing values are usually given the value NA
which stands for Not applicable
, i.e., missing data. This is a very special name in R. Like TRUE
and FALSE
are capitalized, RStudio immediately recognizes the combination of capital letters and gives it another colour than all other values. In this case it means, there are some penguins we do not have the body mass of.
Now we know how to create new variables, and even how to make them if there are conditions on how to add the data.
But, we often want to add several columns of different types, and maybe even add new variables based on other new columns! Oh, it’s starting to sound complicated, but it does not have to be!
mutate()
is so-called lazy-evaluated. This sounds weird, but it means that each new column you make is made in the sequence you make them. So as long as you think about the order of your mutate()
creations, you can do that in a single mutate call.
penguins_s %>%
mutate(
bill_ratio = bill_depth_mm / bill_length_mm,
bill_type = if_else(condition = bill_ratio < 0.5,
true = "elongated",
false = "stumped")
)
# A tibble: 344 × 6
species island bill_length_mm bill_depth_mm bill_ratio bill_type
<fct> <fct> <dbl> <dbl> <dbl> <chr>
1 Adelie Torgersen 39.1 18.7 0.478 elongated
2 Adelie Torgersen 39.5 17.4 0.441 elongated
3 Adelie Torgersen 40.3 18 0.447 elongated
4 Adelie Torgersen NA NA NA <NA>
5 Adelie Torgersen 36.7 19.3 0.526 stumped
6 Adelie Torgersen 39.3 20.6 0.524 stumped
7 Adelie Torgersen 38.9 17.8 0.458 elongated
8 Adelie Torgersen 39.2 19.6 0.5 stumped
9 Adelie Torgersen 34.1 18.1 0.531 stumped
10 Adelie Torgersen 42 20.2 0.481 elongated
# … with 334 more rows
Now you’ve created two variables. One for bill_ratio
, and then another one conditional on the values of the bill_ratio
.
If you switched the order of these two, R would produce an error, because there would be no bill ratio to create the other column.
penguins_s %>%
mutate(
bill_ratio = bill_depth_mm / bill_length_mm,
bill_type = if_else(condition = bill_ratio < 0.5,
true = "elongated",
false = "stumped"),
bill_ratio = bill_depth_mm / bill_length_mm
)
# A tibble: 344 × 6
species island bill_length_mm bill_depth_mm bill_ratio bill_type
<fct> <fct> <dbl> <dbl> <dbl> <chr>
1 Adelie Torgersen 39.1 18.7 0.478 elongated
2 Adelie Torgersen 39.5 17.4 0.441 elongated
3 Adelie Torgersen 40.3 18 0.447 elongated
4 Adelie Torgersen NA NA NA <NA>
5 Adelie Torgersen 36.7 19.3 0.526 stumped
6 Adelie Torgersen 39.3 20.6 0.524 stumped
7 Adelie Torgersen 38.9 17.8 0.458 elongated
8 Adelie Torgersen 39.2 19.6 0.5 stumped
9 Adelie Torgersen 34.1 18.1 0.531 stumped
10 Adelie Torgersen 42 20.2 0.481 elongated
# … with 334 more rows
But what if we want to categorize based on more than one condition? Nested if_else()
?
penguins_s %>%
mutate(
bill_ratio = bill_depth_mm / bill_length_mm,
bill_type = if_else(condition = bill_ratio < 0.35,
true = "elongated",
false = if_else(condition = bill_ratio < 0.45,
true = "normal",
false = "stumped")))
# A tibble: 344 × 6
species island bill_length_mm bill_depth_mm bill_ratio bill_type
<fct> <fct> <dbl> <dbl> <dbl> <chr>
1 Adelie Torgersen 39.1 18.7 0.478 stumped
2 Adelie Torgersen 39.5 17.4 0.441 normal
3 Adelie Torgersen 40.3 18 0.447 normal
4 Adelie Torgersen NA NA NA <NA>
5 Adelie Torgersen 36.7 19.3 0.526 stumped
6 Adelie Torgersen 39.3 20.6 0.524 stumped
7 Adelie Torgersen 38.9 17.8 0.458 stumped
8 Adelie Torgersen 39.2 19.6 0.5 stumped
9 Adelie Torgersen 34.1 18.1 0.531 stumped
10 Adelie Torgersen 42 20.2 0.481 stumped
# … with 334 more rows
what if you have even more conditionals? It can get pretty messy pretty fast.
Thankfully, {dplyr} has a smarter way of doing this, called case_when()
. This function is similar to if_else()
, but where you specify what each condition should be assigned.
On the left you have the logical expression, and the on the right of the tilde (~
) is the value to be assigned if that expression is TRUE
penguins_s %>%
mutate(
bill_ratio = bill_depth_mm / bill_length_mm,
bill_type = case_when(
bill_ratio < 0.35 ~ "elongated",
bill_ratio < 0.45 ~ "normal",
TRUE ~ "stumped")
) %>%
ggplot(mapping = aes(x = bill_length_mm,
y = bill_depth_mm,
colour = bill_type)) +
geom_point()
Warning: Removed 2 rows containing missing values (geom_point).
That looks almost the same. The NA
’s are gone! That’s not right. We cannot categorize values that are missing. It’s our last statement that does this, which just says “make the remainder this value”. Which is not what we want. We need the NA
s to stay NA
’s.
case_when()
, like the mutate()
, evaluates the expressions in sequence. Which is why we can have two statements evaluating the same column with similar expressions (below 0.35 and then below 0.45). All values that are below 0.45 are also below 0.35. Since we first assign everything below 0.35, and then below 0.45, they do not collide. We can do the same for our last statement, saying that all values that are not NA
should be given this category.
penguins %>%
mutate(
bill_ratio = bill_depth_mm / bill_length_mm,
bill_type = case_when(
bill_ratio < 0.35 ~ "elongated",
bill_ratio < 0.45 ~ "normal",
!is.na(bill_ratio) ~ "stumped")
) %>%
ggplot(mapping = aes(x = bill_length_mm,
y = bill_depth_mm,
colour = bill_type)) +
geom_point()
Warning: Removed 2 rows containing missing values (geom_point).
Here, we use the is.na()
, which is a special function in R to detect NA
values. But it also has an !
in front, what does that mean? In R’s logical expressions, the !
is a negation specifier. It means it flips the logical so the TRUE
becomes FALSE
, and vice versa. So here, it means the bill_ratio
is not NA
.
Challenge 3
Create a column named
bill_ld_ratio_log
that is the natural logarithm (using thelog()
function) ofbill_length_mm
divided bybill_depth_mm
Solution 3
penguins %>% mutate(bill_ld_ratio_log = log(bill_length_mm / bill_depth_mm))
# A tibble: 344 × 9 species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g <fct> <fct> <dbl> <dbl> <int> <int> 1 Adelie Torgersen 39.1 18.7 181 3750 2 Adelie Torgersen 39.5 17.4 186 3800 3 Adelie Torgersen 40.3 18 195 3250 4 Adelie Torgersen NA NA NA NA 5 Adelie Torgersen 36.7 19.3 193 3450 6 Adelie Torgersen 39.3 20.6 190 3650 7 Adelie Torgersen 38.9 17.8 181 3625 8 Adelie Torgersen 39.2 19.6 195 4675 9 Adelie Torgersen 34.1 18.1 193 3475 10 Adelie Torgersen 42 20.2 190 4250 # … with 334 more rows, and 3 more variables: sex <fct>, year <int>, # bill_ld_ratio_log <dbl>
Challenge 4
Create a new column called
body_type
, where animals below 3 kg aresmall
, animals between 3 and 4.5 kg arenormal
, and animals larger than 4.5 kg arelarge
. In the same command, create a new column namedbiscoe
and its content should beTRUE
if the island isBiscoe
andFALSE
for everything else.Solution 4
penguins %>% mutate( body_type = case_when( body_mass_g < 3000 ~ "small", body_mass_g >= 3000 & body_mass_g < 4500 ~ "normal", body_mass_g >= 4500 ~ "large"), biscoe = if_else(island == "Biscoe", true = TRUE, false = FALSE) )
# A tibble: 344 × 10 species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g <fct> <fct> <dbl> <dbl> <int> <int> 1 Adelie Torgersen 39.1 18.7 181 3750 2 Adelie Torgersen 39.5 17.4 186 3800 3 Adelie Torgersen 40.3 18 195 3250 4 Adelie Torgersen NA NA NA NA 5 Adelie Torgersen 36.7 19.3 193 3450 6 Adelie Torgersen 39.3 20.6 190 3650 7 Adelie Torgersen 38.9 17.8 181 3625 8 Adelie Torgersen 39.2 19.6 195 4675 9 Adelie Torgersen 34.1 18.1 193 3475 10 Adelie Torgersen 42 20.2 190 4250 # … with 334 more rows, and 4 more variables: sex <fct>, year <int>, # body_type <chr>, biscoe <lgl>
Wrap up
Now we’ve learned a little about adding and altering variables in data sets using {dplyr}’s mutate()
function.
You should be able to play around with the examples provided and learn more about how things work through trial and error.
Key Points
Using
mutate
Using
if_else
Using
case_when