Data manipulation across columns
Overview
Teaching: 45 min
Exercises: 6 minQuestions
How can I calculate the mean of several columns for every row of data?
How can I apply the same function across several related columns?
Objectives
Motivation
We have covered many topics so far, and changing (or mutating) variables has been a key consept. The need to create new columns of data, often based on information in other columns of data, is a type of operation that we need very often. But some times, you also need to calculate something per row for several solumns. For instance, you want the sum of all columns in a certain collection, or the mean of them, how can we do that?
One way, is to write it our entirely. Let just pretend there is a good reason to get the sum of bill length and bill depth. Let us also make a subsetted sample with just the bill measurements so we cab easily see what we are doing. We can do that in the following way.
penguins_s <- penguins %>%
select(species, starts_with("bill"))
penguins_s %>%
mutate(
bill_sum = bill_depth_mm + bill_length_mm
)
# A tibble: 344 × 4
species bill_length_mm bill_depth_mm bill_sum
<fct> <dbl> <dbl> <dbl>
1 Adelie 39.1 18.7 57.8
2 Adelie 39.5 17.4 56.9
3 Adelie 40.3 18 58.3
4 Adelie NA NA NA
5 Adelie 36.7 19.3 56
6 Adelie 39.3 20.6 59.9
7 Adelie 38.9 17.8 56.7
8 Adelie 39.2 19.6 58.8
9 Adelie 34.1 18.1 52.2
10 Adelie 42 20.2 62.2
# … with 334 more rows
We’ve seen similar types of operations before.
But what if you want to sum 20 columns, you would need to type our all 20 column names!
Again, tedious.
We have a special type of operations we can do to get that easily.
We will use the function sum
to calculate the sum of several variables when using this pipeline.
penguins_s %>%
mutate(bill_sum = sum(c_across(starts_with("bill"))))
# A tibble: 344 × 4
species bill_length_mm bill_depth_mm bill_sum
<fct> <dbl> <dbl> <dbl>
1 Adelie 39.1 18.7 NA
2 Adelie 39.5 17.4 NA
3 Adelie 40.3 18 NA
4 Adelie NA NA NA
5 Adelie 36.7 19.3 NA
6 Adelie 39.3 20.6 NA
7 Adelie 38.9 17.8 NA
8 Adelie 39.2 19.6 NA
9 Adelie 34.1 18.1 NA
10 Adelie 42 20.2 NA
# … with 334 more rows
hm, that is not what we expected.
I know why, but the reason is not always easy to understand.
By default, c_across
will summarise all the rows for all the bill columns, and give a single value for the entire data set.
There are some NA
s the entire data set, so it returns NA
.
So how can we force it to work in a row-wise fashion?
We can apply a function called rowwise()
which is a special type of group_by
that groups your data by each row, so each row is its own group.
Then, c_across()
will calculate the mean of the columns just for that group (i.e. row in this case).
penguins_s %>%
rowwise() %>%
mutate(bill_sum = sum(c_across(starts_with("bill"))))
# A tibble: 344 × 4
# Rowwise:
species bill_length_mm bill_depth_mm bill_sum
<fct> <dbl> <dbl> <dbl>
1 Adelie 39.1 18.7 57.8
2 Adelie 39.5 17.4 56.9
3 Adelie 40.3 18 58.3
4 Adelie NA NA NA
5 Adelie 36.7 19.3 56
6 Adelie 39.3 20.6 59.9
7 Adelie 38.9 17.8 56.7
8 Adelie 39.2 19.6 58.8
9 Adelie 34.1 18.1 52.2
10 Adelie 42 20.2 62.2
# … with 334 more rows
Now we can see that we get the row sum of all the bill columns for each row, and the tibble tells us it is “Rowwise”.
To stop the data set being rowwise, we can use the ungroup()
function we learned before.
penguins_s %>%
rowwise() %>%
mutate(bill_sum = sum(c_across(starts_with("bill")))) %>%
ungroup()
# A tibble: 344 × 4
species bill_length_mm bill_depth_mm bill_sum
<fct> <dbl> <dbl> <dbl>
1 Adelie 39.1 18.7 57.8
2 Adelie 39.5 17.4 56.9
3 Adelie 40.3 18 58.3
4 Adelie NA NA NA
5 Adelie 36.7 19.3 56
6 Adelie 39.3 20.6 59.9
7 Adelie 38.9 17.8 56.7
8 Adelie 39.2 19.6 58.8
9 Adelie 34.1 18.1 52.2
10 Adelie 42 20.2 62.2
# … with 334 more rows
Challenge 1
Calculate the mean of all the columns with millimeter measurements, an call it
mm_mean
, for each row of data.Solution
penguins %>% rowwise() %>% mutate( mm_mean = mean(c_across(ends_with("mm"))) )
# A tibble: 344 × 9 # Rowwise: species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g <fct> <fct> <dbl> <dbl> <int> <int> 1 Adelie Torgersen 39.1 18.7 181 3750 2 Adelie Torgersen 39.5 17.4 186 3800 3 Adelie Torgersen 40.3 18 195 3250 4 Adelie Torgersen NA NA NA NA 5 Adelie Torgersen 36.7 19.3 193 3450 6 Adelie Torgersen 39.3 20.6 190 3650 7 Adelie Torgersen 38.9 17.8 181 3625 8 Adelie Torgersen 39.2 19.6 195 4675 9 Adelie Torgersen 34.1 18.1 193 3475 10 Adelie Torgersen 42 20.2 190 4250 # … with 334 more rows, and 3 more variables: sex <fct>, year <int>, # mm_mean <dbl>
Challenge 2
Calculate the mean of all the columns with millimeter measurements, an call it
mm_mean
, for each row of data. Then, group the data by species, and calculate the mean of themm_mean
within each species and add it as a column namedmm_mean_species
. IgnoreNA
s in the last calculationSolution
penguins %>% rowwise() %>% mutate( mm_mean = mean(c_across(ends_with("mm"))), ) %>% group_by(species) %>% mutate(mm_mean_species = mean(mm_mean, na.rm = TRUE))
# A tibble: 344 × 10 # Groups: species [3] species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g <fct> <fct> <dbl> <dbl> <int> <int> 1 Adelie Torgersen 39.1 18.7 181 3750 2 Adelie Torgersen 39.5 17.4 186 3800 3 Adelie Torgersen 40.3 18 195 3250 4 Adelie Torgersen NA NA NA NA 5 Adelie Torgersen 36.7 19.3 193 3450 6 Adelie Torgersen 39.3 20.6 190 3650 7 Adelie Torgersen 38.9 17.8 181 3625 8 Adelie Torgersen 39.2 19.6 195 4675 9 Adelie Torgersen 34.1 18.1 193 3475 10 Adelie Torgersen 42 20.2 190 4250 # … with 334 more rows, and 4 more variables: sex <fct>, year <int>, # mm_mean <dbl>, mm_mean_species <dbl>
Mutating several columns in one go
So far, we’ve been looking at adding or summarising variables one by one, or in a pivoted fashion. This is of course something we do all the time, but some times we need to do the same change to multiple columns at once. Imagine you have a data set with 20 column and you want to scale them all to the same scale. Writing the same command with different columns 20 times is very tedious.
In our case, let us say we want to scale the three columns with millimeter measurements so that they have a mean of 0 and standard deviation of 1.
We’ve already used the scale()
function once before, so we will do it again.
In this simple example we might have done so:
penguins %>%
mutate(
bill_depth_sc = scale(bill_depth_mm),
bill_length_sc = scale(bill_length_mm),
flipper_length_sc = scale(flipper_length_mm)
)
# A tibble: 344 × 11
species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
<fct> <fct> <dbl> <dbl> <int> <int>
1 Adelie Torgersen 39.1 18.7 181 3750
2 Adelie Torgersen 39.5 17.4 186 3800
3 Adelie Torgersen 40.3 18 195 3250
4 Adelie Torgersen NA NA NA NA
5 Adelie Torgersen 36.7 19.3 193 3450
6 Adelie Torgersen 39.3 20.6 190 3650
7 Adelie Torgersen 38.9 17.8 181 3625
8 Adelie Torgersen 39.2 19.6 195 4675
9 Adelie Torgersen 34.1 18.1 193 3475
10 Adelie Torgersen 42 20.2 190 4250
# … with 334 more rows, and 5 more variables: sex <fct>, year <int>,
# bill_depth_sc <dbl[,1]>, bill_length_sc <dbl[,1]>,
# flipper_length_sc <dbl[,1]>
Its just three columns, we can do that.
But let us imagine we have 20 of these, typing all that out is tedious and error prone.
You might forget to alter the name or keep the same type of naming convention.
We are only human, we easily make mistakes.
With {dplyr}’s across()
we can combine our knowledge of tidy-selectors and mutate to create the entire transformation for these columns at once.
penguins %>%
mutate(across(.cols = ends_with("mm"),
.fns = scale))
# A tibble: 344 × 8
species island bill_length_mm[… bill_depth_mm[,… flipper_length_… body_mass_g
<fct> <fct> <dbl> <dbl> <dbl> <int>
1 Adelie Torge… -0.883 0.784 -1.42 3750
2 Adelie Torge… -0.810 0.126 -1.06 3800
3 Adelie Torge… -0.663 0.430 -0.421 3250
4 Adelie Torge… NA NA NA NA
5 Adelie Torge… -1.32 1.09 -0.563 3450
6 Adelie Torge… -0.847 1.75 -0.776 3650
7 Adelie Torge… -0.920 0.329 -1.42 3625
8 Adelie Torge… -0.865 1.24 -0.421 4675
9 Adelie Torge… -1.80 0.480 -0.563 3475
10 Adelie Torge… -0.352 1.54 -0.776 4250
# … with 334 more rows, and 2 more variables: sex <fct>, year <int>
Whoa! So fast!
Now the three columns are scaled.
.col
argument takes a tidy-selection of columns, and .fns
it where you let it know which function to apply.
But oh no! The columns have been overwritten. Rather than creating new ones, we replaced the old ones. This might be your intention in some instances, or maybe you will just create a new data set with the scaled variables.
penguins_mm_sc <- penguins %>%
mutate(across(.cols = ends_with("mm"),
.fns = scale))
but often, we’d like to keep the original but add the new variants. We can do that to within the across!
penguins %>%
mutate(across(.cols = ends_with("mm"),
.fns = scale,
.names = "{.col}_sc")) %>%
select(contains("mm"))
# A tibble: 344 × 6
bill_length_mm bill_depth_mm flipper_length_mm bill_length_mm_sc[,1]
<dbl> <dbl> <int> <dbl>
1 39.1 18.7 181 -0.883
2 39.5 17.4 186 -0.810
3 40.3 18 195 -0.663
4 NA NA NA NA
5 36.7 19.3 193 -1.32
6 39.3 20.6 190 -0.847
7 38.9 17.8 181 -0.920
8 39.2 19.6 195 -0.865
9 34.1 18.1 193 -1.80
10 42 20.2 190 -0.352
# … with 334 more rows, and 2 more variables: bill_depth_mm_sc <dbl[,1]>,
# flipper_length_mm_sc <dbl[,1]>
Now they are all there! neat! But that .names
argument is a little weird. What does it really mean?
Internally, across()
stores the column names in a vector it calls .col
.
We can use this knowledge to tell the across function what to name our new columns.
In this case, we want to append the column name with _sc
.
Challenge 3
Transform all the colmns with an underscore in their name so they are scaled, and add the prefix
sc_
to the columns names.Solution
penguins %>% mutate(across(.cols = contains("_"), .fns = scale, .names = "sc_{.col}"))
# A tibble: 344 × 12 species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g <fct> <fct> <dbl> <dbl> <int> <int> 1 Adelie Torgersen 39.1 18.7 181 3750 2 Adelie Torgersen 39.5 17.4 186 3800 3 Adelie Torgersen 40.3 18 195 3250 4 Adelie Torgersen NA NA NA NA 5 Adelie Torgersen 36.7 19.3 193 3450 6 Adelie Torgersen 39.3 20.6 190 3650 7 Adelie Torgersen 38.9 17.8 181 3625 8 Adelie Torgersen 39.2 19.6 195 4675 9 Adelie Torgersen 34.1 18.1 193 3475 10 Adelie Torgersen 42 20.2 190 4250 # … with 334 more rows, and 6 more variables: sex <fct>, year <int>, # sc_bill_length_mm <dbl[,1]>, sc_bill_depth_mm <dbl[,1]>, # sc_flipper_length_mm <dbl[,1]>, sc_body_mass_g <dbl[,1]>
Challenge 4
Transform all the colmns with an underscore in their name so they are scaled, and add the prefix
sc_
to the columns names. Add another standard change of the body mass column to kilograms Hint: you can add a standard mutate within the same mutate as acrossSolution
penguins %>% mutate( across(.cols = contains("_"), .fns = scale, .names = "sc_{.col}"), body_mass_kg = body_mass_g / 1000 )
# A tibble: 344 × 13 species island bill_length_mm bill_depth_mm flipper_length_mm body_mass_g <fct> <fct> <dbl> <dbl> <int> <int> 1 Adelie Torgersen 39.1 18.7 181 3750 2 Adelie Torgersen 39.5 17.4 186 3800 3 Adelie Torgersen 40.3 18 195 3250 4 Adelie Torgersen NA NA NA NA 5 Adelie Torgersen 36.7 19.3 193 3450 6 Adelie Torgersen 39.3 20.6 190 3650 7 Adelie Torgersen 38.9 17.8 181 3625 8 Adelie Torgersen 39.2 19.6 195 4675 9 Adelie Torgersen 34.1 18.1 193 3475 10 Adelie Torgersen 42 20.2 190 4250 # … with 334 more rows, and 7 more variables: sex <fct>, year <int>, # sc_bill_length_mm <dbl[,1]>, sc_bill_depth_mm <dbl[,1]>, # sc_flipper_length_mm <dbl[,1]>, sc_body_mass_g <dbl[,1]>, # body_mass_kg <dbl>
Wrap-up
We hope these sessions have given your a leg-up in starting you R and tidyverse journey. Remember that learning to code is like learning a new language, the best way to learn is to keep trying. We promise, your efforts will not be in vain as you uncover the power of R and the tidyverse.
Learning more
As and end to this workshop, we would like to provide you with some learning materials that might aid you in further pursuits of learning R.
The tidyverse webpage offers lots of resources on learning the tidyverse way of working, and information about what great things you can do with this collection of packages. There is an R for Datascience learning community that is an excellent and welcoming community of other learners navigating the tidyverse. We wholeheartedly recommend joining this community! The Rstudio community is also a great place to ask questions or look for solutions for questions you may have, and so is stackoverflow.
Key Points