class: middle, right, title-slide .title[ # Tidy data reshaping & summaries ] .author[ ### Athanasia Monika Mowinckel ] --- layout: true <div class="my-sidebar"></div> --- class: dark, center background-image: url(img/tidyr.png), url(img/dplyr.png), url(img/purrr.png) background-size: 15% background-position: 32% 65%, 50% 65%, 68% 65% # Part 2 ## Tidy data reshaping & summaries --- class: middle, inverse ## Tidy data reshaping & summaries <ul style="color: white;"> - pivoting data with [tidyr](https://tidyr.tidyverse.org/) (~25 min) - grouped summaries with [dplyr](https://dplyr.tidyverse.org/) (~25 min) - working with nested data using [purrr](https://purrr.tidyverse.org/) (~25 min) --- class: dark, center background-image: url(img/tidyr.png) background-size: 15% background-position: 50% 65% # tidyr ## pivoting / altering data shape --- background-image: url(img/tidyr.png) background-size: 8% background-position: 95% 5% ## tidyr The goal of tidyr is to help you create tidy data. Tidy data is data where: - Every column is variable. - Every row is an observation. - Every cell is a single value. Tidy data describes a standard way of storing data that is used wherever possible throughout the tidyverse. If you ensure that your data is tidy, you’ll spend less time fighting with the tools and more time working on your analysis. Learn more about tidy data in `vignette("tidy-data")`. --- background-image: url(img/tidyr.png) background-size: 8% background-position: 95% 5% ### Tall/long vs. wide data - Tall (or long) data are considered "tidy", in that they adhere to the three tidy-data principles - Wide data are not necessarily "messy", but have a shape less ideal for easy handling in the tidyverse Example in longitudinal data design: - wide data: each participant has a single row of data, with all longitudinal observations in separate columns - tall data: a participant has as many rows as longitudinal time points, with measures in separate columns --- background-image: url(img/tidyr.png) background-size: 8% background-position: 95% 5% ## tidyr .pull-left[ #### pivoting `pivot_longer()` - wide to long `pivot_wider()` - long to wide Transforms data shape ] -- .pull-right[ ![](gifs/tall_wide.gif)<!-- --> ] --- background-image: url(img/tidyr.png) background-size: 8% background-position: 95% 5% ## Pivoting longer takes tidy-select column arguments, so it is easy to grab all the columns you are after. ```r penguins |> pivot_longer(contains("_")) ``` ``` ## # A tibble: 1,376 × 6 ## species island sex year name value ## <fct> <fct> <fct> <int> <chr> <dbl> ## 1 Adelie Torgersen male 2007 bill_l… 39.1 ## 2 Adelie Torgersen male 2007 bill_d… 18.7 ## 3 Adelie Torgersen male 2007 flippe… 181 ## 4 Adelie Torgersen male 2007 body_m… 3750 ## 5 Adelie Torgersen female 2007 bill_l… 39.5 ## 6 Adelie Torgersen female 2007 bill_d… 17.4 ## 7 Adelie Torgersen female 2007 flippe… 186 ## 8 Adelie Torgersen female 2007 body_m… 3800 ## 9 Adelie Torgersen female 2007 bill_l… 40.3 ## 10 Adelie Torgersen female 2007 bill_d… 18 ## # … with 1,366 more rows ``` --- background-image: url(img/tidyr.png) background-size: 8% background-position: 95% 5% ## Why pivot longer? .pull-left[ Can be convenient for easy sub-plots with ggplot ```r penguins |> pivot_longer(contains("_")) |> ggplot(aes(x = value, fill = species)) + geom_density() + facet_wrap(~ name, scales = "free") + scale_fill_viridis_d(alpha = .5) + theme(legend.position = "bottom") ``` ] .pull-right[ ![](002-tidy-summaries_files/figure-html/long-ggplot1-remd-1.png)<!-- --> ] --- background-image: url(img/tidyr.png) background-size: 8% background-position: 95% 5% ### pivoting wider ```r penguins_long <- penguins |> mutate(id = row_number()) |> pivot_longer(contains("_"), names_to = c("body_part", "measure", "unit"), names_sep = "_") penguins_long |> pivot_wider(names_from = c("body_part", "measure", "unit"), # pivot these columns values_from = "value", # take the values from here names_sep = "_") # separate names_from with this character ``` ``` ## # A tibble: 344 × 9 ## species island sex year id bill_l…¹ ## <fct> <fct> <fct> <int> <int> <dbl> ## 1 Adelie Torgersen male 2007 1 39.1 ## 2 Adelie Torgersen female 2007 2 39.5 ## 3 Adelie Torgersen female 2007 3 40.3 ## 4 Adelie Torgersen <NA> 2007 4 NA ## 5 Adelie Torgersen female 2007 5 36.7 ## 6 Adelie Torgersen male 2007 6 39.3 ## 7 Adelie Torgersen female 2007 7 38.9 ## 8 Adelie Torgersen male 2007 8 39.2 ## 9 Adelie Torgersen <NA> 2007 9 34.1 ## 10 Adelie Torgersen <NA> 2007 10 42 ## # … with 334 more rows, 3 more variables: ## # bill_depth_mm <dbl>, ## # flipper_length_mm <dbl>, body_mass_g <dbl>, ## # and abbreviated variable name ## # ¹bill_length_mm ``` --- class: inverse, middle, center ## Go to RStudio ### live demo --- class: inverse, middle, center ## Go to subsetting exercises ### `learnr::run_tutorial("005-pivoting", "tidyquintro")`
−
+
08
:
00
--- class: dark, center background-image: url(img/dplyr.png), url(img/tidyr.png) background-size: 15% background-position: 41% 65%, 59% 65% ## dplyr + tidyr ### data summaries --- background-image: url(img/dplyr.png) background-size: 8% background-position: 95% 5% ## dplyr - comparison to base-R #### tidy ```r penguins |> summarise(mean(bill_length_mm, na.rm = TRUE)) ``` #### base ```r mean(penguins$bill_length_mm, na.rm = TRUE) ``` <div style="font-size: 15px;"> <a href="https://dplyr.tidyverse.org/articles/base.html">https://dplyr.tidyverse.org/articles/base.html</a> </div> --- class: inverse, middle, center ## Go to RStudio ### live demo --- class: inverse, middle, center ## Go to subsetting exercises ### `learnr::run_tutorial("006-summarising", "tidyquintro")`
−
+
08
:
00
--- class: dark, center background-image: url(img/dplyr.png), url(img/tidyr.png), url(img/purrr.png) background-size: 15% background-position: 32% 65%, 50% 65%, 68% 65% # dplyr + tidyr + purrr ## Working with nested data - avoiding loops --- background-image: url(img/dplyr.png), url(img/tidyr.png), url(img/purrr.png) background-size: 8% background-position: 93% 5%, 97.5% 19%, 84.5% 5% ## comparison to base-R #### tidy ```r penguins |> nest_by(species, island) |> mutate(lm_model = list( lm(bill_length_mm ~ bill_depth_mm, data = data) )) ``` #### base ```r penguins$groups <- interaction(penguins$species, penguins$island) models <- list() for(i in 1:length(unique(penguins$groups))){ tmp <- penguins[penguins$groups == groups[i],] models[[i]] <- lm(bill_length_mm ~ bill_depth_mm, data = data) } # or lapply(unique(penguins$groups), function(x) lm(bill_length_mm ~ bill_depth_mm, data = penguins[penguins$groups == x,])) ``` <div style="font-size: 15px;"> <a href="https://dplyr.tidyverse.org/articles/base.html">https://dplyr.tidyverse.org/articles/base.html</a> </div> --- class: inverse, middle, center ## Go to RStudio ### live demo --- class: inverse, middle, center ## Go to subsetting exercises ### `learnr::run_tutorial("006-nesting", "tidyquintro")`
−
+
08
:
00
--- class: dark, middle, center # End of part 2