Mind your data

.title[
# Mind your data
]
.subtitle[
## - Creating Organised Research Projects -
]
.author[
### Athanasia Monika Mowinckel
]
.date[
### 29.03.2023
]

---

---
background-image: url(tube_structured.png)

---

---
class: dark
# Series inspiration

ways of working I see, make
- it hard for me to help
- loosing track of the work easy
- loosing vital project files or output easy

<br>

Great rOpenSci community call on reproducible research
  - see [video and materials](https://ropensci.org/commcalls/2019-07-30/)
  - see it when you can, it's 50 minutes with lots of great perspectives

---
class: dark

# Series inspiration

[Jenny Bryan](https://twitter.com/JennyBryan)'s blogpost on [workflow vs. scripts](https://www.tidyverse.org/blog/2017/12/workflow-vs-script/) in R
> If the first line of your R script is  
> `setwd("C:\Users\jenny\path\that\only\I\have")`  
> I will come into your office and SET YOUR COMPUTER ON FIRE 🔥.  
> 
> If the first line of your R script is  
> `rm(list = ls())`  
> I will come into your office and SET YOUR COMPUTER ON FIRE 🔥.

---
class: dark, middle

&nbsp; _Part I_   &nbsp; RStudio projects  
&nbsp; _Part II_  &nbsp; Organising your files and workflow  
&nbsp; _Part III_ &nbsp; Package / Library management  
&nbsp; _Part IV_  &nbsp; git & GitHub crash course!

---

## Why do we care about project management?

.pull-left[
**Portability**  
_The ability to move the project without breaking code or needing adapting_
- you will change computers
- you will reorganise your file structure
- you will share your code with others
]

.pull-right[
**Reproducibility**  
_The ability to rerun the entire process from scratch_
- not just for reviews
- not just for best-practice science
- also for future (or even present) you
- for your collaborators/helpers
]

---

## Project workflows

- All necessary files should be contained in the project and referenced relatively

- All necessary outputs are created by code in the project and stored in the project

]

- All code can be run in fresh sessions and produce the same output

- Does not force other users to alter their own work setup
]

---

## Folder structure

.pull-left[
- **data**
  - all raw data files, organised in meaningful ways
  - never, ever write _back_ to this folder, read only
  - if using git, never commit to history, place in `.gitignore`
- **results/outputs**
  - write all analysis etc. results to
  - treat as disposable, can be overwritten
  - may also include figures etc if wanted
]

.pull-right[
- **docs**
  - documentation
  - Rmarkdown files
- **R (advanced)**
  - if you write functions that are used in several places
  - this is the standard R folder for keeping these files that might be called with `source()` calls
- **scripts/analysis**
  - files with full analysis pipelines
  - might have `source` calls to files in `R/`
]

---

## Files

.pull-left[
- **README.Rmd/README.md**
  - markdown file describing the project content and intent
  - maybe also explains which files to look in for what
  - ideal if storing on GitHub
  - `usethis::use_readme_rmd()`
- **DESCRIPTION**
  - R specific file with information about the project
  - can be created with `lcbcr::create_paper_project()` or
  with the project wizard from the [lcbcr-package](https://github.com/LCBC-UiO/lcbcr)
]

.pull-right[
**if you plan on putting it online (GitHub or equivalent)**
- **LICENCE**
  - dictates how project content can be reused
  - ask me at need, or check out [Choose a license](https://choosealicense.com/) 
- **CITATION**
  - gives users/readers a clear instruction on how to cite if they use the project content
  - ask me at need, or read on [GitHub docs](https://docs.github.com/en/repositories/managing-your-repositorys-settings-and-features/customizing-your-repository/about-citation-files)
]

---

## File naming

.pull-left[
Organising files in `data/`, `results/`, `docs/`, and `scripts/` require some ideas of how to name files for:
- easy machine reading
- easy human reading
- easy understanding of file content
- choosing the correct type of file to store
]

.pull-right[
If you are using the `R/` folder to store R-functions, these might need somewhat different naming conventions than the other folders, as these are functions you can use across the other files.

Here, naming should be particularly thought in terms of **content** rather than structural organisation.
]

---

## File naming - Machine reading

Be consistent, be meticulous.

Some machines are more clever than others, so name files in a way that the "dumbest" of them can deal with.
]

.pull-right[
- **don't use white space**
  - decide on a separator and use consistently
  - recommend the dash `-`
- **use small case letters** 
  - certain machines care about capitalization
- **use numbers smartly** 
  - numbers are awesome to use and can help organise files meaningfully
  - but needs some thinking about before implementing
]

---

## File naming - Machine reading

.pull-left[
- Machines list files in numeric-alphabetic order.
- they'll first list files starting with numbers (ascdendingly)
- then in alphabetic order.

```
1_file.txt
2_file.txt
file_one.txt
file_three.txt
file_two.txt
```
]

```
10_file.txt
1_file.txt
2_file.txt
file_1.txt
file_10.txt
file_2.txt
```
]

---

## File naming - Machine reading

```
01_file.txt
02_file.txt
10_file.txt
file_01.txt
file_02.txt
file_10.txt
```
]

.pull-right[
- using dates in file names may also ensure decent organisation
- but be consistent. Recommend using YYYY-MM-DD formatting

```
13-11-21_initial-submission-results.txt
22-01-03_revised-results.txt
2022-02-28_results.txt
```

vs.

```
2021-11-13_initial-submission-results.txt
2022-02-28_results.txt
2022-03-01_revised-results.txt
```
]

---

## File naming - Machine reading

- Consider using different space separators for different parts of the file name
- This way you can use the file name it self, programatically, if needed

```
2021-11-13_initial_submission_results.txt
2022-02-28_results.txt
2022-03-01_revised_results.txt
```

---

```r
file_names <- c(
  "2021-11-13_initial_submission_results.txt",
  "2022-02-28_results.txt",
  "2022-03-01_revised_results.txt")

stringr::str_split(file_names, "_")
```

```
## [[1]]
## [1] "2021-11-13"  "initial"     "submission"  "results.txt"
## 
## [[2]]
## [1] "2022-02-28"  "results.txt"
## 
## [[3]]
## [1] "2022-03-01"  "revised"     "results.txt"
```

---

## File naming - Machine reading

- Consider using different space separators for different parts of the file name
- This way you can use the file name it self, programatically, if needed

```
2021-11-13_initial-submission-results.txt
2022-02-28_results.txt
2022-03-01_revised-results.txt
```

---

```r
file_names <- c(
  "2021-11-13_initial-submission-results.txt",
  "2022-02-28_results.txt",
  "2022-03-01_revised-results.txt")

stringr::str_split(file_names, "_")
```

```
## [[1]]
## [1] "2021-11-13"                     "initial-submission-results.txt"
## 
## [[2]]
## [1] "2022-02-28"  "results.txt"
## 
## [[3]]
## [1] "2022-03-01"          "revised-results.txt"
```

---

## File naming - Human reading & understanding

Optimising file names for computers is great, but ultimately its us humans that need to choose files to work with.
Naming files in a way that makes the file content obvious (or at least give an idea of content) by the file name is good for such interactions.

.pull-left[
```
2021-11-13_final-results.txt
2022-02-28_finalfinal-results.txt
2022-03-01_finished-results.txt
```
]

.pull-right[
```
2021-11-13_first-submission-results.txt
2022-02-28_revision-round1-results.txt
2022-03-01_revision-round2-results.txt
2022-03-01_revision-round2-no-sex-results.txt
```
]

---

## File naming - Human & machine

common file extensions for tables:
- `.tab`/`.tsv` : tabulation separated (`\t`)
- `.csv` - comma (`,`) or semicolon (`;`) separated
- `.dat` - fixed-width (quite uncommon type)
- `.txt` - space (` `) separated
]

.pull-right[
Other formats can be saved on need.
From R, we can sometimes want to save entire
R-objects or environments as `.rda`, `.rds` or `.RData`

**But be careful!** If you are saving output from a statistical model,
the R-object also contains the data used, and you will have
exposed raw data!
]

---

## File naming - Human & machine

- `.png` - supports transparency and has no quality loss upon re-saving
- `.svg` - can rescale to infinity without getting grainy
- `.jpg` - best for photos, quality loss on rescale, blurry edges and poor text rendering
]

.pull-right[
Images can also some times be saved in pdf, but pdf while a vector format, cannot support transparency.

Tiff has fallen out of favour due to high file sizes, but are preferable to jpeg for photos.
]

---
background-image: url(montreal2.png)

---

---
background-image: url("https://drmowinckels.io/about/profile.png")
background-position: 95% center
background-size: 50% auto
class: middle, dark

[<i class="fab fa-twitter"></i>  @DrMowinckels](https://twitter.com/DrMowinckels)  
[<i class="fab fa-mastodon"></i> @Drmowinckels@fosstodon.org](https://fosstodon.org/@Drmowinckels)  
[<i class="fab fa-github" aria-hidden="true"></i>  drmowinckels](https://github.com/drmowinckels)  
[<i class="fa fa-globe" aria-hidden="true"></i>  drmowinckels.io/](https://drmowinckels.io/)

- Staff scientist / Research Software Engineer   
- PhD in cognitive psychology  
- Software Carpentry & Posit Tidyverse Instructor   
- [R-Ladies Leadership](www.rladies.org)
]