Good Practices

Last updated on 2025-04-15 | Edit this page

Overview

Questions

What good practices can to help us develop reproducible, reusable and computationally correct R code?

Objectives

After completing this episode, participants should be able to:

Identify some good practices the help us develop reproducible, reusable and computationally correct R code
Explain how can these practices support reproducibility

Tools and good practices

There are various tools and practices that support the development of reproducible, reusable and computationally correct R code In later episodes we will describe these tools and practices in more detail.

Coding conventions

Following coding conventions and guides for your R code that are agreed upon by the community and other programmers are important practices to ensure that others find it easy to read your code, reuse or extend it in their own examples and applications.

For R, some key resources include:

The tidyverse style guide, for consistent naming conventions, indentation, and code structure. This guide is especially useful if you’re working with packages like ggplot2, dplyr, and tidyr.
styler - An R package that helps you automatically format your code according to a style guide.
lintr - An R package that checks your code for style issues and syntax errors.

Project Structure (`rrtools` Research compendia)

A well-structured project is essential for ensuring that your R code is reproducible, maintainable, and easy to share when you are ready to do so.

Using a research compendium provides an organized directory structure that makes it easy to manage your R code, data, documentation, and results.

A typical R research compendium structure might look like this:

.
├── CONDUCT.md
├── CONTRIBUTING.md
├── DESCRIPTION
├── LICENSE
├── LICENSE.md
├── NAMESPACE
├── R
│   └── process-data.R
├── README.Rmd
├── README.md
├── analysis
│   ├── data
│   │   ├── DO-NOT-EDIT-ANY-FILES-IN-HERE-BY-HAND
│   │   ├── derived_data
│   │   └── raw_data
│   │       └── gillespie.csv
│   ├── figures
│   ├── paper
│   │   ├── elsarticle.cls
│   │   ├── mybibfile.bib
│   │   ├── numcompress.sty
│   │   ├── paper.Rmd
│   │   ├── paper.fff
│   │   ├── paper.pdf
│   │   ├── paper.spl
│   │   ├── paper.tex
│   │   ├── paper_files
│   │   │   └── figure-latex
│   │   │       └── figure1-1.pdf
│   │   └── refs.bib
│   └── templates
│       ├── journal-of-archaeological-science.csl
│       ├── template.Rmd
│       └── template.docx
├── inst
│   └── testdata
│       └── gillespie.csv
├── man
│   └── recode_system_size.Rd
├── rrcompendium.Rproj
└── tests
    ├── testthat
    │   └── test-process-data.R
    └── testthat.R

Reproduced from “Reproducible Research with rrtools” by Anna Krystalli, licensed under CC BY 4.0.

Using rrtools or similar packages helps you to automatically set up this structure, which makes it easier for others to navigate your project.

Code testing (`testthat`)

Testing ensures that your code is correct and does what it is set out to do. When you write code you often feel very confident that it is perfect, but when writing bigger codes or code that is meant to do complex operations it is very hard to consider all possible edge cases or notice every single typing mistake. Testing also gives other people confidence in your code as they can see an example of how it is meant to run and be assured that it does work correctly on their machine - helping with code understanding and reusability.

Code- and project- level documentation (`roxygen2`, `pkgdown`)

Documentation comes in many forms - from code-level documentation including descriptive names of variables and functions and additional comments that explain lines of your code, to project-level documentation (including README, LICENCE, CITATION, etc. files) that help explain the legal terms of reusing it, describe its functionality and how to install and run it, to whole websites full of documentation with function definitions, usage examples, tutorials and guides. You many not need as much documentation as a large commercial software product, but making your code reusable relies on other people being able to understand what your code does and how to use it.

Code licensing

A licence is a legal document which sets down the terms under which the creator of work (such as written text, photographs, films, music, software code) is releasing what they have created for others to use, modify, extend or exploit. It is important to state the terms under which software can be reused - the lack of a licence for your code implies that no one can reuse the software at all.

A common way to declare your copyright of a piece of software and the license you are distributing it under is to include a file called LICENSE in the root directory of your code project folder / repository.

Code citation

We should add citation instructions to our project README or a CITATION file to our project to provide instructions on how and when to cite our code. A citation file can be a plain text (CITATION.txt) or a Markdown file (CITATION.md), but there are certain benefits to using use a special file format called the Citation File Format (CFF), which provides a way to include richer metadata about code (or datasets) we want to cite, making it easy for both humans and machines to use this information.

Managing dependencies (`renv`)

Managing dependencies is essential for ensuring that your R code runs consistently across different environments. renv is a powerful R package that helps you manage the libraries your project relies on, ensuring that the exact versions of packages are used when your project is shared or run in the future.

Code and data used in this course

We are going to follow a fairly typical experience of a new PhD or postdoc joining a research group. They were emailed some data and analysis code bundled in a .zip archive and written by another group member who worked on similar things but has since left the group. They need to be able to install and run this code on their machine, check they can understand it and then adapt it to their own project.

As part of the setup for this course, you should have downloaded a .zip archive containing the software project the new research team member was given. Let’s unzip this archive and inspect its content in R Studio. The software project contains:

a JSON file (data.json) - a snippet of which is shown below - with data on extra-vehicular activities (EVAs or spacewalks) undertaken by astronauts and cosmonauts from 1965 to 2013 (data provided by NASA via its Open Data Portal). The first few lines are:

JSON

[{"eva": "1", "country": "USA", "crew": "Ed White;", "vehicle": "Gemini IV", "date": "1965-06-03T00:00:00.000", "duration": "0:36", "purpose": "First U.S. EVA. Used HHMU and took  photos.  Gas flow cooling of 25ft umbilical overwhelmed by vehicle ingress work and helmet fogged.  Lost overglove.  Jettisoned thermal gloves and helmet sun visor"}
,{"eva": "2", "country": "USA", "crew": "David Scott;", "vehicle": "Gemini VIII", "duration": "0:00", "purpose": "HHMU EVA cancelled before starting by stuck on vehicle thruster that ended mission early"}
,{"eva": "3", "country": "USA", "crew": "Eugene Cernan;", "vehicle": "Gemini IX-A", "date": "1966-06-05T00:00:00.000", "duration": "2:07", "purpose": "Inadequate restraints, stiff 25ft umbilical and high workloads exceeded suit vent loop cooling capacity and caused fogging.  Demo called off of tethered astronaut maneuvering unit"}
]

Let’s have a closer look at one line of to understand the dataset a little better:

JSON

[
...
{
  "eva": "13",
  "country": "USA",
  "crew": "Neil Armstrong;Buzz Aldrin;",
  "vehicle": "Apollo 11",
  "date": "1969-07-20T00:00:00.000",
  "duration": "2:32",
  "purpose": "First to walk on the moon.  Some trouble getting out small hatch.  46.3 lb of geologic material collected.  EASEP seismograph and laser reflector exp deployed.  Solar wind exp deployed & retrieved.  400 ft (120m) circuit on foot.  Dust issue post EVA"
}
...
]

an R script (my_code_v2.R) containing some analysis. The first few lines are:

R

# https://data.nasa.gov/resource/eva.json (with modifications)

# File paths
data_f <- "/home/sarah/Projects/ssi-ukrn-fair-course/data.json"
data_t <- "/home/sarah/Projects/ssi-ukrn-fair-course/data.csv"
g_file <- "myplot.png"

fieldnames <- c("eva", "country", "crew", "vehicle", "date", "duration", "purpose")

data <- list()
data_raw <- readLines(data_f, warn = FALSE)

# 374
library(jsonlite)
for (i in 1:374) {
  line <- data_raw[i]
  print(line)
  data[[i]] <- fromJSON(substr(line, 2, nchar(line)))
}

# Initialize empty vectors
time <- c()
dates <- c()
years <- c()

j <- 1
w <- 0
for (i in data) {  # Iterate manually

  if ("duration" %in% names(data[[j]])) {
    tt <- data[[j]]$duration

    if (tt == "") {
      # Do nothing if empty
    } else {
      t_parts <- strsplit(tt, ":")[[1]]
      ttt <- as.numeric(t_parts[1]) + as.numeric(t_parts[2]) / 60  # Convert to hours
      print(ttt)

...

The code in the R script does some common research tasks:

Read in the data from the JSON file
Change the data from one data format to another and save to a file in the new format (CSV)
Make a plot to visualise the data

Let’s have a critical look at this code and think about how easy it is to reproduce the outputs of this project.

Barriers to Reproducibility ¹

Look at the code in RStudio:

Can you rerun the code in R?

(Hint: what changes do you need to make to get the code to run?)

Are the results of the analysis repeatable?
Are there any barriers that would prevent the results generated by the code from being reproduced by someone else in the future?

Give me a hint

Here are some questions to help you identify barriers to reproducibilty in the code:

Code readability - Is the code easy to understand? Are there clear comments explaining what each part of the code does?
Reusability - Can you easily modify the code to run on different data or compute a different result?
Environment & dependencies - Does the code specify what tools or packages (e.g., R libraries, specific versions) need to be installed to run it? Is it clear which version of the code was used in the analysis?

Show me the solution

The code fails with errors because the file paths to the input and output data are specific to the author’s computer and not available on our computers:

ERROR

Error in file(con, "r") : cannot open the connection
In addition: Warning message:
In file(con, "r") :
  cannot open file '/home/sarah/Projects/ssi-ukrn-fair-course/data.json': No such file or directory

You need to change the data input/ output file paths to get the code to run:

data_f <- "data.json"
data_t <- "data.csv"

b . The results of the analysis are not repeatable because the code produces different output depending on whether it is run for the first time in our RStudio session or not.

SH

In the RStudio menus, select Session > Restart R and run the code once using "Source with Echo":
```output
> print(ct[length(ct)])
[1] 1840.2

```

Now run the code again:
```output
> print(ct[length(ct)])
[1] 1951.2
```

Notice how `ct` is incorrectly reset to `c(111)` here:
```r
ggplot(tdf, aes(x = years, y = ct)) + geom_line(color = "black") + geom_point(color = "black") + 
    labs( x = "Year", y = "Total time spent in space to date (hours)", title = "Cumulative Spacewalk Time" ) + 
    + theme_minimal() ; ct <- c(111)

```
Let's correct this as follows:
```r
ggplot(tdf, aes(x = years, y = ct)) + geom_line(color = "black") + geom_point(color = "black") + 
    labs( x = "Year", y = "Total time spent in space to date (hours)", title = "Cumulative Spacewalk Time" ) + 
    + theme_minimal() 

ct <- c(0)

```
The code should now be repeatable.

Barriers to reproducibility include:
- The code lacks clear comments, and the variable names and file names are not descriptive. It is hard to determine the purpose of the code or how it works. This may hinder another researcher’s ability to get the code running.
- The code does not explicitly specify what third party packages need to be installed to run the code. There are library() statements in the code but these are positioned through out the code and not in a single place where they can be easily identified. We don’t know which version of the packages were used in the analysis.
- It is really difficult to understand what the code does and how it does it. This makes it hard to modify the code to run on different dataset or plot another facet of the data.

Attribution

This episode is a remix of episodes “FAIR research software” (Section: Software and data used in this course) and “Tools and Practices for FAIR research software development” from the Software Carpentries Incubator course “Tools and practices of FAIR research software”] under a CC-BY-4.0 licence with modifications.

The material has been edited to target a audience of R users and to focus on reproducibility, correctness and reuse as end goals rather than FAIR (findability, accessibility, interoperability and reusability) . Consequently, several tools considered in the original course e.g. persistent identifiers have been omitted. The section Code and data used in this course has been adapted to reflect the R version of the spacewalks repository used in this course.

Objectives, Questions, Key Points and Further Reading sections have been updated to reflect the remixed R focussed content. Some original material has been added – this is marked with a footnote ².

Original material.↩︎
Original material.↩︎

Good Practices

Overview

Questions

Objectives

Tools and good practices

Coding conventions

Project Structure (rrtools Research compendia)

Code testing (testthat)

Code- and project- level documentation (roxygen2, pkgdown)

Code licensing

Code citation

Managing dependencies (renv)

Code and data used in this course

JSON

JSON

R

Barriers to Reproducibility 1

Give me a hint

Show me the solution

ERROR

SH

Further reading

Key Points

Attribution

Project Structure (`rrtools` Research compendia)

Code testing (`testthat`)

Code- and project- level documentation (`roxygen2`, `pkgdown`)

Managing dependencies (`renv`)

Barriers to Reproducibility ¹