Content from Course introduction


Last updated on 2025-04-15 | Edit this page

Overview

Questions

  • What is reproducible research?
  • Why is it important?

Objectives

After completing this episode, participants should be able to:

  • Understand the concept of reproducible research

Jargon busting


Before we start with the course, below we cover the terminology and explain terms, phrases, and concepts associated with software development in reproducible research that we will use in this course.

  • Reproducibility - the ability to be reproduced or copied; the extent to which consistent results are obtained when an experiment is repeated (definition from Google’s English dictionary is provided by Oxford Languages)
  • Computational reproducibility - obtaining consistent results using the same input data, computational methods (code), and conditions of analysis; work that can be independently recreated from the same data and the same code (definition by the Turing Way’s “Guide to Reproducible Research”)
  • Reproducible research - the idea that scientific results should be documented in such a way that their deduction is fully transparent (definition from Wikipedia)
  • Open research - research that is openly accessible by others; concerned with making research more transparent, more collaborative, more wide-reaching, and more efficient (definition from Wikipedia)
  • FAIR - an acronym that stands for Findable, Accessible, Interoperable, and Reusable
  • Sustainable software development - software development practice that takes into account longevity and maintainability of code (e.g. beyond the lifetime of the project), environmental impact, societal responsibility and ethics in our software practices.

Computational reproducibility

In this course, we use the term “reproducibility” as a synonym for “computational reproducibility”.

What is reproducible research?


The Turing Way’s “Guide to Reproducible Research” provides an excellent overview of definitions of “reproducibility” and “replicability” found in literature, and their different aspects and levels.

In this course, we adopt the Turing Way’s definitions:

  • Reproducible research: a result is reproducible when the same analysis steps performed on the same data consistently produce the same answer.
    • For example, two different people drop a pen 10 times each and every time it falls to the floor. Or, we run the same code multiple times on different machines and each time it produces the same result.
  • Replicable research: a result is replicable when the same analysis performed on different data produces qualitatively similar answers.
    • For example, instead of a pen, we drop a pencil, and it also falls to the floor. Or, we collect two different datasets as part of two different studies and run the same code over these datasets with the same result each time.
  • Robust research: a result is robust when the same data is subjected to different analysis workflows to answer the same research question and a qualitatively similar or identical answer is produced.
    • For example, I lend you my pen and you drop it out the window, and it still falls to the floor. Or we run the same analysis implemented in both Python and R over the same data and it produces the same result.
  • Generalisable research: combining replicable and robust findings allow us to form generalisable results that are broadly applicable to different types of data or contexts.
    • For example, everything we drop - falls, therefore gravity exists.
Four cartoon images depicting two researchers at two machines which take in data and output the same landscape image in 4 different ways. These visually describe the four scenarios listed above.
The Turing Way project illustration of aspects of reproducible research by Scriberia, used under a CC-BY 4.0 licence, DOI: 10.5281/zenodo.3332807

In this course we mainly address the aspect of reproducibility - i.e. enabling others to run our code to obtain the same results.

The reproducibility spectrum

As with most things, reproducibility is non-binary. Having access to code and data sometimes is good, but how about describing the specific version of each software and packaged used? Do you need to describe the operating system? How about the architecture of the CPU?

How exact should results be to be considered successfully reproduced? Do we expect identical result bit by bit? Or do we allow for small changes in non-significant digits, cosmetic variation in figures such as fonts or colours?

For these reasons instead of saying that a research project is reproducible or not, is often more helpful to say that a research project is more or less reproducible and harder or easier to reproduce. A project that publishes code and data but not software versions is probably harder to reproduce than one that provides the virtual machine in which the code was run.

Why do reproducible research?


Scientific transparency and rigor are key factors in research. Scientific methodology and results need to be published openly and replicated and confirmed by several independent parties. However, research papers often lack the full details required for independent reproduction or replication. Many attempts at reproducing or replicating the results of scientific studies have failed in a variety of disciplines ranging from psychology (The Open Science Collaboration (2015)) to cancer sciences (Errington et al (2021)). These are called the reproducibility and replicability crises - ongoing methodological crises in which the results of many scientific studies are difficult or impossible to repeat.

Reproducible research is a practice that ensures that researchers can repeat the same analysis multiple times with the same results. It offers many benefits to those who practice it:

  • Reproducible research helps researchers remember how and why they performed specific tasks and analyses; this enables easier explanation of work to collaborators and reviewers.
  • Reproducible research enables researchers to quickly modify analyses and figures - this is often required at all stages of research and automating this process saves loads of time.
  • Reproducible research enables reusability of previously conducted tasks so that new projects that require the same or similar tasks become much easier and efficient by reusing or reconfiguring previous work.
  • Reproducible research supports researchers’ career development by facilitating the reuse and citation of all research outputs - including both code and data.
  • Reproducible research is a strong indicator of rigor, trustworthiness, and transparency in scientific research. This can increase the quality and speed of peer review, because reviewers can directly access the analytical process described in a manuscript. It increases the probability that errors are caught early on - by collaborators or during the peer-review process, helping alleviate the reproducibility crisis.

Other Considerations


An important consideration is that reproducible results are not necessarily scientifically, statistically or computationally correct - incorrect results can be perfectly reproducible.1

As researchers, we often also want our work to be “computationally correct” and “reusable”- that is, the code we write should do what we think it does and we want to be able to use our code in future projects or related tasks without having to rewrite it from scratch.2

Tools and Practices


Developing high quality, reusable and reproducible research code often requires that researchers new practices.

This course teaches good practices and reproducible working methods for those working with R and aims to provide researchers with the tools and knowledge to feel confident when writing good quality and sustainable code to support their research.

Further reading


We recommend the following resources for some additional reading on the topic of this episode:

Acknowledgements and references


The content of this course borrows from or references various work.

Attribution


This episode reuses material from the “Introduction” episode of Software Carpentries Incubator course “Tools and practices of FAIR research software” under a CC-BY-4.0 licence with modifications: (i) minor edits have been made to make the material suitable for an audience of R users (e.g. replacing “software” with “code” in places) (ii) section “Software in research and research software” from the original source has been renamed “Tools and Practices” and is shortened. (iii) objectives, questions, keypoints and further reading have been edited to focus the material on reproducibility rather than FAIR (iv) Some original material has been introduced to maintain flow and is indicated by a footnoted reference. (v) The callout “The reproducibility spectrum” has been added verbatim from the lesson [“What is reproducibility anyway”]repro-rocks-what] under CC-BY-4.0 from “An R reproducibility toolkit for the practical researcher” by Elio Campitelli and Paola Corrales.


  1. From “An R reproducibility toolkit for the practical researcher” by Elio Campitelli and Paola Corrales.↩︎

  2. Original material.↩︎

Content from Good Practices


Last updated on 2025-04-15 | Edit this page

Overview

Questions

  • What good practices can to help us develop reproducible, reusable and computationally correct R code?

Objectives

After completing this episode, participants should be able to:

  • Identify some good practices the help us develop reproducible, reusable and computationally correct R code
  • Explain how can these practices support reproducibility

Tools and good practices


There are various tools and practices that support the development of reproducible, reusable and computationally correct R code In later episodes we will describe these tools and practices in more detail.

Coding conventions

Following coding conventions and guides for your R code that are agreed upon by the community and other programmers are important practices to ensure that others find it easy to read your code, reuse or extend it in their own examples and applications.

For R, some key resources include:

  • The tidyverse style guide, for consistent naming conventions, indentation, and code structure. This guide is especially useful if you’re working with packages like ggplot2, dplyr, and tidyr.
  • styler - An R package that helps you automatically format your code according to a style guide.
  • lintr - An R package that checks your code for style issues and syntax errors.

Project Structure (rrtools Research compendia)

A well-structured project is essential for ensuring that your R code is reproducible, maintainable, and easy to share when you are ready to do so.

Using a research compendium provides an organized directory structure that makes it easy to manage your R code, data, documentation, and results.

A typical R research compendium structure might look like this:

.
├── CONDUCT.md
├── CONTRIBUTING.md
├── DESCRIPTION
├── LICENSE
├── LICENSE.md
├── NAMESPACE
├── R
│   └── process-data.R
├── README.Rmd
├── README.md
├── analysis
│   ├── data
│   │   ├── DO-NOT-EDIT-ANY-FILES-IN-HERE-BY-HAND
│   │   ├── derived_data
│   │   └── raw_data
│   │       └── gillespie.csv
│   ├── figures
│   ├── paper
│   │   ├── elsarticle.cls
│   │   ├── mybibfile.bib
│   │   ├── numcompress.sty
│   │   ├── paper.Rmd
│   │   ├── paper.fff
│   │   ├── paper.pdf
│   │   ├── paper.spl
│   │   ├── paper.tex
│   │   ├── paper_files
│   │   │   └── figure-latex
│   │   │       └── figure1-1.pdf
│   │   └── refs.bib
│   └── templates
│       ├── journal-of-archaeological-science.csl
│       ├── template.Rmd
│       └── template.docx
├── inst
│   └── testdata
│       └── gillespie.csv
├── man
│   └── recode_system_size.Rd
├── rrcompendium.Rproj
└── tests
    ├── testthat
    │   └── test-process-data.R
    └── testthat.R

Reproduced from “Reproducible Research with rrtools” by Anna Krystalli, licensed under CC BY 4.0.

Using rrtools or similar packages helps you to automatically set up this structure, which makes it easier for others to navigate your project.

Code testing (testthat)

Testing ensures that your code is correct and does what it is set out to do. When you write code you often feel very confident that it is perfect, but when writing bigger codes or code that is meant to do complex operations it is very hard to consider all possible edge cases or notice every single typing mistake. Testing also gives other people confidence in your code as they can see an example of how it is meant to run and be assured that it does work correctly on their machine - helping with code understanding and reusability.

Code- and project- level documentation (roxygen2, pkgdown)

Documentation comes in many forms - from code-level documentation including descriptive names of variables and functions and additional comments that explain lines of your code, to project-level documentation (including README, LICENCE, CITATION, etc. files) that help explain the legal terms of reusing it, describe its functionality and how to install and run it, to whole websites full of documentation with function definitions, usage examples, tutorials and guides. You many not need as much documentation as a large commercial software product, but making your code reusable relies on other people being able to understand what your code does and how to use it.

Code licensing

A licence is a legal document which sets down the terms under which the creator of work (such as written text, photographs, films, music, software code) is releasing what they have created for others to use, modify, extend or exploit. It is important to state the terms under which software can be reused - the lack of a licence for your code implies that no one can reuse the software at all.

A common way to declare your copyright of a piece of software and the license you are distributing it under is to include a file called LICENSE in the root directory of your code project folder / repository.

Code citation

We should add citation instructions to our project README or a CITATION file to our project to provide instructions on how and when to cite our code. A citation file can be a plain text (CITATION.txt) or a Markdown file (CITATION.md), but there are certain benefits to using use a special file format called the Citation File Format (CFF), which provides a way to include richer metadata about code (or datasets) we want to cite, making it easy for both humans and machines to use this information.

Managing dependencies (renv)

Managing dependencies is essential for ensuring that your R code runs consistently across different environments. renv is a powerful R package that helps you manage the libraries your project relies on, ensuring that the exact versions of packages are used when your project is shared or run in the future.

Code and data used in this course


We are going to follow a fairly typical experience of a new PhD or postdoc joining a research group. They were emailed some data and analysis code bundled in a .zip archive and written by another group member who worked on similar things but has since left the group. They need to be able to install and run this code on their machine, check they can understand it and then adapt it to their own project.

As part of the setup for this course, you should have downloaded a .zip archive containing the software project the new research team member was given. Let’s unzip this archive and inspect its content in R Studio. The software project contains:

  1. a JSON file (data.json) - a snippet of which is shown below - with data on extra-vehicular activities (EVAs or spacewalks) undertaken by astronauts and cosmonauts from 1965 to 2013 (data provided by NASA via its Open Data Portal). The first few lines are:

JSON

[{"eva": "1", "country": "USA", "crew": "Ed White;", "vehicle": "Gemini IV", "date": "1965-06-03T00:00:00.000", "duration": "0:36", "purpose": "First U.S. EVA. Used HHMU and took  photos.  Gas flow cooling of 25ft umbilical overwhelmed by vehicle ingress work and helmet fogged.  Lost overglove.  Jettisoned thermal gloves and helmet sun visor"}
,{"eva": "2", "country": "USA", "crew": "David Scott;", "vehicle": "Gemini VIII", "duration": "0:00", "purpose": "HHMU EVA cancelled before starting by stuck on vehicle thruster that ended mission early"}
,{"eva": "3", "country": "USA", "crew": "Eugene Cernan;", "vehicle": "Gemini IX-A", "date": "1966-06-05T00:00:00.000", "duration": "2:07", "purpose": "Inadequate restraints, stiff 25ft umbilical and high workloads exceeded suit vent loop cooling capacity and caused fogging.  Demo called off of tethered astronaut maneuvering unit"}
]

Let’s have a closer look at one line of to understand the dataset a little better:

JSON

[
...
{
  "eva": "13",
  "country": "USA",
  "crew": "Neil Armstrong;Buzz Aldrin;",
  "vehicle": "Apollo 11",
  "date": "1969-07-20T00:00:00.000",
  "duration": "2:32",
  "purpose": "First to walk on the moon.  Some trouble getting out small hatch.  46.3 lb of geologic material collected.  EASEP seismograph and laser reflector exp deployed.  Solar wind exp deployed & retrieved.  400 ft (120m) circuit on foot.  Dust issue post EVA"
}
...
]
  1. an R script (my_code_v2.R) containing some analysis. The first few lines are:

R

# https://data.nasa.gov/resource/eva.json (with modifications)

# File paths
data_f <- "/home/sarah/Projects/ssi-ukrn-fair-course/data.json"
data_t <- "/home/sarah/Projects/ssi-ukrn-fair-course/data.csv"
g_file <- "myplot.png"

fieldnames <- c("eva", "country", "crew", "vehicle", "date", "duration", "purpose")

data <- list()
data_raw <- readLines(data_f, warn = FALSE)

# 374
library(jsonlite)
for (i in 1:374) {
  line <- data_raw[i]
  print(line)
  data[[i]] <- fromJSON(substr(line, 2, nchar(line)))
}

# Initialize empty vectors
time <- c()
dates <- c()
years <- c()

j <- 1
w <- 0
for (i in data) {  # Iterate manually

  if ("duration" %in% names(data[[j]])) {
    tt <- data[[j]]$duration

    if (tt == "") {
      # Do nothing if empty
    } else {
      t_parts <- strsplit(tt, ":")[[1]]
      ttt <- as.numeric(t_parts[1]) + as.numeric(t_parts[2]) / 60  # Convert to hours
      print(ttt)

...

The code in the R script does some common research tasks:

  • Read in the data from the JSON file
  • Change the data from one data format to another and save to a file in the new format (CSV)
  • Make a plot to visualise the data

Let’s have a critical look at this code and think about how easy it is to reproduce the outputs of this project.

Barriers to Reproducibility 1

Look at the code in RStudio:

  1. Can you rerun the code in R?

(Hint: what changes do you need to make to get the code to run?)

  1. Are the results of the analysis repeatable?
  2. Are there any barriers that would prevent the results generated by the code from being reproduced by someone else in the future?

Here are some questions to help you identify barriers to reproducibilty in the code:

  • Code readability - Is the code easy to understand? Are there clear comments explaining what each part of the code does?
  • Reusability - Can you easily modify the code to run on different data or compute a different result?
  • Environment & dependencies - Does the code specify what tools or packages (e.g., R libraries, specific versions) need to be installed to run it? Is it clear which version of the code was used in the analysis?
  1. The code fails with errors because the file paths to the input and output data are specific to the author’s computer and not available on our computers:

ERROR

Error in file(con, "r") : cannot open the connection
In addition: Warning message:
In file(con, "r") :
  cannot open file '/home/sarah/Projects/ssi-ukrn-fair-course/data.json': No such file or directory

You need to change the data input/ output file paths to get the code to run:

data_f <- "data.json"
data_t <- "data.csv"

b . The results of the analysis are not repeatable because the code produces different output depending on whether it is run for the first time in our RStudio session or not.

SH

In the RStudio menus, select Session > Restart R and run the code once using "Source with Echo":
```output
> print(ct[length(ct)])
[1] 1840.2

```

Now run the code again:
```output
> print(ct[length(ct)])
[1] 1951.2
```

Notice how `ct` is incorrectly reset to `c(111)` here:
```r
ggplot(tdf, aes(x = years, y = ct)) + geom_line(color = "black") + geom_point(color = "black") + 
    labs( x = "Year", y = "Total time spent in space to date (hours)", title = "Cumulative Spacewalk Time" ) + 
    + theme_minimal() ; ct <- c(111)

```
Let's correct this as follows:
```r
ggplot(tdf, aes(x = years, y = ct)) + geom_line(color = "black") + geom_point(color = "black") + 
    labs( x = "Year", y = "Total time spent in space to date (hours)", title = "Cumulative Spacewalk Time" ) + 
    + theme_minimal() 

ct <- c(0)

```
The code should now be repeatable.
  1. Barriers to reproducibility include:
    • The code lacks clear comments, and the variable names and file names are not descriptive. It is hard to determine the purpose of the code or how it works. This may hinder another researcher’s ability to get the code running.
    • The code does not explicitly specify what third party packages need to be installed to run the code. There are library() statements in the code but these are positioned through out the code and not in a single place where they can be easily identified. We don’t know which version of the packages were used in the analysis.
    • It is really difficult to understand what the code does and how it does it. This makes it hard to modify the code to run on different dataset or plot another facet of the data.

Further reading


We recommend the following resources for some additional reading on the topic of this episode:

Also check the full reference set for the course.

Key Points

  • Coding conventions ensure your R code is easy to read, reuse, and extend.
  • Research compendia provide an organized directory for your code, data, documentation, and results.
  • Testing helps you check that your code is behaving as expected and will continue to do so in the future or when used by someone else.
  • Documentation is essential for explaining what your code does, how to use it, and the legal terms for reuse.
  • Dependency management helps making your code reproducible across different computational environments.

Attribution


This episode is a remix of episodes “FAIR research software” (Section: Software and data used in this course) and “Tools and Practices for FAIR research software development” from the Software Carpentries Incubator course “Tools and practices of FAIR research software”] under a CC-BY-4.0 licence with modifications.

The material has been edited to target a audience of R users and to focus on reproducibility, correctness and reuse as end goals rather than FAIR (findability, accessibility, interoperability and reusability) . Consequently, several tools considered in the original course e.g. persistent identifiers have been omitted. The section Code and data used in this course has been adapted to reflect the R version of the spacewalks repository used in this course.

Objectives, Questions, Key Points and Further Reading sections have been updated to reflect the remixed R focussed content. Some original material has been added – this is marked with a footnote 2.


  1. Original material.↩︎

  2. Original material.↩︎

Content from Code readability


Last updated on 2025-04-15 | Edit this page

Overview

Questions

  • Why does code readability matter?
  • How can I organise my code to be more readable?
  • What types of documentation can I include to improve the readability of my code?

Objectives

After completing this episode, participants should be able to:

  • Organise R code into reusable functions that achieve a singular purpose
  • Choose function and variable names that help explain the purpose of the function or variable
  • Write informative comments and to provide more detail about what the code is doing

In this episode, we will introduce the concept of readable code and consider how it can help create reusable scientific software and empower collaboration between researchers.

When someone writes code, they do so based on requirements that are likely to change in the future. Requirements change because code interacts with the real world, which is dynamic. When these requirements change, the developer (who is not necessarily the same person who wrote the original code) must implement the new requirements. They do this by reading the original code to understand how it currently works and identify what needs to change. Readable code facilitates this process and saves future you and developers’ time and effort.

In order to develop readable code, we should ask ourselves: “If I re-read this piece of code in fifteen days or one year, will I be able to understand what I have done and why?” Or even better: “If a new person who just joined the project reads my software, will they be able to understand what I have written here?”

We will now learn about a few best practices we can follow to help create more readable code.

Remove unused files and folders


Let’s start by removing any files or folders that are not needed in our project directory.

BASH

rm -r astronaut-data-analyses-old

Use meaningful file names


Let’s give our analysis script a meaningful name that reflects what it does and do the same for our input and output files.

BASH

mv my_analysis_v2.R eva_data_analysis.R 
mv data.json eva_data.json

Place library statements at the top


Let’s have a look at our code again - the first thing we may notice is that our script currently places library() statements throughout the code. Conventionally, all library() statements are placed at the top of the script so that dependent libraries are clearly visible and not buried inside the code This will help readability (accessibility) and reusability of our code.

Our code after the modification should look like the following.

R

library(jsonlite)
library(ggplot2)

# https://data.nasa.gov/resource/eva.json (with modifications)
# File paths
data_f <- "eva-data.json"
data_t <- "eva-data.csv"
g_file <- "cumulative_eva_graph.png"

fieldnames <- c("eva", "country", "crew", "vehicle", "date", "duration", "purpose")

data <- list()
data_raw <- readLines(data_f, warn = FALSE)

# 374
for (i in 1:374) {
  line <- data_raw[i]
  print(line)
  data[[i]] <- fromJSON(substr(line, 2, nchar(line)))
}

# Initialize empty vectors
time <- c()
dates <- c()
years <- c()

j <- 1
w <- 0
for (i in data) {  # Iterate manually

  if ("duration" %in% names(data[[j]])) {
    tt <- data[[j]]$duration

    if (tt == "") {
      # Do nothing if empty
    } else {
      t_parts <- strsplit(tt, ":")[[1]]
      ttt <- as.numeric(t_parts[1]) + as.numeric(t_parts[2]) / 60  # Convert to hours
      print(ttt)
      time <- c(time, ttt)

      if (("date" %in% names(data[[j]]) & ("eva" %in% names(data[[j]])))) {
        date <- as.Date(substr(data[[j]]$date, 1, 10), format = "%Y-%m-%d")
        year <- as.numeric(format(date,"%Y"))
        dates <- c(dates, date)
        years <- c(years, year)
        row_data <- as.data.frame(data[[j]])
      } else {
        time <- time[-1]
      }
    }
  }

  ## Comment out this bit if you don't want the spreadsheet
  if (exists("row_data")) {
    print(row_data)
    if (w==0) {
      write.table(row_data, data_t, sep = ",", row.names = FALSE, col.names = TRUE, append = FALSE)
    } else {
      write.table(row_data, data_t, sep = ",", row.names = FALSE, col.names = FALSE, append = TRUE)
    }
    w <- w+1
    rm(row_data)
  }

  j <- j + 1
}

if (!exists("ct")) {ct <- c(0)}

for (k in time) {
  ct <- c(ct, ct[length(ct)] + k)
}

sorted_indices <- order(dates)
years <- years[sorted_indices]
time <- time[sorted_indices]

# Print total time in space
print(ct[length(ct)])

tdf <- data.frame(
  years = years,
  ct = ct[-1]
)


# Plot the data
ggplot(tdf, aes(x = years, y = ct)) + geom_line(color = "black") + geom_point(color = "black") +
  labs( x = "Year", y = "Total time spent in space to date (hours)", title = "Cumulative Spacewalk Time" ) + theme_minimal()

# Correction for repeatability
ct <- c(0)

# Save plot
ggsave(g_file, width = 8, height = 6)

Use meaningful variable names


Variables are the most common thing you will assign when coding, and it’s really important that it is clear what each variable means in order to understand what the code is doing. If you return to your code after a long time doing something else, or share your code with a colleague, it should be easy enough to understand what variables are involved in your code from their names. Therefore we need to give them clear names, but we also want to keep them concise so the code stays readable. There are no “hard and fast rules” here, and it’s often a case of using your best judgment.

Some useful tips for naming variables are:

  • Short words are better than single character names. For example, if we were creating a variable to store the speed to read a file, s (for ‘speed’) is not descriptive enough but MBReadPerSecondAverageAfterLastFlushToLog is too long to read and prone to misspellings. ReadSpeed (or read_speed) would suffice.
  • If you are finding it difficult to come up with a variable name that is both short and descriptive, go with the short version and use an inline comment to describe it further (more on those in the next section). This guidance does not necessarily apply if your variable is a well-known constant in your domain - for example, c represents the speed of light in physics.
  • Try to be descriptive where possible and avoid meaningless or funny names like foo, bar, var, thing, etc.

Remember there are some restrictions to consider when naming variables in R - variable names can only contain letters, numbers, underscores and full-stops. They cannot start with a number or contain spaces.

Give a descriptive name to a variable

Below we have a variable called var being set the value of 9.81. var is not a very descriptive name here as it doesn’t tell us what 9.81 means, yet it is a very common constant in physics! Go online and find out which constant 9.81 relates to and suggest a new name for this variable.

Hint: the units are metres per second squared!

R

var <- 9.81

\[ 9.81 m/s^2 \] is the gravitational force exerted by the Earth. It is often referred to as “little g” to distinguish it from “big G” which is the Gravitational Constant. A more descriptive name for this variable therefore might be:

R

g_earth <- 9.81

Rename our variables to be more descriptive

Let’s apply this to eva_data_analysis.R.

  1. Edit the code as follows to use descriptive variable names:

    • Change data_f to input_file
    • Change data_t to output_file
    • Change g_file to graph_file
  2. What other variable names in our code would benefit from renaming?

  1. Updated code:

R

library(jsonlite)
library(ggplot2)

# https://data.nasa.gov/resource/eva.json (with modifications)
# File paths
input_file <- "eva-data.json"
output_file <- "eva-data.csv"
graph_file <- "cumulative_eva_graph.png"

fieldnames <- c("eva", "country", "crew", "vehicle", "date", "duration", "purpose")

data <- list()
data_raw <- readLines(input_file, warn = FALSE)

# 374
for (i in 1:374) {
  line <- data_raw[i]
  print(line)
  data[[i]] <- fromJSON(substr(line, 2, nchar(line)))
}

# Initialize empty vectors
time <- c()
dates <- c()
years <- c()

j <- 1
rownno <- 0
for (i in data) {  # Iterate manually

  if ("duration" %in% names(data[[j]])) {
    time_text <- data[[j]]$duration

    if (time_text == "") {
      # Do nothing if empty
    } else {
      t_parts <- strsplit(time_text, ":")[[1]]
      t_hours <- as.numeric(t_parts[1]) + as.numeric(t_parts[2]) / 60  # Convert to hours
      print(t_hours)
      time <- c(time, t_hours)

      if (("date" %in% names(data[[j]]) & ("eva" %in% names(data[[j]])))) {
        date <- as.Date(substr(data[[j]]$date, 1, 10), format = "%Y-%m-%d")
        year <- as.numeric(format(date,"%Y"))
        dates <- c(dates, date)
        years <- c(years, year)
        row_data <- as.data.frame(data[[j]])
      } else {
        time <- time[-1]
      }
    }
  }

  ## Comment out this bit if you don't want the spreadsheet
  if (exists("row_data")) {
    print(row_data)
    if (rownno==0) {
      write.table(row_data, output_file, sep = ",", row.names = FALSE, col.names = TRUE, append = FALSE)
    } else {
      write.table(row_data, output_file, sep = ",", row.names = FALSE, col.names = FALSE, append = TRUE)
    }
    rownno <- rowno+1
    rm(row_data)
  }

  j <- j + 1
}

if (!exists("cumulative_time")) {cumulative_time <- c(0)}

for (k in time) {
  cumulative_time <- c(cumulative_time, cumulative_time[length(cumulative_time)] + k)
}

sorted_indices <- order(dates)
years <- years[sorted_indices]
time <- time[sorted_indices]

# Print total time in space
print(cumulative_time[length(cumulative_time)])

tdf <- data.frame(
  years = years,
  cumulative_time = cumulative_time[-1]
)


# Plot the data
ggplot(tdf, aes(x = years, y = cumulative_time)) + geom_line(color = "black") + geom_point(color = "black") +
  labs( x = "Year", y = "Total time spent in space to date (hours)", title = "Cumulative Spacewalk Time" ) + theme_minimal()

# Correction for repeatability
cumulative_time <- c(0)

# Save plot
ggsave(graph_file, width = 8, height = 6)
  1. We should also rename variables w, tt, ttt, ct to be more descriptive. In the solution above, these have been renamed to rownno, time_text, t_hours and cumulative_time respectively.

Use standard libraries


Our script currently reads the data line-by-line from the JSON data file and uses custom code to manipulate the data. Variables of interest are stored in lists but there are more suitable data structures (e.g. data frames) to store data in our case. By choosing custom code over standard and well-tested libraries, we are making our code less readable and understandable and more error-prone.

The main functionality of our code can be rewritten as follows using the dplyr library to load and manipulate the data in data frames.

The code should now look like:

R

library(jsonlite)
library(ggplot2)
library(dplyr)
library(tidyr)
library(lubridate)
library(readr)
library(stringr)

input_file <- "eva-data.json"
output_file <- "eva-data.csv"
graph_file <- "cumulative_eva_graph.png"

eva_data <- fromJSON(input_file, flatten = TRUE) |>
 mutate(eva = as.numeric(eva)) |>
 mutate(date = ymd_hms(date)) |>
 mutate(year = year(date)) |>
 drop_na() |>
 arrange(date) |>


write_csv(eva_data, output_file)

time_in_space_plot <- eva_data |>
  rowwise()  |>
  mutate(duration_hours =
                  sum(as.numeric(str_split_1(duration, "\\:"))/c(1, 60))
  ) |>
  ungroup() |>
  mutate(cumulative_time = cumsum(duration_hours)) |>
  ggplot(aes(x = year, y = cumulative_time)) +
  geom_line(color = "black") +
  labs(
    x = "Year",
    y = "Total time spent in space to date (hours)",
    title = "Cumulative Spacewalk Time" ) +
  theme_minimal()

ggsave(graph_file, width = 8, height = 6, plot = time_in_space_plot)

We should replace the existing code in our R script eva_data_analysis.R with the above code.

Use comments to explain functionality


Commenting is a very useful practice to help convey the context of the code. It can be helpful as a reminder for your future self or your collaborators as to why code is written in a certain way, how it is achieving a specific task, or the real-world implications of your code.

There are several ways to add comments to code:

  • An inline comment is a comment on the same line as a code statement. Typically, it comes after the code statement and finishes when the line ends and is useful when you want to explain the code line in short. Inline comments in R should be separated by at least two spaces from the statement; they start with a # followed by a single space, and have no end delimiter.

R

x <- 5  # In R, inline comments begin with the `#` symbol and a single space.
  • A multi-line or block comment spans multiple lines. R doesn’t have a specific multiline comment syntax, a common practice is to use a single # for each line but make it look like a block comment for readability. You could also align the # symbol vertically to create a visually consistent block of comments.

R

# ==========================================
# This is a block comment.
# You can write a detailed explanation here.
# It can span multiple lines as needed.
# ==========================================

Here are a few things to keep in mind when commenting your code:

  • Focus on the why and the how of your code - avoid using comments to explain what your code does. If your code is too complex for other programmers to understand, consider rewriting it for clarity rather than adding comments to explain it.
  • Make sure you are not reiterating something that your code already conveys on its own. Comments should not echo your code.
  • Keep comments short and concise. Large blocks of text quickly become unreadable and difficult to maintain.
  • Comments that contradict the code are worse than no comments. Always make a priority of keeping comments up-to-date when code changes.

Examples of unhelpful comments

R

statetax <- 1.0625  # Assigns the float 1.0625 to the variable 'statetax'
citytax <- 1.01  # Assigns the float 1.01 to the variable 'citytax'
specialtax <- 1.01  # Assigns the float 1.01 to the variable 'specialtax'

The comments in this code simply tell us what the code does, which is easy enough to figure out without the inline comments.

Examples of helpful comments

R

statetax = 1.0625  # State sales tax rate is 6.25% through Jan. 1
citytax = 1.01  # City sales tax rate is 1% through Jan. 1
specialtax = 1.01  # Special sales tax rate is 1% through Jan. 1

In this case, it might not be immediately obvious what each variable represents, so the comments offer helpful, real-world context. The date in the comment also indicates when the code might need to be updated.

Add comments to our code

  1. Examine eva_data_analysis.R. Add as many comments as you think is required to help yourself and others understand what that code is doing.

  2. Add as many print statements as you think is required to keep the user informed about what the code is doing as it runs.

Some good comments and print statements may look like the example below.

R

ibrary(jsonlite)
library(ggplot2)
library(dplyr)
library(tidyr)
library(lubridate)
library(readr)
library(stringr)

# https://data.nasa.gov/resource/eva.json (with modifications)
input_file <- "eva-data.json"
output_file <- "eva-data.csv"
graph_file <- "cumulative_eva_graph.png"

print("--START--")
print("Reading JSON file")

# Read the data from a JSON file into a dataframe
eva_data <- fromJSON(input_file, flatten = TRUE) |>
 mutate(eva = as.numeric(eva)) |>
 mutate(date = ymd_hms(date)) |>
 mutate(year = year(date)) |>
 drop_na() |>
 arrange(date) |>


print("Saving to CSV file")
# Save dataframe to CSV file for later analysis
write_csv(eva_data, output_file)

print("Plotting cumulative spacewalk duration and saving to file")
# Plot cumulative time spent in space over years
time_in_space_plot <- eva_data |>
  rowwise() |>
  mutate(duration_hours =
                  sum(as.numeric(str_split_1(duration, "\\:"))/c(1, 60))
  ) |>
  ungroup() |>
  # Calculate cumulative time
  mutate(cumulative_time = cumsum(duration_hours)) |>
  ggplot(aes(x = year, y = cumulative_time)) +
  geom_line(color = "black") +
  labs(
    x = "Year",
    y = "Total time spent in space to date (hours)",
    title = "Cumulative Spacewalk Time" ) +
  theme_minimal()

ggsave(graph_file, width = 8, height = 6, plot = time_in_space_plot)
print("--END--")

Separate units of functionality


Functions are a fundamental concept in writing software and are one of the core ways you can organise your code to improve its readability. A function is an isolated section of code that performs a single, specific task that can be simple or complex. It can then be called multiple times with different inputs throughout a codebase, but its definition only needs to appear once.

Breaking up code into functions in this manner benefits readability since the smaller sections are easier to read and understand. Since functions can be reused, codebases naturally begin to follow the Don’t Repeat Yourself principle which prevents software from becoming overly long and confusing. The software also becomes easier to maintain because, if the code encapsulated in a function needs to change, it only needs updating in one place instead of many. As we will learn in a future episode, testing code also becomes simpler when code is written in functions. Each function can be individually checked to ensure it is doing what is intended, which improves confidence in the software as a whole.

Callout

Decomposing code into functions helps with reusability of blocks of code and eliminating repetition, but, equally importantly, it helps with code readability and testing.

Looking at our code, you may notice it contains different pieces of functionality:

  1. reading the data from a JSON file
  2. converting and saving the data in the CSV format
  3. processing/cleaning the data and preparing it for analysis
  4. data analysis and visualising the results

Let’s refactor our code so that reading the data in JSON format into a dataframe (step 1.) and converting it and saving to the CSV format (step 2.) is extracted into a separate function. Let’s name this function read_json_to_dataframe. The main part of the script should then be simplified to invoke these new functions, while the function itself contain the complexity of this step. We will continue to work on steps 3. and 4. above later on.

After the initial refactoring, our code may look something like the following.

R

library(jsonlite)
library(ggplot2)
library(dplyr)
library(tidyr)
library(lubridate)
library(readr)
library(stringr)

read_json_to_dataframe <- function(input_file) {
  print("Reading JSON file")

  eva_df <- fromJSON(input_file, flatten = TRUE) |>
    mutate(eva = as.numeric(eva)) |>
    mutate(date = ymd_hms(date)) |>
    mutate(year = year(date)) |>
    drop_na() |>
    arrange(date)

  return(eva_df)
}


# https://data.nasa.gov/resource/eva.json (with modifications)
input_file <- "eva-data.json"
output_file <- "eva-data.csv"
graph_file <- "cumulative_eva_graph.png"

print("--START--")


# Read the data from a JSON file into a dataframe
eva_data <- read_json_to_dataframe(input_file)

# Save dataframe to CSV file for later analysis
write_csv(eva_data, output_file)

print("Plotting cumulative spacewalk duration and saving to file")
# Plot cumulative time spent in space over years
time_in_space_plot <- eva_data |>
  rowwise() |>
  mutate(duration_hours =
                  sum(as.numeric(str_split_1(duration, "\\:"))/c(1, 60))
  ) |>
  ungroup() |>
  # Calculate cumulative time
  mutate(cumulative_time = cumsum(duration_hours)) |>
  ggplot(aes(x = year, y = cumulative_time)) +
  geom_line(color = "black") +
  labs(
    x = "Year",
    y = "Total time spent in space to date (hours)",
    title = "Cumulative Spacewalk Time" ) +
  theme_minimal()

ggsave(graph_file, width = 8, height = 6, plot = time_in_space_plot)
print("--END--")

We have chosen to create functions for reading in our data files since this is a very common task within research software. While this function does not contain that many lines of code due to using the tidyverse functions that do all the complex data reading and data preparation operations, it can be useful to package these steps together into reusable functions if you need to read in or write out a lot of similarly structured files and process them in the same way.

Function Documentation


Now that we have written some functions, it is time to document them so that we can quickly recall (and others looking at our code in the future can understand) what the functions do without having to read the code.

Roxygen comments are a specific type of function-level documentation that are provided within R functions and classes. Function level documentation should explain what that particular code is doing, what parameters the function needs (inputs) and what form they should take, what the function outputs (you may see words like ‘return’ here), and errors (if any) that might be raised.

Providing Roxygen comments helps improve code readability since it makes the function code more transparent and aids understanding. Particularly, roxygen comments that provide information on the input and output of functions makes it easier to reuse them in other parts of the code, without having to read the full function to understand what needs to be provided and what will be returned.

Roxygen comment lines in R start with #’ to indicate that they are documentation and are written directly above the function definition. The tag @param is used to document function parameters, while @return describes what the function returns, and you can also use @details to give additional information such as errors.

R

#' Divide number x by number y.
#' 
#' @param x A number to be divided.
#' @param y A number to divide by.
#' @return The result of dividing x by y.
#' 
#' @details
#' If y is zero, this will result in an error (division by zero).
divide <- function(x, y) {
    return(x / y)
}

We’ll see later in the course how the roxygen2 package can be used to automatically generate formal documentation for the function. while the package pkgdown can be used to generate a documentation website for our whole project.

Let’s write a roxygen comment for the function read_json_to_dataframe we introduced in the previous exercise. Remember, questions we want to answer when writing the function-level documentation include:

  • What the function does?
  • What kind of inputs does the function take? Are they required or optional? Do they have default values?
  • What output will the function produce?
  • What exceptions/errors, if any, it can produce?

Our read_json_to_dataframe function fully described by a roxygen comment block may look like:

R

#' Read and Clean EVA Data from JSON
#'
#' This function reads EVA data from a JSON file, cleans it by converting
#' the 'eva' column to numeric, converting data from text to date format,
#. creating a year variable and removing rows with missing values, and sorts
#' the data by the 'date' column.
#'
#' @param input_file A character string specifying the path to the input JSON file.
#'
#' @return A cleaned and sorted data frame containing the EVA data.
#'
read_json_to_dataframe <- function(input_file) {
  print("Reading JSON file")

  eva_df <- fromJSON(input_file, flatten = TRUE) |>
    mutate(eva = as.numeric(eva)) |>
    mutate(date = ymd_hms(date)) |>
    mutate(year = year(date)) |>
    drop_na() |>
    arrange(date)

  return(eva_df)
}

Writing roxygen comments

Write a roxygen comment for this function text_to_duration:.

R

text_to_duration <- function(duration) {
  time_parts <- stringr::str_split(duration, ":")[[1]]
  hours <- as.numeric(time_parts[1])
  minutes <- as.numeric(time_parts[2])
  duration_hours <- hours + minutes / 60
  return(duration_hours)
}

Our text_to_duration function fully described by roxygen comments may look like:

R

#' Convert Duration from HH:MM Format to Hours
#'
#' This function converts a duration in "HH:MM" format (as a character string)
#' into the total duration in hours (as a numeric value).
#'
#' @details
#' When applied to a vector, it will only process and return the first element
#' so this function must be applied to a data frame rowwise.
#'
#' @param duration A character string representing the duration in "HH:MM" format.
#'
#' @return A numeric value representing the duration in hours.
#'
#' @examples
#' text_to_duration("03:45")  # Returns 3.75 hours
#' text_to_duration("12:30")  # Returns 12.5 hours
text_to_duration <- function(duration) {
  time_parts <- stringr::str_split(duration, ":")[[1]]
  hours <- as.numeric(time_parts[1])
  minutes <- as.numeric(time_parts[2])
  duration_hours <- hours + minutes / 60
  return(duration_hours)
}

Functions for modular and reusable code revisited.


As we have already seen earlier in this episode - functions play a key role in creating modular and reusable code. We are going to carry on improving our code following these principles:

  • Each function should have a single, clear responsibility. This makes functions easier to understand, test, and reuse.
  • Write functions that can be easily combined or reused with other functions to build more complex functionality.
  • Functions should accept parameters to allow flexibility and reusability in different contexts; avoid hard-coding values inside functions/code (e.g. data files to read from/write to) and pass them as arguments instead.

Bearing in mind the above principles, we can further simplify the main part of our code by extracting the code to process, analyse our data and plot a graph into a separate function plot_cumulative_time_in_space.

We can further extract the code to convert the spacewalk duration text into a number to allow for arithmetic calculations (into a separate function text_to_duration).

R

library(jsonlite)
library(ggplot2)
library(dplyr)
library(tidyr)
library(lubridate)
library(readr)
library(stringr)

#' Read and Clean EVA Data from JSON
#'
#' This function reads EVA data from a JSON file, cleans it by converting
#' the 'eva' column to numeric, converting data from text to date format,
#. creating a year variable and removing rows with missing values, and sorts
#' the data by the 'date' column.
#'
#' @param input_file A character string specifying the path to the input JSON file.
#'
#' @return A cleaned and sorted data frame containing the EVA data.
#'
read_json_to_dataframe <- function(input_file) {
  print("Reading JSON file")

  eva_df <- fromJSON(input_file, flatten = TRUE) |>
    mutate(eva = as.numeric(eva)) |>
    mutate(date = ymd_hms(date)) |>
    mutate(year = year(date)) |>
    drop_na() |>
    arrange(date)

  return(eva_df)
}

#' Convert Duration from HH:MM Format to Hours
#'
#' This function converts a duration in "HH:MM" format (as a character string)
#' into the total duration in hours (as a numeric value).
#'
#' @details
#' When applied to a vector, it will only process and return the first element
#' so this function must be applied to a data frame rowwise.
#'
#' @param duration A character string representing the duration in "HH:MM" format.
#'
#' @return A numeric value representing the duration in hours.
#'
#' @examples
#' text_to_duration("03:45")  # Returns 3.75 hours
#' text_to_duration("12:30")  # Returns 12.5 hours
text_to_duration <- function(duration) {
  time_parts <- stringr::str_split(duration, ":")[[1]]
  hours <- as.numeric(time_parts[1])
  minutes <- as.numeric(time_parts[2])
  duration_hours <- hours + minutes / 60
  return(duration_hours)
}

#' Plot Cumulative Time in Space Over the Years
#'
#' This function plots the cumulative time spent in space over the years based on
#' the data in the dataframe. The cumulative time is calculated by converting the
#' "duration" column into hours, then computing the cumulative sum of the duration.
#' The plot is saved as a PNG file at the specified location.
#'
#' @param tdf A dataframe containing a "duration" column in "HH:MM" format and a "date" column.
#' @param graph_file A character string specifying the path to save the graph.
#'
#' @return NULL
plot_cumulative_time_in_space <- function(tdf, graph_file) {

  time_in_space_plot <- tdf |>
    rowwise() |>
    mutate(duration_hours = text_to_duration(duration)) |>  # Add duration_hours column
    ungroup() |>
    mutate(cumulative_time = cumsum(duration_hours)) |>     # Calculate cumulative time
    ggplot(aes(x = date, y = cumulative_time)) +
    geom_line(color = "black") +
    labs(
      x = "Year",
      y = "Total time spent in space to date (hours)",
      title = "Cumulative Spacewalk Time"
    )

  ggsave(graph_file, width = 8, height = 6, plot = time_in_space_plot)
}

The main part of our code then becomes much simpler and more readable, only containing the invocation of the following three functions:

R

# https://data.nasa.gov/resource/eva.json (with modifications)
input_file <- "eva-data.json"
output_file <- "eva-data.csv"
graph_file <- "cumulative_eva_graph.png"

print("--START--")

# Read the data from a JSON file into a dataframe
eva_data <- read_json_to_dataframe(input_file)

print("Writing CSV File")
# Save dataframe to CSV file for later analysis
write_csv(eva_data, output_file)

print("Plotting cumulative spacewalk duration and saving to file")
# Plot cumulative time spent in space over years
plot_cumulative_time_in_space(eva_data, graph_file)

print("--END--")
...

Using the styler package 1

  1. Look up the stylr package. How can it help you make your code more readable?
  2. Install stylr and apply it to this poorly formatted version of the plot_cumulative_time_in_space functions we created.

R

plot_cumulative_time_in_space <- function(tdf, graph_file) {
  time_in_space_plot <- tdf |>
    rowwise() |>
       mutate(duration_hours = text_to_duration(duration)) |> # Add duration_hours column
    ungroup() |>
    mutate(cumulative_time = cumsum(duration_hours)) |> # Calculate cumulative time
       ggplot(aes(x = date, y = cumulative_time)) +
    geom_line(color = "black") +
    labs(
      x = "Year",
      y = "Total time spent in space to date (hours)",
      
      title = "Cumulative Spacewalk Time"
    )

  ggsave(graph_file, width = 8, height = 6, plot = time_in_space_plot)
}
  1. See R for Data Science for a discussion of stylr and how it can help you make your code more readable.

  2. plot_cumulative_time_in_space after applying stylr

R

plot_cumulative_time_in_space <- function(tdf, graph_file) {
  time_in_space_plot <- tdf |>
    rowwise() |>
    mutate(duration_hours = text_to_duration(duration)) |> # Add duration_hours column
    ungroup() |>
    mutate(cumulative_time = cumsum(duration_hours)) |> # Calculate cumulative time
    ggplot(aes(x = date, y = cumulative_time)) +
    geom_line(color = "black") +
    labs(
      x = "Year",
      y = "Total time spent in space to date (hours)",
      title = "Cumulative Spacewalk Time"
    )

  ggsave(graph_file, width = 8, height = 6, plot = time_in_space_plot)
}

Further reading


We recommend the following resources for some additional reading on the topic of this episode:

Also check the full reference set for the course.

Key Points

  • Readable code is easier to understand, maintain, debug and extend (reuse) - saving time and effort.
  • Choosing descriptive variable and function names will communicate their purpose more effectively.
  • Using comments and function-level documentation (roxygen) to describe parts of the code will help transmit understanding and context.
  • Use libraries or packages for common functionality to avoid duplication.
  • Creating functions from the smallest, reusable units of code will make the code more readable and help. compartmentalise which parts of the code are doing what actions and isolate specific code sections for re-use.

Attribution


This episode reuses material from the “Code Readability” episode of the Software Carpentries Incubator course “Tools and practices of FAIR research software” under a CC-BY-4.0 licence with modifications: (i) adaptations have been made to make the material suitable for an audience of R users (e.g. replacing “software” with “code” in places, docstrings with roxygen2), (ii) all code has been ported from Python to R (iii) the example code has been deliberately modified to be non-repeatable for teaching purposes. (iv) Objectives, Questions, Key Points and Further Reading sections have been updated to reflect the remixed R focussed content. Some original material has been added – this is marked with a footnote.


  1. Original Material↩︎

Content from Code structure


Last updated on 2025-04-15 | Edit this page

Overview

Questions

  • How can we best structure our R project?
  • What are conventional places to store data, code, results and tests within our research project?

Objectives

After completing this episode, participants should be able to:

  • Set up and use an R “research compendium” using rrtools to organise a reproducible research project.

In the previous episode we have seen some tools and practices that can help up improve readability of our code - including breaking our code into small, reusable functions that perform one specific task.

In this episode we will expand these practices to our (research) projects as a whole.

Introducing Research Compendia


Ensuring that our R project is organised and well-structured is just as important as writing well-structured code. Following conventions on consistent and informative directory structure for our project will ensure people will immediately know where to find things and is especially helpful for long-term research projects or when working in teams.

Our project is currently set up as a simple R project with all of our files stored in the root of the project folder. We could improve on this significantly by creating a more structured directory layout like the one below:

OUTPUT

project_name/
├── README.md             # overview of the project
├── data/                 # data files used in the project
│   ├── README.md         # describes where data came from
│   ├── raw/
│   └── processed/
├── manuscript/           # manuscript describing the results
├── results/              # results of the analysis (data, tables)
│   ├── preliminary/
│   └── final/
├── figures/              # results of the analysis (figures)
│   ├── comparison_plot.png
│   └── regression_chart.pdf
├── src/                  # contains source code for the project
│   ├── LICENSE           # license for your code
│   ├── main_script.R    # main script/code entry point
│   └── ...
├── doc/                  # documentation for your project
└── ...

However, we are going to structure our project as an rrtools (package) research compendium instead.

An rrtools compendium is essentially an R package containing everything required to reproduce an analysis (data and functions). While experience of building R packages is helpful - it isn’t necessary to get started working with rrtools compendia - we will cover the necessary detail as we go along.

Research Compendium vs R Project

A research compendium offers additional benefits over a simple R project structure, particularly for ensuring reproducibility and long-term sustainability of the project.

  • While a simple R project can be well-organised, a research compendium follows a standardised R package structure that aligns with best practices for reproducible research. This makes it far easier for someone new to the project to understand and run the analysis.

  • rrtools compendia include metadata files like DESCRIPTION and NAMESPACE that provide clear documentation of dependencies, which helps other collaborators or future users run your project with the correct packages and versions.

  • The compendium setup also allows you to include automatic documentation generation through tools like roxygen2, making it easier to maintain and update as the project evolves.

  • A research compendium supports automated testing (using tools like testthat).

Setting up a research compendium


In this section we are going to setup an R compendium using the rrtools package and copy over our content from out current project to the new compendium.

The top-level of our folder structure for this course is organised as follows:

OUTPUT

advanced_r/
├── project/
│   └── spacewalks1/ # contains source code for the project
└── compendium/

Before we start, we need to close our current spacewalks1 project (File >> Close Project) and create and open a new R project spacewalks2 in the compendium subdfolder of advanced_r (File >> New Project >> New Directory >> New Project).

OUTPUT

advanced_r/
├── project/
│   └── spacewalks1/
└── compendium/
│   └── spacewalks2/ 

We also need to install a number of packages we’ll need to start working with the compendium:

R

install.packages("rrtools")
install.packages("usethis")
install.packages("devtools")

Once we have created and launched the new project, we can start setting up the compendium by running the following commands in the R console:

R

library(rrtools)
rrtools::use_compendium(simple=FALSE) 

OUTPUT

> library(rrtools)
✔ Git is installed on this computer, your username is abc123
New project 'spacewalks2' is nested inside an existing project '/Users/myusername/projects/astronaut-data-analysis-r/advanced_r/compendium', which is rarely a good idea.
If this is unexpected, the here package has a function, `here::dr_here()` that reveals why '/Users/myusername/projects/astronaut-data-analysis-r/advanced_r/compendium' is regarded as a project.
Do you want to create anyway?

1: I agree
2: Absolutely not
3: Not now

OUTPUT

Selection: 1
✔ Setting active project to '/Users/myusername/projects/astronaut-data-analysis-r/advanced_r/compendium/spacewalks2'
✔ Creating 'R/'
✔ Writing 'DESCRIPTION'
Package: spacewalks2
Title: What the Package Does (One Line, Title Case)
Version: 0.0.0.9000
Authors@R (parsed):
    * First Last <first.last@example.com> [aut, cre]
Description: What the package does (one paragraph).
License: MIT + file LICENSE
ByteCompile: true
Encoding: UTF-8
LazyData: true
Roxygen: list(markdown = TRUE)
RoxygenNote: 7.3.2
✔ Writing 'NAMESPACE'
Overwrite pre-existing file 'spacewalks2.Rproj'?

1: Yup
2: Nope
3: Not now

OUTPUT

Selection: 1
✔ Writing 'spacewalks2.Rproj'
✔ Adding '^spacewalks2\\.Rproj$' to '.Rbuildignore'
✔ Adding '.Rproj.user' to '.gitignore'
✔ Adding '^\\.Rproj\\.user$' to '.Rbuildignore'
✔ Opening '/Users/myusername/projects/astronaut-data-analysis-r/advanced_r/compendium/spacewalks2/' in new RStudio session
✔ Setting active project to '<no active project>'
✔ The package spacewalks2 has been created

Next, you need to:  ↓ ↓ ↓
• Edit the DESCRIPTION file
• Add a license file (e.g. with usethis::use_mit_license(copyright_holder = 'Your Name'))
• Use other 'rrtools' functions to add components to the compendium

The output of the use_compendium function provides a list of next steps to complete the setup of the compendium.

We will follow these steps to complete the setup, but first let’s take a look at the new directory structure that has been created:

OUTPUT

.
├── DESCRIPTION <- .............................package metadata
|                                               dependency management
├── NAMESPACE <- ...............................AUTO-GENERATED on build
├── R <- .......................................folder for functions
└── spacewalks2.Rproj <- ......................rstudio project file

rrtools::use_compendium() creates the bare backbone of infrastructure required for a research compendium.

At this point it provides facilities to store general metadata about our compendium (eg bibliographic details to create a citation) and manage dependencies in the DESCRIPTION file and store and document functions in the R/ folder.

Together these allow us to manage, install and share functionality associated with our project.

  • A DESCRIPTION file is a required component of an R package. This file contains metadata about our package, including the name, version, author, and dependencies.
  • A NAMESPACE file is a required component of an R package. Its role is to defines the functions, methods, and datasets that are exported from a package (i.e., made available to users) and those that are kept internal (i.e., not accessible directly by users). It helps manage the visibility of functions and ensures that only the intended parts of the package are exposed to the outside world. This file is auto-generated when the package is built.
  • The R/ folder contains the R scripts that contain the functions in the package.
  • RStudio Project file spacewalks2.Rproj - this file is used to open the project in RStudio.

Edit the DESCRIPTION file

Let’s start by editing the DESCRIPTION file.

Package: spacewalks2
Title: What the Package Does (One Line, Title Case)
Version: 0.0.0.9000
Authors@R:
    person("First", "Last", , "first.last@example.com", role = c("aut", "cre"))
Description: What the package does (one paragraph).
License: MIT + file LICENSE
ByteCompile: true
Encoding: UTF-8
LazyData: true
Roxygen: list(markdown = TRUE)
RoxygenNote: 7.3.2

The following fields need to be updated:

  • Package - the name of the package.
  • Title - a short description of what the package does.
  • Version - the version number of the package.
  • Authors@R - the authors of the package.
  • Description - a longer description of what the package does.
  • License - the license under which the package code is distributed.
Package: spacewalks2
Title: Analysis of NASA's extravehicular activity datasets
Version: 0.0.0.9000
Authors@R: c(
    person(given   = "Kamilla",
           family  = "Kopec-Harding",
           role    = c("cre"),
           email   = "k.r.kopec-harding@bham.ac.uk",
           comment = c(ORCID = "{{0000-0002-2960-7944}}")),
    person(given   = "Sarah",
           family  = "Jaffa",
           role    = c("aut"),
           email   = "sarah.jaffa@manchester.ac.uk",
           comment = c(ORCID = "{{0000-0002-6711-6345}}")),
    person(given   = "Aleksandra",
           family  = "Nenadic",
           role    = c("aut"),
           comment = c(ORCID = "{{0000-0002-2269-3894}}"))
    )
Description: An R research compendium for researchers to generate visualisations and statistical
    summaries of NASA's extravehicular activity datasets.
License: MIT + file LICENSE
ByteCompile: true
Encoding: UTF-8
LazyData: true
Roxygen: list(markdown = TRUE)
RoxygenNote: 7.3.2

Add a license file

Next, we need to add a license file to the compendium.
A license file is a text file that specifies the legal terms under which the code in the package can be used.

We can add a license file using the usethis package: We’ll use an MIT license for this project. We discuss how to select a license in the next episode.

R

usethis::use_mit_license(copyright_holder = "Kamilla Kopec-Harding")

OUTPUT

✔ Writing 'LICENSE'
✔ Writing 'LICENSE.md'
✔ Adding '^LICENSE\\.md$' to '.Rbuildignore'

Add components to the compendium

Once we have filled in the DESCRIPTION file and added a license file, we can start adding content to the compendium. We’ll look a this in detail in the next section

Managing functionality in a package


We mentioned previously that an R compendium is an R package. rrtools essentially provides us with an R package template and in this section we will start populating this with the functionality from our current project spacewalks1.

Our first task is to copy over the code from our current project to the new compendium and get it working within the compendium / R package structure. We’ll follow these steps:

  • Create package functions
  • Document our functions
  • Build and install our compendium package
  • Check our package for issues
  • Fix any issues that come up
  • Re-build and install our compendium package
  • Write a script to run (drive) our analysis

Create package functions

We’ll start by creating a new R script file in the R/ folder of the compendium and copying over the functions we have created so far.

To create or edit .R files in the R/ directory, we can use:

usethis::use_r("eva_data_analysis.R")

This creates a file called eva_data_analysis.R in the R/ directory and opens it up for editing. Let’s populate this with the functions we created in the previous episode.

R

# R/eva_data_analysis.R

#' Read and Clean EVA Data from JSON
#'
#' This function reads EVA data from a JSON file, cleans it by converting
#' the 'eva' column to numeric, converting data from text to date format,
#. creating a year variable and removing rows with missing values, and sorts
#' the data by the 'date' column.
#'
#' @param input_file A character string specifying the path to the input JSON file.
#'
#' @return A cleaned and sorted data frame containing the EVA data.
#'
read_json_to_dataframe <- function(input_file) {
  print("Reading JSON file")

  eva_df <- fromJSON(input_file, flatten = TRUE) |>
    mutate(eva = as.numeric(eva)) |>
    mutate(date = ymd_hms(date)) |>
    mutate(year = year(date)) |>
    drop_na() |>
    arrange(date)

  return(eva_df)
}

#' Convert Duration from HH:MM Format to Hours
#'
#' This function converts a duration in "HH:MM" format (as a character string)
#' into the total duration in hours (as a numeric value).
#'
#' @details
#' When applied to a vector, it will only process and return the first element
#' so this function must be applied to a data frame rowwise.
#'
#' @param duration A character string representing the duration in "HH:MM" format.
#'
#' @return A numeric value representing the duration in hours.
#'
#' @examples
#' text_to_duration("03:45")  # Returns 3.75 hours
#' text_to_duration("12:30")  # Returns 12.5 hours
text_to_duration <- function(duration) {
  time_parts <- stringr::str_split(duration, ":")[[1]]
  hours <- as.numeric(time_parts[1])
  minutes <- as.numeric(time_parts[2])
  duration_hours <- hours + minutes / 60
  return(duration_hours)
}

#' Plot Cumulative Time in Space Over the Years
#'
#' This function plots the cumulative time spent in space over the years based on
#' the data in the dataframe. The cumulative time is calculated by converting the
#' "duration" column into hours, then computing the cumulative sum of the duration.
#' The plot is saved as a PNG file at the specified location.
#'
#' @param tdf A dataframe containing a "duration" column in "HH:MM" format and a "date" column.
#' @param graph_file A character string specifying the path to save the graph.
#'
#' @return NULL
plot_cumulative_time_in_space <- function(tdf, graph_file) {

  time_in_space_plot <- tdf |>
    rowwise() |>
    mutate(duration_hours = text_to_duration(duration)) |>  # Add duration_hours column
    ungroup() |>
    mutate(cumulative_time = cumsum(duration_hours)) |>     # Calculate cumulative time
    ggplot(ggplot2::aes(x = date, y = cumulative_time)) +
    geom_line(color = "black") +
    labs(
      x = "Year",
      y = "Total time spent in space to date (hours)",
      title = "Cumulative Spacewalk Time"
    )

  ggplot2::ggsave(graph_file, width = 8, height = 6, plot = time_in_space_plot)
}

Notice that this file only contains functions and we have omitted library() calls. We will add the main script that calls these functions later.

Document package functions

Now, to have our functions exported as part of the spacewalks2 package, we need to document them using Roxygen2.

As we saw earlier, Roxygen2 provides a documentation framework in R and allows us to write specially-structured comments preceding each function definition. When we document our our package, these are processed automatically to produce .Rd help files for our functions. The contents of these files controls which are exported to the package NAMESPACE.

The @export tag tells Roxygen2 to add a function as an export in the NAMESPACE file, so that it will be accessible and available for use after package installation. This means that we need to add the @export function to each of our functions:

R

#' Read and Clean EVA Data from JSON
#'
#' This function reads EVA data from a JSON file, cleans it by converting
#' the 'eva' column to numeric, converting data from text to date format,
#. creating a year variable and removing rows with missing values, and sorts
#' the data by the 'date' column.
#'
#' @param input_file A character string specifying the path to the input JSON file.
#'
#' @return A cleaned and sorted data frame containing the EVA data.
#' @export
read_json_to_dataframe <- function(input_file) {
  print("Reading JSON file")

  eva_df <- fromJSON(input_file, flatten = TRUE) |>
    mutate(eva = as.numeric(eva)) |>
    mutate(date = ymd_hms(date)) |>
    mutate(year = year(date)) |>
    drop_na() |>
    arrange(date)

  return(eva_df)
}

#' Convert Duration from HH:MM Format to Hours
#'
#' This function converts a duration in "HH:MM" format (as a character string)
#' into the total duration in hours (as a numeric value).
#'
#' @details
#' When applied to a vector, it will only process and return the first element
#' so this function must be applied to a data frame rowwise.
#'
#' @param duration A character string representing the duration in "HH:MM" format.
#'
#' @return A numeric value representing the duration in hours.
#'
#' @examples
#' text_to_duration("03:45")  # Returns 3.75 hours
#' text_to_duration("12:30")  # Returns 12.5 hours
#' @export
text_to_duration <- function(duration) {
  time_parts <- str_split(duration, ":")[[1]]
  hours <- as.numeric(time_parts[1])
  minutes <- as.numeric(time_parts[2])
  duration_hours <- hours + minutes / 60
  return(duration_hours)
}

#' Plot Cumulative Time in Space Over the Years
#'
#' This function plots the cumulative time spent in space over the years based on
#' the data in the dataframe. The cumulative time is calculated by converting the
#' "duration" column into hours, then computing the cumulative sum of the duration.
#' The plot is saved as a PNG file at the specified location.
#'
#' @param tdf A dataframe containing a "duration" column in "HH:MM" format and a "date" column.
#' @param graph_file A character string specifying the path to save the graph.
#'
#' @return NULL
#' @export
plot_cumulative_time_in_space <- function(tdf, graph_file) {

  time_in_space_plot <- tdf |>
    rowwise() |>
    mutate(duration_hours = text_to_duration(duration)) |>  # Add duration_hours column
    ungroup() |>
    mutate(cumulative_time = cumsum(duration_hours)) |>     # Calculate cumulative time
    ggplot(ggplot2::aes(x = date, y = cumulative_time)) +
    geom_line(color = "black") +
    labs(
      x = "Year",
      y = "Total time spent in space to date (hours)",
      title = "Cumulative Spacewalk Time"
    )

  ggplot2::ggsave(graph_file, width = 8, height = 6, plot = time_in_space_plot)
}

Build and install package

Build Roxygen documentation

Now that we’ve annotated our source code we can build the documentation either by clicking on More > Document in the RStudio Build panel or from the console using:

R

devtools::document()

OUTPUT

ℹ Updating spacewalks2 documentation
ℹ Loading spacewalks2
Writing NAMESPACE
Writing read_json_to_dataframe.Rd
Writing text_to_duration.Rd
Writing plot_cumulative_time_in_space.Rd

The man/ directory will now contain an .Rd file for each of our functions.

OUTPUT

man
└── plot_cumulative_time_in_space.Rd
└── read_json_to_dataframe.Rd
└── text_to_duration.Rd

and the NAMESPACE now contains an export() entry for each of our functions:

OUTPUT

# Generated by roxygen2: do not edit by hand
export(plot_cumulative_time_in_space)
export(read_json_to_dataframe)
export(text_to_duration)

Install Package

The usual workflow for package development is to:

  • make some changes
  • build and install the package
  • unload and reload the package (often in a new R session)

The best way to install and reload a package in a fresh R session is to use the 🔨 Clean and Install cammand tab in the Build panel which performs several steps in sequence to ensure a clean and correct result:

  • Unloads any existing version of the package (including shared libraries if necessary).
  • Builds and installs the package using R CMD INSTALL.
  • Restarts the underlying R session to ensure a clean environment for re-loading the package.
  • Reloads the package in the new R session by executing the library function.

Running the 🔨 Clean and Install command on our package results in this output in the Build panel output:

OUTPUT

==> R CMD INSTALL --preclean --no-multiarch --with-keep.source spacewalks2

* installing to library ‘/Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library’
* installing *source* package ‘spacewalks2’ ...
** using staged installation
** R
** byte-compile and prepare package for lazy loading
** help
*** installing help indices
** building package indices
** testing if installed package can be loaded from temporary location
** testing if installed package can be loaded from final location
** testing if installed package keeps a record of temporary installation path
* DONE (spacewalks2)

We can inspect the resulting documentation for our function using ?plot_cumulative_time_in_space

Check package

Automated checking

An important part of the package development process is R CMD check. R CMD check automatically checks your code and can automatically detects many common problems that we’d otherwise discover the hard way.

To check our package, we can:

  • use devtools::check()
  • click on the ✅Check tab in the Build panel.

This: + Ensures that the documentation is up-to-date by running devtools::document(). + Bundles the package before checking it.

More info on checks here.

Both these run R CMD check which return three types of messages:

  • ERRORs: Severe problems that you should fix regardless of whether or not you’re submitting to CRAN.
  • WARNINGs: Likely problems that you must fix if you’re planning to submit to CRAN (and a good idea to look into even if you’re not). _ NOTEs: Mild problems. If you are submitting to CRAN, you should strive to eliminate all NOTEs, even if they are false positives. Let’s Check our package:

R

devtools::check()

OUTPUT

── R CMD check results ──────────────────────────────────────────────────────────── spacewalks2 0.0.0.9000 ────
Duration: 10s

❯ checking dependencies in R code ... WARNING
  '::' or ':::' imports not declared from:
    ‘ggplot2’ ‘stringr’

❯ checking R code for possible problems ... NOTE
  plot_cumulative_time_in_space: no visible global function definition
    for ‘ggplot’
  plot_cumulative_time_in_space: no visible global function definition
    for ‘mutate’
  plot_cumulative_time_in_space: no visible global function definition
    for ‘ungroup’
  plot_cumulative_time_in_space: no visible global function definition
    for ‘rowwise’
  plot_cumulative_time_in_space: no visible binding for global variable
    ‘duration’
  plot_cumulative_time_in_space: no visible binding for global variable
    ‘duration_hours’
  plot_cumulative_time_in_space: no visible binding for global variable
    ‘cumulative_time’
  plot_cumulative_time_in_space: no visible global function definition
    for ‘geom_line’
  plot_cumulative_time_in_space: no visible global function definition
    for ‘labs’
  read_json_to_dataframe: no visible global function definition for
    ‘arrange’
  read_json_to_dataframe: no visible global function definition for
    ‘drop_na’
  read_json_to_dataframe: no visible global function definition for
    ‘mutate’
  read_json_to_dataframe: no visible global function definition for
    ‘fromJSON’
  read_json_to_dataframe: no visible binding for global variable ‘eva’
  read_json_to_dataframe: no visible global function definition for
    ‘ymd_hms’
  read_json_to_dataframe: no visible global function definition for
    ‘year’
  Undefined global functions or variables:
    arrange cumulative_time drop_na duration duration_hours eva fromJSON
    geom_line ggplot labs mutate rowwise ungroup year ymd_hms

0 errors ✔ | 1 warning ✖ | 1 notes ✖

R CMD check succeeded OK so there’s a couple of flags from problems and a NOTE. Let’s start troubleshooting with:

OUTPUT

our_function_name: no visible global function definition for ‘third_party_function_name’

read_json_to_dataframe: no visible global function definition for ‘mutate’
plot_cumulative_time_in_space: no visible global function definition for ‘ggplot’

This arises because we are using lots of functions from third party packages in our code e.g. mutate and ggplot from dplyr and ggplot2 respectively. However, we have not specified that they are imported from the dplyr and ggplot2 NAMESPACEs so the checks look for functions with those names in our package (spacewalks2) instead and obviously can’t find anything.

To fix this we need to add the namespace of every third-party function we use.

To specify the namespace of a function we use the notation ::, so let’s update our functions with these details.

Let’s start with read_json_to_dataframe:

R

read_json_to_dataframe <- function(input_file) {
  print("Reading JSON file")

  eva_df <- fromJSON(input_file, flatten = TRUE) |>
    mutate(eva = as.numeric(eva)) |>
    mutate(date = ymd_hms(date)) |>
    mutate(year = year(date)) |>
    drop_na() |>
    arrange(date)

  return(eva_df)
}

Once we’ve added namespace notation to all the functions our read_json_to_dataframe function looks like this:

R

read_json_to_dataframe <- function(input_file) {
  print("Reading JSON file")

  eva_df <- jsonlite::fromJSON(input_file, flatten = TRUE) |>
    dplyr::mutate(eva = as.numeric(eva)) |>
    dplyr::mutate(date = lubridate::ymd_hms(date)) |>
    dplyr::mutate(year = lubridate::year(date)) |>
    tidyr::drop_na() |>
    dplyr::arrange(date)

  return(eva_df)
}

Challenge

Add namespace notation to the plot_cumulative_time_in_space function

plot_cumulative_time_in_space <- function(tdf, graph_file) {

  time_in_space_plot <- tdf |>
    rowwise() |>
    mutate(duration_hours = text_to_duration(duration)) |>  # Add duration_hours column
    ungroup() |>
    mutate(cumulative_time = cumsum(duration_hours)) |>     # Calculate cumulative time
    ggplot(aes(x = date, y = cumulative_time)) +
    geom_line(color = "black") +
    labs(
      x = "Year",
      y = "Total time spent in space to date (hours)",
      title = "Cumulative Spacewalk Time"
    )

  ggsave(graph_file, width = 8, height = 6, plot = time_in_space_plot)
}

plot_cumulative_time_in_space <- function(tdf, graph_file) {

time_in_space_plot <- tdf |> dplyr::rowwise() |> dplyr::mutate(duration_hours = text_to_duration(duration)) |> # Add duration_hours column dplyr::ungroup() |> dplyr::mutate(cumulative_time = cumsum(duration_hours)) |> # Calculate cumulative time ggplot2::ggplot(ggplot2::aes(x = date, y = cumulative_time)) + ggplot2::geom_line(color = “black”) + ggplot2::labs( x = “Year”, y = “Total time spent in space to date (hours)”, title = “Cumulative Spacewalk Time” )

ggplot2::ggsave(graph_file, width = 8, height = 6, plot = time_in_space_plot) }

We can update the rest of our functions as follows:

R

#' Read and Clean EVA Data from JSON
#'
#' This function reads EVA data from a JSON file, cleans it by converting
#' the 'eva' column to numeric, converting data from text to date format,
#. creating a year variable and removing rows with missing values, and sorts
#' the data by the 'date' column.
#'
#' @param input_file A character string specifying the path to the input JSON file.
#'
#' @return A cleaned and sorted data frame containing the EVA data.
#' @export
read_json_to_dataframe <- function(input_file) {
  print("Reading JSON file")

  eva_df <- jsonlite::fromJSON(input_file, flatten = TRUE) |>
    dplyr::mutate(eva = as.numeric(eva)) |>
    dplyr::mutate(date = lubridate::ymd_hms(date)) |>
    dplyr::mutate(year = lubridate::year(date)) |>
    tidyr::drop_na() |>
    dplyr::arrange(date)

  return(eva_df)
}

#' Convert Duration from HH:MM Format to Hours
#'
#' This function converts a duration in "HH:MM" format (as a character string)
#' into the total duration in hours (as a numeric value).
#'
#' @details
#' When applied to a vector, it will only process and return the first element
#' so this function must be applied to a data frame rowwise.
#'
#' @param duration A character string representing the duration in "HH:MM" format.
#'
#' @return A numeric value representing the duration in hours.
#'
#' @examples
#' text_to_duration("03:45")  # Returns 3.75 hours
#' text_to_duration("12:30")  # Returns 12.5 hours
#' @export
text_to_duration <- function(duration) {
  time_parts <- stringr::str_split(duration, ":")[[1]]
  hours <- as.numeric(time_parts[1])
  minutes <- as.numeric(time_parts[2])
  duration_hours <- hours + minutes / 60
  return(duration_hours)
}

#' Plot Cumulative Time in Space Over the Years
#'
#' This function plots the cumulative time spent in space over the years based on
#' the data in the dataframe. The cumulative time is calculated by converting the
#' "duration" column into hours, then computing the cumulative sum of the duration.
#' The plot is saved as a PNG file at the specified location.
#'
#' @param tdf A dataframe containing a "duration" column in "HH:MM" format and a "date" column.
#' @param graph_file A character string specifying the path to save the graph.
#'
#' @return NULL
#' @export
plot_cumulative_time_in_space <- function(tdf, graph_file) {

  time_in_space_plot <- tdf |>
    dplyr::rowwise() |>
    dplyr::mutate(duration_hours = text_to_duration(duration)) |>  # Add duration_hours column
    dplyr::ungroup() |>
    dplyr::mutate(cumulative_time = cumsum(duration_hours)) |>     # Calculate cumulative time
    ggplot2::ggplot(ggplot2::aes(x = date, y = cumulative_time)) +
    ggplot2::geom_line(color = "black") +
    ggplot2::labs(
      x = "Year",
      y = "Total time spent in space to date (hours)",
      title = "Cumulative Spacewalk Time"
    )

  ggplot2::ggsave(graph_file, width = 8, height = 6, plot = time_in_space_plot)

}

Let’s run Check again:

R

devtools::check()

OUTPUT

── R CMD check results ───────────────── spacewalks2 0.0.0.9000 ────
Duration: 1m 7.7s

❯ checking dependencies in R code ... WARNING
  '::' or ':::' imports not declared from:
    ‘dplyr’ ‘ggplot2’ ‘jsonlite’ ‘lubridate’ ‘stringr’ ‘tidyr’

❯ checking R code for possible problems ... NOTE
  plot_cumulative_time_in_space: no visible binding for global variable
    ‘duration’
  plot_cumulative_time_in_space: no visible binding for global variable
    ‘duration_hours’
  plot_cumulative_time_in_space: no visible binding for global variable
    ‘cumulative_time’
  read_json_to_dataframe: no visible binding for global variable ‘eva’
  Undefined global functions or variables:
    cumulative_time duration duration_hours eva

0 errors ✔ | 1 warning ✖ | 1 notes ✖
Error: R CMD check found WARNINGs
Execution halted

Exited with status 1.

Add dependency

In this next round of checks, the note about undefined global functions is gone but now we have a warning regarding ‘::’ or ‘:::’ import not declared from: ‘dplyr’, ‘ggplot2’ and other packages. It’s flagging the fact that we are wanting to import functions from dplyr and other packages but have not yet declared the package as a dependency in the Imports field of the DESCRIPTION file.

We can add dplyr and these other packages to Imports with:

R

usethis::use_package("dplyr", "Imports")
usethis::use_package("ggplot2", "Imports")
usethis::use_package("jsonlite", "Imports")
usethis::use_package("lubridate", "Imports")
usethis::use_package("stringr", "Imports")
usethis::use_package("tidyr", "Imports")

OUTPUT

✔ Setting active project to '/Users/krharding/projects/uob/astronaut-data-analysis-not-so-fair-r/advanced_r/compendium/spacewalks2'
✔ Adding 'dplyr' to Imports field in DESCRIPTION
• Refer to functions with `dplyr::fun()`
> usethis::use_package("ggplot2", "Imports")
✔ Adding 'ggplot2' to Imports field in DESCRIPTION
• Refer to functions with `ggplot2::fun()`
> usethis::use_package("jsonlite", "Imports")
✔ Adding 'jsonlite' to Imports field in DESCRIPTION
• Refer to functions with `jsonlite::fun()`
> usethis::use_package("lubridate", "Imports")
✔ Adding 'lubridate' to Imports field in DESCRIPTION
• Refer to functions with `lubridate::fun()`
> usethis::use_package("stringr", "Imports")
✔ Adding 'stringr' to Imports field in DESCRIPTION
• Refer to functions with `stringr::fun()`
> usethis::use_package("tidyr", "Imports")
✔ Adding 'tidyr' to Imports field in DESCRIPTION
• Refer to functions with `tidyr::fun()` 

Our description file now looks like this:

Package: spacewalks2
Title: Analysis of NASA's extravehicular activity datasets
Version: 0.0.0.9000
Authors@R: c(
    person(given   = "Kamilla",
           family  = "Kopec-Harding",
           role    = c("cre"),
           email   = "k.r.kopec-harding@bham.ac.uk",
           comment = c(ORCID = "{{0000-0002-2960-7944}}")),
    person(given   = "Sarah",
           family  = "Jaffa",
           role    = c("aut"),
           email   = "sarah.jaffa@manchester.ac.uk",
           comment = c(ORCID = "{{0000-0002-6711-6345}}")),
    person(given   = "Aleksandra",
           family  = "Nenadic",
           role    = c("aut"),
           comment = c(ORCID = "{{0000-0002-2269-3894}}"))
    )
Description: An R research compendium for researchers to generate visualisations and statistical
    summaries of NASA's extravehicular activity datasets.
License: MIT + file LICENSE
ByteCompile: true
Encoding: UTF-8
LazyData: true
Roxygen: list(markdown = TRUE)
RoxygenNote: 7.3.2
Imports:
    dplyr,
    ggplot2,
    jsonlite,
    lubridate,
    stringr,
    tidyr

Let’s do one final check.

R

devtools::check()

OUTPUT

── R CMD check results ───────────────── spacewalks2 0.0.0.9000 ────
Duration: 13.7s

❯ checking for future file timestamps ... NOTE
  unable to verify current time

❯ checking R code for possible problems ... NOTE
  plot_cumulative_time_in_space: no visible binding for global variable
    ‘duration’
  plot_cumulative_time_in_space: no visible binding for global variable
    ‘duration_hours’
  plot_cumulative_time_in_space: no visible binding for global variable
    ‘cumulative_time’
  read_json_to_dataframe: no visible binding for global variable ‘eva’
  Undefined global functions or variables:
    cumulative_time duration duration_hours eva

0 errors ✔ | 0 warnings ✔ | 1 notes ✖

We’ll ignore this note for the time being. It results from the non-standard evaluation used in dplyr functions. You can find out more about it in the [Programming with dplyr vignette][programming-with-dplyr-vignette].

Creating a Driver Script

  1. Now that we’ve updated our functions to include the correct namespace for dplyr functions, let’s build and install the package again so that are functions are available for use.

  2. Setup a run script that calls the functions in the package to run the analysis on the EVA data. The steps below will lead you through this process.

    1. Create a new folder analysis in the root of the compendium.

    2. Create the following sub-folders in the analysis folder to organise our scripts, data and results:

      • scripts - to store our analysis scripts
      • data - to store raw and processed data
      • data/raw_data - to store our raw data
      • data/derived_data - to store the output of our analysis
      • figures - to store any figures generated by our analysis
      • tables - to store any tables generated by our analysis
    3. Place the file eva-data.json in the analysis/data/raw_data folder.

    4. Create a new R script file in the scripts folder called run_analysis.R. Use the run_analysis function from spacewalks1 and related code to run the analysis

      Hint: remember to update the input and output file locations in your code.

    5. One final piece of housekeeping we need to do is to edit the file .Rbuildignore in the root of the compendium and add the following line to it:

      ^analysis$

      This will ensure that the analysis folder is not included in the package next time it is built.

  1. Build and install the package

Run “Clean and Install” from the build panel.

OUTPUT

==> R CMD INSTALL --preclean --no-multiarch --with-keep.source spacewalks2

* installing to library ‘/Library/Frameworks/R.framework/Versions/4.3-arm64/Resources/library’
* installing *source* package ‘spacewalks2’ ...
** using staged installation
** R
** byte-compile and prepare package for lazy loading
** help
*** installing help indices
** building package indices
** testing if installed package can be loaded from temporary location
** testing if installed package can be loaded from final location
** testing if installed package keeps a record of temporary installation path
* DONE (spacewalks2)
  1. Your run_analysis.R script should look like this:

R

library(spacewalks)
library(dplyr)
library(readr)

run_analysis <- function(input_file, output_file, graph_file) {
  print("--START--\n")

  eva_data <- read_json_to_dataframe(input_file)
  write_csv(eva_data, output_file)
  plot_cumulative_time_in_space(eva_data, graph_file)

  print("--END--\n")
}


input_file <- 'analysis/data/raw_data/eva-data.json'
output_file <- 'analysis/data/derived_data/eva-data.csv'
graph_file <- 'analysis/figures/cumulative_eva_graph.png'
run_analysis(input_file, output_file, graph_file)

Further reading


We recommend the following resources for some additional reading on the topic of this episode:

Also check the full reference set for the course.

Key Points

  • Good practices for code and project structure are essential for creating readable, reusable and reproducible projects.

Attribution


This episode reuses material from pages “Create a compendium” and “Manage functionality as a package“ from “Reproducible Research in R with rrtools” by Anna Krystalli under CC-BY-4.0 license with modifications. Sections covering git and github have been removed. Output has been modified to reflect the spacewalks case study project used in this course. Some original material has been added to introduce the episode and to connect sections together where needed. This is indicated with a footnote 1 e.g. Challenge “Create a Driver Script”. The section ”Test function” is omitted and the Document function section has been cut-down as roxygen is covered elsewhere in this course.
Questions, Objectives and Keypoints have been re-used from the “Code Structure” episod the Software Carpentries Incubator course ’Tools and practices of FAIR research software” under a CC-BY-4.0 licence with modifications: adaptations for R code.


  1. Original Material↩︎

Content from Code documentation


Last updated on 2025-04-15 | Edit this page

Overview

Questions

  • How should we document our R code?
  • Why is documentation important?
  • What are the minimum elements of documentation needed to support reproducible research?

Objectives

After completing this episode, participants should be able to:

  • Use a README file to provide an overview of an R project including citation instructions
  • Describe the main types of code documentation (tutorials, how to guides, reference and explanation).
  • Describe the different formats available for delivering code documentation (Markdown files, static webpages).
  • [Supplementary Material] Use pkgdown to generate and manage comprehensive project documentation
  • [Supplementary Material] Apply a documentation framework to write effective documentation of any type.

We have seen how writing inline comments and roxygen2 function-level documentation within our code can help with improving its readability. The purpose of profect-level code documentation is to communicate other important information about our analysis (its purpose, dependencies, how to install and run it, etc.) to the people who need it – both users and developers.

Why document our code?


Code documentation is often perceived as a thankless and time-consuming task with few tangible benefits and is often neglected in research projects. However, like testing, documenting our code can help ourselves and others conduct better research and produce reproducible research:

  • Good documentation captures important methodological details ready for when we come to publish our research
  • Good documentation can help us return to a project seamlessly after time away
  • Documentation can facilitate collaborations by helping us onboard new project members quickly and more easily
  • Good documentation can save us time by answering frequently asked questions (FAQs) about our code for us
  • Code documentation improves the re-usability of our code.
    • Good documentation can make our code more understandable and reusable by others, and can bring us some citations and credit
    • How-to guides and tutorials ensure that users can install our code independently and make use of its basic features
    • Reference guides and background information can help developers understand our code sufficiently to modify/extend/repurpose it.

Project-level documentation


In previous episodes we encountered several different forms of in-code documentation aspects, including in-line comments and roxygen2 function-level documentation These are an excellent way to improve the readability of our code, but by themselves are insufficient to ensure that our code is easy to use, understand and modify - this requires additional project-level documentation.

Project-level documentation includes various information and metadata about our code such as the legal terms of reusing it, describe its functionality on a high level and how to install, run and contribute to it.

There are many different types of project-level documentation including.

Technical documentation

Project-level technical documentation encompasses:

  • Tutorials - lessons that guide learners through a series of exercises to build proficiency as using the code
  • How-To Guides - step by step instructions on how to accomplish specific goals using the code.
  • Reference - a lookup manual to help users find relevant information about the software e.g. functions and their parameters.
  • Explanation - conceptual discussion of the code to help users understand implementation decisions

Project metadata files

A common way to to provide project-level documentation is to include various metadata files in the software repository together with code. Many of these files can be described as “social documentation”, i.e. they indicate how users should “behave” in relation to our project. Some common examples of repository metadata files and their role are explained below:

File Description
README Provides an overview of the project, including installation, usage instructions, dependencies and links to other metadata files and technical documentation (tutorial/how-to/explanation/reference)
CONTRIBUTING Explains to developers how to contribute code to the project including processes and standards that should be followed
CODE_OF_CONDUCT Defines expected standards of conduct when engaging in a software project
LICENSE Defines the (legal) terms of using, modifying and distributing the code
CITATION Provides instructions on how to cite the code
AUTHORS Provides information on who authored the code (can also be included in CITATION)

Just enough documentation

For many small projects the following three pieces of project-level documentation may be sufficient: README, LICENSE and CITATION.

Let’s look at each of these files in turn.

README file

A README file is the first piece of documentation users are likely to read and should provide sufficient information for users to and developers to get started using your code.

Let’s create a simple README for our repository:

R

rrtools::use_readme_qmd()

NB: the README created by rrtools provides a good template. For the purposes of this course we’ll create our own.

We can start by adding a one-liner that explains the purpose of our code and who it is for.

# Spacewalks

## Overview

Spacewalks is a research compendium written in R which contains the data
and code underpinning our analysis of NASA’s extravehicular activity
datasets. It is intended for researchers who want to reproduce our analysis.

Now let’s add a list of Spacewalks’ key features:


## Features

Key features of Spacewalks:

- Generates a line plot to show the cumulative duration of space walks
  over time

Now let’s tell users about any pre-requisites required to run the software:

## Pre-requisites

This research compendium has been developed using the statistical
programming language R. To work with the compendium, you will need
installed on your computer the [R
software](https://cloud.r-project.org/) itself (version: \>=4.3.3) and
optionally [RStudio
Desktop](https://rstudio.com/products/rstudio/download/).

Additional dependencies are documented in the DESCRIPTION file.

Spacewalks README

Extend the README for Spacewalks by adding:

  1. Installation instructions
  2. A simple usage example / instructions

Installation instructions:

NB: In the solution below the back ticks of each code block have been escaped to avoid rendering issues (if you are copying and pasting the text, make sure you unescape them).

## Installation instructions

+ Clone the Spacewalks repository to your local machine using Git.
If you don't have Git installed, you can download it from the official Git website.

\`\`\`
git clone https://github.com/your-repository-url/spacewalks.git
cd spacewalks
\`\`\`

+ Open the project in RStudio by clickling on the spacewalks.Rproj file.
+ Navgigate to Build > Check to check the package.
+ Navigate to Build > Install install the package and restart the R sessions.


## Usage Intructions

To run this analysis, navigate to the analysis/scripts folder and open the
file `run_analysis.R` and run the script using `Source with echo` from
the text editor bar.

LICENSE file

Copyright allows a creator of work (such as written text, photographs, films, music, software code) to state that they own the work they have created. Copyright is automatically implied - even if the creator does not explicitly assert it, copyright of the work exists from the moment of creation. A licence is a legal document which sets down the terms under which the creator is releasing what they have created for others to use, modify, extend or exploit.

Because any creative work is copyrighted the moment it is created, even without any kind of licence agreement, it is important to state the terms under which code can be reused. The lack of a licence for your software implies that no one can reuse the software at all - hence it is imperative you declare it. A common way to declare your copyright of a piece of software and the license you are distributing it under is to include a file called LICENSE in the root directory of your code repository.

There is an optional extra episode in this course on different open source software licences that you can choose for your code and that we recommend for further reading.

Tools to help you choose a licence

  • A short intro on different open source software licences included as extra content to this course.
  • The open source guide on applying, changing and editing licenses.
  • choosealicense.com online tool has some great resources to help you choose a license that is appropriate for your needs, and can even automate adding the LICENSE file to your GitHub code repository.

Change the license of your code

NB: One you’ve shared your code outside your research group or made your code public it isn’t common to change the license and there are complexities associated with this (see: 12.2.4 Relicensing Section in R Packages (2e)). This challenge covers the situation prior to code release when you
might change your mind about the license you’d like to use.

  1. Many R packages are licensed using GPL-2 / 3. Read the description of GPL-2 and GPL-3 licenses from the choosealicense.com website.

  2. Use usethis::use_gpl_license(version = 3, include_future = TRUE) to change the license of your code to GPL-3. This function will create a LICENSE file in the root of your repository with the text of the GPL-3 license.

  1. See choosealicense.com
  2. usethis::use_gpl_license(version = 3, include_future = TRUE)

CITATION information

We should add a citation instructions to our README to provide instructions on how to cite our code. This encourages users to credit us when they make use of our code:

R

### How to cite

Please cite this compendium as:

> Kopec-Harding et al, (2025). _Compendium of R code and data for an analyis of NASA extravehicular data. Accessed 15 Apr 2025. Online at <https://doi.org/xxx/xxx>

In this example, we assume that the code of the R Compendium will be shared online via repository such as Zenodo (see “Archiving code to Zenodo and obtaining a DOI”.

CITATION.cff

We can include citation information in our README file, but there are certain benefits to using a a special file format called the Citation File Format (CFF), which provides a way to include richer metadata about code (or datasets) we want to cite, making it easy for both humans and machines to use this information.

Further information is available from the Turing Way’s guide to software citation.

[Supplementary Material] Documentation tools


Once our project reaches a certain size or level of complexity we may want to add additional documentation such as a standalone tutorial or “background” explaining our methodological choices.

Once we move beyond using a README as our primary source of documentation, we need to consider how we will distribute our documentation to our users. Options include:

  • A docs/ folder of Markdown files
  • Adding a Wiki to our repository (if sharing online)
  • Creating a set of web pages (either bundled with our project folder or hosted online) for our documentation using a static site generator for our documentation such as pkgdown.

Creating a static site is a popular solution as it has the key benefit being able to automatically generate a reference manual from any roxygen2 comment blocks we have added to our code.

pkgdown [^1]

You can use package pkgdown to create an online site for your documentation. It effectively recycles the documentation you have already created for your functions, information in your README and DESCRIPTION file and presents it in a standardised website form.

Let’s create such a site for our package.

R

pkgdown::build_site()

This creates html documentation for our package in the docs/ folder and presents you with a preview to the site.

Explore your documentation

Explore documentation in docs/ folder built with pkgdown for your project, starting from the index.html file.

Open index.html file in a Web browser to see how it renders.

Check Reference page in the top menu to see how roxygen2 blocks from your functions are provided here as a reference manual.

Hosting documentation

We saw how pkgdown documentation can be distributed with our repository and viewed “offline” using a browser.

We can also make our documentation available as a live website by deploying our documentation to a hosting service like GitHub pages. You can find out more about how to do this here: see Create documentation site in course “Reproducible Research Data and Project Management in R” for further details.

Documentation guides


Once we start to consider other forms of documentation beyond the README, we can also increase reusability of our code by ensuring that the content and style of our documentation matches its purpose.

Documentation guides such as Write the Docs, The Good Docs Project and the Diataxis framework provide a range of resources including documentation templates to help to help us do this.

Spacewalks how-to guide

  1. Review the Diataxis guidance page on writing a How-to guide. Identify three features of an effective how-to guide.

  2. Following the Diataxis guidelines, write a how-to guide to show users how to change the destination filename for the output CSV dataset generated by the Spacewalks software.

An effective how-to guide should:

  • be goal oriented and focus on action.
  • avoid teaching or explanation
  • use appropriate language e.g. conditional imperatives
  • have an informative title

An example how-to guide for our project to the file docs/how-to-guides.md:

# How to change the file path of Spacewalk's output dataset

This guide shows you how to set the file path for Spacewalk's output
data set to a location of your choice.

By default, the cleaned data set in CSV format, generated by the Spacewalk software, is saved to the `analysis/data/derived-data`
folder within the working directory with file name `eva-data.csv`.

If you would like to modify the name or location of the output dataset, you should edit
the run_analysis.R script in the `analysis/scripts` folder directly and modify the `output_file` variable.

The specified destination folder `data/clean/` must exist before running spacewalks analysis script.

Remember to rebuild your documentation:

R

devtools::document()

Further reading


We recommend the following resources for some additional reading on the topic of this episode:

Also check the full reference set for the course.

Key Points

  • Documentation allows users to run and understand software without having to work things out for themselves directly from the source code.
  • Software documentation improves the reusability of research code.
  • A (good) README, CITATION entry/file and LICENSE file are the minimum documentation elements required to support reproducible and reusable research code.
  • Documentation can be provided to users in a variety of formats including a docs folder of Markdown files, a repository Wiki and static webpages.
  • A static documentation site can be created using the tool pkgdown
  • Documentation frameworks such as Diataxis provide content and style guidelines that can helps us write high quality documentation.

Attribution


This episode reuses material from “Code Documentation” episode of the Software Carpentries Incubator course “Tools and practices of FAIR research software” under a CC-BY-4.0 with modifications, (i) adaptations have been made to make the material suitable for an audience of R users (e.g. replacing “software” with “code” in places and introducing R specific versions content e.g. discussing pkgdown insteam of mkdocs) (ii) all code has been ported from Python to R (iii) Objectives, Questions, Key Points and Further Reading sections have been updated to reflect the remixed R focussed content.

[^1] : Reused from Create documentation site in course “Reproducible Research Data and Project Management in R” by Anna Krystalli under CC-BY-4.0.

Content from Code correctness


Last updated on 2025-04-15 | Edit this page

Overview

Questions

  • How can we verify that our code is correct?
  • How can we automate testing in R?
  • What makes a “good” test?
  • Which parts of our code should we prioritise for testing?

Objectives

After completing this episode, participants should be able to:

  • Explain why code testing is important and how this supports reproducibility.
  • Describe the different types of software tests (unit tests, integration tests, regression tests).
  • Implement unit tests to verify that functions behave as expected using the R testing framework testthat.
  • Interpret the output from testthat to identify which functions are not behaving as expected.
  • Write tests using typical values, edge cases and invalid inputs to ensure that the code can handle extreme values and invalid inputs appropriately.
  • Evaluate code coverage to identify how much of the code is being tested and identify areas that need further tests.

Now that we have improved the structure and readability of our code - it is much easier to test its functionality and improve it further. The goal of software testing is to check that the actual results produced by a piece of code meet our expectations, i.e. are correct.

Callout

Open your research compendium in R Studio and clear your environment:

  • Double-click on spacwalks2.rproj to open your project in RStudio.
  • Clear your environment using the built-in GUI:
    • Go to the Environment pane (usually on the top right).
    • Click on the ** broom icon** (Clear All) to remove all objects in the environment.

Why use software testing?


Adopting software testing as part of our research workflow helps us to conduct better research and produce reproducible software:

  • Software testing can help us be more productive as it helps us to identify and fix problems with our code early and quickly and allows us to demonstrate to ourselves and others that our code does what we claim. More importantly, we can share our tests alongside our code, allowing others to verify our software for themselves.
  • The act of writing tests encourages to structure our code as individual functions and often results in a more readable, modular and maintainable codebase that is easier to extend or repurpose.
  • Software testing improves the accessibility and reusability of our code - well-written software tests capture the expected behaviour of our code and can be used alongside documentation to help other developers quickly make sense of our code. In addition, a well tested codebase allows developers to experiment with new features safe in the knowledge that tests will reveal if their changes have broken any existing functionality.
  • Software testing also gives us the confidence to engage in open research practices - if we are not sure that our code works as intended and produces accurate results, we are unlikely to feel confident about sharing our code with others. Software testing brings piece of mind by providing a step-by-step approach that we can apply to verify that our code is correct.

Types of software tests


There are many different types of software tests, including:

  • Unit tests focus on testing individual functions in isolation. They ensure that each small part of the software performs as intended. By verifying the correctness of these individual units, we can catch errors early in the development process.

  • Integration tests check how different parts of the code e.g. functions work together.

  • Regression tests are used to ensure that new changes or updates to the codebase do not adversely affect the existing functionality. They involve checking whether a program or part of a program still generates the same results after changes have been made.

  • End-to-end tests are a special type of integration testing which checks that a program as a whole behaves as expected.

In this course, our primary focus will be on unit testing. However, the concepts and techniques we cover will provide a solid foundation applicable to other types of testing.

Types of software tests

Fill in the blanks in the sentences below:

  • __________ tests compare the ______ output of a program to its ________ output to demonstrate correctness.
  • Unit tests compare the actual output of a ______ ________ to the expected output to demonstrate correctness.
  • __________ tests check that results have not changed since the previous test run.
  • __________ tests check that two or more parts of a program are working together correctly.
  • End-to-end tests compare the actual output of a program to the expected output to demonstrate correctness.
  • Unit tests compare the actual output of a single function to the expected output to demonstrate correctness.
  • Regression tests check that results have not changed since the previous test run.
  • Integration tests check that two or more parts of a program are working together correctly.

Informal testing


How should we test our code?

One approach is to load the code or a function into our R environment.

From the R console, we can then run one function or a piece of code at a time and check that it behaves as expected. To do this, we can observe how the function behaves using input values for which we know what the correct return value should be.

Let’s do this for our text_to_duration function.

Callout

Before we do so, let’s deliberately introduce a bug into our code:

  1. Open R/eva_data_analysis.R
  2. Let’s modify text_to_duration so that the minutes component is divided by 6 instead of 60 (an easy typo to make!).

R

#' Convert Duration from HH:MM Format to Hours
#'
#' This function converts a duration in "HH:MM" format (as a character string)
#' into the total duration in hours (as a numeric value).
#'
#' @param duration A character string representing the duration in "HH:MM" format.
#'
#' @return A numeric value representing the duration in hours.
#' @export
#'
#' @examples
#' text_to_duration("03:45")  # Returns 3.75 hours
#' text_to_duration("12:30")  # Returns 12.5 hours
text_to_duration <- function(duration) {
  time_parts <- stringr::str_split(duration, ":")[[1]]
  hours <- as.numeric(time_parts[1])
  minutes <- as.numeric(time_parts[2])
  duration_hours <- hours + minutes / 6
  return(duration_hours)
}
3. Once we've done this we must rebuild and install our compendium using 
   "Clean and Install" from the Build panel.

Recall that the text_to_duration function converts a spacewalk duration stored as a string in format “HH:MM” to a duration in hours - e.g. duration 01:15 (1 hour and 15 minutes) should return a numerical value of 1.25.

Open R/eva_data_analysis.R in RStudio and click “Source” to load all of the functions into the R environment.

On the R console, let’s invoke our function with the value “10:00”:

R

> text_to_duration("10:00")
10.0

So, we have invoked our function with the value “10:00” and it returned the floating point value “10” as expected.

We can then further explore the behaviour of our function by running:

R

> text_to_duration("00:00")
0.0

This all seems correct so far.

Testing code in this “informal” way in an important process to go through as we draft our code for the first time. However, there are some serious drawbacks to this approach if used as our only form of testing.

What are the limitations of informally testing code? (5 minutes)

Think about the questions below. Your instructors may ask you to share your answers in a shared notes document and/or discuss them with other participants.

  • Why might we choose to test our code informally?
  • What are the limitations of relying solely on informal tests to verify that a piece of code is behaving as expected?

It can be tempting to test our code informally because this approach:

  • is quick and easy
  • provides immediate feedback

However, there are limitations to this approach:

  • Working interactively is error prone
  • We must reload our function in R each time we change our code
  • We must repeat our tests every time we update our code which is time consuming
  • We must rely on memory to keep track of how we have tested our code, e.g. what input values we tried
  • We must rely on memory to keep track of which functions have been tested and which have not (informal testing may work well on smaller pieces of code but it becomes unpractical for a large codebase)
  • Once we close the R console, we lose all the test scenarios we have tried

Formal testing


Caution

The way we setup and store tests in this section is not conventional for R and is used for teaching purposes to introduce the concept of testing.

In section “Testing Frameworks”, we cover the conventional way to write tests in R.

We can overcome some of these limitations by formalising our testing process. A formal approach to testing our code is to write dedicated test functions to check it. These test functions:

  • Run the function we want to test - the target function with known inputs
  • Compare the output to known, valid results
  • Raise an error if the function’s actual output does not match the expected output
  • Are recorded in a test script that can be re-run on demand.

Let’s explore this process by writing some formal tests for our text_to_duration function.

In RStudio, let’s create a new R file test_code.R in analysis/scripts to store our tests.

We need to load spacewalks using a library() call so that we can access text_to_duration` in our test script.

Then, we define our first test function and run it:

R

library(spacewalks)
test_text_to_duration_integer <- function() {
  input_value <- "10:00"
  test_result <- text_to_duration(input_value) == 10
  print(paste("text_to_duration('10:00') == 10?", test_result))
}

test_text_to_duration_integer()

We can run this code with RStudio using the “Source” button or by running the code in the R console:

R

> test_text_to_duration_integer()
[1] "text_to_duration('10:00') == 10? TRUE"

This test checks that when we apply text_to_duration to input value 10:00, the output matches the expected value of 10.

In this example, we use a print statement to report whether the actual output from text_to_duration meets our expectations.

However, this does not meet our requirement to “Raise an error if the function’s output does not match the expected output” and means that we must carefully read our test function’s output to identify whether it has failed.

To ensure that our code raises an error if the function’s output does not match the expected output, we using
the expect_true function from the testthat package.

The expect_true function checks whether a statement is True. If the statement is True, expect_true does not return a value and the code continues to run. However, if the statement is False, expect_true raises an Error.

Let’s rewrite our test with an expect_true statement:

R


library(spacewalks)
library(testthat)

test_text_to_duration_integer <- function() {
  input_value <- "10:00"
  test_result <- text_to_duration(input_value) == 10
  expect_true(test_result)
}

test_text_to_duration_integer()

Notice that when we run test_text_to_duration_integer(), nothing happens - there is no output. That is because our function is working correctly and returning the expected value of 10.

Let’s add another test to check what happens when duration is not an integer number and if our function can handle durations with a non-zero minute component, and rerun our test code.

R

library(spacewalks)
library(testthat)

test_text_to_duration_integer <- function() {
  input_value <- "10:00"
  test_result <- text_to_duration(input_value) == 10
  expect_true(test_result)
}

test_text_to_duration_float <- function() {
  input_value <- "10:15"
  test_result <- all.equal(text_to_duration(input_value), 10.25) #
  expect_true(test_result)
}

test_text_to_duration_float()
test_text_to_duration_integer()

ERROR

> test_text_to_duration_float()
Error: `test_result` is not TRUE

`actual`:   FALSE
`expected`: TRUE

Notice that this time, our test test_text_to_duration_float fails. Our expect_true statement has raised an Error - a clear signal that there is a problem in our code that we need to fix.

We know that duration 10:15 should be converted to number 10.25. What is wrong with our code? If we look at our text_to_duration function, we may identify the following line of our code as problematic:

R

text_to_duration <- function(duration) {
  ...
  duration_hours <- hours + minutes / 6
  ...
}

Recall that our conversion code contains a bug - the minutes component should have been divided by 60 and not 6. We were able to spot this tiny bug only by testing our code (note that just by looking at the result graph there is not way to spot incorrect results).

Let’s fix the problematic line and rerun out tests. To do this we need to:

  • Navigate to R/eva_data_analysis.R in RStudio
  • Correct the affected line of code

R


duration_hours = int(hours) + int(minutes)/60 

Checklist

Remember to “Clean and Install” (Build panel) to make sure that our changes have been installed.

This time our tests run without problem.

You may have noticed that we have to repeat a lot of code to add each individual test for each test case. You may also have noticed that our test script stopped after the first test failure and none of the tests after that were run.

To run our remaining tests we would have to manually comment out our failing test and re-run the test script. As our code base grows, testing in this way becomes cumbersome and error-prone. These limitations can be overcome by automating our tests using a testing framework.

Testing frameworks


Testing frameworks can automatically find all the tests in our code base, run all of them (so we do not have to invoke them explicitly or, even worse, forget to invoke them), and present the test results as a readable summary.

We will use the R testing framework testthat with the code coverage package covr.

Let’s install these packages and add them to our DESCRIPTION file.

R

install.packages("testthat")
install.packages("covr)

DESCRIPTION file

R

Suggests: 
  testthat (>= 3.0.0),
  covr,
  ...

Let’s make sure that our tests are ready to work with testthat.

  • testthat automatically discovers tests based on specific naming patterns. It looks for files that start with “test_” or “test-” and end with “.r” or “.R”. Then, within these files, testthat looks for function calls to “test_that()”. Our test file already meets these requirements, so there is nothing to do here. However, our script does contain lines to run each of our test functions. These are no-longer required as testthat will run our tests so we can remove them:

    R

    # Delete these 2 lines
    test_text_to_duration_float()
    test_text_to_duration_integer()
  • It is also conventional when working with testthat to place test files in a tests\testthat directory at the root of our project and to name each test file after the code file that it targets. This helps in maintaining a clean structure and makes it easier for others to understand where the tests are located.

  • Finally, a standard setup file testthat.R is placed in the tests folder:

    R

    # This file is part of the standard setup for testthat.
    # It is recommended that you do not modify it.
    #
    # Where should you do additional test configuration?
    # Learn more about the roles of various files in:
    # * https://r-pkgs.org/testing-design.html#sec-tests-files-overview
    # * https://testthat.r-lib.org/articles/special-files.html
    
    library(testthat)
    library(spacewalks)
    
    test_check("spacewalks")

We can setup the folder structure and setup file manually or by running the following commands in the R console:

usethis::use_testthat()

A set of tests for a given piece of code is called a test suite. Our test suite is currently located in analysis/scripts. Let’s move it to a conventional test folder tests/testthat and rename our test-code.R file to test-eva_data_analysis.R.

You can do this using the file panel in RStudio or by typing the following commands in the command line terminal:

BASH

mv test-code.R tests/testthat/test-eva_data_analysis.R

Before we re-run our tests using testthat, let’s convert out test functions to test_that() calls and add some inline comments to clarify what each test is doing. We will also expand our syntax to highlight the logic behind our approach:

R

test_that("text_to_duration returns expected ground truth values
    for typical durations with a non-zero minute component", {
  actual_result <- text_to_duration("10:15")
  expected_result <- 10.25
  expect_true(isTRUE(all.equal(actual_result), expected_result))
})

test_that("text_to_duration returns expected ground truth values
    for typical whole hour durations", {
  actual_result <- text_to_duration("10:00")
  expected_result <- 10
  expect_true(actual_result==expected_result)
})

Writing our tests this way highlights the key idea that each test should compare the actual results returned by our function with expected values.

Similarly, writing inline comments for our tests that complete the sentence “Test that …” helps us to understand what each test is doing and why it is needed.

Before running out tests with testthat, let’s reintroduce our old bug in function text_to_duration that affects the durations with a non-zero minute component like “10:25” but not those that are whole hours, e.g. “10:00”:

R

text_to_duration <- function(duration) {
  ...
  duration_hours <- hours + minutes / 6 # 6 instead of 60
  ...
}

Finally, let’s run our tests. We can do this by running the following command in the R console:

R

testthat::test_dir("tests/testthat")

This runs all of the tests in the tests/testthat directory and provides a summary of the results.

ERROR

> testthat::test_dir("tests/testthat")
✔ | F W  S  OK | Context
✖ | 1        1 | eva_data_analysis
────────────────────────────────────────────────────────────────────────────────
Error (test-eva_data_analysis.R:5:3): text_to_duration returns expected ground truth values
    for typical durations with a non-zero minute component
Error in `isTRUE(all.equal(actual_result), expected_result)`: unused argument (expected_result)
Backtrace:
    ▆
 1. └─testthat::expect_true(isTRUE(all.equal(actual_result), expected_result)) at test-eva_data_analysis.R:5:3
 2.   └─testthat::quasi_label(enquo(object), label, arg = "object")
 3.     └─rlang::eval_bare(expr, quo_get_env(quo))
────────────────────────────────────────────────────────────────────────────────

══ Results ═════════════════════════════════════════════════════════════════════
── Failed tests ────────────────────────────────────────────────────────────────
Error (test-eva_data_analysis.R:5:3): text_to_duration returns expected ground truth values
    for typical durations with a non-zero minute component
Error in `isTRUE(all.equal(actual_result), expected_result)`: unused argument (expected_result)
Backtrace:
    ▆
 1. └─testthat::expect_true(isTRUE(all.equal(actual_result), expected_result)) at test-eva_data_analysis.R:5:3
 2.   └─testthat::quasi_label(enquo(object), label, arg = "object")
 3.     └─rlang::eval_bare(expr, quo_get_env(quo))

[ FAIL 1 | WARN 0 | SKIP 0 | PASS 1 ]
Error: Test failures

From the above output from testthats’s execution of out tests, we notice that:

  • If a test finishes without triggering an error, the test is considered “OK” and is included in the total count of of successful tests under the “OK” column.
  • If a test raises an error, the test is considered a failure with an F and is included in the total count of of successful tests under the “OK” column
  • The output includes details about the error to help identify what went wrong.

Let’s fix our bug once again, reload our compendium (devtools::load_all()) and rerun our tests using testthat.

R

text_to_duration <- function(duration) {
  ...
  duration_hours <- hours + minutes / 60 
  ...
}

Checklist

Remember to “Clean and Install” (Build panel) to make sure that our changes have been installed.

R

testthat::test_dir("tests/testthat")

Interpreting testhat output

A colleague has asked you to conduct a pre-publication review of their code which analyses time spent in space by various individual astronauts.

You tested their code using testhat, and got the following output. Inspect it and answer the questions below.

OUTPUT

> testthat::test_dir("tests/testthat")
✔ | F W  S  OK | Context
✖ | 2        4 | analyse
──────────────────────────────────────────────────────────────────────────────────────────────────────────────
Failure (test-analyse.R:7:3): test_total_duration
`actual` (`actual`) not equal to `expected` (`expected`).

  `actual`: 100
`expected`:  10

Error (test-analyse.R:14:3): test_mean_duration
Error in `len(durations)`: could not find function "len"
Backtrace:
    ▆
 1. └─spacetravel:::calculate_mean_duration(durations) at test-analyse.R:14:3
──────────────────────────────────────────────────────────────────────────────────────────────────────────────
✔ |      1    2 | prepare

══ Results ═══════════════════════════════════════════════════════════════════════════════════════════════════
── Failed tests ──────────────────────────────────────────────────────────────────────────────────────────────
Failure (test-analyse.R:7:3): test_total_duration
`actual` (`actual`) not equal to `expected` (`expected`).

  `actual`: 100
`expected`:  10

Error (test-analyse.R:14:3): test_mean_duration
Error in `len(durations)`: could not find function "len"
Backtrace:
    ▆
 1. └─spacetravel:::calculate_mean_duration(durations) at test-analyse.R:14:3

[ FAIL 2 | WARN 0 | SKIP 1 | PASS 6 ]
Error: Test failures
  1. How many tests has our colleague included in the test suite?
  2. How many tests failed?
  3. Why did “test_total_duration” fail?
  4. Why did “test_mean_duration” fail?
  1. 9 tests were detected in the test suite
  2. s - stands for “skipped”,
  3. 2 tests failed in in test file test_analyse.py
  4. test_total_duration failed because the calculated total duration differs from the expected value by a factor of 10.
  5. test_mean_duration failed because there is a syntax error in calculate_mean_duration. Our colleague has used the command len (not an R command) instead of length. As a result, running the function returns a could not find function error rather than a calculated value and the test fails.

Test suite design


We now have the tools in place to automatically run tests. However, that alone is not enough to properly test code. We will now look into what makes a good test suite and good practices for testing code.

Let’s start by considering the following scenario. A collaborator on our project has sent us the following function which can be used to add a new column called crew_size to our data containing the number of astronauts participating in any given spacewalk. How do we know that it works as intended and that it will not break the rest of our code? For this, we need to write a test suite with a comprehensive coverage of the new code.

R


#' Calculate the Size of the Crew
#'
#' This function calculates the number of crew members from a string containing
#' their names, separated by semicolons. The crew size is determined by counting
#' the number of crew members listed and subtracting 1 to account for an empty string
#' at the end of the list.  This function should be applied to a dataframe.
#'
#' @param crew A character string containing the names of crew members, separated by semicolons.
#'
#' @return An integer representing the size of the crew (the number of crew members).
#' @export
#'
#' @examples
#' calculate_crew_size("John Doe;Jane Doe;")  # Returns 2
#' calculate_crew_size("John Doe;")  # Returns 1
calculate_crew_size <- function(crew) {
  # Use purrr::map_int to iterate over each crew element and return an integer vector
  purrr::map_int(crew, function(c) {
    trimmed_crew <- stringr::str_trim(c)
    if (trimmed_crew == "") {
      return(NA_integer_)  # Return NA as an integer (NA_integer_)
    } else {
      crew_list <- stringr::str_split(c, ";")[[1]]
      return(length(crew_list) - 1)  # Return the number of crew members (excluding the last empty string)
    }
  })
}
    

Let’s add this function to R/eva_data_analysis.R and update analysis/run_data_analysis.R to include it.

R

run_analysis <- function(input_file, output_file, graph_file) {
  cat("--START--\n")

  eva_data <- read_json_to_dataframe(input_file)
  
  eva_data <- eva_data |> # Add this line
    mutate(crew_size = calculate_crew_size(crew)) # Add this line

  write_dataframe_to_csv(eva_data, output_file) # Add this line

  plot_cumulative_time_in_space(eva_data, graph_file)
  generate_summary_table(eva_data, "crew_size", "analysis/tables/summary_table.html")
  cat("--END--\n")
}


::: checklist

Remember to "Clean and Install" (Build panel) to make sure that our changes have been installed.

:::

Writing good tests

The aim of writing good tests is to verify that each of our functions behaves as expected with the full range of inputs that it might encounter. It is helpful to consider each argument of a function in turn and identify the range of typical values it can take. Once we have identified this typical range or ranges (where a function takes more than one argument), we should:

  • Test all values at the edge of the range
  • Test at least one interior point
  • Test invalid values

Let’s have a look at the calculate_crew_size function from our colleague’s new code and write some tests for it.

Unit tests for calculate_crew_size

Implement unit tests for the calculate_crew_size function. Cover typical cases and edge cases.

Hint - use the following template when writing tests:

test_that("MYFUNCTION ...", {
    """
    Test that ...   #FIXME
    """

    # Typical value 1
    actual_result <-  _______________ #FIXME
    expected_result <- ______________ #FIXME
    expect_equal(actual_result == expected_result, tolerance = 1e-6)

    # Typical value 2
    actual_result <-  _______________ #FIXME
    expected_result <- ______________ #FIXME
    expect_equal(actual_result == expected_result, tolerance = 1e-6)
}

We can add the following test functions to out test suite.

R


test_that("calculate_crew_size returns correct values for typical crew inputs", {
  
  # First test case
  actual_result_1 <- calculate_crew_size("Valentina Tereshkova;")
  expected_result_1 <- 1
  expect_equal(actual_result_1, expected_result_1)
  
  # Second test case
  actual_result_2 <- calculate_crew_size("Judith Resnik; Sally Ride;")
  expected_result_2 <- 2
  expect_equal(actual_result_2, expected_result_2)
  
})


# Edge cases
test_that("calculate_crew_size returns expected value for an empty crew string", {
  actual_result <- calculate_crew_size("")
  expect_true(is.na(actual_result))
})

Let’s run out tests:

R

testthat::test_dir("tests/testthat")

OUTPUT

> testthat::test_dir("tests/testthat")
✔ | F W  S  OK | Context
✔ |          5 | eva_data_analysis

══ Results ═══════════════════════════════════════════════════════════════════════════════════════════════════
[ FAIL 0 | WARN 0 | SKIP 0 | PASS 5 ]

Just enough tests

In this episode, so far we have (only) written tests for two individual functions: text_to_duration and calculate_crew_size.

We can quantify the proportion of our code base that is run (also referred to as “exercised”) by a given test suite using a metric called code coverage:

\[ \text{Line Coverage} = \left( \frac{\text{Number of Executed Lines}}{\text{Total Number of Executable Lines}} \right) \times 100 \]

We can calculate our test coverage using the covr package as follows.

R

install.packages("covr")
install.packages("DT")
install.packages("htmltools")

library(covr)
coverage <- package_coverage()

OUTPUT

> coverage
spacewalks Coverage: 40.00%
R/eva_data_analysis.R: 40.00%

To get an in-depth report about which parts of our code are tested and which are not , we can run:

R

 covr::report(coverage)

This option generates a report in the RStudio viewer. This provides structured information about our test coverage including:

  • a table showing the proportion of lines in each file that are currently tested, and
  • an annotated copy of our code where untested lines are highlighted in red.

Ideally, all the lines of code in our code base should be exercised by at least one test. However, if we lack the time and resources to test every line of our code we should:

  • avoid testing R’s built-in functions or functions imported from well-known and well-tested libraries like dplyr or ggplot2.
  • focus on the the parts of our code that carry the greatest “reputational risk”, i.e. that could affect the accuracy of our reported results.

Callout

Test coverage of less than 100% indicates that more testing may be helpful.

Test coverage of 100% does not mean that our code is bug-free.

Evaluating code coverage

Generate the code coverage report for your compendium using the covr::report(coverage) command.

Inspect the report generated and extract the following information:

  1. What proportion of the code base is currently “not” exercised by the test suite?
  2. Which functions in our code base are currently untested?
  1. The proportion of the code base NOT covered by our tests is ~60% (100% - 40%) - this may differ for your version of the code.
  2. You can find this information by checking which functions in the annotated source code section of the report contain red (untested) lines. The following functions in our code base are currently untested:
    • read_json_to_dataframe
    • write_dataframe_to_csv
    • add_duration_hours_variable
    • plot_cumulative_time_in_space
    • add_crew_size_variable

Summary


During this episode, we have covered how to use tests to verify the correctness of our code. We have seen how to write a unit test, how to manage and run our tests using the testthat framework and how to identify which parts of our code require additional testing using test coverage reports.

These skills reduce the probability that there will be a mistake in our code and support reproducible research by giving us the confidence to engage in open research practices. Tests also document the intended behaviour of our code for other developers and mean that we can experiment with changes to our code knowing that our tests will let us know if we break any existing functionality.

Further reading


We recommend the following resources for some additional reading on the topic of this episode:

Also check the full reference set for the course.

Key Points

1 . Code testing supports reproducibility by demonstrating that your code behaves as you expect and consistently generates the same output with a given set of inputs. 2. Unit testing is crucial as it ensures each functions works correctly. 3. Using the testthat package you can write basic unit tests for R functions to verify their correctness. 4. Identifying and handling edge cases in unit tests is essential to ensure your code performs correctly under a variety of conditions. 5. Test coverage can help you to identify parts of your code that require additional testing.

Attribution


This episode reuses material from the “Code Correctness” episode of the Software Carpentries Incubator course “Tools and practices of FAIR research software” under a CC-BY-4.0 with modifications (i) adaptations have been made to make the material suitable for an audience of R users (e.g. replacing “software” with “code” in places, pytest with testthat), (ii) all code has been ported from Python to R (iii) Objectives, Questions, Key Points and Further Reading sections have been updated to reflect the remixed R focussed content.

Content from Reproducible development environment


Last updated on 2025-04-15 | Edit this page

Overview

Questions

  • How can we manage R dependencies in our analysis projects?

Objectives

After completing this episode, participants should be able to:

  • Set up a local package library for an R project using renv.

Our code has dependencies 1


Now that we’ve finished developing out project, let’s take a look back at our code.

If we have a look at analysis/scripts.R, we can see a number of library() calls at the top of the file:

R

library(spacewalks2)
library(dplyr)
etc.

Similarly, if we have a look at our script R/eva_data_analysis.R, we can see a number of functions from external packages being used including dplyr::mutate and ggplot2::ggplot:

R

time_in_space_plot <- tdf |>
    dplyr::rowwise() |>
    dplyr::mutate(duration_hours = text_to_duration(duration)) |>  # Add duration_hours column
    dplyr::ungroup() |>
    dplyr::mutate(cumulative_time = cumsum(duration_hours)) |>     # Calculate cumulative time
    ggplot2::ggplot(ggplot2::aes(x = date, y = cumulative_time)) +
    ggplot2::geom_line(color = "black") +
    ggplot2::labs(
      x = "Year",
      y = "Total time spent in space to date (hours)",
      title = "Cumulative Spacewalk Time"
    )

This means that our code requires several external libraries (also called third-party packages or dependencies) - jsonlite, readr, dplyr, stringr, tidyr etc.

Managing dependencies 2


One of the most important aspects of reproducible research is managing dependencies.

For your analysis to run, it interacts with:

  • Operating system: The operating system of your computer.

  • System configurations: such as locations of libraries and the search path your computer uses to find files and libraries

  • System Level libraries. These are non-R libraries that R or packages in R you might be using depend on. (for example libraries such as GEOS, GDAL and PROJ4 that geospatial R packages like sf and rgeos depend on). Such external dependencies to R are listed as System Requirements in R package DESCRIPTION files (e.g have a look at the relevant line in the sf package DESCRIPTION.

  • R and the R packages your analysis depends on. For example, the version of R we’ve been using as well as packages ggplot2 and dplyr.

When someone else tries to reproduce your analysis on a different computer, if any of these elements differ from what existed on your own system when you last ran your analysis, e.g some of the dependencies are missing, different versions are available or code behaves differently on a different operating system, they may not be able to reproduce your work.

Callout

It is essential to document the specific package versions used in a project and to offer a way for others to install those exact versions on their own machines. The renv package addresses both of these needs effectively.

Managing R dependencies with renv 3


The renv package is a recent effort to bring project-local R dependency management to R projects.

Underlying the philosophy of renv is that any of your existing workflows should just work as they did before – renv helps manage library paths (and other project-specific state) to help isolate your project’s R dependencies.

renv Workflow

The general workflow when working with renv is:

  1. Call renv::init() to initialize a new project-local environment with a private R library,

  2. Work in the project as normal, installing and removing new R packages as they are needed in the project,

  3. Call renv::snapshot() to save the state of the project library to the lockfile (called renv.lock),

  4. Continue working on your project, installing and updating R packages as needed.

  5. Call renv::snapshot() again to save the state of your project library if your attempts to update R packages were successful, or call renv::restore() to revert to the previous state as encoded in the lockfile if your attempts to update packages introduced some new problems.

The renv::init() function attempts to ensure the newly-created project library includes all R packages currently used by the project. It does this by crawling R files within the project for dependencies with the renv::dependencies() function. The discovered packages are then installed into the project library with the renv::hydrate() function, which will also attempt to save time by copying packages from your user library (rather than re-installing from CRAN) as appropriate.

Calling renv::init() will also write out the infrastructure necessary to automatically load and use the private library for new R sessions launched from the project root directory. This is accomplished by creating (or amending) a project-local .Rprofile with the necessary code to load the project when the R session is started.

The following files are written to and used by projects using renv:

File Usage
.Rprofile Used to activate renv for new R sessions launched in the project.
renv.lock The lockfile, describing the state of your project’s library at some point in time.
renv/activate.R The activation script run by the project .Rprofile.
renv/library The private project library.

Reproducibility with renv

Using renv, it’s possible to “save” and “load” the state of your project library. More specifically, you can use:

  • renv::snapshot() to save the state of your project to renv.lock; and
  • renv::restore() to restore the state of your project from renv.lock.

For each package used in your project, renv will record the package version, and (if known) the external source from which that package can be retrieved. renv::restore() uses that information to retrieve and re-install those packages in your project.

Beyond the DESCRIPTION file 4

So far we have been using the DESCRIPTION file to record the dependencies of our research compendium. This file contains a list of all of the direct dependencies of our research compendium. If we were to share our research compendium with someone else, they could install these dependencies by running devtools_install().

However, this approach has some limitations:

  • We’ve had to manually track the dependencies of our analysis scripts and their versions.
  • By default devtools_install() will install the dependencies into the users global R library, not into a project-specific library. This makes it challenging to work with multiple research compendia that may require different versions of the same package
  • DESCRIPTION file only records the direct dependencies of our code - not the transitive dependencies (the dependencies of our dependencies) .

renv helps to ensure that our R project can be reproduced by:

  • providing tools to automatically identify the dependencies of our compendium.
  • logging the exact versions of our both direct AND transitive dependencies.

Using renv in our spacewalks project

Let’s go back into our spacewalks compendium and capture our dependencies into a project level R package library.

First, let’s configure the install location that renv will use to renv/library using a .REnviron file.

TXT

# .REnviron
RENV_PATHS_LIBRARY = renv/library

Then, we can initialise renv

R

renv::init()

OUTPUT

> renv::init()
This project contains a DESCRIPTION file.
Which files should renv use for dependency discovery in this project?

1: Use only the DESCRIPTION file. (explicit mode)
2: Use all files in this project. (implicit mode)

Selection: 2

We are presented with two options for automatic discover of our compendium’s dependencies. “explicit mode” uses only the DESCRIPTION file, while “implicit mode” scans all the files in our project for dependencies.

We are Choose “2” to use all files in the project for dependency discovery.

OUTPUT


- Using 'implicit' snapshot type. Please see `?renv::snapshot` for more details.

- Linking packages into the project library ... Done!
- Resolving missing dependencies ...
# Installing packages --------------------------------------------------------
- Installing magrittr ...                       OK [linked from cache]
- Installing dplyr ...                          OK [linked from cache]
- Installing ggplot2 ...                        OK [linked from cache]
- Installing here ...                           OK [linked from cache]
- Installing jsonlite ...                       OK [linked from cache]
- Installing xfun ...                           OK [linked from cache]
- Installing highr ...                          OK [linked from cache]
- Installing knitr ...                          OK [linked from cache]
- Installing purrr ...                          OK [linked from cache]
- Installing readr ...                          OK [linked from cache]
- Installing rmarkdown ...                      OK [linked from cache]
- Installing stringr ...                        OK [linked from cache]
- Installing evaluate ...                       OK [linked from cache]
- Installing withr ...                          OK [linked from cache]
- Installing waldo ...                          OK [linked from cache]
- Installing testthat ...                       OK [linked from cache]
- Installing tidyr ...                          OK [linked from cache]
- Installing roxygen2 ...                       OK [linked from cache]
- Installing devtools ...                       OK [linked from cache]
The following package(s) will be updated in the lockfile:

# CRAN -----------------------------------------------------------------------
- askpass        [* -> 1.2.0]
- base64enc      [* -> 0.1-3]
- bit            [* -> 4.0.5]
- bit64          [* -> 4.0.5]
- brew           [* -> 1.0-10]
- brio           [* -> 1.1.4]
- bslib          [* -> 0.7.0]
- cachem         [* -> 1.0.8]
- callr          [* -> 3.7.6]
- cli            [* -> 3.6.2]
- clipr          [* -> 0.8.0]
- colorspace     [* -> 2.1-0]
- commonmark     [* -> 1.9.1]
- conflicted     [* -> 1.2.0]
- cpp11          [* -> 0.4.7]
- crayon         [* -> 1.5.2]
- credentials    [* -> 2.0.1]
- desc           [* -> 1.4.3]
- devtools       [* -> 2.4.5]
- diffobj        [* -> 0.3.5]
- digest         [* -> 0.6.35]
- downlit        [* -> 0.4.3]
- dplyr          [* -> 1.1.4]
- ellipsis       [* -> 0.3.2]
- evaluate       [* -> 1.0.3]
- fansi          [* -> 1.0.6]
- farver         [* -> 2.1.1]
- fastmap        [* -> 1.1.1]
- fontawesome    [* -> 0.5.2]
- fs             [* -> 1.6.3]
- generics       [* -> 0.1.3]
- gert           [* -> 2.0.1]
- ggplot2        [* -> 3.5.1]
- gh             [* -> 1.4.1]
- git2r          [* -> 0.35.0]
- gitcreds       [* -> 0.1.2]
- glue           [* -> 1.7.0]
- gtable         [* -> 0.3.4]
- here           [* -> 1.0.1]
- highr          [* -> 0.11]
- hms            [* -> 1.1.3]
- htmltools      [* -> 0.5.8.1]
- htmlwidgets    [* -> 1.6.4]
- httpuv         [* -> 1.6.15]
- httr           [* -> 1.4.7]
- httr2          [* -> 1.0.1]
- ini            [* -> 0.3.1]
- isoband        [* -> 0.2.7]
- jquerylib      [* -> 0.1.4]
- jsonlite       [* -> 1.9.0]
- knitr          [* -> 1.49]
- labeling       [* -> 0.4.3]
- later          [* -> 1.3.2]
- lattice        [* -> 0.22-5]
- lifecycle      [* -> 1.0.4]
- magrittr       [* -> 2.0.3]
- MASS           [* -> 7.3-60.0.1]
- Matrix         [* -> 1.6-5]
- memoise        [* -> 2.0.1]
- mgcv           [* -> 1.9-1]
- mime           [* -> 0.12]
- miniUI         [* -> 0.1.1.1]
- munsell        [* -> 0.5.1]
- nlme           [* -> 3.1-164]
- pillar         [* -> 1.9.0]
- pkgbuild       [* -> 1.4.4]
- pkgconfig      [* -> 2.0.3]
- pkgdown        [* -> 2.0.7]
- pkgload        [* -> 1.3.4]
- praise         [* -> 1.0.0]
- prettyunits    [* -> 1.2.0]
- processx       [* -> 3.8.4]
- profvis        [* -> 0.3.8]
- progress       [* -> 1.2.3]
- promises       [* -> 1.2.1]
- ps             [* -> 1.7.6]
- purrr          [* -> 1.0.4]
- R6             [* -> 2.5.1]
- rappdirs       [* -> 0.3.3]
- rcmdcheck      [* -> 1.4.0]
- RColorBrewer   [* -> 1.1-3]
- Rcpp           [* -> 1.0.12]
- readr          [* -> 2.1.5]
- remotes        [* -> 2.5.0]
- renv           [* -> 1.0.5]
- rlang          [* -> 1.1.3]
- rmarkdown      [* -> 2.29]
- roxygen2       [* -> 7.3.2]
- rprojroot      [* -> 2.0.4]
- rstudioapi     [* -> 0.16.0]
- rversions      [* -> 2.1.2]
- sass           [* -> 0.4.9]
- scales         [* -> 1.3.0]
- sessioninfo    [* -> 1.2.2]
- shiny          [* -> 1.8.1.1]
- sourcetools    [* -> 0.1.7-1]
- stringi        [* -> 1.8.3]
- stringr        [* -> 1.5.1]
- sys            [* -> 3.4.2]
- testthat       [* -> 3.2.3]
- tibble         [* -> 3.2.1]
- tidyr          [* -> 1.3.1]
- tidyselect     [* -> 1.2.1]
- tinytex        [* -> 0.50]
- tzdb           [* -> 0.4.0]
- urlchecker     [* -> 1.0.1]
- usethis        [* -> 2.2.3]
- utf8           [* -> 1.2.4]
- vctrs          [* -> 0.6.5]
- viridisLite    [* -> 0.4.2]
- vroom          [* -> 1.6.5]
- waldo          [* -> 0.6.1]
- whisker        [* -> 0.4.1]
- withr          [* -> 3.0.2]
- xfun           [* -> 0.51]
- xopen          [* -> 1.0.1]
- xtable         [* -> 1.8-4]
- yaml           [* -> 2.3.8]
- zip            [* -> 2.3.1]

# https://carpentries.r-universe.dev -----------------------------------------
- curl           [* -> 5.2.1]
- openssl        [* -> 2.1.1]
- ragg           [* -> 1.3.0]
- systemfonts    [* -> 1.0.6]
- textshaping    [* -> 0.3.7]
- xml2           [* -> 1.3.6]

The version of R recorded in the lockfile will be updated:
- R              [* -> 4.3.3]

- Lockfile written to "~/projects/uob/astronaut-data-analysis-fair-r/spacewalks/renv.lock".

Restarting R session...

- Project '~/projects/uob/astronaut-data-analysis-fair-r/spacewalks' loaded. [renv 1.0.5]

Alongside our existing content, our compendium now also contains the infrastructure that powers renv dependency management:

.
├── .Rprofile
├── renv
│   ├── .gitignore
│   ├── activate.R
│   ├── settings.json
│   └── library
│       └── R-4.3
└── renv.lock

The renv.lock contains a list of names and versions of packages used in our project.

The folder renv/library/R-4.3 contains the project specific library of installed packages for the R version the analysis is currently being performed on. This is never shared with recipients of our compendium. Rather, we share the rest of the files (renv.lock file, the .Rprofile file and the renv/activate.R). Making these available means others users of your code will have the appropriate packages installed in their own local library when the download and use your code.

Once we have run renv::init(), we should run renv::status() to check the status of our project library.

R

renv::status()

OUTPUT

No issues found -- the project is in a consistent state.

OUTPUT

No issues found -- the project is in a consistent state.

Restoring an Environment 5


Anyone who wants to install the same packages that we use with their exact versions can download our code, open the compendium and use renv::restore().

This will install everything in their local project library so they can be up an running in no time.

Challenge

Restore an environment 6

  1. Download this reproducible project.

  2. (Due to a recent issue with RStudio, you might need to press Enter before continuing).

  3. Open the project and run renv::status() in the R console. What’s the status of the packages?

  4. Run renv::restore() in the R console and proceed.

  5. Run renv::status() again to check that the project is in a consistent state.

  6. Render analysis/report.Rmd to make sure that it worked.

Updating an Environment


If we continue working on our project and add some new functionality that uses a new dependency we will need to update out DESCRIPTION file, our renv.lock file and our project library renv/library.

Let’s add a new function to R/eva_data_analysis.R that tabulates crew sizes in our spacewalks data.

R

generate_summary_table <- function(data, column_name, output_file) {

  # Check if the column exists in the data
  if (!(column_name %in% colnames(data))) {
    stop("Column not found in the data frame")
  }

  # Create summary statistics table using gtsummary
  summary_table <- data |>
    select(all_of(column_name)) |>
    gtsummary::tbl_summary()

  # Save the summary table to the specified output file
  gtsummary::as_gt(summary_table) |>
    gt::gtsave(output_file)

  cat("Summary table saved to", output_file, "\n")
}

This uses the tabulate package gtsummary.

Let’s add the new function to R/eva_data_analysis.R and add the following lines to analysis/scripts/run_analysis.R to call the function and save the result:

R

generate_summary_table(eva_data, "crew_size", "summary_table.html")

Checklist

Remember to “Clean and Install” (Build panel) to make sure that our changes have been installed.

When we run analysis/scripts/run_analysis.R we see an error because gtsummary is not installed in out local project library.

ERROR

- The project is out-of-sync -- use `renv::status()` for details.
[conflicted] Will prefer dplyr::filter over any other package.
[conflicted] Will prefer dplyr::lag over any other package.
Error in library(package, pos = pos, lib.loc = lib.loc, character.only = TRUE,  :
  there is no package called ‘spacewalks’
Calls: library -> library
Execution halted

OUTPUT

Before we install gtsummary let’s run renv::status() to check the status of our current project dependencies.

R

renv::status()

OUTPUT

No issues found -- the project is in a consistent state.

Now let’s install out missing dependency in the console and rerun analysis/scripts/run_analysis.R to check that everything is running correctly.

R

> install.packages("gtsummary")

OUTPUT

# Downloading packages -------------------------------------------------------
- Downloading gtsummary from CRAN ...           OK [1.6 Mb in 1.5s]
- Downloading cards from CRAN ...               OK [523.8 Kb in 1.3s]
- Downloading gt from CRAN ...                  OK [5.7 Mb in 1.8s]
- Downloading bigD from CRAN ...                OK [1.1 Mb in 1.3s]
- Downloading bitops from CRAN ...              OK [24.8 Kb in 0.81s]
- Downloading juicyjuice from CRAN ...          OK [1.1 Mb in 1.3s]
- Downloading V8 from CRAN ...                  OK [8.4 Mb in 2.2s]
- Downloading markdown from CRAN ...            OK [142.2 Kb in 0.82s]
- Downloading reactable from CRAN ...           OK [1 Mb in 1.0s]
- Downloading reactR from CRAN ...              OK [594.8 Kb in 0.89s]
Successfully downloaded 10 packages in 17 seconds.

The following package(s) will be installed:
- bigD       [0.3.0]
- bitops     [1.0-9]
- cards      [0.5.0]
- gt         [0.11.1]
- gtsummary  [2.1.0]
- juicyjuice [0.1.0]
- markdown   [1.13]
- reactable  [0.4.4]
- reactR     [0.6.1]
- V8         [6.0.1]
These packages will be installed into "~/projects/uob/astronaut-data-analysis-fair-r/spacewalks/renv/library/R-4.3/aarch64-apple-darwin20".
Would you like to proceed? [Y/n]: Y

OUTPUT

# Installing packages --------------------------------------------------------
- Installing cards ...                          OK [installed binary and cached in 0.29s]
- Installing bigD ...                           OK [installed binary and cached in 0.29s]
- Installing bitops ...                         OK [installed binary and cached in 0.23s]
- Installing V8 ...                             OK [installed binary and cached in 0.5s]
- Installing juicyjuice ...                     OK [installed binary and cached in 0.28s]
- Installing markdown ...                       OK [installed binary and cached in 0.31s]
- Installing reactR ...                         OK [installed binary and cached in 0.32s]
- Installing reactable ...                      OK [installed binary and cached in 0.29s]
- Installing gt ...                             OK [installed binary and cached in 0.32s]
- Installing gtsummary ...                      OK [installed binary and cached in 0.28s]
Successfully installed 10 packages in 3.5 seconds.

Now let’s run renv::status() in the console:

R

> renv::status()

The following package(s) are in an inconsistent state:

package installed recorded used bigD y n y
bitops y n y
cards y n y
gt y n y
gtsummary y n y
juicyjuice y n y
markdown y n y
reactable y n y
reactR y n y
V8 y n y

See ?renv::status() for advice on resolving these issues.


Now let's run renv::snapshot() in the console.
```r
renv::snapshot()

OUTPUT

The following package(s) will be updated in the lockfile:

# CRAN ----------------------------------------------------------------------
- bigD         [* -> 0.3.0]
- bitops       [* -> 1.0-9]
- cards        [* -> 0.5.0]
- gt           [* -> 0.11.1]
- gtsummary    [* -> 2.1.0]
- juicyjuice   [* -> 0.1.0]
- markdown     [* -> 1.13]
- reactable    [* -> 0.4.4]
- reactR       [* -> 0.6.1]
- V8           [* -> 6.0.1]

Do you want to proceed? [Y/n]: Y

OUTPUT

- Lockfile written to "~/projects/uob/astronaut-data-analysis-fair-r/spacewalks/renv.lock".

Finally, let’s run renv::status() in the console:

R

renv::status()

OUTPUT

No issues found -- the project is in a consistent state.

Caveats 7


The lockfile

The lockfile holds a snapshot of the project library at a moment in time, but it doesn’t guarantee that this corresponds to the rendered result. The lockfile could be out of date when the code is run, or the code might have been run with a previous version of the lockfile. It’s up to you to always render your file when the project is in a consistent state.

Dependency discovery

The automatic dependency discovery is really cool, but somewhat limited. It understand the most common and obvious ways a package can be loaded in a script, but it can fail if you use some more indirect methods. As we saw, it fails to detect the markdown dependency. It also fails if you use functionality in a package that depends on Suggested packages (e.g. ggplot2::geom_hex() requires the hexbin package, so you need to add an explicit library(hexbin) somewhere in your project).

Package installation

Sometimes package installation fails. One common case would be if you installed a CRAN-compiled package in Windows but the person trying to restore() the environment is running Linux. Since CRAN doesn’t offer compiled packages for Linux, renv will try to install from source, which can fail if compilation requires missing system dependencies. There’s nothing renv can do in this case, but the problem can be resolved by installing the relevant system dependencies.

Package installation will fail if the remote repository that hosts a package is unreachable either due to local connection issues or it being down, or deleted. Again, there’s nothing renv can do in that situation.

Package installation can fail if the package requires compilation but the machine doesn’t have enough RAM to compile. This is the case with the sf package, which cannot be compiled in the free-tier RStudio Cloud machine.

System dependencies

Furthermore, some R packages require certain system dependences to be installed to run. renv does not handle these cases yet, so if you are using a package that needs system dependencies, installation could fail if these are not met. Even if installation goes well, a package might not work if it has unmet runtime dependencies.

Even in the case in which system dependencies are fulfilled, renv offers no guarantee that these are the same versions used to run the analysis. This means that if results depend on the version of some system dependency, renv will not be able to ensure reproducibility. This includes the version of R itself!

Further reading


We recommend the following resources for some additional reading on the topic of this episode:

  • [renv package official documentation][renv_man]

Also check the full reference set for the course.

Key Points

  • renv environments keep different R package versions and dependencies required by different projects separate.
  • An renv environment is essentially a project-specific directory structure that isolates the packages and their versions used within that project/compendium.
  • You can use renv to create and manage R project environments, and use renv::restore() to install and manage external (third-party) libraries (packages) in your project.
  • By convention, you can save and export your R environment in a set of files (renv.lock), located in your project’s root directory. This file can then be shared with collaborators/users and used to replicate your environment elsewhere using renv::restore().

Attribution


This episode reuses material from the [“Reproducible Development Environment”][fair-software-course-correctn] episode of the Software Carpentries Incubator course “Tools and practices of FAIR research software” under a CC-BY-4.0 with modifications (i) adaptations have been made to make the material suitable for an audience of R users (e.g. replacing “software” with “code” in places, pytest with testthat), (ii) all code has been ported from Python to R (iii) Objectives, Questions, Key Points and Further Reading sections have been updated to reflect the remixed R focussed content.


  1. Material re-used from “Reproducible Development Environment” under CC-BY-4.0 with modifications detailed above.↩︎

  2. Reused from Managing Dependences in course “Reproducible Research Data and Project Management in R” by Anna Krystalli under CC-BY-4.0 with modifications to update R output for our case-study.↩︎

  3. Reused from Managing Dependences in course “Reproducible Research Data and Project Management in R” by Anna Krystalli under CC-BY-4.0 with modifications to update R output for our case-study.↩︎

  4. Original material.↩︎

  5. Re-used from the lesson “Managing R dependencies with renv” under CC-BY-4.0 from “An R reproducibility toolkit for the practical researcher” by Elio Campitelli and Paola Corrales.↩︎

  6. Re-used from the lesson “Managing R dependencies with renv” under CC-BY-4.0 from “An R reproducibility toolkit for the practical researcher” by Elio Campitelli and Paola Corrales.↩︎

  7. Re-used from the lesson “Managing R dependencies with renv” under CC-BY-4.0 from “An R reproducibility toolkit for the practical researcher” by Elio Campitelli and Paola Corrales.↩︎

Content from Wrap-up


Last updated on 2025-04-15 | Edit this page

Overview

Questions

  • What reproducible research practices have we covered in this course?
  • What tools and practices should you learn next?

Objectives

  • Reflect on the reproducible research practices in this course and identify next steps for additional skills

In this course we have explored the significance of reproducible research and explored some tools and practices we can use in our own R development that can help us and others do better research including unit testing and dependency management.

Reproducible research often requires that researchers implement new practices and learn new tools - in this course we taught you some of these as a starting point but you will discover what works best for yourself, your group, community and domain.

Some of these practices may take a while to implement and may require perseverance, others you can start practicing today.

An image of a Chinese proverb "The best time to plant a tree was 20 years ago. The second best time is now
An image of a Chinese proverb “The best time to plant a tree was 20 years ago. The second best time is now” by CCNULL, used under a CC-BY 2.0 licence

Next Steps 1


In this course we focused on tools and practices relevant to R development.

A key practice / tool that we did not cover in this course is the use of version control tools to track changes in you project. Some of the tools covered in this course e.g. the rrtools compendium will work even better when used with version control tools.

This is an extremely important skill that can help support reproducible research - but that’s a story for another course.

Further reading


Please check out the following resources for some additional reading on the topic of this course and the full reference set.

Key Points

  • We covered a range of best practices for reproducible research in this course
  • Thes practices will deliver even better results if you pair them with the use of version control e.g. git/GitHub.

Attribution


This episode reuses material from the “Wrap-up” episode of the Software Carpentries Incubator course “Tools and practices of FAIR research software” under a CC-BY-4.0 with modifications: Objectives, Questions, Key Points and Further Reading sections have been updated to reflect the remixed R focussed content.


  1. Original Content↩︎