Content from Introduction to R and RStudio


Last updated on 2024-12-10 | Edit this page

Overview

Questions

  • How can I find my way around RStudio?
  • How can I manage my projects in RStudio?
  • How to interact with R for data analysis and coding?

Objectives

  • Describe the purpose and functionality of each pane in RStudio
  • Locate key buttons and options within the RStudio interface
  • Create and manage self-contained projects in RStudio
  • Define and work with variables
  • Assign data to a variable
  • Use mathematical and comparison operators
  • Call and work with functions

Motivation


Science is a multi-step process: once you’ve designed an experiment and collected data, the real fun begins with analysis!

Although we could use a spreadsheet in Microsoft Excel or Google sheets to analyze our data, these tools are limited in their flexibility and accessibility. Critically, they also are difficult to share steps which explore and change the raw data, which is key to “reproducible” research.

Therefore, this lesson will teach you how to begin exploring your data using R and RStudio. The R program is available for Windows, Mac, and Linux operating systems, and is a freely-available where you downloaded it above. To run R, all you need is the R program.

However, to make using R easier, we will use the program RStudio, which we also downloaded above. RStudio is a free, open-source, Integrated Development Environment (IDE). It provides a built-in editor, works on all platforms (including on servers) and provides many advantages such as integration with version control and project management.

Before Starting The Workshop


Please ensure you have the latest version of R and RStudio installed on your machine. This is important, as some packages used in the workshop may not install correctly (or at all) if R is not up to date.

Data Set


We will begin with raw data, perform exploratory analyses, and learn how to plot results graphically. We’ll use a version of the Palmer Penguins dataset dataset from the Palmer Penguins package that we have adapted for the purposes of this course. The Palmer Penguins dataset provides morphological measurements (bill length, bill depth, flipper length, body mass) for three penguin species (Adélie, Chinstrap, and Gentoo) collected from three islands (Torgersen, Dream, and Biscoe) in the Palmer Archipelago, Antarctica.

Meet the penguins cartoon illustration of the penguin species Chinstrap! Adelie! and Gentoo! by @Allison Horst (CC-BY-4.0)
Meet the penguins cartoon illustration of the penguin species Chinstrap! Adelie! and Gentoo! by @Allison Horst (CC-BY-4.0)

Can you read the data into R? Can you plot relationship between bill length and flipper length? Can you calculate the mean body mass of penguins on the island of Dream? By the end of these lessons you will be able to explore key relationships in the data in minutes!

Introduction to RStudio


Throughout this lesson, we’re going to teach you some of the fundamentals of the R language as well as some best practices for organizing code for scientific projects that will make your life easier.

Basic layout

When you first open RStudio, you will be greeted by three panels:

  • The interactive R console/Terminal (entire left)
  • Environment/History/Connections (tabbed in upper right)
  • Files/Plots/Packages/Help/Viewer (tabbed in lower right)
RStudio layout

Once you open files, such as R scripts, an editor panel will also open in the top left.

RStudio layout with .R file open

Projects in R Studio


The scientific process is naturally incremental, and many projects start life as random notes, some code, then a manuscript, and eventually everything is a bit mixed together.

Most people tend to organize their projects like this:

Screenshot of file manager demonstrating bad project organisation

There are many reasons why we should ALWAYS avoid this:

  1. It is really hard to tell which version of your data is the original and which is the modified;
  2. It gets really messy because it mixes files with various extensions together;
  3. It probably takes you a lot of time to actually find things, and relate the correct figures to the exact code that has been used to generate it;

A good project layout will ultimately make your life easier:

  • It will help ensure the integrity of your data;
  • It makes it simpler to share your code with someone else (a lab-mate, collaborator, or supervisor);
  • It allows you to easily upload your code with your manuscript submission;
  • It makes it easier to pick the project back up after a break.

Creating a self-contained project

Fortunately, there are tools and packages which can help you manage your work effectively.

One of the most powerful and useful aspects of RStudio is its project management functionality. We’ll be using this today to create a self-contained, reproducible project.

We’re going to create a new project in RStudio:

  1. Click the “File” menu button, then “New Project”.
  2. Click “New Directory”.
  3. Click “New Project”.
  4. Type in the name of the directory to store your project, e.g. “my_project”.
  5. If available, select the checkbox for “Create a git repository.”
  6. Click the “Create Project” button.

Best practices for project organization

Although there is no “best” way to lay out a project, there are some general principles to adhere to that will make project management easier:

Treat data as read only

This is probably the most important goal of setting up a project. Data is typically time consuming and/or expensive to collect. Working with them interactively (e.g., in Excel) where they can be modified means you are never sure of where the data came from, or how it has been modified since collection. It is therefore a good idea to treat your data as “read-only”.

Data Cleaning

In many cases your data will be “dirty”: it will need significant preprocessing to get into a format R (or any other programming language) will find useful. This task is sometimes called “data munging”. Storing these scripts in a separate folder, and creating a second “read-only” data folder to hold the “cleaned” data sets can prevent confusion between the two sets.

Treat generated output as disposable

Anything generated by your scripts should be treated as disposable: it should all be able to be regenerated from your scripts.

There are lots of different ways to manage this output. Having an output folder with different sub-directories for each separate analysis makes it easier later. Since many analyses are exploratory and don’t end up being used in the final project, and some of the analyses get shared between projects.

Tip: Good Enough Practices for Scientific Computing

Good Enough Practices for Scientific Computing gives the following recommendations for project organization:

  1. Put each project in its own directory, which is named after the project.
  2. Put text documents associated with the project in the doc directory.
  3. Put raw data and metadata in the data directory, and files generated during cleanup and analysis in a results directory.
  4. Put source for the project’s scripts and programs in the src directory, and programs brought in from elsewhere or compiled locally in the bin directory.
  5. Name all files to reflect their content or function.

Use Project Subfolders

Create data, src, and results directories in your project directory.

Copy the files penguins_teaching.csv, weather-data.csv, and weather-data_v2.csv files from the zip file you downloaded as part of the lesson set up to the data/ folder within your project. We will load and use these files later in the course.

Tip: Working directory

Knowing R’s current working directory is important because when you need to access other files (for example, to import a data file), R will look for them relative to the current working directory.

Each time you create a new RStudio Project, it will create a new directory for that project. When you open an existing .Rproj file, it will open that project and set R’s working directory to the folder that file is in.

You can check the current working directory with the getwd() command, or by using the menus in RStudio.

  1. In the console, type getwd() (“wd” is short for “working directory”) and hit Enter.
  2. In the Files pane, double click on the data folder to open it (or navigate to any other folder you wish). To get the Files pane back to the current working directory, click “More” and then select “Go To Working Directory”.

You can change the working directory with setwd(), or by using RStudio menus.

  1. In the console, type setwd("data") and hit Enter. Type getwd() and hit Enter to see the new working directory.
  2. In the menus at the top of the RStudio window, click the “Session” menu button, and then select “Set Working Directory” and then “Choose Directory”. Next, in the windows navigator that opens, navigate back to the project directory, and click “Open”. Note that a setwd command will automatically appear in the console.

Workflow within RStudio


There are two main ways one can work within RStudio:

  1. Test and play within the interactive R console then copy code into a .R file to run later.
  • This works well when doing small tests and initially starting off.
  • It quickly becomes laborious
  1. Start writing in a .R file and use RStudio’s short cut keys for the Run command to push the current line, selected lines or modified lines to the interactive R console.
  • This is a great way to start; all your code is saved for later
  • You will be able to run the file you create from within RStudio or using R’s source() function.

In this lesson we’ll spend a lot of time working in the interactive console. We’ll move on to creating scripts in the next lesson.

Tip: Running segments of your code

RStudio offers you great flexibility in running code from within the editor window. There are buttons, menu choices, and keyboard shortcuts. To run the current line, you can

  1. click on the Run button above the editor panel, or
  2. select “Run Lines” from the “Code” menu, or
  3. hit Ctrl+Return in Windows or Linux or +Return on OS X. (This shortcut can also be seen by hovering the mouse over the button). To run a block of code, select it and then Run. If you have modified a line of code within a block of code you have just run, there is no need to reselect the section and Run, you can use the next button along, Re-run the previous region. This will run the previous code block including the modifications you have made.

R scripts

Any commands that you write in the R console can be saved to a file to be re-run again. Files containing R code to be run in this way are called R scripts. R scripts have .R at the end of their names to let you know what they are.

Introduction to R


Much of your time in R will be spent in the R interactive console. This is where you will run all of your code, and can be a useful environment to try out ideas before adding them to an R script file. This console in RStudio is the same as the one you would get if you typed in R in your command-line environment.

The first thing you will see in the R interactive session is a bunch of information, followed by a “>” and a blinking cursor. In many ways this is similar to the shell environment you learned about during the shell lessons: it operates on the same idea of a “Read, evaluate, print loop”: you type in commands, R tries to execute them, and then returns a result.

Using R as a calculator


The simplest thing you could do with R is to do arithmetic:

R

1 + 100

OUTPUT

[1] 101

And R will print out the answer, with a preceding “[1]”. [1] is the index of the first element of the line being printed in the console.

If you type in an incomplete command, R will wait for you to complete it. If you are familiar with Unix Shell’s bash, you may recognize this behavior from bash.

R

> 1 +

OUTPUT

+

Any time you hit return and the R session shows a “+” instead of a “>”, it means it’s waiting for you to complete the command. If you want to cancel a command you can hit Esc and RStudio will give you back the “>” prompt.

Tip: Canceling commands

If you’re using R from the command line instead of from within RStudio, you need to use Ctrl+C instead of Esc to cancel the command. This applies to Mac users as well!

Canceling a command isn’t only useful for killing incomplete commands: you can also use it to tell R to stop running code (for example if it’s taking much longer than you expect), or to get rid of the code you’re currently writing.

When using R as a calculator, the order of operations is the same as you would have learned back in school.

From highest to lowest precedence:

  • Parentheses: (, )
  • Exponents: ^ or **
  • Multiply: *
  • Divide: /
  • Add: +
  • Subtract: -

R

3 + 5 * 2

OUTPUT

[1] 13

Use parentheses to group operations in order to force the order of evaluation if it differs from the default, or to make clear what you intend.

R

(3 + 5) * 2

OUTPUT

[1] 16

This can get unwieldy when not needed, but clarifies your intentions. Remember that others may later read your code.

R

(3 + (5 * (2 ^ 2))) # hard to read
3 + 5 * 2 ^ 2       # clear, if you remember the rules
3 + 5 * (2 ^ 2)     # if you forget some rules, this might help

The text after each line of code is called a “comment”. Anything that follows after the hash (or octothorpe) symbol # is ignored by R when it executes code.

Really small or large numbers get a scientific notation:

R

2/10000

OUTPUT

[1] 2e-04

Which is shorthand for “multiplied by 10^XX”. So 2e-4 is shorthand for 2 * 10^(-4).

You can write numbers in scientific notation too:

R

5e3  # Note the lack of minus here

OUTPUT

[1] 5000

Mathematical functions


R has many built in mathematical functions. To call a function, we can type its name, followed by open and closing parentheses. Functions take arguments as inputs, anything we type inside the parentheses of a function is considered an argument. Depending on the function, the number of arguments can vary from none to multiple. For example:

R

getwd() #returns an absolute filepath

doesn’t require an argument, whereas for the next set of mathematical functions we will need to supply the function a value in order to compute the result.

R

sin(1)  # trigonometry functions

OUTPUT

[1] 0.841471

R

log(1)  # natural logarithm

OUTPUT

[1] 0

R

log10(10) # base-10 logarithm

OUTPUT

[1] 1

R

exp(0.5) # e^(1/2)

OUTPUT

[1] 1.648721

Don’t worry about trying to remember every function in R. You can look them up on Google, or if you can remember the start of the function’s name, use the tab completion in RStudio.

This is one advantage that RStudio has over R on its own, it has auto-completion abilities that allow you to more easily look up functions, their arguments, and the values that they take.

Typing a ? before the name of a command will open the help page for that command. When using RStudio, this will open the ‘Help’ pane; if using R in the terminal, the help page will open in your browser. The help page will include a detailed description of the command and how it works. Scrolling to the bottom of the help page will usually show a collection of code examples which illustrate command usage.

Comparing things


We can also do comparisons in R:

R

1 == 1  # equality (note two equals signs, read as "is equal to")

OUTPUT

[1] TRUE

R

1 != 2  # inequality (read as "is not equal to")

OUTPUT

[1] TRUE

R

1 < 2  # less than

OUTPUT

[1] TRUE

R

1 <= 1  # less than or equal to

OUTPUT

[1] TRUE

R

1 > 0  # greater than

OUTPUT

[1] TRUE

R

1 >= -9 # greater than or equal to

OUTPUT

[1] TRUE

Tip: Comparing Numbers

A word of warning about comparing numbers: you should never use == to compare two numbers unless they are integers (a data type which can specifically represent only whole numbers).

Computers may only represent decimal numbers with a certain degree of precision, so two numbers which look the same when printed out by R, may actually have different underlying representations and therefore be different by a small margin of error (called Machine numeric tolerance).

Instead you should use the all.equal function.

Further reading: http://floating-point-gui.de/

Variables and assignment


We can store values in variables using the assignment operator <-, like this:

R

x <- 1/40

Notice that assignment does not print a value. Instead, we stored it for later in something called a variable. x now contains the value 0.025:

R

x

OUTPUT

[1] 0.025

More precisely, the stored value is a decimal approximation of this fraction called a floating point number.

Look for the Environment tab in the top right panel of RStudio, and you will see that x and its value have appeared. Our variable x can be used in place of a number in any calculation that expects a number:

R

log(x)

OUTPUT

[1] -3.688879

Notice also that variables can be reassigned:

R

x <- 100

x used to contain the value 0.025 and now it has the value 100.

Assignment values can contain the variable being assigned to:

R

x <- x + 1 #notice how RStudio updates its description of x on the top right tab
y <- x * 2

The right hand side of the assignment can be any valid R expression. The right hand side is fully evaluated before the assignment occurs.

Variable names can contain letters, numbers, underscores and periods but no spaces. They must start with a letter or a period followed by a letter (they cannot start with a number nor an underscore). Variables beginning with a period are hidden variables. Different people use different conventions for long variable names, these include

  • periods.between.words
  • underscores_between_words
  • camelCaseToSeparateWords

What you use is up to you, but be consistent.

It is also possible to use the = operator for assignment:

R

x = 1/40

But this is much less common among R users. The most important thing is to be consistent with the operator you use. There are occasionally places where it is less confusing to use <- than =, and it is the most common symbol used in the community. So the recommendation is to use <-.

Vectorization


One final thing to be aware of is that R is vectorized, meaning that variables and functions can have vectors as values. A vector is 1 dimensional array of ordered values all of the same data type. For example:

R

1:5

OUTPUT

[1] 1 2 3 4 5

We can assign a vector to a variable:

R

x <- 5:10

We can apply functions to all the elements of a vector:

R

(1:5) * 2

OUTPUT

[1]  2  4  6  8 10

R

2^(1:5)

OUTPUT

[1]  2  4  8 16 32

We can also create vectors “by hand” using the c() function; this tersely named function is used to combine values into a vector; these values can, themselves, be vectors:

R

c(2, 4, -1)

OUTPUT

[1]  2  4 -1

R

c(x, 2, 2, 3)

OUTPUT

[1]  5  6  7  8  9 10  2  2  3

Vectors aren’t limited to storing numbers:

R

c("a", "b", "c", "def")

OUTPUT

[1] "a"   "b"   "c"   "def"

R comes with a few built in constants, containing useful values:

R

LETTERS
letters
month.abb
month.name

We will use some of these in the examples that follow.

Vector lengths

We can calculate how many elements a vector contains using the length() function:

R

length(x)

OUTPUT

[1] 6

R

length(letters)

OUTPUT

[1] 26

Subsetting vectors

Having defined a vector, it’s often useful to extract parts of a vector. We do this with the [] operator. Using the built in month.name vector:

R

month.name[2]

OUTPUT

[1] "February"

R

month.name[2:4]

OUTPUT

[1] "February" "March"    "April"   

Let’s unpick the second example; 2:4 generates the sequence 2,3,4. This gets passed to the extract operator []. We can also generate this sequence using the c() function:

R

month.name[c(2,3,4)]

OUTPUT

[1] "February" "March"    "April"   

Vector numbering in R starts at 1

In many programming languages (C and python, for example), the first element of a vector has an index of 0. In R, the first element is 1.

Values are returned in the order that we specify the indices. We can extract the same element more than once:

R

month.name[4:2]

OUTPUT

[1] "April"    "March"    "February"

R

month.name[c(1,1,2,3,4)]

OUTPUT

[1] "January"  "January"  "February" "March"    "April"   

If we try and extract an element that doesn’t exist in the vector, the missing values are NA:

R

month.name[10:13]

OUTPUT

[1] "October"  "November" "December" NA        

Missing data

NA is a special value, that is used to represent “not available”, or “missing”. If we perform computations which include NA, the result is usually NA:

R

1 + NA

OUTPUT

[1] NA

This raises an interesting point; how do we test if a value is NA? This doesn’t work:

R

x <- NA
x == NA

OUTPUT

[1] NA

Handling special values

There are a number of special functions you can use to handle missing data, and other special values:

  • is.na will return all positions in a vector, matrix, or data.frame containing NA.
  • likewise, is.nan, and is.infinite will do the same for NaN and Inf.
  • is.finite will return all positions in a vector, matrix, or data.frame that do not contain NA, NaN or Inf.
  • na.omit will filter out all missing values from a vector

Skipping and removing elements

If we use a negative number as the index of a vector, R will return every element except for the one specified:

R

month.name[-2]

OUTPUT

 [1] "January"   "March"     "April"     "May"       "June"      "July"
 [7] "August"    "September" "October"   "November"  "December" 

We can skip multiple elements:

R

month.name[c(-1, -5)]  # or month.name[-c(1,5)]

OUTPUT

 [1] "February"  "March"     "April"     "June"      "July"      "August"
 [7] "September" "October"   "November"  "December" 

Tip: Order of operations

A common error occurs when trying to skip slices of a vector. Most people first try to negate a sequence like so:

R

month.name[-1:3]

This gives a somewhat cryptic error:

ERROR

Error in month.name[-1:3]: only 0's may be mixed with negative subscripts

But remember the order of operations. : is really a function, so what happens is it takes its first argument as -1, and second as 3, so generates the sequence of numbers: -1, 0, 1, 2, 3.

The correct solution is to wrap that function call in brackets, so that the - operator is applied to the sequence:

R

-(1:3)

OUTPUT

[1] -1 -2 -3

R

month.name[-(1:3)]

OUTPUT

[1] "April"     "May"       "June"      "July"      "August"    "September"
[7] "October"   "November"  "December" 

Subsetting with logical vectors

As well as providing a list of indices we want to keep (or delete, if we prefix them with -), we can pass a logical vector to R indicating the indices we wish to select:

R

fourmonths <- month.name[1:4]
fourmonths

OUTPUT

[1] "January"  "February" "March"    "April"   

R

fourmonths[c(TRUE, FALSE, TRUE, TRUE)]

OUTPUT

[1] "January" "March"   "April"  

What happens if we supply a logical vector that is shorter than the vector we’re extracting the elements from?

R

fourmonths[c(TRUE,FALSE)]

OUTPUT

[1] "January" "March"  

This illustrates the idea of vector recycling; the [] extract operator “recycles” the subsetting vector:

R

fourmonths[c(TRUE,FALSE,TRUE,FALSE)]

OUTPUT

[1] "January" "March"  

This can be useful, but can easily catch you out.

The idea of selecting elements of a vector using a logical subsetting vector may seem a bit esoteric, and a lot more typing than just selecting the elements you want by index. It becomes really useful when we write code to generate the logical vector:

R

my_vector <- c(0.01, 0.69, 0.51, 0.39)
my_vector > 0.5

OUTPUT

[1] FALSE  TRUE  TRUE FALSE

R

my_vector[my_vector > 0.5]

OUTPUT

[1] 0.69 0.51

Tip: Combining logical conditions

There are many situations in which you will wish to combine multiple logical criteria. For example, we might want to find all the elements that are between two values. Several operations for combining logical vectors exist in R:

  • &, the “logical AND” operator: returns TRUE if both the left and right are TRUE.
  • |, the “logical OR” operator: returns TRUE, if either the left or right (or both) are TRUE.

The recycling rule applies with both of these, so TRUE & c(TRUE, FALSE, TRUE) will compare the first TRUE on the left of the & sign with each of the three conditions on the right.

You may sometimes see && and || instead of & and |. These operators do not use the recycling rule: they only look at the first element of each vector and ignore the remaining elements. The longer operators are mainly used in programming, rather than data analysis.

  • !, the “logical NOT” operator: converts TRUE to FALSE and FALSE to TRUE. It can negate a single logical condition (e.g. !TRUE becomes FALSE), or a whole vector of conditions(e.g. !c(TRUE, FALSE) becomes c(FALSE, TRUE)).

Additionally, you can compare the elements within a single vector using the all function (which returns TRUE if every element of the vector is TRUE) and the any function (which returns TRUE if one or more elements of the vector are TRUE).

Challenge 1

Which of the following are valid R variable names?

R

min_depth
max.depth
_region
.mass
MaxLength
min-length
15Ndelta
celsius2kelvin

The following can be used as R variables:

R

min_depth
max.depth
MaxLength
celsius2kelvin

The following creates a hidden variable:

R

.mass

The following will not be able to be used to create a variable

R

_region
min-length
15Ndelta

Challenge 2

What will be the value of each variable after each statement in the following program?

R

mass <- 47.5
depth <- 122
mass <- mass * 2.3
depth <- depth - 20

R

mass <- 47.5

This will give a value of 47.5 for the variable mass

R

depth <- 122

This will give a value of 122 for the variable depth

R

mass <- mass * 2.3

This will multiply the existing value of 47.5 by 2.3 to give a new value of 109.25 to the variable mass.

R

depth <- depth - 20

This will subtract 20 from the existing value of 122 to give a new value of 102 to the variable depth.

Challenge 3

Run the code from the previous challenge, and write a command to compare mass to depth. Is mass larger than depth?

One way of answering this question in R is to use the > to set up the following:

R

mass > depth

OUTPUT

[1] TRUE

This should yield a boolean value of TRUE since 109.25 is greater than 102.

Challenge 4

Return a vector containing the letters of the alphabet in reverse order

We can extract the elements in reverse order by generating the sequence 26, 25, … 1 using the : operator:

R

letters[26:1]

OUTPUT

 [1] "z" "y" "x" "w" "v" "u" "t" "s" "r" "q" "p" "o" "n" "m" "l" "k" "j" "i" "h"
[20] "g" "f" "e" "d" "c" "b" "a"

Although this works, it makes the assumption that letters will always be length 26. Although this is probably a safe assumption in English, other languages may have more (or fewer) letters in their alphabet. It is good practice to avoid assuming anything about your data.

We can avoid assuming letters is length 26, by using the length() function:

R

letters[length(letters):1]

OUTPUT

 [1] "z" "y" "x" "w" "v" "u" "t" "s" "r" "q" "p" "o" "n" "m" "l" "k" "j" "i" "h"
[20] "g" "f" "e" "d" "c" "b" "a"

Challenge 5

Given the following code:

R

x <- c(5.4, 6.2, 7.1, 7.5, 4.8)
x

OUTPUT

[1] 5.4 6.2 7.1 7.5 4.8

Come up with at least 3 different commands that will produce the following output:

OUTPUT

[1] 6.2 7.1 7.5

After you find 3 different commands, compare notes with your neighbour. Did you have different strategies?

R

x[2:4]

OUTPUT

[1] 6.2 7.1 7.5

R

x[-c(1,5)]

OUTPUT

[1] 6.2 7.1 7.5

R

x[c(2,3,4)]

OUTPUT

[1] 6.2 7.1 7.5

R

x[c(FALSE, TRUE, TRUE, TRUE, FALSE)]

OUTPUT

[1] 6.2 7.1 7.5

(We can use vector recycling to make the previous example slightly shorter:

R

 x[c(FALSE, TRUE, TRUE, TRUE)]

OUTPUT

[1] 6.2 7.1 7.5

The first element of the logical vector will be recycled)

We can construct a logical test to generate the logical vector:

R

x[x > 6]

OUTPUT

[1] 6.2 7.1 7.5

Unpicking this example, x > 6 generates the logical vector:

OUTPUT

[1] FALSE  TRUE  TRUE  TRUE FALSE

Which is passed to the [ ] subsetting operator.

Further reading


We recommend the following resources for some additional reading on the topic of this episode:

Key Points

  • Use RStudio to write and run R programs.
  • R has the usual arithmetic operators and mathematical functions.
  • Use <- to assign values to variables.
  • Use RStudio to create and manage projects with a consistent structure.
  • Treat raw data as read-only.
  • Treat generated output as disposable.

Content from Data Structures and Subsetting Data


Last updated on 2024-12-10 | Edit this page

Overview

Questions

  • How can I read data into R?
  • What are the basic data types in R?
  • How can I work with subsets of data in R?
  • How do I represent and work with categorical data in R?

Objectives

  • Identify and understand the five main data types in R
  • Explore data frames and understand their relationship with vectors and lists
  • To be able to subset vectors, lists, and data frames
  • To be able to extract individual and multiple elements: by index, by name, using comparison operations
  • To be able to skip and remove elements from various data structures

One of R’s most powerful features is its ability to deal with tabular data - such as you may already have in a spreadsheet or a CSV file. Let’s start by making a toy dataset in your data/ directory, called weather-data.csv:

R

weather <- data.frame(
  island = c("torgersen", "biscoe", "dream"),
  temperature = c(1.6, 1.5, -2.6),
  snowfall = c(0, 0, 1)
)

We can now save weather as a CSV file. It is good practice to call the argument names explicitly so the function knows what default values you are changing. Here we are setting row.names = FALSE. Recall you can use ?write.csv to pull up the help file to check out the argument names and their default values.

R

write.csv(x = weather, file = "data/weather-data.csv", row.names = FALSE)

The contents of the new file, weather-data.csv:

R

island,temperature,snowfall
torgersen,1.6,0
biscoe,1.5,0
dream,-2.6,1

Tip: Editing Text files in R

Alternatively, you can create data/weather-data.csv using a text editor (Nano), or within RStudio with the File -> New File -> Text File menu item.

We can load this into R via the following:

R

weather <- read.csv(file = "data/weather-data.csv")
weather

OUTPUT

     island temperature snowfall
1 torgersen         1.6        0
2    biscoe         1.5        0
3     dream        -2.6        1

The read.table function is used for reading in tabular data stored in a text file where the columns of data are separated by punctuation characters such as CSV files (csv = comma-separated values). Tabs and commas are the most common punctuation characters used to separate or delimit data points in csv files. For convenience R provides 2 other versions of read.table. These are: read.csv for files where the data are separated with commas and read.delim for files where the data are separated with tabs. Of these three functions read.csv is the most commonly used. If needed it is possible to override the default delimiting punctuation marks for both read.csv and read.delim.

Check your data for factors

In recent times, the default way how R handles textual data has changed. Text data was interpreted by R automatically into a format called “factors”. But there is an easier format that is called “character”. We will hear about factors later, and what to use them for. For now, remember that in most cases, they are not needed and only complicate your life, which is why newer R versions read in text as “character”. Check now if your version of R has automatically created factors and convert them to “character” format:

  1. Check the data types of your input by typing str(weather)
  2. In the output, look at the three-letter codes after the colons: If you see only “num” and “chr”, you can continue with the lesson and skip this box. If you find “fct”, continue to step 3.
  3. Prevent R from automatically creating “factor” data. That can be done by the following code: options(stringsAsFactors = FALSE). Then, re-read the weather table for the change to take effect.
  4. You must set this option every time you restart R. To not forget this, include it in your analysis script before you read in any data, for example in one of the first lines.
  5. For R versions greater than 4.0.0, text data is no longer converted to factors anymore. So you can install this or a newer version to avoid this problem. If you are working on an institute or company computer, ask your administrator to do it.

We can begin exploring our dataset right away, pulling out columns by specifying them using the $ operator:

R

weather$temperature

OUTPUT

[1]  1.6  1.5 -2.6

R

weather$island

OUTPUT

[1] "torgersen" "biscoe"    "dream"    

We can also use the subsetting operator [] on our data frame. While a vector is one dimensionl, a data frame is two dimensional, so we pass two arguments to the [] operator.

The first argument indicates the row(s) we require and the second indicates the columns. So to return rows 1 and 2, and columns 2 and 3 we can use:

R

weather[1:2,2:3]

OUTPUT

  temperature snowfall
1         1.6        0
2         1.5        0

We can do other operations on the columns:

R

## Let's convert the temperature from Celcius to Kelvin:
weather$temperature + 273.15

OUTPUT

[1] 274.75 274.65 270.55

R

paste("The location is", weather$island)

OUTPUT

[1] "The location is torgersen" "The location is biscoe"
[3] "The location is dream"    

But what about

R

weather$temperature + weather$island

ERROR

Error in weather$temperature + weather$island: non-numeric argument to binary operator

Understanding what happened here is key to successfully analyzing data in R.

Data Types

If you guessed that the last command will return an error because 1.6 plus "torgersen" is nonsense, you’re right - and you already have some intuition for an important concept in programming called data types. We can ask what type of data something is:

R

typeof(weather$temperature)

OUTPUT

[1] "double"

There are 5 main types: double, integer, complex, logical and character. For historic reasons, double is also called numeric.

R

typeof(3.14)

OUTPUT

[1] "double"

R

typeof(1L) # The L suffix forces the number to be an integer, since by default R uses float numbers

OUTPUT

[1] "integer"

R

typeof(1+1i)

OUTPUT

[1] "complex"

R

typeof(TRUE)

OUTPUT

[1] "logical"

R

typeof('banana')

OUTPUT

[1] "character"

No matter how complicated our analyses become, all data in R is interpreted as one of these basic data types. This strictness has some really important consequences.

A colleagues has added weather observations for another island. This information is in the file data/weather-data_v2.csv.

R

file.show("data/weather-data_v2.csv")

R

island,temperature,snowfall
torgersen,"1.6",0
biscoe,"1.5",0
dream,"-2.6",1
deception,"-3.4 or 3.5",1

Load the new weather data like before, and check what type of data we find in the temperature column:

R

weather <- read.csv(file="data/weather-data_v2.csv")
typeof(weather$temperature)

OUTPUT

[1] "character"

Oh no, our temperatures aren’t the double type anymore! If we try to do the same math we did on them before, we run into trouble:

R

weather$temperature + 273.15

ERROR

Error in weather$temperature + 273.15: non-numeric argument to binary operator

What happened? The weather data we are working with is something called a data frame. Data frames are one of the most common and versatile types of data structures we will work with in R. A given column in a data frame cannot be composed of different data types. In this case, R does not read everything in the data frame column temperature as a double, therefore the entire column data type changes to something that is suitable for everything in the column.

When R reads a csv file, it reads it in as a data frame. Thus, when we loaded the weather csv file, it is stored as a data frame. We can recognize data frames by the first row that is written by the str() function:

R

str(weather)

OUTPUT

'data.frame':	4 obs. of  3 variables:
 $ island     : chr  "torgersen" "biscoe" "dream" "deception"
 $ temperature: chr  "1.6" "1.5" "-2.6" "-3.4 or 3.5"
 $ snowfall   : int  1 0 1 1

Data frames are composed of rows and columns, where each column has the same number of rows. Different columns in a data frame can be made up of different data types (this is what makes them so versatile), but everything in a given column needs to be the same type (e.g., vector, factor, or list).

Let’s explore more about different data structures and how they behave. For now, let’s remove that extra line from our weather data and reload it, while we investigate this behavior further:

weather-data.csv:

island,temperature,snowfall
torgersen,"1.6",0
biscoe,"1.5",0
dream,"-2.6",1

And back in RStudio:

R

weather <- read.csv(file="data/weather-data.csv")

Vectors and Type Coercion

To better understand this behavior, let’s meet another of the data structures: the vector.

R

my_vector <- vector(length=3)
my_vector

OUTPUT

[1] FALSE FALSE FALSE

A vector in R is essentially an ordered list of things, with the special condition that everything in the vector must be the same basic data type. If you don’t choose the datatype, it’ll default to logical; or, you can declare an empty vector of whatever type you like.

R

another_vector <- vector(mode='character', length=3)
another_vector

OUTPUT

[1] "" "" ""

You can check if something is a vector:

R

str(another_vector)

OUTPUT

 chr [1:3] "" "" ""

The somewhat cryptic output from this command indicates the basic data type found in this vector - in this case chr, character; an indication of the number of things in the vector - actually, the indexes of the vector, in this case [1:3]; and a few examples of what’s actually in the vector - in this case empty character strings. If we similarly do

R

str(weather$temperature)

OUTPUT

 num [1:3] 1.6 1.5 -2.6

we see that weather$temperature is a vector, too - the columns of data we load into R data.frames are all vectors, and that’s the root of why R forces everything in a column to be the same basic data type.

Coercion by combining vectors

As we saw in lesson 1, you can also make vectors with explicit contents with the combine function:

R

combine_vector <- c(2,6,3)
combine_vector

OUTPUT

[1] 2 6 3

Given what we’ve learned so far, what do you think the following will produce?

R

quiz_vector <- c(2,6,'3')
quiz_vector

OUTPUT

[1] "2" "6" "3"

This is something called type coercion, and it is the source of many surprises and the reason why we need to be aware of the basic data types and how R will interpret them. When R encounters a mix of types (here double and character) to be combined into a single vector, it will force them all to be the same type. Consider:

R

coercion_vector <- c('a', TRUE)
coercion_vector

OUTPUT

[1] "a"    "TRUE"

R

another_coercion_vector <- c(0, TRUE)
another_coercion_vector

OUTPUT

[1] 0 1

The type hierarchy

The coercion rules go: logical -> integer -> double (“numeric”) -> complex -> character, where -> can be read as are transformed into. For example, combining logical and character transforms the result to character:

R

c('a', TRUE)

OUTPUT

[1] "a"    "TRUE"

A quick way to recognize character vectors is by the quotes that enclose them when they are printed.

You can try to force coercion against this flow using the as. functions:

R

character_vector_example <- c('0','2','4')
character_vector_example

OUTPUT

[1] "0" "2" "4"

R

character_coerced_to_double <- as.double(character_vector_example)
character_coerced_to_double

OUTPUT

[1] 0 2 4

R

double_coerced_to_logical <- as.logical(character_coerced_to_double)
double_coerced_to_logical

OUTPUT

[1] FALSE  TRUE  TRUE

As you can see, some surprising things can happen when R forces one basic data type into another! Nitty-gritty of type coercion aside, the point is: if your data doesn’t look like what you thought it was going to look like, type coercion may well be to blame; make sure everything is the same type in your vectors and your columns of data.frames, or you will get nasty surprises!

But coercion can also be very useful! For example, in our weather data snowfall is numeric, but we know that the 1s and 0s actually represent TRUE and FALSE (a common way of representing them). We should use the logical datatype here, which has two states: TRUE or FALSE, which is exactly what our data represents. We can ‘coerce’ this column to be logical by using the as.logical function:

R

weather$snowfall

OUTPUT

[1] 0 0 1

R

weather$snowfall <- as.logical(weather$snowfall) # Note this is the first time we have modified our data!
weather$snowfall

OUTPUT

[1] FALSE FALSE  TRUE

Finally, we can ask a few questions about vectors:

R

sequence_example <- 20:25
head(sequence_example, n=2)

OUTPUT

[1] 20 21

R

tail(sequence_example, n=4)

OUTPUT

[1] 22 23 24 25

R

length(sequence_example)

OUTPUT

[1] 6

R

typeof(sequence_example)

OUTPUT

[1] "integer"

Lists

Another data structure you’ll want in your bag of tricks is the list. A list is simpler in some ways than the other types, because you can put anything you want in it. Remember everything in the vector must be of the same basic data type, but a list can have different data types:

R

list_example <- list(1, "a", TRUE, 1+4i)
list_example

OUTPUT

[[1]]
[1] 1

[[2]]
[1] "a"

[[3]]
[1] TRUE

[[4]]
[1] 1+4i

When printing the object structure with str(), we see the data types of all elements:

R

str(list_example)

OUTPUT

List of 4
 $ : num 1
 $ : chr "a"
 $ : logi TRUE
 $ : cplx 1+4i

What is the use of lists? They can organize data of different types. For example, you can organize different tables that belong together, similar to spreadsheets in Excel. But there are many other uses, too.

We will see another example that will maybe surprise you in the next chapter.

To retrieve one of the elements of a list, use the double bracket:

R

list_example[[2]]

OUTPUT

[1] "a"

The elements of lists also can have names, they can be given by prepending them to the values, separated by an equals sign:

R

another_list <- list(title = "Numbers", numbers = 1:10, data = TRUE )
another_list

OUTPUT

$title
[1] "Numbers"

$numbers
 [1]  1  2  3  4  5  6  7  8  9 10

$data
[1] TRUE

This results in a named list. Now we have a new function of our object! We can access single elements by an additional way!

R

another_list$title

OUTPUT

[1] "Numbers"

Names

With names, we can give meaning to elements. It is the first time that we do not only have the data, but also explaining information. It is metadata that can be stuck to the object like a label. In R, this is called an attribute. Some attributes enable us to do more with our object, for example, like here, accessing an element by a self-defined name.

Accessing vectors and lists by name

We have already seen how to generate a named list. The way to generate a named vector is very similar. You have seen this function before:

R

pizza_price <- c( pizzasubito = 5.64, pizzafresh = 6.60, callapizza = 4.50 )

The way to retrieve elements is different, though:

R

pizza_price["pizzasubito"]

OUTPUT

pizzasubito
       5.64 

The approach used for the list does not work:

R

pizza_price$pizzafresh

ERROR

Error in pizza_price$pizzafresh: $ operator is invalid for atomic vectors

It will pay off if you remember this error message, you will meet it in your own analyses. It means that you have just tried accessing an element like it was in a list, but it is actually in a vector.

Accessing and changing names

If you are only interested in the names, use the names() function:

R

names(pizza_price)

OUTPUT

[1] "pizzasubito" "pizzafresh"  "callapizza" 

We have seen how to access and change single elements of a vector. The same is possible for names:

R

names(pizza_price)[3]

OUTPUT

[1] "callapizza"

R

names(pizza_price)[3] <- "call-a-pizza"
pizza_price

OUTPUT

 pizzasubito   pizzafresh call-a-pizza
        5.64         6.60         4.50 

Data frames


We have data frames at the very beginning of this lesson, they represent a table of data. We didn’t go much further into detail with our example weather data frame:

R

weather

OUTPUT

     island temperature snowfall
1 torgersen         1.6    FALSE
2    biscoe         1.5    FALSE
3     dream        -2.6     TRUE

We can now understand something a bit surprising in our data.frame; what happens if we run:

R

typeof(weather)

OUTPUT

[1] "list"

We see that data.frames look like lists ‘under the hood’. Think again what we heard about what lists can be used for:

Lists organize data of different types

Columns of a data frame are vectors of different types, that are organized by belonging to the same table.

A data.frame is really a list of vectors. It is a special list in which all the vectors must have the same length.

How is this “special”-ness written into the object, so that R does not treat it like any other list, but as a table?

R

class(weather)

OUTPUT

[1] "data.frame"

A class, just like names, is an attribute attached to the object. It tells us what this object means for humans.

You might wonder: Why do we need another what-type-of-object-is-this-function? We already have typeof()? That function tells us how the object is constructed in the computer. The class is the meaning of the object for humans. Consequently, what typeof() returns is fixed in R (mainly the five data types), whereas the output of class() is diverse and extendable by R packages.

In our weather example, we have an integer, a double and a logical variable. As we have seen already, each column of data.frame is a vector.

R

weather$island

OUTPUT

[1] "torgersen" "biscoe"    "dream"    

R

weather[,1]

OUTPUT

[1] "torgersen" "biscoe"    "dream"    

R

typeof(weather[,1])

OUTPUT

[1] "character"

R

str(weather[,1])

OUTPUT

 chr [1:3] "torgersen" "biscoe" "dream"

Each row is an observation of different variables, itself a data.frame, and thus can be composed of elements of different types.

R

weather[1,]

OUTPUT

     island temperature snowfall
1 torgersen         1.6    FALSE

R

typeof(weather[1,])

OUTPUT

[1] "list"

R

str(weather[1,])

OUTPUT

'data.frame':	1 obs. of  3 variables:
 $ island     : chr "torgersen"
 $ temperature: num 1.6
 $ snowfall   : logi FALSE

Challenge 1

An important part of every data analysis is cleaning the input data. If you know that the input data is all of the same format, (e.g. numbers), your analysis is much easier! Clean the weather data set from the chapter about type coercion.

Copy the code template

Create a new script in RStudio and copy and paste the following code. Then move on to the tasks below, which help you to fill in the gaps (______).

# Read data
weather <- read.csv("data/weather-data_v2.csv")

# 1. Print the data
_____

# 2. Show an overview of the table with all data types
_____(weather)

# 3. The "temperature" column has the incorrect data type __________.
#    The correct data type is: ____________.

# 4. Correct the 4th temperature data point with the mean of the two given values
weather$temperature[4] <- -3.45
#    print the data again to see the effect
weather

# 5. Convert the temperature to the right data type
weather$temperature <- ______________(weather$temperature)

#    Calculate the mean to test yourself
mean(weather$temperature)

# If you see the correct mean value (and not NA), you did the exercise
# correctly!

Instructions for the tasks

Execute the first statement (read.csv(...)). Then print the data to the console

Show the content of any variable by typing its name.

Two correct solutions:

weather
print(weather)

2. Overview of the data types

The data type of your data is as important as the data itself. Use a function we saw earlier to print out the data types of all columns of the weather table.

In the chapter “Data types” we saw two functions that can show data types. One printed just a single word, the data type name. The other printed a short form of the data type, and the first few values. We need the second here.

 str(weather)

3. Which data type do we need?

The shown data type is not the right one for this data (temperature at a location). Which data type do we need?

  • Why did the read.csv() function not choose the correct data type?
  • Fill in the gap in the comment with the correct data type for temperature!

Scroll up to the section about the type hierarchy to review the available data types

  • temperature is expressed on a continuous scale (real numbers). The R data type for this is “double” (also known as “numeric”).
  • The fourth row has the value “-3.4 or -3.5”. That is not a number but two numbers, and an english word. Therefore, the “character” data type is chosen. The whole column is now text, because all values in the same columns have to be the same data type.

4. Correct the problematic value

The code to assign a new temperature value to the problematic fourth row is given. Think first and then execute it: What will be the data type after assigning a number like in this example? You can check the data type after executing to see if you were right.

Revisit the hierarchy of data types when two different data types are combined.

The data type of the column “temperature” is “character”. The assigned data type is “double”. Combining two data types yields the data type that is higher in the following hierarchy:

 logical < integer < double < complex < character

Therefore, the column is still of type character! We need to manually convert it to “double”.

5. Convert the column “temperature” to the correct data type

Temperature readings are numbers. But the column does not have this data type yet. Coerce the column to floating point numbers.

The functions to convert data types start with as.. You can look for the function further up in the manuscript or use the RStudio auto-complete function: Type “as.” and then press the TAB key.

There are two functions that are synonymous for historic reasons:

weather$temperature <- as.double(weather$temperature)
weather$temperature <- as.numeric(weather$temperature)

Challenge 2

There are several subtly different ways to call variables, observations and elements from data.frames:

  • weather[1]
  • weather[[1]]
  • weather$island
  • weather["island"]
  • weather[1, 1]
  • weather[, 1]
  • weather[1, ]

Try out these examples and explain what is returned by each one.

Hint: Use the function typeof() to examine what is returned in each case.

R

weather[1]

OUTPUT

     island
1 torgersen
2    biscoe
3     dream

We can think of a data frame as a list of vectors. The single brace [1] returns the first slice of the list, as another list. In this case it is the first column of the data frame.

R

weather[[1]]

OUTPUT

[1] "torgersen" "biscoe"    "dream"    

The double brace [[1]] returns the contents of the list item. In this case it is the contents of the first column, a vector of type character.

R

weather$island

OUTPUT

[1] "torgersen" "biscoe"    "dream"    

This example uses the $ character to address items by name. island is the first column of the data frame, again a vector of type character.

R

weather["island"]

OUTPUT

     island
1 torgersen
2    biscoe
3     dream

Here we are using a single brace ["island"] replacing the index number with the column name. Like example 1, the returned object is a list.

R

weather[1, 1]

OUTPUT

[1] "torgersen"

This example uses a single brace, but this time we provide row and column coordinates. The returned object is the value in row 1, column 1. The object is a vector of type character.

R

weather[, 1]

OUTPUT

[1] "torgersen" "biscoe"    "dream"    

Like the previous example we use single braces and provide row and column coordinates. The row coordinate is not specified, R interprets this missing value as all the elements in this column and returns them as a vector.

R

weather[1, ]

OUTPUT

     island temperature snowfall
1 torgersen         1.6    FALSE

Again we use the single brace with row and column coordinates. The column coordinate is not specified. The return value is a list containing all the values in the first row.

Tip: Renaming data frame columns

Data frames have column names, which can be accessed with the names() function.

R

names(weather)

OUTPUT

[1] "island"      "temperature" "snowfall"   

If you want to rename the second column of weather, you can assign a new name to the second element of names(weather).

R

names(weather)[2] <- "temperature_celcius"
weather

OUTPUT

     island temperature_celcius snowfall
1 torgersen                 1.6    FALSE
2    biscoe                 1.5    FALSE
3     dream                -2.6     TRUE

Key Points

  • Use read.csv to read tabular data in R.
  • The basic data types in R are double, integer, complex, logical, and character.
  • Data structures such as data frames are built on top of lists and vectors, with some added attributes.
  • Indexing in R starts at 1, not 0.
  • Access individual values by location using [].
  • Access slices of data using [low:high].
  • Access arbitrary sets of data using [c(...)].
  • Use logical operations and logical vectors to access subsets of data.

Content from Exploring Data Frames


Last updated on 2024-12-10 | Edit this page

Overview

Questions

  • How can I manipulate a data frame?

Objectives

  • Add and remove rows or columns.
  • Append two data frames.
  • Display basic properties of data frames including size and class of the columns, names, and first few rows.

At this point, you’ve seen it all: in the last lesson, we toured all the basic data types and data structures in R. Everything you do will be a manipulation of those tools. But most of the time, the star of the show is the data frame—the table that we created by loading information from a csv file. In this lesson, we’ll learn a few more things about working with data frames.

Understanding our Data frame


Let’s ask some questions about this data frame to understand more about its structure:

R

str(weather)

OUTPUT

'data.frame':	3 obs. of  3 variables:
 $ island     : chr  "torgersen" "biscoe" "dream"
 $ temperature: num  1.6 1.5 -2.6
 $ snowfall   : int  0 0 1

R

summary(weather)

OUTPUT

    island           temperature         snowfall
 Length:3           Min.   :-2.6000   Min.   :0.0000
 Class :character   1st Qu.:-0.5500   1st Qu.:0.0000
 Mode  :character   Median : 1.5000   Median :0.0000
                    Mean   : 0.1667   Mean   :0.3333
                    3rd Qu.: 1.5500   3rd Qu.:0.5000
                    Max.   : 1.6000   Max.   :1.0000  

R

ncol(weather)

OUTPUT

[1] 3

R

nrow(weather)

OUTPUT

[1] 3

R

dim(weather)

OUTPUT

[1] 3 3

R

colnames(weather)

OUTPUT

[1] "island"      "temperature" "snowfall"   

R

head(weather)

OUTPUT

     island temperature snowfall
1 torgersen         1.6        0
2    biscoe         1.5        0
3     dream        -2.6        1

R

typeof(weather)

OUTPUT

[1] "list"

Adding columns and rows in data frames


We already learned that the columns of a data frame are vectors, so that our data are consistent in type throughout the columns. As such, if we want to add a new column, we can start by making a new vector:

R

# sunshine hours
sun <- c(10, 11, 12)
weather

OUTPUT

     island temperature snowfall
1 torgersen         1.6        0
2    biscoe         1.5        0
3     dream        -2.6        1

We can then add this as a column via:

R

cbind(weather, sun)

OUTPUT

     island temperature snowfall sun
1 torgersen         1.6        0  10
2    biscoe         1.5        0  11
3     dream        -2.6        1  12

Note that if we tried to add a vector of sunshine hours with a different number of entries than the number of rows in the data frame, it would fail:

R

sun <- c(10, 11, 12, 13)
cbind(weather, sun)

ERROR

Error in data.frame(..., check.names = FALSE): arguments imply differing number of rows: 3, 4

R

sun <- c(10, 11)
cbind(weather, sun)

ERROR

Error in data.frame(..., check.names = FALSE): arguments imply differing number of rows: 3, 2

Why didn’t this work? Of course, R wants to see one element in our new column for every row in the table:

R

nrow(weather)

OUTPUT

[1] 3

R

length(sun)

OUTPUT

[1] 2

So for it to work we need to have nrow(weather) = length(sun). Let’s overwrite the content of weather with our new data frame.

R

sun <- c(10, 11, 12)
weather <- cbind(weather, sun)

Now how about adding rows? We already know that the rows of a data frame are lists:

R

new_row <- list("deception", -3.45, TRUE, 13)
weather <- rbind(weather, new_row)

Let’s confirm that our new row was added correctly.

R

weather

OUTPUT

     island temperature snowfall sun
1 torgersen        1.60        0  10
2    biscoe        1.50        0  11
3     dream       -2.60        1  12
4 deception       -3.45        1  13

Removing rows


We now know how to add rows and columns to our data frame in R. Now let’s learn to remove rows.

R

weather

OUTPUT

     island temperature snowfall sun
1 torgersen        1.60        0  10
2    biscoe        1.50        0  11
3     dream       -2.60        1  12
4 deception       -3.45        1  13

We can ask for a data frame minus the last row:

R

weather[-4, ]

OUTPUT

     island temperature snowfall sun
1 torgersen         1.6        0  10
2    biscoe         1.5        0  11
3     dream        -2.6        1  12

Notice the comma with nothing after it to indicate that we want to drop the entire fourth row.

Note: we could also remove several rows at once by putting the row numbers inside of a vector, for example: weather[c(-3,-4), ]

Removing columns


We can also remove columns in our data frame. What if we want to remove the column “sun”. We can remove it in two ways, by variable number or by index.

R

weather[,-4]

OUTPUT

     island temperature snowfall
1 torgersen        1.60        0
2    biscoe        1.50        0
3     dream       -2.60        1
4 deception       -3.45        1

Notice the comma with nothing before it, indicating we want to keep all of the rows.

Alternatively, we can drop the column by using the index name and the %in% operator. The %in% operator goes through each element of its left argument, in this case the names of weather, and asks, “Does this element occur in the second argument?”

R

drop <- names(weather) %in% c("sun")
weather[,!drop]

OUTPUT

     island temperature snowfall
1 torgersen        1.60        0
2    biscoe        1.50        0
3     dream       -2.60        1
4 deception       -3.45        1

Appending to a data frame


The key to remember when adding data to a data frame is that columns are vectors and rows are lists. We can also glue two data frames together with rbind:

R

weather <- rbind(weather, weather)
weather

OUTPUT

     island temperature snowfall sun
1 torgersen        1.60        0  10
2    biscoe        1.50        0  11
3     dream       -2.60        1  12
4 deception       -3.45        1  13
5 torgersen        1.60        0  10
6    biscoe        1.50        0  11
7     dream       -2.60        1  12
8 deception       -3.45        1  13

Saving Our Data Frame


We can use the write.table function to save our new data frame:

R

write.table(
  weather,
  file="results/prepared_weather.csv",
  sep=",", 
  quote=FALSE, 
  row.names=FALSE
)

Challenge 1

You can create a new data frame right from within R with the following syntax:

R

df <- data.frame(id = c("a", "b", "c"),
                 x = 1:3,
                 y = c(TRUE, TRUE, FALSE))

Make a data frame that holds the following information for yourself:

  • first name
  • last name
  • lucky number

Then use rbind to add an entry for the people sitting beside you. Finally, use cbind to add a column with each person’s answer to the question, “Is it time for coffee break?”

R

df <- data.frame(first = c("Grace"),
                 last = c("Hopper"),
                 lucky_number = c(0))
df <- rbind(df, list("Marie", "Curie", 238) )
df <- cbind(df, coffeetime = c(TRUE,TRUE))

Key Points

  • Use cbind() to add a new column to a data frame.
  • Use rbind() to add a new row to a data frame.
  • Use str(), summary(), nrow(),ncol(),dim(),colnames(),head(), andtypeof()` to understand the structure of a data frame.
  • Read in a csv file using read.csv().
  • Understand what length() of a data frame represents.

Content from R Packages and Seeking Help


Last updated on 2024-12-10 | Edit this page

Overview

Questions

  • How do I use packages in R?
  • How can I get help in R?
  • How can I manage my environment?

Objectives

  • To be able to install packages, and load them into your R session
  • To be able to use CRAN task views to identify packages to solve a problem
  • To be able to read R help files for functions and special operators
  • To be able to seek help from your peers
  • To be able to manage a workspace in an interactive R session

R packages


R packages extend the functionality of R. Over 13,000 packages have been written by others. It’s also possible to write your own packages; this can be a great way of disseminating your research and making it useful to others. A number of useful packages are installed by default with R (are part of the R core distribution). The teaching machines at the University have a number of additional packages installed by default.

We can see the packages installed on an R installation via the “packages” tab in RStudio, or by typing installed.packages() at the prompt, or by selecting the “Packages” tab in RStudio.

In this course we will be using packages in the tidyverse to perform the bulk of our plotting and data analysis. Although we could do most of the tasks without using extra packages, the tidyverse makes it quicker and easier to perform common data analysis tasks. The tidyverse packages are already installed on the university teaching machines.

Finding and installing new packages


There are several sources of packages in R; the ones you are most likely to encounter are:

CRAN

CRAN is the main repository of packages for R. All the packages have undergone basic quality assurance when they were submitted. There are over 12,000 packages in the archive; there is a lot of overlap between some packages. Working out what the most appropriate package to use isn’t always straightforward.

Bioconductor

Bioconductor is a more specialised repository, which contains packages for bioinformatics. Common workflows are provided, and the packages are more thoroughly quality assured. Because of its more specialised nature we don’t focus on Bioconductor in this course.

Github / personal websites

Some authors distribute packages via Github or their own personal web-pages. These packages may not have undergone any form of quality assurance. Note that many packages have their own website, but the package itself is distributed via CRAN.

Finding packages to help with your research

There are various ways of finding packages that might be useful in your research:

  • The CRAN task view provides an overview of the packages available for various research fields and methodologies.

  • rOpenSci provides packages for performing open and reproducible science. Most of these are available from CRAN or Bioconductor.

  • Journal articles should cite the R packages they used in their analysis.

  • The Journal of Statistical Software contains peer-reviewed articles for R packages (and other statistical software).

  • The R Journal contains articles about new packages in R.

Installing packages

If a package is available on CRAN, you can install it by typing:

R

install.packages("packagename")

This will automatically install any packages that the package you are installing depends on.

Installing a package doesn’t make the functions included in it available to you; to do this you must use the library() function. As we will be using the tidyverse later in the course, let’s load that now:

R

library("tidyverse")

OUTPUT

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

The tidyverse is a collection of other packages that work well together. The tidyverse package’s main function is to load some of these other packages. We will be using some of these later in the course.

Conflicting names

You may get a warning message when loading a package that a function is “masked”. This happens when a function name has already been “claimed” by a package that’s already loaded. The most recently loaded function wins.

If you want to use the function from the other package, use packagename::function().

Reading Help files


R, and every package, provide help files for functions. The general syntax to search for help on any function, say function_name:

R

?function_name
# OR
help(function_name)

This will load up a help page in RStudio (or by launching a web browser, or as plain text if you are using R without RStudio).

Each help page is broken down into sections:

  • Description: An extended description of what the function does.
  • Usage: The arguments of the function and their default values.
  • Arguments: An explanation of the data each argument is expecting.
  • Details: Any important details to be aware of.
  • Value: The data the function returns.
  • See Also: Any related functions you might find useful.
  • Examples: Some examples for how to use the function.

Different functions might have different sections, but these are the main ones you should be aware of.

Tip: Reading help files

One of the most daunting aspects of R is the large number of functions available. It would be prohibitive, if not impossible to remember the correct usage for every function you use. Luckily, the help files mean you don’t have to!

Special Operators


To seek help on special operators, use quotes:

R

?"<-"

Getting help on packages


Many packages come with “vignettes”: tutorials and extended example documentation. Without any arguments, vignette() will list all vignettes for all installed packages; vignette(package="package-name") will list all available vignettes for package-name, and vignette("vignette-name") will open the specified vignette.

If a package doesn’t have any vignettes, you can usually find help by typing help("package-name"), or package?package-name.

When you kind of remember the function


If you’re not sure what package a function is in, or how it’s specifically spelled you can do a fuzzy search:

R

??function_name

Citing R and R packages

If you use R in your work you should cite it, and the packages you use. The citation() command will return the appropriate citation for R itself.
citation(packagename) will provide the citation for packagename.

When your code doesn’t work: seeking help from your peers


If you’re having trouble using a function, 9 times out of 10, the answers you are seeking have already been answered on Stack Overflow. You can search using the [r] tag.

If you can’t find the answer, there are a few useful functions to help you ask a question from your peers:

R

?dput

Will dump the data you’re working with into a format so that it can be copy and pasted by anyone else into their R session.

Package versions


Many of the packages in R are frequently updated. This does mean that code written for one version of a package may not work with another version of the package (or, potentially even worse, run but give a different result). The sessionInfo() command prints information about the system, and the names and versions of packages that have been loaded. You should include the output of sessionInfo() somewhere in your research. The packrat package provides a way of keeping specific versions of packages associated with each of your projects.

R

sessionInfo()

Challenge 1

Use help to find a function (and its associated parameters) that you could use to load data from a tabular file in which columns are delimited with “\t” (tab) and the decimal point is a “.” (period). This check for decimal separator is important, especially if you are working with international colleagues, because different countries have different conventions for the decimal point (i.e. comma vs period). Hint: use ??"read table" to look up functions related to reading in tabular data.

The standard R function for reading tab-delimited files with a period decimal separator is read.delim(). You can also do this with read.table(file, sep="\t") (the period is the default decimal separator for read.table()), although you may have to change the comment.char argument as well if your data file contains hash (#) characters.

Challenge 2

Vignettes are often useful tutorials. We will be using the dplyr package to manipulate tables of data. List the vignettes available in the package. You might want to take a look at these now, or later when we cover dplyr.

R

vignette(package="dplyr")

Shows that there are several vignettes included in the package. The dplyr vignette looks like it might be useful later. We can view this with:

R

vignette(package="dplyr", "dplyr")

Managing your environment


There are a few useful commands you can use to interact with the R session. Let’s create tow varables x and y to demonstrate this.

R

x <- 5
y <- "birds"

ls will list all of the variables and functions stored in the global environment (your working R session):

R

ls()

OUTPUT

[1] "x" "y"

Tip: hidden objects

Like in the shell, ls will hide any variables or functions starting with a “.” by default. To list all objects, type ls(all.names=TRUE) instead

Note here that we didn’t give any arguments to ls, but we still needed to give the parentheses to tell R to call the function.

If we type ls by itself, R prints a bunch of code instead of a listing of objects.

R

ls

OUTPUT

function (name, pos = -1L, envir = as.environment(pos), all.names = FALSE,
    pattern, sorted = TRUE)
{
    if (!missing(name)) {
        pos <- tryCatch(name, error = function(e) e)
        if (inherits(pos, "error")) {
            name <- substitute(name)
            if (!is.character(name))
                name <- deparse(name)
            warning(gettextf("%s converted to character string",
                sQuote(name)), domain = NA)
            pos <- name
        }
    }
    all.names <- .Internal(ls(envir, all.names, sorted))
    if (!missing(pattern)) {
        if ((ll <- length(grep("[", pattern, fixed = TRUE))) &&
            ll != length(grep("]", pattern, fixed = TRUE))) {
            if (pattern == "[") {
                pattern <- "\\["
                warning("replaced regular expression pattern '[' by  '\\\\['")
            }
            else if (length(grep("[^\\\\]\\[<-", pattern))) {
                pattern <- sub("\\[<-", "\\\\\\[<-", pattern)
                warning("replaced '[<-' by '\\\\[<-' in regular expression pattern")
            }
        }
        grep(pattern, all.names, value = TRUE)
    }
    else all.names
}
<bytecode: 0x55c460729d60>
<environment: namespace:base>

What’s going on here?

Like everything in R, ls is the name of an object, and entering the name of an object by itself prints the contents of the object. The object x that we created earlier contains 5:

R

x

OUTPUT

[1] 5

The object ls contains the R code that makes the ls function work! We’ll talk more about how functions work and start writing our own later.

You can use rm to delete objects you no longer need:

R

rm(x)

If you have lots of things in your environment and want to delete all of them, you can pass the results of ls to the rm function:

R

rm(list = ls())

In this case we’ve combined the two. Like the order of operations, anything inside the innermost parentheses is evaluated first, and so on.

In this case we’ve specified that the results of ls should be used for the list argument in rm. When assigning values to arguments by name, you must use the = operator!!

If instead we use <-, there will be unintended side effects, or you may get an error message:

R

rm(list <- ls())

ERROR

Error in rm(list <- ls()): ... must contain names or character strings

Tip: Warnings vs. Errors

Pay attention when R does something unexpected! Errors, like above, are thrown when R cannot proceed with a calculation. Warnings on the other hand usually mean that the function has run, but it probably hasn’t worked as expected.

In both cases, the message that R prints out usually give you clues how to fix a problem.

Further reading


We recommend the following resources for some additional reading on the topic of this episode: - Quick R - RStudio cheat sheets - Cookbook for R

Key Points

  • Use install.packages() to install packages (libraries) from CRAN
  • Use help() to get online help in R
  • Use ls() to list the variables in a program
  • Use rm() to delete objects in a program

Content from Manipulating Tibbles With Dplyr


Last updated on 2024-12-10 | Edit this page

Overview

Questions

  • How can I manipulate data frames without repeating myself?

Objectives

  • To be able to use the six main data frame manipulation ‘verbs’ with pipes in dplyr.
  • To understand how group_by() and summarize() can be combined to summarize datasets.
  • Be able to analyze a subset of data using logical filtering.

In this episode we’ll start to work with the penguins data directly. This is a version of the Palmer Penguins dataset that has been adapted for teaching. All missing values have been replaced with the sample mean / mode for convenience. The columns bill_length_mm and bill_depth_mm are defined as illustrated below.

Cartoon illustration of bill length and depth by @Allison Horst (CC-BY-4.0);
An illustration of Bill Length and Depth Artwork by @allison_horst

Manipulation of data frames means many things to many researchers: we often select certain observations (rows) or variables (columns), we often group the data by a certain variable(s), or we even calculate summary statistics. We can do these operations using the normal base R operations:

OUTPUT

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

R

mean(penguins$body_mass_g[penguins$species == "Adelie"])

OUTPUT

[1] 3703.959

R

mean(penguins$body_mass_g[penguins$species == "Chinstrap"])

OUTPUT

[1] 3733.088

R

mean(penguins$body_mass_g[penguins$species == "Gentoo"])

OUTPUT

[1] 5068.966

But this isn’t very nice because there is a fair bit of repetition. Repeating yourself will cost you time, both now and later, and potentially introduce some nasty bugs.

The dplyr package


Luckily, the dplyr package provides a number of very useful functions for manipulating data frames in a way that will reduce the above repetition, reduce the probability of making errors, and probably even save you some typing. As an added bonus, you might even find the dplyr grammar easier to read.

Tip: Tidyverse

dplyr package belongs to a broader family of opinionated R packages designed for data science called the “Tidyverse”. These packages are specifically designed to work harmoniously together. Some of these packages will be covered along this course, but you can find more complete information here: https://www.tidyverse.org/.

If you have have not installed the tidyverse package earlier, please do so:

R

install.packages('tidyverse')

Now let’s load the package:

R

library("tidyverse")

Here we’re going to cover 5 of the most commonly used dplyr functions as well as using pipes (|>) to combine them.

  1. select()
  2. filter()
  3. group_by()
  4. summarize()
  5. mutate()

But first we need to highlight some key differences between how base R and the tidyverse handle tabular data.

Let’s make a new script for this episode, by choosing the menu options File, New File, R Script.

One key thing to note is that the tidyverse duplicates many base R functions e.g.
we can load our dataset using the tidyverse function read_csv rather than read.csv:

R

library(tidyverse)
penguins <- read_csv("data/penguins_teaching.csv", col_types = cols(year = col_character()))
penguins

OUTPUT

# A tibble: 344 × 8
   species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
   <chr>   <chr>              <dbl>         <dbl>             <dbl>       <dbl>
 1 Adelie  Torgersen           39.1          18.7              181        3750
 2 Adelie  Torgersen           39.5          17.4              186        3800
 3 Adelie  Torgersen           40.3          18                195        3250
 4 Adelie  Torgersen           43.9          17.2              201.       4202.
 5 Adelie  Torgersen           36.7          19.3              193        3450
 6 Adelie  Torgersen           39.3          20.6              190        3650
 7 Adelie  Torgersen           38.9          17.8              181        3625
 8 Adelie  Torgersen           39.2          19.6              195        4675
 9 Adelie  Torgersen           34.1          18.1              193        3475
10 Adelie  Torgersen           42            20.2              190        4250
# ℹ 334 more rows
# ℹ 2 more variables: sex <chr>, year <chr>

Notice that we specify that the year column should be loaded as text (character data type). We do this because as the dataset only contains three year’s worth of data we want to treat “year” as a categorical variable. It is helpful when plotting to store this categorical variable as text.

We can see that penguins consists of a 344 by 8 tibble. We see the variable names, and an abbreviation indicating what type of data is stored in each variable.

A tibble is a way of storing tabular data - a modern version of data frames, which is part of the tidyverse. Tibbles can for the most part, be treated like a data.frame.

Callout

R’s standard data structure for tabular data is the data.frame. In contrast, read_csv() creates a tibble (also referred to, for historic reasons, as a tbl_df). This extends the functionality of a data.frame, and can, for the most part, be treated like a data.frame.

You may find that some older functions don’t work on tibbles. A tibble can be converted to a dataframe using as.data.frame(mytibble). To convert a data frame to a tibble, use as.tibble(mydataframe).

Tibbles behave more consistently than data frames when subsetting with []; this will always return another tibble. This isn’t the case when working with data.frames. You can find out more about the differences between data.frames and tibbles by typing vignette("tibble").

Using select()


If, for example, we wanted to move forward with only a few of the variables in our data frame we could use the select() function. This will keep only the variables you select.

R

year_island_bmg <- select(penguins, year, island, body_mass_g)
head(year_island_bmg)

OUTPUT

# A tibble: 6 × 3
  year  island    body_mass_g
  <chr> <chr>           <dbl>
1 2007  Torgersen       3750
2 2007  Torgersen       3800
3 2007  Torgersen       3250
4 2007  Torgersen       4202.
5 2007  Torgersen       3450
6 2007  Torgersen       3650 
Diagram illustrating use of select function to select two columns of a data frame

If we open up year_island_bmg we’ll see that it only contains the year, island and body mass (g).

Note that we can also use select to remove columns we don’t want in our dataset:

R

noyear_noisland_nobmg <- select(penguins, -year, -island, -body_mass_g)
head(noyear_noisland_nobmg)

OUTPUT

# A tibble: 6 × 5
  species bill_length_mm bill_depth_mm flipper_length_mm sex
  <chr>            <dbl>         <dbl>             <dbl> <chr>
1 Adelie            39.1          18.7              181  male
2 Adelie            39.5          17.4              186  female
3 Adelie            40.3          18                195  female
4 Adelie            43.9          17.2              201. male
5 Adelie            36.7          19.3              193  female
6 Adelie            39.3          20.6              190  male  

Above we used ‘normal’ grammar, but the strengths of dplyr lie in combining several functions using pipes. Since the pipes grammar is unlike anything we’ve seen in R before, let’s repeat what we’ve done above using pipes.

R

# before: year_island_bmg <- select(penguins, year, island, body_mass_g)
year_island_bmg <- penguins |> select(year, island, body_mass_g)
head(year_island_bmg)

OUTPUT

# A tibble: 6 × 3
  year  island    body_mass_g
  <chr> <chr>           <dbl>
1 2007  Torgersen       3750
2 2007  Torgersen       3800
3 2007  Torgersen       3250
4 2007  Torgersen       4202.
5 2007  Torgersen       3450
6 2007  Torgersen       3650 

To help you understand why we wrote that in that way, let’s walk through it step by step. First we summon the penguins data frame and pass it on, using the pipe symbol |>, to the next step, which is the select() function. In this case we don’t specify which data object we use in the select() function since it gets that from the previous pipe. Fun Fact: There is a good chance you have encountered pipes before in the shell. In R, a pipe symbol is |> while in the shell it is | but the concept is the same!

Tip: Renaming data frame columns in dplyr

In Chapter 4 we covered how you can rename columns with base R by assigning a value to the output of the names() function. Just like select, this is a bit cumbersome, but thankfully dplyr has a rename() function.

Within a pipeline, the syntax is rename(new_name = old_name). For example, we may want to rename the island column name from our select() statement above.

R

tidy_bmg <- year_island_bmg |> rename(island_name = island)

head(tidy_bmg)

OUTPUT

# A tibble: 6 × 3
  year  island_name body_mass_g
  <chr> <chr>             <dbl>
1 2007  Torgersen         3750
2 2007  Torgersen         3800
3 2007  Torgersen         3250
4 2007  Torgersen         4202.
5 2007  Torgersen         3450
6 2007  Torgersen         3650 

Using filter()


If we now want to move forward with the above, but only with observations for the island of “Dream”, we can combine select and filter

R

dream_year_island_bmg <- penguins |>
    filter(island == "Dream") |>
    select(year, body_mass_g)
head(dream_year_island_bmg)

OUTPUT

# A tibble: 6 × 2
  year  body_mass_g
  <chr>       <dbl>
1 2007         3250
2 2007         3900
3 2007         3300
4 2007         3900
5 2007         3325
6 2007         4150

Notice how this code is indented and formatted over multiple lines to improve readability.

Cartoon showing three fuzzy monsters either selecting or crossing out rows of a data table. If the type of animal in the table is “otter” and the site is “bay”, a monster is drawing a purple rectangle around the row. If those conditions are not met, another monster is putting a line through the column indicating it will be excluded. Stylized text reads "dplyr::filter() - keep rows that satisfy your conditions."
filter() keeps rows that satisfy your conditions -Artwork by @allison_horst

If we now want to show body mass of penguin species on Dream island but only for a specific year (e.g., 2007), we can do as below.

R

dream_island_2007 <- penguins |>
  filter(island == "Dream", year == "2007") |> 
  select(species, body_mass_g)

head(dream_island_2007)

OUTPUT

# A tibble: 6 × 2
  species body_mass_g
  <chr>         <dbl>
1 Adelie         3250
2 Adelie         3900
3 Adelie         3300
4 Adelie         3900
5 Adelie         3325
6 Adelie         4150

Notice that 2007 is in quotes (“2007”) as the year column is stored as text (character datatype).

As with last time, first we pass the penguins data frame to the filter() function, then we pass the filtered version of the penguins data frame to the select() function. Note: The order of operations is very important in this case. If we used ‘select’ first, filter would not be able to find the variable island since we would have removed it in the previous step.

Using group_by()


Now, we were supposed to be reducing the error prone repetitiveness of what can be done with base R, but up to now we haven’t done that since we would have to repeat the above for each island Instead of filter(), which will only pass observations that meet your criteria (in the above: island=="Dream"), we can use group_by(), which will essentially use every unique criteria that you could have used in filter.

R

str(penguins)

OUTPUT

spc_tbl_ [344 × 8] (S3: spec_tbl_df/tbl_df/tbl/data.frame)
 $ species          : chr [1:344] "Adelie" "Adelie" "Adelie" "Adelie" ...
 $ island           : chr [1:344] "Torgersen" "Torgersen" "Torgersen" "Torgersen" ...
 $ bill_length_mm   : num [1:344] 39.1 39.5 40.3 43.9 36.7 ...
 $ bill_depth_mm    : num [1:344] 18.7 17.4 18 17.2 19.3 ...
 $ flipper_length_mm: num [1:344] 181 186 195 201 193 ...
 $ body_mass_g      : num [1:344] 3750 3800 3250 4202 3450 ...
 $ sex              : chr [1:344] "male" "female" "female" "male" ...
 $ year             : chr [1:344] "2007" "2007" "2007" "2007" ...
 - attr(*, "spec")=
  .. cols(
  ..   species = col_character(),
  ..   island = col_character(),
  ..   bill_length_mm = col_double(),
  ..   bill_depth_mm = col_double(),
  ..   flipper_length_mm = col_double(),
  ..   body_mass_g = col_double(),
  ..   sex = col_character(),
  ..   year = col_character()
  .. )
 - attr(*, "problems")=<externalptr> 

R

str(penguins |> group_by(island))

OUTPUT

gropd_df [344 × 8] (S3: grouped_df/tbl_df/tbl/data.frame)
 $ species          : chr [1:344] "Adelie" "Adelie" "Adelie" "Adelie" ...
 $ island           : chr [1:344] "Torgersen" "Torgersen" "Torgersen" "Torgersen" ...
 $ bill_length_mm   : num [1:344] 39.1 39.5 40.3 43.9 36.7 ...
 $ bill_depth_mm    : num [1:344] 18.7 17.4 18 17.2 19.3 ...
 $ flipper_length_mm: num [1:344] 181 186 195 201 193 ...
 $ body_mass_g      : num [1:344] 3750 3800 3250 4202 3450 ...
 $ sex              : chr [1:344] "male" "female" "female" "male" ...
 $ year             : chr [1:344] "2007" "2007" "2007" "2007" ...
 - attr(*, "spec")=
  .. cols(
  ..   species = col_character(),
  ..   island = col_character(),
  ..   bill_length_mm = col_double(),
  ..   bill_depth_mm = col_double(),
  ..   flipper_length_mm = col_double(),
  ..   body_mass_g = col_double(),
  ..   sex = col_character(),
  ..   year = col_character()
  .. )
 - attr(*, "problems")=<externalptr>
 - attr(*, "groups")= tibble [3 × 2] (S3: tbl_df/tbl/data.frame)
  ..$ island: chr [1:3] "Biscoe" "Dream" "Torgersen"
  ..$ .rows : list<int> [1:3]
  .. ..$ : int [1:168] 21 22 23 24 25 26 27 28 29 30 ...
  .. ..$ : int [1:124] 31 32 33 34 35 36 37 38 39 40 ...
  .. ..$ : int [1:52] 1 2 3 4 5 6 7 8 9 10 ...
  .. ..@ ptype: int(0)
  ..- attr(*, ".drop")= logi TRUE

You will notice that the structure of the data frame where we used group_by() (grouped_df) is not the same as the original penguins (data.frame). A grouped_df can be thought of as a list where each item in the listis a data.frame which contains only the rows that correspond to the a particular value island (at least in the example above).

Diagram illustrating how the group by function oraganizes a data frame into groups

Callout

You may have noticed this output when using the group_by() function:

`summarise()` has grouped output by 'year'. You can override using the
`.groups` argument.
`summarise()` has grouped output by 'year'. 
  • This indicates that the dataset we are working with was grouped by a variable (in this case, year) using a function like group_by(year) before applying summarise().

  • The result of summarise() will maintain the grouping structure unless explicitly changed.

You can override using the `.groups` argument.

The .groups argument in summarise() lets you control how grouping is handled in the output. By default, summarise() keeps one level of grouping intact (i.e., the result remains grouped by any other grouping variables, if present). You can override this behaviour by specifying .groups explicitly.

The possible values for .groups are: - “drop_last” (default): Drops the last grouping variable (e.g., if grouped by year and month, it drops month but keeps year). - “drop”: Removes all grouping - “keep”: Keeps all grouping variables intact.

Using summarize()


The above was a bit on the uneventful side but group_by() is much more exciting in conjunction with summarize(). This will allow us to create new variable(s) by using functions that repeat for each of the group-specific data frames. That is to say, using the group_by() function, we split our original data frame into multiple pieces, then we can run functions (e.g. mean() or sd()) within summarize().

R

bm_byspecies <- penguins |>
    group_by(species) |>
    summarize(mean_body_mass_g = mean(body_mass_g))

head(bm_byspecies)

OUTPUT

# A tibble: 3 × 2
  species   mean_body_mass_g
  <chr>                <dbl>
1 Adelie               3704.
2 Chinstrap            3733.
3 Gentoo               5069.
Diagram illustrating the use of group by and summarize together to create a new variable

That allowed us to calculate the mean body_mass_g for each species, but it gets even better.

The function group_by() allows us to group by multiple variables. Let’s group by year and species.

R

bm_byyear_byspecies <- penguins |>
    group_by(year, species) |>
    summarize(mean_body_mass_g = mean(body_mass_g))

OUTPUT

`summarise()` has grouped output by 'year'. You can override using the
`.groups` argument.

R

head(bm_byyear_byspecies)

OUTPUT

# A tibble: 6 × 3
# Groups:   year [2]
  year  species   mean_body_mass_g
  <chr> <chr>                <dbl>
1 2007  Adelie               3707.
2 2007  Chinstrap            3694.
3 2007  Gentoo               5071.
4 2008  Adelie               3742
5 2008  Chinstrap            3800
6 2008  Gentoo               5020.

That is already quite powerful, but it gets even better! You’re not limited to defining 1 new variable in summarize().

R

bm_byyear_byspecies <- penguins |>
    group_by(year, species) |>
    summarize(
      mean_body_mass_g = mean(body_mass_g),
      sd_body_mass_g = sd(body_mass_g)
    )

OUTPUT

`summarise()` has grouped output by 'year'. You can override using the
`.groups` argument.

R

head(bm_byyear_byspecies)

OUTPUT

# A tibble: 6 × 4
# Groups:   year [2]
  year  species   mean_body_mass_g sd_body_mass_g
  <chr> <chr>                <dbl>          <dbl>
1 2007  Adelie               3707.           451.
2 2007  Chinstrap            3694.           328.
3 2007  Gentoo               5071.           583.
4 2008  Adelie               3742            455.
5 2008  Chinstrap            3800            519.
6 2008  Gentoo               5020.           515.

count() and n()


A very common operation is to count the number of observations for each group. The dplyr package comes with two related functions that help with this.

For instance, if we wanted to check the number of penguins included in the dataset for the year 2007, we can use the count() function. It takes the name of one or more columns that contain the groups we are interested in, and we can optionally sort the results in descending order by adding sort=TRUE:

R

penguins |>
    filter(year == "2007") |>
    count(island, sort = TRUE)

OUTPUT

# A tibble: 3 × 2
  island        n
  <chr>     <int>
1 Dream        46
2 Biscoe       44
3 Torgersen    20

If we need to use the number of observations in calculations, the n() function is useful. It will return the total number of observations in the current group rather than counting the number of observations in each group within a specific column. For instance, if we wanted to get the standard error of the body mass for penguins on each island:

R

penguins |>
    group_by(island) |>
    summarize(se_bm = sd(body_mass_g)/sqrt(n()))

OUTPUT

# A tibble: 3 × 2
  island    se_bm
  <chr>     <dbl>
1 Biscoe     60.3
2 Dream      37.4
3 Torgersen  61.9

You can also chain together several summary operations; in this case calculating the minimum, maximum, mean and se of body_mass_g on each island:

R

penguins |>
    group_by(island) |>
    summarize(
      mean_bm = mean(body_mass_g),
      min_bm = min(body_mass_g),
      max_bm = max(body_mass_g),
      se_bm = sd(body_mass_g)/sqrt(n())
    )

OUTPUT

# A tibble: 3 × 5
  island    mean_bm min_bm max_bm se_bm
  <chr>       <dbl>  <dbl>  <dbl> <dbl>
1 Biscoe      4713.   2850   6300  60.3
2 Dream       3713.   2700   4800  37.4
3 Torgersen   3716.   2900   4700  61.9

Using mutate()


We can also create new variables prior to (or even after) summarizing information using mutate().

R

bm_byyear_byisland_byspecies <- penguins |>
    mutate(body_mass_kg = body_mass_g/1000) |>
    group_by(year, island, species) |>
    summarize(
      mean_body_mass_kg = mean(body_mass_kg),
      sd_body_mass_kg = sd(body_mass_kg)
    )

OUTPUT

`summarise()` has grouped output by 'year', 'island'. You can override using
the `.groups` argument.

R

head(bm_byyear_byisland_byspecies)

OUTPUT

# A tibble: 6 × 5
# Groups:   year, island [4]
  year  island    species   mean_body_mass_kg sd_body_mass_kg
  <chr> <chr>     <chr>                 <dbl>           <dbl>
1 2007  Biscoe    Adelie                 3.62           0.292
2 2007  Biscoe    Gentoo                 5.07           0.583
3 2007  Dream     Adelie                 3.67           0.527
4 2007  Dream     Chinstrap              3.69           0.328
5 2007  Torgersen Adelie                 3.79           0.441
6 2008  Biscoe    Adelie                 3.63           0.478
Cartoon of cute fuzzy monsters dressed up as different X-men characters, working together to add a new column to an existing data frame. Stylized title text reads "dplyr::mutate - add columns, keep existing."
An illustration of the mutate function Artwork by @allison_horst

Connect mutate with logical filtering: ifelse


When creating new variables, we can hook this with a logical condition. A simple combination of mutate() and ifelse() facilitates filtering right where it is needed: in the moment of creating something new. This easy-to-read statement is a fast and powerful way of discarding certain data (even though the overall dimension of the data frame will not change) or for updating values depending on this given condition.

R

## keeping all data but "filtering" after a certain condition
# calculate bill length / depth ratio for only for Adelie penguins
bill_morph_adelie <- penguins |>
    mutate(
      bill_ratio = ifelse(
        species == "Adelie", 
        bill_length_mm/bill_depth_mm, 
        NA
      )
    )  
head(bill_morph_adelie)

OUTPUT

# A tibble: 6 × 9
  species island    bill_length_mm bill_depth_mm flipper_length_mm body_mass_g
  <chr>   <chr>              <dbl>         <dbl>             <dbl>       <dbl>
1 Adelie  Torgersen           39.1          18.7              181        3750
2 Adelie  Torgersen           39.5          17.4              186        3800
3 Adelie  Torgersen           40.3          18                195        3250
4 Adelie  Torgersen           43.9          17.2              201.       4202.
5 Adelie  Torgersen           36.7          19.3              193        3450
6 Adelie  Torgersen           39.3          20.6              190        3650
# ℹ 3 more variables: sex <chr>, year <chr>, bill_ratio <dbl>

Challenge 1

Write a single command (which can span multiple lines and includes pipes) that will produce a data frame with the penguins found on Torgersen island, including the columns species, body_mass_g, and year, but not for other islands. How many rows does your data frame have, and why?

You should start from the penguins tibble.

R

penguins_torgersen <- penguins |>
  filter(island == "Torgersen") |>
  select(species, body_mass_g, year)

head(penguins_torgersen)

OUTPUT

# A tibble: 6 × 3
  species body_mass_g year
  <chr>         <dbl> <chr>
1 Adelie        3750  2007
2 Adelie        3800  2007
3 Adelie        3250  2007
4 Adelie        4202. 2007
5 Adelie        3450  2007
6 Adelie        3650  2007 

Using head(penguins_torgersen) tells us that our output is “A tibble 52 x 3” i.e. a tibble with 52 rows and 3 columns.

Alternative solutions to find the shape of the tibble:

R

dim(penguins_torgersen)

OUTPUT

[1] 52  3

R

nrow(penguins_torgersen)

OUTPUT

[1] 52

R

ncol(penguins_torgersen)

OUTPUT

[1] 3

Challenge 2

Calculate the average flipper length per species. Which species has the longest average flipper length, and which has the shortest?

You should start from the penguins tibble.

R

body_mass_by_species <- penguins |>
  group_by(species) |>
  summarize(mean_body_mass = mean(body_mass_g))

body_mass_by_species |>
  filter(mean_body_mass == min(mean_body_mass) | mean_body_mass == max(mean_body_mass))

OUTPUT

# A tibble: 2 × 2
  species mean_body_mass
  <chr>            <dbl>
1 Adelie           3704.
2 Gentoo           5069.

Another way to do this is to use the dplyr function arrange(), which arranges the rows in a data frame according to the order of one or more variables from the data frame. It has similar syntax to other functions from the dplyr package. You can use desc() inside arrange() to sort in descending order.

R

body_mass_by_species |>
  arrange(mean_body_mass) |>
  head(1)

OUTPUT

# A tibble: 1 × 2
  species mean_body_mass
  <chr>            <dbl>
1 Adelie           3704.

R

body_mass_by_species |>
  arrange(desc(mean_body_mass)) |>
  head(1)

OUTPUT

# A tibble: 1 × 2
  species mean_body_mass
  <chr>            <dbl>
1 Gentoo           5069.

Alphabetical order works too

R

body_mass_by_species |>
  arrange(desc(species)) |>
  head(1)

OUTPUT

# A tibble: 1 × 2
  species mean_body_mass
  <chr>            <dbl>
1 Gentoo           5069.

Advanced Challenge

Calculate the average body mass in 2007 of 10 randomly selected penguins for each species. Then arrange the species names in reverse alphabetical order. Hint: Use the dplyr functions arrange() and sample_n(), they have similar syntax to other dplyr functions.

R

body_mass_10penguins_byspecies <- penguins |>
  filter(year == "2007") |> 
  group_by(species) |>
  sample_n(10) |>
  summarize(mean_body_mass = mean(body_mass_g)) |>
  arrange(desc(species))

body_mass_10penguins_byspecies

OUTPUT

# A tibble: 3 × 2
  species   mean_body_mass
  <chr>              <dbl>
1 Gentoo             5405
2 Chinstrap          3648.
3 Adelie             3500.

Further reading


We recommend the following resources for some additional reading on the topic of this episode:

Key Points

  • Use the dplyr package to manipulate data frames.
  • Use select() to choose variables from a data frame.
  • Use filter() to choose data based on values.
  • Use group_by() and summarize() to work with subsets of data.
  • Use mutate() to create new variables.

Content from Creating Publication-Quality Graphics with ggplot2


Last updated on 2024-12-10 | Edit this page

Overview

Questions

  • How can I create publication-quality graphics in R?

Objectives

  • To be able to use ggplot2 to generate publication-quality graphics.
  • To apply geometry, aesthetic, and statistics layers to a ggplot plot.
  • To manipulate the aesthetics of a plot using different colors, shapes, and lines.
  • To improve data visualization through transforming scales and paneling by group.
  • To save a plot created with ggplot to disk.

Let’s make a new script for this episode, by choosing the menu options File, New File, R Script.

Although we loaded the tidyverse in the previous episode, we should make our scripts self-contained, so we should include library(tidyverse) in the new script.

R

library(tidyverse)

OUTPUT

── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
✔ dplyr     1.1.4     ✔ readr     2.1.5
✔ forcats   1.0.0     ✔ stringr   1.5.1
✔ ggplot2   3.5.1     ✔ tibble    3.2.1
✔ lubridate 1.9.4     ✔ tidyr     1.3.1
✔ purrr     1.0.2
── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()
ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors

R

penguins <- read_csv("data/penguins_teaching.csv", col_types = cols(year = col_character()))

Plotting our data is one of the best ways to quickly explore it and the various relationships between variables.

There are three main plotting systems in R, the base plotting system, the lattice package, and the ggplot2 package.

Today we’ll be learning about the ggplot2 package, because it is the most effective for creating publication-quality graphics.

ggplot2 is built on the grammar of graphics, the idea that any plot can be built from the same set of components: a data set, mapping aesthetics, and graphical layers:

  • Data sets are the data that you, the user, provide.

  • Mapping aesthetics are what connect the data to the graphics. They tell ggplot2 how to use your data to affect how the graph looks, such as changing what is plotted on the X or Y axis, or the size or color of different data points.

  • Layers are the actual graphical output from ggplot2. Layers determine what kinds of plot are shown (scatterplot, histogram, etc.), the coordinate system used (rectangular, polar, others), and other important aspects of the plot. The idea of layers of graphics may be familiar to you if you have used image editing programs like Photoshop, Illustrator, or Inkscape.

Let’s start off building an example using the penguins data from earlier. First let’s create a derived data set to plot! Let’s find out if the mean body mass of the Adelie penguins has changed over time.

R

penguins_bm <- penguins |>
  filter(species == c("Adelie")) |>
  group_by(year, island, species) |>
  summarize(mean_body_mass = mean(body_mass_g)) |>
  ungroup()

OUTPUT

`summarise()` has grouped output by 'year', 'island'. You can override using
the `.groups` argument.

The most basic function is ggplot, which lets R know that we’re creating a new plot. Any of the arguments we give the ggplot function are the global options for the plot: they apply to all layers on the plot.

R

library(ggplot2)
ggplot(data = penguins_bm)
Blank plot, before adding any mapping aesthetics to ggplot().

Here we called ggplot and told it what data we want to show on our figure. This is not enough information for ggplot to actually draw anything. It only creates a blank slate for other elements to be added to.

Now we’re going to add in the mapping aesthetics using the aes function. aes tells ggplot how variables in the data map to aesthetic properties of the figure, such as which columns of the data should be used for the x and y locations.

R

ggplot(data = penguins_bm, mapping = aes(x = year, y = mean_body_mass))
Plotting area with axes for a scatter plot of mean body mass vs year, with no data points visible.

Here we told ggplot we want to plot the “mean_body_mass” column of the our data frame on the x-axis, and the “year” column on the y-axis. Notice that we didn’t need to explicitly pass aes these columns (e.g. x = penguins[, "year"]), this is because ggplot is smart enough to know to look in the data for that column!

The final part of making our plot is to tell ggplot how we want to visually represent the data. We do this by adding a new layer to the plot using one of the geom functions.

R

ggplot(data = penguins_bm, mapping = aes(x = year, y = mean_body_mass)) +
  geom_point()
Scatter plot of mean body mass vs year, now showing the data points.

Here we used geom_point, which tells ggplot we want to visually represent the relationship between x and y as a scatterplot of points.

Notice that we use a “+” sign to add another layer to the plot, not a pipe “|>” which we’ve previously used to pass the output data from one data processing step to another step.

Challenge 1

Modify the example so that the figure shows how bill length (mm) varies with flipper length (mm) instead of body mass :

R

ggplot(data = penguins, mapping = aes(x = flipper_length_mm, y = body_mass_g)) + geom_point()

Here is one possible solution:

R

ggplot(data = penguins, mapping = aes(x = flipper_length_mm, y = bill_length_mm)) + geom_point()
Scatter plot showing bill length (mm) versus flipper length (mm) for individual penguins, displaying each species as distinct points. All points are coloured on the plot are coloured black.
Scatter plot showing bill length (mm) versus flipper length (mm) for individual penguins, displaying each species as distinct points. All points are coloured on the plot are coloured black.

Challenge 2

In the previous examples and challenge we’ve used the aes function to tell the scatterplot geom about the x and y locations of each point. Another aesthetic property we can modify is the point color. Modify the code from the previous challenge to color the points by the “species” column. What trends do you see in the data? Are they what you expected?

Hint: Use ?aes or ?ggplot2 to get help on these functions if needed.

The solution presented below adds color=species to the call of the aes function. The general trend seems to indicate an increased bill length with flipper length. Visual inspection suggests that the strength of this relationship is similar across all three species.

R

ggplot(data = penguins, mapping = aes(x = flipper_length_mm, y = bill_length_mm, color=species)) +
  geom_point()
Scatter plot of body mass (g) vs flipper length (mm), with points color-coded by penguin species to show how body mass varies by species and flipper length, thus showing the value of 'aes' function
Scatter plot of body mass (g) vs flipper length (mm), with points color-coded by penguin species to show how body mass varies by species and flipper length, thus showing the value of ‘aes’ function

Layers


Using a scatterplot probably isn’t the best for visualizing change over time. Instead, let’s tell ggplot to visualize the data as a line plot:

R

ggplot(data = penguins_bm, mapping = aes(x=year, y=mean_body_mass)) +
  geom_line()

Instead of adding a geom_point layer, we’ve added a geom_line layer.

However, the result doesn’t look quite as we might have expected: it seems to be jumping around a lot within each year Let’s try to separate the data by island, plotting one line for each island:

R

ggplot(data = penguins_bm, mapping = aes(x=year, y=mean_body_mass, group=island)) +
  geom_line()

Let’s color the lines by island to make things clearer:

R

ggplot(data = penguins_bm, mapping = aes(x=year, y=mean_body_mass, group=island, color = island)) +
  geom_line()

We’ve added the group aesthetic, which tells ggplot to draw a line for each island.

But what if we want to visualize both lines and points on the plot? We can add another layer to the plot:

R

ggplot(data = penguins_bm, mapping = aes(x=year, y=mean_body_mass, group=island, color=island)) +
  geom_line() + 
  geom_point()

It’s important to note that each layer is drawn on top of the previous layer. In this example, the points have been drawn on top of the lines. Here’s a demonstration:

R

ggplot(data = penguins_bm, mapping = aes(x=year, y=mean_body_mass, group=island)) +
  geom_line(mapping = aes(color=island)) + geom_point()

In this example, the aesthetic mapping of color has been moved from the global plot options in ggplot to the geom_line layer so it no longer applies to the points. Now we can clearly see that the points are drawn on top of the lines.

Tip: Setting an aesthetic to a value instead of a mapping

So far, we’ve seen how to use an aesthetic (such as color) as a mapping to a variable in the data. For example, when we use geom_line(mapping = aes(color=island)), ggplot will give a different color to each island But what if we want to change the color of all lines to blue? You may think that geom_line(mapping = aes(color="blue")) should work, but it doesn’t. Since we don’t want to create a mapping to a specific variable, we can move the color specification outside of the aes() function, like this: geom_line(color="blue").

Challenge 3

Switch the order of the point and line layers from the previous example. What happened?

The lines now get drawn over the points!

R

ggplot(data = penguins_bm, mapping = aes(x=year, y=mean_body_mass, group=island)) +
  geom_point() +
  geom_line(mapping = aes(color=island)) 
Scatter plot of mran body mass (g) over time, with lines connecting values for each year and species, demonstrating species-specific trends in body mass across years

Statistics


ggplot2 also makes it easy to overlay statistical models over the data. To demonstrate we’ll go back to our first challenge:

R

ggplot(data = penguins, mapping = aes(x = flipper_length_mm, y = bill_length_mm)) +
  geom_point()

We can modify the transparency of the points, using the alpha function, which is especially helpful when you have a large amount of data which is very clustered.

Tip Reminder: Setting an aesthetic to a value instead of a mapping

Notice that we used geom_point(alpha = 0.5). As the previous tip mentioned, using a setting outside of the aes() function will cause this value to be used for all points, which is what we want in this case. But just like any other aesthetic setting, alpha can also be mapped to a variable in the data. For example, we can give a different transparency to each island with geom_point(mapping = aes(alpha = island)).

We can fit a simple relationship to the data by adding another layer, geom_smooth:

R

ggplot(data = penguins, mapping = aes(x = flipper_length_mm, y = bill_length_mm)) +
  geom_point(alpha = 0.5) + 
  geom_smooth(method="lm")

OUTPUT

`geom_smooth()` using formula = 'y ~ x'
Scatter plot of flipperer length vs bill length with a blue trend line summarising the relationship between variables, and gray shaded area indicating 95% confidence intervals for that trend line.

We can make the line thicker by setting the linewidth aesthetic in the geom_smooth layer:

R

ggplot(data = penguins, mapping = aes(x = flipper_length_mm, y = bill_length_mm)) +
  geom_point(alpha = 0.5) + 
  geom_smooth(method="lm", linewidth=1.5)

OUTPUT

`geom_smooth()` using formula = 'y ~ x'
Scatter plot of flipper length vs bill length with a trend line summarising the relationship between variables. The trend line is slightly thicker than in the previous figure.

There are two ways an aesthetic can be specified. Here we set the linewidth aesthetic by passing it as an argument to geom_smooth and it is applied the same to the whole geom. Previously in the lesson we’ve used the aes function to define a mapping between data variables and their visual representation.

Challenge 4a

Modify the color and size of the points on the point layer in the plot from Challenge 1 example.

Hint: do not use the aes function.

Hint: the equivalent of linewidth for points is size.

Here a possible solution: Notice that the color argument is supplied outside of the aes() function. This means that it applies to all data points on the graph and is not related to a specific variable.

R

ggplot(data = penguins, mapping = aes(x = flipper_length_mm, y = bill_length_mm)) +
  geom_point(size = 2, color = "orange")
Scatter plot of average body mass (g) over time, showing enlarged orange data points for each year, connected by lines colored by species.

Challenge 4b

Modify your solution to Challenge 4a so that the points are now a different shape and are colored by species.

Hint: (1) The color argument can be used inside the aesthetic. (2) See this quick reference to find out more about available point shapes in R

Here is a possible solution: Notice that supplying the color argument inside the aes() functions enables you to connect it to a certain variable. The shape argument, as you can see, modifies all data points the same way (it is outside the aes() call) while the color argument which is placed inside the aes() call modifies a point’s color based on its species value.

R

ggplot(data = penguins, mapping = aes(x = flipper_length_mm, y = bill_length_mm)) +
  geom_point(size = 2, shape = 17, aes(color=species)) 
Scatter plot of flipper length (mm) against bill length (mm).

Multi-panel figures


Earlier we visualized the relationship between bill length and flipper length across all species in one plot. Alternatively, we can split this out over multiple panels by adding a layer of facet panels.

R

ggplot(data = penguins, mapping = aes(x = flipper_length_mm, y = bill_length_mm, color = island)) +
  geom_point() +
  facet_wrap( ~ species) +
  theme(axis.text.x = element_text(angle = 45))

The facet_wrap layer took a “formula” as its argument, denoted by the tilde (~). This tells R to draw a panel for each unique value in the species column of the penguins dataset.

Note that we apply a “theme” definition to rotate the x-axis labels to maintain readability. Nearly everything in ggplot2 is customizable.

Modifying text


To clean this figure up for a publication we need to change some of the text elements. The x-axis is too cluttered, and the axis labels should read “Bill Length” and “Flipper Length, rather than the column names in the data frame.

We can do this by adding a couple of different layers. The theme layer controls the axis text, and overall text size. Labels for the axes, plot title and any legend can be set using the labs function. Legend titles are set using the same names we used in the aes specification. Thus below the color legend title is set using color = "Island", while the title of a fill legend would be set using fill = "MyTitle".

R

ggplot(data = penguins, mapping = aes(x = flipper_length_mm, y = bill_length_mm, color = island)) +
  geom_point() +
  facet_wrap( ~ species) +
  labs(
    x = "Flipper Length (mm)",              # x axis title
    y = "Bill Length (mm)",   # y axis title
    title = "Figure 1",      # main title of figure
    color = "Island"      # title of legend
  ) +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

Exporting the plot


The ggsave() function allows you to export a plot created with ggplot. You can specify the dimension and resolution of your plot by adjusting the appropriate arguments (width, height and dpi) to create high quality graphics for publication. In order to save the plot from above, we first assign it to a variable bill_flipper_plot, then tell ggsave to save that plot in png format to a directory called results. (Make sure you have a results/ folder in your working directory.)

R

bill_flipper_plot <- ggplot(data = penguins, mapping = aes(x = flipper_length_mm, y = bill_length_mm, color = island)) +
  geom_point() +
  facet_wrap( ~ species) +
  labs(
    x = "Flipper Length (mm)",              # x axis title
    y = "Bill Length (mm)",   # y axis title
    title = "Figure 1",      # main title of figure
    color = "Island"      # title of legend
  ) +
  theme(axis.text.x = element_text(angle = 90, hjust = 1))

ggsave(filename = "results/bill_flipper_plot.png", plot = bill_flipper_plot, width = 12, height = 10, dpi = 300, units = "cm")

There are two nice things about ggsave. First, it defaults to the last plot, so if you omit the plot argument it will automatically save the last plot you created with ggplot. Secondly, it tries to determine the format you want to save your plot in from the file extension you provide for the filename (for example .png or .pdf). If you need to, you can specify the format explicitly in the device argument.

This is a taste of what you can do with ggplot2. RStudio provides a really useful cheat sheet of the different layers available, and more extensive documentation is available on the ggplot2 website. All RStudio cheat sheets are available from the RStudio website. Finally, if you have no idea how to change something, a quick Google search will usually send you to a relevant question and answer on Stack Overflow with reusable code to modify!

Challenge 5

Generate a boxplot to compare flipper length between species.

Advanced:

  • Add axis labels
  • Hide the legend

Here a possible solution: xlab() and ylab() set labels for the x and y axes, respectively The axis title, text and ticks are attributes of the theme and must be modified within a theme() call.

R

ggplot(data = penguins, mapping = aes(x = species, y = flipper_length_mm, fill = species)) +
  geom_boxplot() +
  labs(
    x = "Species",              # x axis title
    y = "Flipper Length (mm)",   # y axis title
  ) +  
  theme(legend.position = "none")
Boxplot comparing flipper length (mm) across penguin species, with labeled axes showing species on the x-axis and flipper length on the y-axis, and the legend hidden for a cleaner view.

Further reading


We recommend the following resources for some additional reading on the topic of this episode:

Other Plotting Systems

Key Points

  • Use ggplot2 to create plots.
  • Think about graphics in layers: aesthetics, geometry, statistics, scale transformation, and grouping.

Content from Wrap-up


Last updated on 2024-12-10 | Edit this page

Overview

Questions

  • What are the next steps in learning R to further your R coding skills?

Objectives

  • Reflect on the key topics covered in this course and identify areas for further study to enhance your R skills.

In this course, we have introduced you to the basics of R and RStudio and demonstrated how these tools can be used for scientific analysis. The concepts and techniques you’ve learned here are essential to getting started with R, but there is much more to explore. Learning R is a journey, and you will find additional topics and tools that are vital for working effectively in your domain.

Some of the topics we touched on briefly, or didn’t have time to explore in depth, will be critical for your continued success with R. Here, we’ll highlight a few areas that you should prioritize as you continue your R journey.

Topics to Explore Further


While this course covered the essentials, some key topics were only mentioned in passing or were not included. It is important to delve deeper into these areas to fully leverage R for your research.

Factors

  • Factors are a specific data type in R used to represent categorical variables with discrete levels. They are essential for performing statistical analysis on categorical data.
  • If you plan to analyze survey data, experimental treatments, or any other dataset involving categories, learning how to create and manipulate factors is crucial.
  • See Chapter 16 Factors in R for Data Science (2e) by Hadley Wickham.

Missing Data


Data Structures: Matrices and Arrays

  • Matrices and arrays are foundational data structures in R, used to represent multi-dimensional data.
  • These structures are particularly important if you intend to:
    • Work with numerical computations across multi-dimensional datasets.
    • Develop your own statistical R packages or perform advanced simulations.
  • Understanding these data structures will give you greater flexibility and power in how you handle and process data.
  • See Chapter 13 - Data Types and Structures in the Software Carpentry Course “Programming with R”

Next Steps in Self-Learning


In order to build on what you’ve learned in this course, we also recommend focusing on the following topics next:

  1. Control Flow
  2. Functions
  3. Data Frame Manipulation with tidyr

Looking Ahead


The skills you’ve gained here form a strong foundation for using R in scientific research. As you continue to practice and explore R, remember that your growing your coding expertise will help: - Your team, by making your work accessible and reproducible. - Your peers, by sharing scripts and workflows that others can understand and build upon. - The broader scientific community, by contributing to open, transparent, and impactful research.

Key Points

  • This course covered the essentials of R and RStudio for reproducible analysis, providing a strong foundation for further learning.
  • Topics such as factors, matrices and arrays, and more advanced concepts like data frame manipulation with tidyr, control flow, and functions are critical next steps.
  • Continued practice and exploration of R will benefit you, your team, and the wider research community.