Marie-Hélène Burle
             -----------
email msb2@sfu.ca  twitter @MHBurle  github prosoitos
September 05, 2018

Tips on best practices in R

Table of Contents

Projects

Work in self-contained projects

Advantages:

  • allows collaboration
  • makes projects portable
  • facilitates version control
  • reduces the risk to end up with broken links or missing files necessary for the projects

Create a directory that contains all the necessary subdirectories and files for your project

This includes data, scripts, as well as the project outputs (graphs, results, etc.)

If you use RStudio, create RStudio projects.

Example possible project structure

This would be a rather standard project structure:

/project_root     /data         /raw
                                      /clean

                      /results      /graphs
                                      /tables

                      /docs
                      /bin
                      /src
                      /ms

You are free to organize your projects in a way that works for you. But following some rather standard way of organizing files will help collaborators find their way around your project and some consistency between projects will also help you adapt scripts and snippets from one project to the next.

Paths

The problem with absolute paths

If this is how your script starts:

setwd("C:\Users\charlie\some_very_personal_directory_structure\our_project")

what do you think the odds are that the person to whom you are giving that script

  • is called Charlie and
  • is using Windows and
  • is having the exact same directory structure in their computer?

In order to run your script, your friend Lucy, who uses linux, will have to change all links to:

"/home/lucy/totally_different_directory_structure/our_project"

This makes for challenging collaborations.

If you move your project to another machine, the links will equally get broken. And it is likely that they will get broken over time as you reorganize documents in your computer.

Solution: use paths relative to your project root

Once your entire project lives within one directory (the project root), as we saw in the previous section, make sure that:

all the paths in your scripts are relative to the project root

This makes your project portable, much more robust (lower risk to get broken over time), and straightforward to share.

The way a lot of people go around this is by setting the working directory manually in RStudio (by clicking on "Set As Working Directory") to the project root. But for this to run in another machine, the user will have to do the same before running the script (unless the script is in the project root, but this is rarely the case and you certainly cannot assume that it is).

A portable script should run as is, without this manual tinkering. Why?

  1. It is easy to forget to do the tinkering and wonder why nothing works
  2. This manual tinkering prevents automation and defeats one of the main advantages of programming:

once you realize the advantages of writing code over the use of a GUI software, you should realize that you have to be careful in how you use RStudio. Clicking around is ok for some tasks, but it is not if this does something necessary for the script to run successfully (such as setting the working directory).

If you use RStudio projects (and if you use RStudio, you definitely should create RStudio projects), RStudio will automatically, upon opening a project, move the current working directory to the project root. This is great. But you cannot assume that everybody running your script uses RStudio: R scripts can be run directly in R, in the command line, from a shell script, or using other tools or IDEs. So the script itself should ideally contain a way to refer to the project root, independent of RStudio.

How?

Package here

There are various methods, but a wonderfully easy one is:

the package here from Kirill Müller

The function here::here() starts from the current working directory (the directory in which the script is running if you don't set it manually in RStudio or with setwd) and goes up the directory chain until it finds a .Rproj file (if you use RStudio projects), a .git or .svn file (if you version control your projects), a .projectile file (if you use emacs projectile), or other sensible files which signify a project root. If none of these apply to you (which is unlikely), you can create a file .here in your project root with the function set_here() and this file will then signify to the function here() that this is the project root.

From there on, you can refer to any file in your project with here("file/path/from/project/root").

Example usage:

library(tidyverse)
library(here)

my_data <- read_excel(here("data/raw/my_data.xlsx"))
my_plot <- ggplot(data = my_data) + geom_point()
ggsave(here("results/graphs/my_plot.png"))

Clean session

Never set anything that might change how your code runs

In particular:

  • never save your workspace upon closing a session (beware of RStudio default settings! go edit them now),
  • restart your R session frequently to make sure that you are not running bits of code from past sessions,
  • do not add anything in your .Rprofile, .Renviron, or any other setting file that would affect the output of your code in any way, such as setting options, creating functions, loading packages, etc. This is tempting if you always use the same options or packages. But this makes your scripts non-reproducible by others who do not have those settings. It is much better to create snippets to add those lines of code very easily (even automatically) at the beginning of your scripts.

Formatting

There is no official R formatting. Hadley Wickham wrote a short book on R formatting and this can be a great template to follow. A growing number of people are following his guidelines and it would be a good idea to familiarize yourself with them.

The package lintr by Jim Hester, which runs in emacs ESS, Sublime, Vim, and Atom, as well as RStudio functionalities highlight where your code does not follow these formatting recommendations and can be a great way to get used to applying them to your code until they become automatic.

But the most important pieces of advice, when it comes to formatting code are:

be consistent

follow the style used by your collaborators, particularly if you edit their scripts

Things you do not want in a script

Avoid anything that will make changes to a computer

If someone runs your script, this should not install packages or make any other change to their machine. So, for instance, avoid

install.packages()

I owe these better coding habits to Jenny Bryan and Hadley Wickham. Do not hesitate to look for their books, workshops, and other material that are very useful and open source.