Marie-Hélène Burle
email msb2@sfu.ca  twitter @MHBurle  github prosoitos
August 24, 2018

Functional programming in R (with purrr)

Table of Contents

Introduction

What is functional programming?

It is a programming paradigm based on the evaluation of functions. This is opposed to imperative programming. While some languages are based strictly on functional programming (e.g. Haskell), R allows both imperative code (e.g. loops) and functional code (e.g. many base functions, the apply() family, purrr).

Iterations

Iterations are the repetition of a process (e.g. applying the same function to several variables, several datasets, or several files).

The classic methods in R are:

  • loops
  • apply() functions family

The purrr package

One of the tidyverse core packages, purrr was written in 2015 by Lionel Henry (also the maintainer), Hadley Wickham, and RStudio inc.

Goal

Purrr is a set of tools allowing consistent functional programming in R in a tidyverse style (using magrittr pipes and following the same naming conventions found in other tidyverse packages).

As Hadley Wickham says, in many ways, purrr is the equivalent of the dplyr, but while dplyr focuses on data frames, purrr works on vectors: it works on the elements of atomic vectors, lists, and data frames. Since R's most basic data structure is the vector, this makes purrr extremely powerful and flexible.

Logistics

Install it with:

install.packages("tidyverse")
## or
install.packages("purrr")

Load it with:

library(tidyverse)
## or
library(purrr)

As always, once the package is loaded, you can get information on the package with:

?purrr

and on any of its functions with:

?function
## e.g. for the map function
?map

Let's dive in

Load packages

First, let's load the packages that we will use. It is always a good idea to write all the packages that you will be using at the top of the script. This will help others, using your script, to know what is required to run it.

library(tidyverse)   # we will use purrr and other core packages
library(magrittr)    # we will use several types of pipes

Create some fake banding data

Let's create some imaginary bird banding data:

banding <- tibble(
  bird = paste0("bird", 1:50),
  sex = sample(c("F", "M"), 50, replace = T),
  population = sample(LETTERS[1:3], 50, replace = T),
  mass = rnorm(50, 43, 4) %>% round(1),
  tarsus = rnorm(50, 27, 1) %>% round(1),
  wing = rnorm(50, 112, 3) %>% round(0)
)

banding

Map: apply functions to elements of a list

Imagine that you want to calculate the mean for each of the morphometric measurements (mass, tarsus, and wing).

How would you usually do this?
Spend 5 minutes writing code you would usually use.

To apply functions to elements of a list, you can use map, one of the key function of the purrr package.

Usage

map(.x, .f, ...)
.x     a list or atomic vector
.f     a function, formula, or atomic vector
...     additional arguments passed to .f

For every element of .x, apply .f.

What we have, in the simplest case, is:

map(list, function)

In our example

How could we use map() to calculate the means of all 3 measurement types?

A data frame is a list! It is a list of vectors.

Without running it in your computer, try to guess what the result of the following will be:

length(banding)

Now, run it. What do you get? Why?

So, back to our example, we do have a list: a list of vectors. That's what our banding data frame is! So no problem about applying map() to it.

Answer

map(banding[4:6], mean)

or using a pipe

banding[4:6] %>% map(mean)

However, the output of map() is always a list. And a list as output is not really convenient here. There are other map functions which have vector or data frame outputs. To get a numeric vector as the output, we use map_dbl():

Answer

map_dbl(banding[4:6], mean)

or

banding[4:6] %>% map_dbl(mean)

Similarly, you can calculate the variance, the sum, look for the largest value, or apply any other function to our data.

Spend 2 min writing codes for these.

Answer

map_dbl(banding[4:6], var)
map_dbl(banding[4:6], sum)
map_dbl(banding[4:6], max)

Stepping things up

Now, imagine that you would like to plot the relationship between tarsus and mass for each population.

How would you usually do that?
Spend 5 min writing code for this.
And feel free to chat.

Answer

You could write a for loop:

for (i in unique(banding$population)) {
  print(ggplot(banding %>% filter(population == i),
               aes(tarsus, mass)) + geom_point())
}

But this is the functional programming method:

banding %>%
  split(.$population) %>%
  map(~ ggplot(., aes(tarsus, mass)) + geom_point())

Let's save those graphs in a variable called graphs that we will use later.

graphs <-
  banding %>%
  split(.$population) %>%
  map(~ ggplot(., aes(tarsus, mass)) + geom_point())

Formulas

Formulas = a shorter notation for anonymous functions

With one element

The code:

map(function(x) x + 3)

which contains the anonymous function function(x) x + 3 can be written as:

map(~ . + 3)

This code abbreviation is called a "formula".

Your turn: write the following anonymous function as a formula.

map(function(x) mean(x) + 3)

Answer

map(~ mean(.) + 3)
With 2 elements

The code:

map2(function(x, y) x + y)

can be shortened to:

map2(~ .x + .y)
Referring to elements
1st element   2nd element   3rd element
.        
.x   .y    
..1   ..2   ..3

etc.

Your turn: write the following anonymous function as a formula.

pmap(function(x1, x2, y) lm(y ~ x1 + x2))

Answer

pmap(~ lm(..3 ~ ..1 + ..2))

map_if/modify_if and map_at/modify_at

We built our data frame with tibble() which, as is the norm in the tidyverse, does not transform strings into factors:

banding <-
  tibble(
    bird = paste0("bird", 1:50),
    sex = sample(c("F", "M"), 50, replace = T),
    population = sample(LETTERS[1:3], 50, replace = T),
    mass = rnorm(50, 43, 4) %>% round(1),
    tarsus = rnorm(50, 27, 1) %>% round(1),
    wing = rnorm(50, 112, 3) %>% round(0)
  ) %T>% 
  str()

Several base R functions however, do.

Let's build the same data with the base R function data.frame():

banding <-
  data.frame(
    bird = paste0("bird", 1:50),
    sex = sample(c("F", "M"), 50, replace = T),
    population = sample(LETTERS[1:3], 50, replace = T),
    mass = rnorm(50, 43, 4) %>% round(1),
    tarsus = rnorm(50, 27, 1) %>% round(1),
    wing = rnorm(50, 112, 3) %>% round(0)
  ) %T>% 
  str()

The reason several base R functions transform strings into factors is historic. This used to be essential to save space. But this is not relevant anymore and has become somewhat of an annoyance.

If you have such a data frame, you may wish to transform the factors into characters.

How can you do this?

map() has the derivatives map_if() and map_at() which allow to apply functions when conditions are met or at certain locations. Here, we can use map_if():

banding %>%
  map_if(is.factor, as.character) %T>% 
  str()

However, map_if and map_at always return lists. If you want the output to be of the same type of the input, use modify_if and modify_at instead.

banding <-
  data.frame(
    bird = paste0("bird", 1:50),
    sex = sample(c("F", "M"), 50, replace = T),
    population = sample(LETTERS[1:3], 50, replace = T),
    mass = rnorm(50, 43, 4) %>% round(1),
    tarsus = rnorm(50, 27, 1) %>% round(1),
    wing = rnorm(50, 112, 3) %>% round(0)
  )

banding %>%
  modify_if(is.factor, as.character) %>%
  head() %T>% 
  str()

This could also be accomplished with mutate_if():

banding %>% mutate_if(is.factor, as.character)

But the map() functions also work with lists and are more flexible than mutate() and its derivatives.

Usage

modify(.x, .f, ...)
modify_if(.x, .p, .f, ...)
modify_at(.x, .at, .f, ...)
.x     a list or atomic vector
.f     a function, formula, or atomic vector
...    additional arguments passed to .f
.p     a predicate function.
       Only the elements for which .p evaluates to TRUE will be modified
.at    a character vector of names or a numeric vector of positions.
       Only the elements corresponding to .at will be modified

For every element of .x, apply .f, and return a modified version of .x.

So basically, in its simplest form, we have:

modify(list, function)

Walk: apply side effects to elements of a list

Now, we want to save the 3 graphs we previously drew into 3 files.

How would you do this?
Spend 5 minutes writing code you would usually use.

To apply side effects to elements of a list, we use the walk functions family.

Usage

walk(.x, .f, ...)
.x     a list or atomic vector
.f     a function, formula, or atomic vector
...     additional arguments passed to .f

Apply to our example

We already have a list of graphs: graphs. Now, we can create a list of paths where we want to save them:

paths <- paste0("population_", names(graphs), ".png")

So we want to save each element of graphs into an element of paths. The function we will use is ggsave. To apply it to all of our elements, instead of using map, we will use walk because we are not trying to create a new object.

The problem is that we have 2 lists to deal with. Map and walk only allow to deal with one list. But map2 and walk2 allow to deal with 2 lists (pmap and pwalk allow to deal with any number of lists).

Here is how walk2 works (it is the same for map2):

walk2(.x, .y, .f, ...)
.x, .y   vectors of the same length.
         A vector of length 1 will be recycled.
.f       a function, formula, or atomic vector
...       additional arguments passed to .f

Give it a try:
use walk2 to save the elements of graphs into the elements of paths using ggsave.
Don't hesitate to look up the help file for ggsave with ?ggsave if you don't remember how to use it!

Answer

walk2(paths, graphs, ggsave)

Summary of the map and walk functions family

We will use different map (or walk, if we want the side effects) function depending on:

- How many lists we are using in the input

number of arguments in input     purrr function
1     map or walk
2     map2 or walk2
more     pmap or pwalk

- The class of the output we want

class we want for the output     purrr function
nothing*     walk
list*     map
double     map_dbl
integer     map_int
character     map_chr
logical     map_lgl
data frame (by row-binding)     map_dfr
data frame (by column-binding)     map_dfc

Results are returned predictably and consistently, which is not the case of sapply().

*As Jenny Bryan said nicely:

"walk() can be thought of as map_nothing()

map() can be thought of as map_list()"


- How we want to select the input

selecting input based on     purrr function
condition     map_if
location     map_at

Conclusion

These are some of the most important purrr functions. But there are many others and I encourage you to explore them by yourself.

Great resources for this are:

Have fun!!!