Functional programming in R (with purrr)
Table of Contents
Introduction
What is functional programming?
It is a programming paradigm based on the evaluation of functions. This is opposed to imperative programming. While some languages are based strictly on functional programming (e.g. Haskell), R allows both imperative code (e.g. loops) and functional code (e.g. many base functions, the apply()
family, purrr
).
Iterations
Iterations are the repetition of a process (e.g. applying the same function to several variables, several datasets, or several files).
The classic methods in R are:
- loops
apply()
functions family
The purrr package
One of the tidyverse
core packages, purrr
was written in 2015 by Lionel Henry (also the maintainer), Hadley Wickham, and RStudio inc.
Goal
Purrr
is a set of tools allowing consistent functional programming in R in a tidyverse
style (using magrittr
pipes and following the same naming conventions found in other tidyverse
packages).
As Hadley Wickham says, in many ways, purrr
is the equivalent of the dplyr
, but while dplyr
focuses on data frames, purrr
works on vectors: it works on the elements of atomic vectors, lists, and data frames. Since R's most basic data structure is the vector, this makes purrr
extremely powerful and flexible.
Logistics
Install it with:
install.packages("tidyverse") ## or install.packages("purrr")
Load it with:
library(tidyverse) ## or library(purrr)
As always, once the package is loaded, you can get information on the package with:
?purrr
and on any of its functions with:
?function ## e.g. for the map function ?map
Let's dive in
Load packages
First, let's load the packages that we will use. It is always a good idea to write all the packages that you will be using at the top of the script. This will help others, using your script, to know what is required to run it.
library(tidyverse) # we will use purrr and other core packages library(magrittr) # we will use several types of pipes
Create some fake banding data
Let's create some imaginary bird banding data:
banding <- tibble( bird = paste0("bird", 1:50), sex = sample(c("F", "M"), 50, replace = T), population = sample(LETTERS[1:3], 50, replace = T), mass = rnorm(50, 43, 4) %>% round(1), tarsus = rnorm(50, 27, 1) %>% round(1), wing = rnorm(50, 112, 3) %>% round(0) ) banding
Map: apply functions to elements of a list
Imagine that you want to calculate the mean for each of the morphometric measurements (mass, tarsus, and wing).
How would you usually do this?
Spend 5 minutes writing code you would usually use.
To apply functions to elements of a list, you can use map
, one of the key function of the purrr
package.
Usage
map(.x, .f, ...)
.x a list or atomic vector .f a function, formula, or atomic vector ... additional arguments passed to .f
For every element of .x
, apply .f
.
What we have, in the simplest case, is:
map(list, function)
In our example
How could we use map()
to calculate the means of all 3 measurement types?
A data frame is a list! It is a list of vectors.
Without running it in your computer, try to guess what the result of the following will be:
length(banding)
Now, run it. What do you get? Why?
So, back to our example, we do have a list: a list of vectors. That's what our banding data frame is! So no problem about applying map()
to it.
Answer
map(banding[4:6], mean)
or using a pipe
banding[4:6] %>% map(mean)
However, the output of map()
is always a list. And a list as output is not really convenient here. There are other map functions which have vector or data frame outputs. To get a numeric vector as the output, we use map_dbl()
:
Answer
map_dbl(banding[4:6], mean)
or
banding[4:6] %>% map_dbl(mean)
Similarly, you can calculate the variance, the sum, look for the largest value, or apply any other function to our data.
Spend 2 min writing codes for these.
Answer
map_dbl(banding[4:6], var) map_dbl(banding[4:6], sum) map_dbl(banding[4:6], max)
Stepping things up
Now, imagine that you would like to plot the relationship between tarsus and mass for each population.
How would you usually do that?
Spend 5 min writing code for this.
And feel free to chat.
Answer
You could write a for loop:
for (i in unique(banding$population)) { print(ggplot(banding %>% filter(population == i), aes(tarsus, mass)) + geom_point()) }
But this is the functional programming method:
banding %>% split(.$population) %>% map(~ ggplot(., aes(tarsus, mass)) + geom_point())
Let's save those graphs in a variable called graphs
that we will use later.
graphs <- banding %>% split(.$population) %>% map(~ ggplot(., aes(tarsus, mass)) + geom_point())
Formulas
Formulas = a shorter notation for anonymous functions
With one element
The code:
map(function(x) x + 3)
which contains the anonymous function function(x) x + 3
can be written as:
map(~ . + 3)
This code abbreviation is called a "formula".
Your turn: write the following anonymous function as a formula.
map(function(x) mean(x) + 3)
Answer
map(~ mean(.) + 3)
With 2 elements
The code:
map2(function(x, y) x + y)
can be shortened to:
map2(~ .x + .y)
Referring to elements
1st element | 2nd element | 3rd element | ||
---|---|---|---|---|
. |
||||
.x |
.y |
|||
..1 |
..2 |
..3 |
etc.
Your turn: write the following anonymous function as a formula.
pmap(function(x1, x2, y) lm(y ~ x1 + x2))
Answer
pmap(~ lm(..3 ~ ..1 + ..2))
map_if
/modify_if
and map_at
/modify_at
We built our data frame with tibble()
which, as is the norm in the tidyverse
, does not transform strings into factors:
banding <- tibble( bird = paste0("bird", 1:50), sex = sample(c("F", "M"), 50, replace = T), population = sample(LETTERS[1:3], 50, replace = T), mass = rnorm(50, 43, 4) %>% round(1), tarsus = rnorm(50, 27, 1) %>% round(1), wing = rnorm(50, 112, 3) %>% round(0) ) %T>% str()
Several base R functions however, do.
Let's build the same data with the base R function data.frame()
:
banding <- data.frame( bird = paste0("bird", 1:50), sex = sample(c("F", "M"), 50, replace = T), population = sample(LETTERS[1:3], 50, replace = T), mass = rnorm(50, 43, 4) %>% round(1), tarsus = rnorm(50, 27, 1) %>% round(1), wing = rnorm(50, 112, 3) %>% round(0) ) %T>% str()
The reason several base R functions transform strings into factors is historic. This used to be essential to save space. But this is not relevant anymore and has become somewhat of an annoyance.
If you have such a data frame, you may wish to transform the factors into characters.
How can you do this?
map()
has the derivatives map_if()
and map_at()
which allow to apply functions when conditions are met or at certain locations. Here, we can use map_if()
:
banding %>% map_if(is.factor, as.character) %T>% str()
However, map_if
and map_at
always return lists. If you want the output to be of the same type of the input, use modify_if
and modify_at
instead.
banding <- data.frame( bird = paste0("bird", 1:50), sex = sample(c("F", "M"), 50, replace = T), population = sample(LETTERS[1:3], 50, replace = T), mass = rnorm(50, 43, 4) %>% round(1), tarsus = rnorm(50, 27, 1) %>% round(1), wing = rnorm(50, 112, 3) %>% round(0) ) banding %>% modify_if(is.factor, as.character) %>% head() %T>% str()
This could also be accomplished with mutate_if()
:
banding %>% mutate_if(is.factor, as.character)
But the map()
functions also work with lists and are more flexible than mutate()
and its derivatives.
Usage
modify(.x, .f, ...) modify_if(.x, .p, .f, ...) modify_at(.x, .at, .f, ...)
.x a list or atomic vector .f a function, formula, or atomic vector ... additional arguments passed to .f .p a predicate function. Only the elements for which .p evaluates to TRUE will be modified .at a character vector of names or a numeric vector of positions. Only the elements corresponding to .at will be modified
For every element of .x
, apply .f
, and return a modified version of .x
.
So basically, in its simplest form, we have:
modify(list, function)
Walk: apply side effects to elements of a list
Now, we want to save the 3 graphs we previously drew into 3 files.
How would you do this?
Spend 5 minutes writing code you would usually use.
To apply side effects to elements of a list, we use the walk
functions family.
Usage
walk(.x, .f, ...)
.x a list or atomic vector .f a function, formula, or atomic vector ... additional arguments passed to .f
Apply to our example
We already have a list of graphs: graphs
. Now, we can create a list of paths where we want to save them:
paths <- paste0("population_", names(graphs), ".png")
So we want to save each element of graphs
into an element of paths
. The function we will use is ggsave
. To apply it to all of our elements, instead of using map
, we will use walk
because we are not trying to create a new object.
The problem is that we have 2 lists to deal with. Map
and walk
only allow to deal with one list. But map2
and walk2
allow to deal with 2 lists (pmap
and pwalk
allow to deal with any number of lists).
Here is how walk2
works (it is the same for map2
):
walk2(.x, .y, .f, ...)
.x, .y vectors of the same length. A vector of length 1 will be recycled. .f a function, formula, or atomic vector ... additional arguments passed to .f
Give it a try:
use walk2
to save the elements of graphs
into the elements of paths
using ggsave
.
Don't hesitate to look up the help file for ggsave
with ?ggsave
if you don't remember how to use it!
Answer
walk2(paths, graphs, ggsave)
Summary of the map and walk functions family
We will use different map
(or walk
, if we want the side effects) function depending on:
- How many lists we are using in the input
number of arguments in input | purrr function | ||
---|---|---|---|
1 | map or walk |
||
2 | map2 or walk2 |
||
more | pmap or pwalk |
- The class of the output we want
class we want for the output | purrr function | ||
---|---|---|---|
nothing* | walk |
||
list* | map |
||
double | map_dbl |
||
integer | map_int |
||
character | map_chr |
||
logical | map_lgl |
||
data frame (by row-binding) | map_dfr |
||
data frame (by column-binding) | map_dfc |
Results are returned predictably and consistently, which is not the case of sapply()
.
*As Jenny Bryan said nicely:
"
walk()
can be thought of asmap_nothing()
map()
can be thought of asmap_list()
"
- How we want to select the input
selecting input based on | purrr function | ||
---|---|---|---|
condition | map_if |
||
location | map_at |
Conclusion
These are some of the most important purrr
functions. But there are many others and I encourage you to explore them by yourself.
Great resources for this are:
- The iteration chapter of Hadley Wickham's book R for data science
- The purrr cheatsheet
- The purrr CRAN manual
- The vignettes and help files for the many purrr functions
Have fun!!!