dplyr and tidyr

dplyr introduces a consistent grammar of data manipulation for R. The significance of this should not be underestimated. It makes the standard data manipulation that all data analysts need to do so much easier. A particularly nice feature is the ability to connect a series of commands in the same way as pipes in Unix systems. Rather than the vertical bar |, the symbol is %>%. The pipe takes input, commonly a dataframe from the left-hand side and passes it to functions on the right-hand side.

As well as dplyr, Hadley has introduced another new package for reshaping: tidyr. Together with ggplot2, lubridate, stringr and others they form the ‘tidyverse’.

dplyr and tidyr

The basics

dplyr has, as described in the introductory documentation, “five basic data manipulation verbs that work on a single table: filter(), arrange(), select(), mutate() and summarise().”

Creating new variables

New variables are created using the mutate() verb, one takes a dataframe, passes it in a pipe to mutate and create a new variable in that function call.

library(dplyr)

## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union

library(stringr)
library(tidyr)

data("mtcars")

mtcars <- mtcars %>% 
  mutate(cyl_2 = cyl + 1, 
         mpg_2 = mpg * 2)

head(mtcars)

##    mpg cyl disp  hp drat    wt  qsec vs am gear carb cyl_2 mpg_2
## 1 21.0   6  160 110 3.90 2.620 16.46  0  1    4    4     7  42.0
## 2 21.0   6  160 110 3.90 2.875 17.02  0  1    4    4     7  42.0
## 3 22.8   4  108  93 3.85 2.320 18.61  1  1    4    1     5  45.6
## 4 21.4   6  258 110 3.08 3.215 19.44  1  0    3    1     7  42.8
## 5 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2     9  37.4
## 6 18.1   6  225 105 2.76 3.460 20.22  1  0    3    1     7  36.2

Detecting strings

I have a vector in a dataframe and I wish to determine whether I can detect any one of a series of strings. I would normally use str_detect for this, but it doesn’t work without a little trickery. The paste(something, collapse = "|") is the important bit here.

test_data <- data.frame(item = c("Apple", "Bear", "Orange", "Pear", "Two Apples"), 
                        stringsAsFactors = FALSE)
fruit <- c("Apple", "Orange", "Pear")
test_data

##         item
## 1      Apple
## 2       Bear
## 3     Orange
## 4       Pear
## 5 Two Apples

test_data <- test_data %>%  
  mutate(is.fruit = str_detect(item, paste(fruit, collapse = "|")))
test_data

##         item is.fruit
## 1      Apple     TRUE
## 2       Bear    FALSE
## 3     Orange     TRUE
## 4       Pear     TRUE
## 5 Two Apples     TRUE

Replacing values with dplyr functions

Can use recode() or case_when(). recode() is a vectorised switch. It’s important to note that the data types must be the same when using recode(), and that a double is not the same type as an integer, which seems obvious when typed. If they are not the same type, then unspecified replacements get NA values.

mtcars %>% 
  mutate(new_cyl = recode(cyl, `6` = 16, `8` = 200) ) %>% 
  select(cyl, new_cyl) %>% 
  tbl_df() %>%
  head()

## # A tibble: 6 x 2
##     cyl new_cyl
##   <dbl>   <dbl>
## 1     6      16
## 2     6      16
## 3     4       4
## 4     6      16
## 5     8     200
## 6     6      16

case_when() vectorises multiple if() else() statements, and it’s a lot neater than writing nested ifelse() statements. case_when() can be used inside a mutate() call, or outside.

There are times when the column needs to be specified with a .$ in front but the situations in which this is required have changed as the package has developed.

mtcars %>% 
mutate(
  new_cyl = case_when(cyl == 4 ~ "Four", 
                      cyl == 6 ~ "Six", 
                      cyl == 8 ~ "Eight"
                      )) %>%
  select(cyl, new_cyl) %>% 
  head()

##   cyl new_cyl
## 1   6     Six
## 2   6     Six
## 3   4    Four
## 4   6     Six
## 5   8   Eight
## 6   6     Six

There’s an important gotcha here though. One has to make sure that each possible outcome is specified, otherwise unspecified levels are replaced with <NA> values. One can work around this as in the second example. I haven’t yet tested what will happen if numeric output is required, I expect that one would need TRUE ~ as.double(.$cyl). For numeric columns, it’s also worth remembering that integer data types are not the same as other numeric data types such as doubles and so as.numeric() may still return <NA> values.

mtcars %>% 
  mutate( cyl2 = case_when(
    cyl == 4 ~ "four"
  )
  ) %>% 
  select(cyl, cyl2) %>% 
  head()

##   cyl cyl2
## 1   6 <NA>
## 2   6 <NA>
## 3   4 four
## 4   6 <NA>
## 5   8 <NA>
## 6   6 <NA>

mtcars %>% 
  mutate( cyl2 = case_when(
    cyl == 4 ~ "four", 
    TRUE ~ as.character(cyl)
  )
  ) %>% 
  select(cyl, cyl2) %>% 
  head()

##   cyl cyl2
## 1   6    6
## 2   6    6
## 3   4 four
## 4   6    6
## 5   8    8
## 6   6    6

This also works for detecting strings in columns. It’s a bit weird giving the TRUE ~ 0 for the non-true cases. I tried FALSE ~ 0 and that doesn’t work.

data(mtcars)
mtcars$name <- row.names(mtcars)
names_to_detect <- c("Mazda", "Datsun")

mtcars %>% 
  mutate(has_name = case_when(
    str_detect(name, paste(names_to_detect, collapse = "|")) == TRUE ~ 1, 
    TRUE ~ 0)
  )%>% 
  select(name, has_name) %>% 
  head()

##                name has_name
## 1         Mazda RX4        1
## 2     Mazda RX4 Wag        1
## 3        Datsun 710        1
## 4    Hornet 4 Drive        0
## 5 Hornet Sportabout        0
## 6           Valiant        0

Working on multiple columns at once

mutate_all or summarise_all will affect all variables in the dataframe in the same way. mutate_at will mutate only the specified columns

library("lubridate")

## 
## Attaching package: 'lubridate'

## The following object is masked from 'package:base':
## 
##     date

dat <- data.frame(x = c(1,2,3), 
  y_date = c("01/01/2017", "02/01/2017", "03/01/2017"), stringsAsFactors = FALSE)
dat <- dat %>% mutate_at(vars(contains("date")), funs(lubridate::dmy))
dat

##   x     y_date
## 1 1 2017-01-01
## 2 2 2017-01-02
## 3 3 2017-01-03

summarise_at allows you to specify a set of columns one which to work.

data("mtcars")
mtcars %>% summarise_at(vars(mpg), funs(sum)) 

##     mpg
## 1 642.9

mtcars %>% summarise_at(vars(-mpg, - contains("cyl")), funs(sum))

##     disp   hp   drat      wt   qsec vs am gear carb
## 1 7383.1 4694 115.09 102.952 571.16 14 13  118   90

mtcars %>% summarise_at(vars(matches("mpg")), funs(sum))

##     mpg
## 1 642.9

my_vars <- c("mpg", "cyl")
mtcars %>% summarise_at(vars(my_vars), funs(sum))

##     mpg cyl
## 1 642.9 198

Subsetting (aka filtering)

Much less verbose than test[test$var1 == "something",]

filter(test, var1 == "something")

But getting unique combinations is a little more awkward. group_by() is covered a little later.

test <- data.frame(id = c(1,1,1,2,2),
org = c("apple", "apple", "bear", "orange", "pear"),
test = c("S", "R","S", "R", "S"))
test

##   id    org test
## 1  1  apple    S
## 2  1  apple    R
## 3  1   bear    S
## 4  2 orange    R
## 5  2   pear    S

out <- test %>% group_by(id, org) %>% 
  filter(row_number() == 1)
out

## # A tibble: 4 x 3
## # Groups:   id, org [4]
##      id org    test 
##   <dbl> <fct>  <fct>
## 1     1 apple  S    
## 2     1 bear   S    
## 3     2 orange R    
## 4     2 pear   S

Summarising data

Uses the summarise() function. n() is a function introduced in dplyr and produces a count of the number of rows in the data set.

library(binom)

test <- data.frame(group = c(rep("A", 10), rep("B", 10)),
outcome = c(1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0))
head(test)

##   group outcome
## 1     A       1
## 2     A       0
## 3     A       0
## 4     A       0
## 5     A       0
## 6     A       0

table(test$group, test$outcome, dnn = c("Group", "Outcome"))

##      Outcome
## Group 0 1
##     A 9 1
##     B 8 2

out <-test %>% group_by(group) %>% 
  summarise(n = n(), successes = sum(outcome)) %>% 
  mutate(pc.succ = round(binom.confint(successes, n, 
                                       methods = "exact")[[4]]*100, 2),
         pc.lci = round(binom.confint(successes, n, 
                                      methods = "exact")[[5]]*100, 2),
         pc.uci = round(binom.confint(successes, n, 
                                      methods = "exact")[[6]]*100, 2)
            )

out

## # A tibble: 2 x 6
##   group     n successes pc.succ pc.lci pc.uci
##   <fct> <int>     <dbl>   <dbl>  <dbl>  <dbl>
## 1 A        10         1      10   0.25   44.5
## 2 B        10         2      20   2.52   55.6

Outside dplyr I would write binom.confint(x,n,methods = "exact")$mean. This doesn’t work in dplyr and returns the error "Error in binom.confint(c(6928L, 5704L, 4746L, 5166L, 6779L, 6275L, 4832L, : invalid subscript type 'closure' " However, indexing the columns does work. Also, wrapping the round() and binom.confint() in a function would improve the readability of the process.

Passing arguments to `summarise_all`

For example, you might want to specify removal of NA values in a sum or mean function. You can do this by

data(mtcars)
summarise_all(mtcars, funs(sum(., na.rm = TRUE)))

Grouped operations

Grouping performs operations by levels of the grouping variable or variables. For example, I might want to get the mean mpg for each value of cylinder in mtcars:

mtcars %>% group_by(cyl) %>% 
  mutate(grouped_mpg = mean(mpg)) %>% 
  select(mpg, cyl, grouped_mpg) %>% 
  head()

## # A tibble: 6 x 3
## # Groups:   cyl [3]
##     mpg   cyl grouped_mpg
##   <dbl> <dbl>       <dbl>
## 1  21       6        19.7
## 2  21       6        19.7
## 3  22.8     4        26.7
## 4  21.4     6        19.7
## 5  18.7     8        15.1
## 6  18.1     6        19.7

Adding a group id

Sometimes one might want to create an integer group id. i.e. for all rows with the same groupings, these will have the same id. group_indices() goes some of the way to resolve this by creating a vector of integers corresponding to the index numbers of the groups.

Unfortunately, one can’t use group_indices() in mutate(), it returns an error. Therefore, one has to sort the dataframe by the grouping variables before adding in the id. It’s important to note that the data must be sorted prior to the creation of i, otherwise the mutate(group_id = i) will be in the wrong sort order.

data(mtcars)
mtcars <- mtcars %>% group_by(cyl, am) %>% arrange(cyl, am)

i <- mtcars %>% group_indices()

mtcars <- mtcars %>% ungroup() %>% mutate(group_id = i)
mtcars %>% select(cyl, am, group_id)

## # A tibble: 32 x 3
##      cyl    am group_id
##    <dbl> <dbl>    <int>
##  1     4     0        1
##  2     4     0        1
##  3     4     0        1
##  4     4     1        2
##  5     4     1        2
##  6     4     1        2
##  7     4     1        2
##  8     4     1        2
##  9     4     1        2
## 10     4     1        2
## # ... with 22 more rows

Dropping and renaming variables

Variables can be renamed with rename(). The syntax is rename(data, new.var = old.var). Note that there are no quotations marks. e.g.

rename(iris, petal_length = Petal.Length) %>% head()

##   Sepal.Length Sepal.Width petal_length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

Variables can be dropped using select(data, -var_name). Or, if quicker, retained from a long list of variables, e.g. select(data, var1, var2).

select(iris, -Sepal.Length) %>% head

##   Sepal.Width Petal.Length Petal.Width Species
## 1         3.5          1.4         0.2  setosa
## 2         3.0          1.4         0.2  setosa
## 3         3.2          1.3         0.2  setosa
## 4         3.1          1.5         0.2  setosa
## 5         3.6          1.4         0.2  setosa
## 6         3.9          1.7         0.4  setosa

select(iris, Sepal.Length, Species) %>% head

##   Sepal.Length Species
## 1          5.1  setosa
## 2          4.9  setosa
## 3          4.7  setosa
## 4          4.6  setosa
## 5          5.0  setosa
## 6          5.4  setosa

It’s also possible to select columns based on text that the column name contains:

pps <- pps %>% select(p.full.id, contains("_atc"))

thanks to: http://stackoverflow.com/questions/25923392/r-dplyr-select-columns-based-on-string

Lagged values and windows

#  test windowed functions
test <- data.frame(cbind(time = c(1:10), value = rep(1, 10)), 
                   stringsAsFactors = FALSE)
head(test)

##   time value
## 1    1     1
## 2    2     1
## 3    3     1
## 4    4     1
## 5    5     1
## 6    6     1

test <- test %>% 
  mutate( lag1 = lag(value),
          lag2 = lag(value, 2),
          lag3 = lag(value, 3),
          lag4 = lag(value, 4)
          
          ) %>% # next line needs to be outside first mutate call otherwise it won't work.
  mutate(rolling_4 = rowSums(.[3:6], na.rm = TRUE))
test

##    time value lag1 lag2 lag3 lag4 rolling_4
## 1     1     1   NA   NA   NA   NA         0
## 2     2     1    1   NA   NA   NA         1
## 3     3     1    1    1   NA   NA         2
## 4     4     1    1    1    1   NA         3
## 5     5     1    1    1    1    1         4
## 6     6     1    1    1    1    1         4
## 7     7     1    1    1    1    1         4
## 8     8     1    1    1    1    1         4
## 9     9     1    1    1    1    1         4
## 10   10     1    1    1    1    1         4

Wrapping dplyr in a function

Programming with dplyr has changed a lot with the release of dplyr 0.7. Previously there was a convoluted process using lazy eval and the interp function. Now, dplyr incorporates tidyeval and is much simpler. The vignette on programming with dplyr provides a much longer explanation of how it works.

A simple tidyeval example

Say I want a function to sort a dataframe by a named column. I know that dplyr will already do this, but this is simply to illustrate how evaluation works when writing functions with dplyr.

data(mtcars)
select_a_column <- function(dat, column){
  col_to_use <- enquo(column)
  dat <- select(dat, !!col_to_use)
  return(dat)
}
select_a_column(mtcars, mpg) %>% head()

##                    mpg
## Mazda RX4         21.0
## Mazda RX4 Wag     21.0
## Datsun 710        22.8
## Hornet 4 Drive    21.4
## Hornet Sportabout 18.7
## Valiant           18.1

Two steps here: enquo() to capture the variable of interest and !! to use it.

One has to interpret the multiple variables supplied, otherwise you get all sorts of errors.

# require(lazyeval) # I thought dplyr loaded this by default, but apparently not.
tabFun <- function(dat, y, z) {
  grouping_var <- enquo(y)
  summing_var <- enquo(z)
  a <- dat %>% group_by(!!grouping_var) %>%
    summarise(n = length(!!summing_var), 
              sum_z = sum(!!summing_var)) %>%
    mutate(pc = round((sum_z / n) * 100, 1)
           )
  return(a)
}
tabFun(mtcars, cyl, am)

## # A tibble: 3 x 4
##     cyl     n sum_z    pc
##   <dbl> <int> <dbl> <dbl>
## 1     4    11     8  72.7
## 2     6     7     3  42.9
## 3     8    14     2  14.3

Phew.

Oh, and incidentally, trying to simplify this with n = n() returns the error Error in n() : This function should not be called directly.

I don’t know why.

Mutate in a function

temp_dat <- data.frame(dob = as.Date("01/01/2017", format = "%d/%m/%Y"), 
                       sampledate = as.Date("15/06/2017", format = "%d/%m/%Y"))


age_fun <- function(dat, dob_var, sampledate_var) {
  var1 <- enquo(dob_var)
  var2 <- enquo(sampledate_var)
  out <- dat %>% 
    mutate(
      age = as.numeric( (!!var2) - (!!var1) ) / 365.25
    )
  return(out)
}

# doesn't work
age_fun2 <- function(dat, dob_var, sampledate_var) {
  # var1 <- enquo(dob_var)
  # var2 <- enquo(sampledate_var)
  out <- dat %>% 
    mutate(
      age = as.numeric( dob_var - sampledate_var ) / 365.25
    )
  return(out)
}

temp_dat <- age_fun(dat = temp_dat, dob_var = dob, sampledate_var = sampledate)
temp_dat <- age_fun2(dat = temp_dat, dob_var = dob, sampledate_var = sampledate)
temp_dat

Joining

Very similar to plyr, but the function name specifies the type of join rather than specifying within the function.

                                                                                 data3 <- left_join(data1, data2, by = "common.var")

                                                                             See [SQL Venn](https://blog.codinghorror.com/a-visual-explanation-of-sql-joins/) for reminders. 

Usefully, one can join by more than one variable. This is useful when a single variable does not provide a unique identifier.

x <- data.frame(
  var1 = c(1, 1, 2),
  var2 = c("a", "b", "b"),
  var3 = c("x", "y", "z")
  )
y <- data.frame(var1 = c(1, 1, 2),
                var2 = c("a", "b", "b"),
                var4 = c(20, 21, 22)
                )
x

##   var1 var2 var3
## 1    1    a    x
## 2    1    b    y
## 3    2    b    z

z <- left_join(x, y, by = c("var1", "var2"))
z

##   var1 var2 var3 var4
## 1    1    a    x   20
## 2    1    b    y   21
## 3    2    b    z   22

anti_join

Anti join retains those records which are in the left table but not the right and retains the columns of the left table.

This is probably more useful than one might at first think.

require(dplyr)

# want those in 2, that do not appear in 1
anti1 <- data.frame(id = c(1, 1, 1, 2, 3),
                    abx = c("pen", "amx", "flu", "pen", "amx"))
anti2 <- data.frame( id = c(1, 1, 1, 2, 4),
                     abx = c("pen", "amx", "flu", "flu", "pen"),
                     res = c("r", "i", "s", "i", "s")
                     )
anti1

##   id abx
## 1  1 pen
## 2  1 amx
## 3  1 flu
## 4  2 pen
## 5  3 amx

anti2

##   id abx res
## 1  1 pen   r
## 2  1 amx   i
## 3  1 flu   s
## 4  2 flu   i
## 5  4 pen   s

anti3 <- anti_join(anti2, anti1, by = "id")
anti3

##   id abx res
## 1  4 pen   s

anti3 <- anti_join(anti2, anti1, by = c("id", "abx"))
anti3

##   id abx res
## 1  2 flu   i
## 2  4 pen   s

Hypothesis tests and dplyr

Technically possible, see broom and dplyr. I haven’t (yet) managed this for a chisq.test() , but:

test <- as.data.frame(structure(
  list(organism.species.name = c("ec", "ec", "kp"),
       abx = c("ceph", "carb", "ceph"),
       n_sus_start = c(5L, 5L, 10L),
       n_res_start = c(10L, 5L, 5L),
       n_sus_end = c(10L, 5L, 5L),
       n_res_end = c(5L, 5L, 10L)),
  .Names = c("organism.species.name", "abx", "n_sus_start", "n_res_start",
  "n_sus_end", "n_res_end"), class = "data.frame", row.names = c(NA,-3L)
  ))
  
test %>% group_by(organism.species.name, abx) %>%
  mutate(p.val = chisq.test(matrix(
  c(n_sus_start, n_res_start, n_sus_end, n_res_end), nrow = 2))$p.value)

## # A tibble: 3 x 7
## # Groups:   organism.species.name, abx [3]
##   organism.specie~ abx   n_sus_start n_res_start n_sus_end n_res_end p.val
##   <chr>            <chr>       <int>       <int>     <int>     <int> <dbl>
## 1 ec               ceph            5          10        10         5 0.144
## 2 ec               carb            5           5         5         5 1    
## 3 kp               ceph           10           5         5        10 0.144

tidyr

wide to long

stocks <-
  data.frame(
  time = as.Date('2009-01-01') + 0:9,
  X = rnorm(10, 0, 1),
  Y = rnorm(10, 0, 2),
  Z = rnorm(10, 0, 4)
  )
  stocks

##          time           X          Y          Z
## 1  2009-01-01 -0.04585453 -1.4495466  0.2587919
## 2  2009-01-02  0.22757262  0.4319155 -1.5743721
## 3  2009-01-03 -0.39376239 -0.8340409 -2.8023892
## 4  2009-01-04 -2.00358729  5.1717332 -4.2780069
## 5  2009-01-05  2.62465922 -3.0533851 -1.0727224
## 6  2009-01-06 -0.96244722 -0.3959880 -3.5751754
## 7  2009-01-07  0.15607861  4.5448018 -3.0587058
## 8  2009-01-08  1.17352798 -0.6613273  1.0263555
## 9  2009-01-09 -0.96637365  1.0986367  3.1425586
## 10 2009-01-10  0.90033313  3.5056913  5.4665945

  gather(data = stocks, key = stock, value = price, -time)

##          time stock       price
## 1  2009-01-01     X -0.04585453
## 2  2009-01-02     X  0.22757262
## 3  2009-01-03     X -0.39376239
## 4  2009-01-04     X -2.00358729
## 5  2009-01-05     X  2.62465922
## 6  2009-01-06     X -0.96244722
## 7  2009-01-07     X  0.15607861
## 8  2009-01-08     X  1.17352798
## 9  2009-01-09     X -0.96637365
## 10 2009-01-10     X  0.90033313
## 11 2009-01-01     Y -1.44954657
## 12 2009-01-02     Y  0.43191551
## 13 2009-01-03     Y -0.83404093
## 14 2009-01-04     Y  5.17173319
## 15 2009-01-05     Y -3.05338512
## 16 2009-01-06     Y -0.39598803
## 17 2009-01-07     Y  4.54480178
## 18 2009-01-08     Y -0.66132732
## 19 2009-01-09     Y  1.09863666
## 20 2009-01-10     Y  3.50569129
## 21 2009-01-01     Z  0.25879187
## 22 2009-01-02     Z -1.57437215
## 23 2009-01-03     Z -2.80238921
## 24 2009-01-04     Z -4.27800687
## 25 2009-01-05     Z -1.07272236
## 26 2009-01-06     Z -3.57517542
## 27 2009-01-07     Z -3.05870585
## 28 2009-01-08     Z  1.02635549
## 29 2009-01-09     Z  3.14255858
## 30 2009-01-10     Z  5.46659454

long to wide

spread() takes long data and makes it wide. One specifies the key which contains the cells that will become column headers and then value column which contains the data that will go in the cells beneath the new columns.

data(infert)
long <- infert %>%
  mutate(id = seq_along(education)) %>% # create integer id
  gather(key = variable, value = val,-id)

## Warning: attributes are not identical across measure variables;
## they will be dropped

head(long)

##   id  variable     val
## 1  1 education  0-5yrs
## 2  2 education  0-5yrs
## 3  3 education  0-5yrs
## 4  4 education  0-5yrs
## 5  5 education 6-11yrs
## 6  6 education 6-11yrs

long %>% spread(key = variable, value = val) %>%
  head()

##   id age case education induced parity pooled.stratum spontaneous stratum
## 1  1  26    1    0-5yrs       1      6              3           2       1
## 2  2  42    1    0-5yrs       1      1              1           0       2
## 3  3  39    1    0-5yrs       2      6              4           0       3
## 4  4  34    1    0-5yrs       2      4              2           0       4
## 5  5  35    1   6-11yrs       1      3             32           1       5
## 6  6  36    1   6-11yrs       2      4             36           1       6

For a more complex approach see this SO post.

Expanding data to cover all permutations

There are a number of options for this. A combination of spreading to wide, then gathering to long is one approach. However, tidyr also provides complete and expand. I quite often need to produce summary counts by an organisation and if one organisation hasn’t reported any cases for a period then there will be a single NA row for that organisation, rather than multiple rows of zero.

By the description in the manual, complete is

“a wrapper around expand(), dplyr::left_join() and replace_na() that’s useful for completing missing combinations of data”

Given the data below, I want to have all the rows for St James Infirmary that I also have for St Elsewhere.

dat <- structure(list(org_code = c("1A", "1A", "1A", "1A", "1A", "1A", 
                                   "1A", "1A", "1A", "1A", "1A", "1A", "2F"), 
                      status = c("FT", "FT", "FT", "FT", "FT", "FT", "FT", 
                                 "FT", "FT", "FT", "FT", "FT", "--"), 
                      org_name = c("St Elsewhere", "St Elsewhere", "St Elsewhere", 
                                   "St Elsewhere", "St Elsewhere", "St Elsewhere", 
                                   "St Elsewhere", "St Elsewhere", "St Elsewhere", 
                                   "St Elsewhere", "St Elsewhere", "St Elsewhere", 
                                   "St James Infirmary"), 
                      year = c(2016L, 2016L, 2016L, 2016L, 2016L, 2016L, 2017L, 
                               2017L, 2017L, 2017L, 2017L, 2017L, NA), 
                      month = c(11L, 11L, 11L, 12L, 12L, 12L, 1L, 1L, 1L, 2L, 
                                2L, 2L, NA), 
                      measure = c("measure_a", "measure_b", "measure_c", 
                                  "measure_a", "measure_b", "measure_c", 
                                  "measure_a", "measure_b", "measure_c", 
                                  "measure_a", "measure_b", "measure_c", NA), 
                      count = c(1L, 5L, 7L, 5L, 9L, 3L, 7L, 4L, 2L, 9L, 5L, 7L, 
                                NA)), .Names = c("org_code", "status", 
                                                 "org_name", "year", "month", 
                                                 "measure", "count"), 
                 class = "data.frame", row.names = c(NA, -13L))

dat

##    org_code status           org_name year month   measure count
## 1        1A     FT       St Elsewhere 2016    11 measure_a     1
## 2        1A     FT       St Elsewhere 2016    11 measure_b     5
## 3        1A     FT       St Elsewhere 2016    11 measure_c     7
## 4        1A     FT       St Elsewhere 2016    12 measure_a     5
## 5        1A     FT       St Elsewhere 2016    12 measure_b     9
## 6        1A     FT       St Elsewhere 2016    12 measure_c     3
## 7        1A     FT       St Elsewhere 2017     1 measure_a     7
## 8        1A     FT       St Elsewhere 2017     1 measure_b     4
## 9        1A     FT       St Elsewhere 2017     1 measure_c     2
## 10       1A     FT       St Elsewhere 2017     2 measure_a     9
## 11       1A     FT       St Elsewhere 2017     2 measure_b     5
## 12       1A     FT       St Elsewhere 2017     2 measure_c     7
## 13       2F     -- St James Infirmary   NA    NA      <NA>    NA

Using complete I get what I need. I also need to use unite and separate, otherwise I get rows for 2016 month 1 and 2016 month 2.

dat2 <- dat %>% 
  unite(year_month, year, month) %>% 
  complete(year_month, measure, 
                         nesting(org_code, status, org_name), 
                        fill = list(count = 0)) %>% 
  separate(year_month, into = c("year", "month")) %>%
  # don't strictly need these lines, but they make the result more logical based on original data
  select(org_code, status, org_name, year, month, measure) %>% # re-order cols
  arrange(org_code, year, month, measure) # sort data
tail(dat2, n = 12)

## # A tibble: 12 x 6
##    org_code status org_name           year  month measure  
##    <chr>    <chr>  <chr>              <chr> <chr> <chr>    
##  1 2F       --     St James Infirmary 2017  1     measure_a
##  2 2F       --     St James Infirmary 2017  1     measure_b
##  3 2F       --     St James Infirmary 2017  1     measure_c
##  4 2F       --     St James Infirmary 2017  1     <NA>     
##  5 2F       --     St James Infirmary 2017  2     measure_a
##  6 2F       --     St James Infirmary 2017  2     measure_b
##  7 2F       --     St James Infirmary 2017  2     measure_c
##  8 2F       --     St James Infirmary 2017  2     <NA>     
##  9 2F       --     St James Infirmary NA    NA    measure_a
## 10 2F       --     St James Infirmary NA    NA    measure_b
## 11 2F       --     St James Infirmary NA    NA    measure_c
## 12 2F       --     St James Infirmary NA    NA    <NA>

The manual adds a helpfull explanation of the use of nesting:

“To find all unique combinations of x, y and z, including those not found in the data, supply each variable as a separate argument. To find only the combinations that occur in the data, use nest: expand(df, nesting(x, y, z)).”

Crossing is the tidy equivalent of expand.grid and doesn’t convert strings to factors.

crossing(year = c(2016, 2017, 2018), 
         month = c("January", "February", "March"))

## # A tibble: 9 x 2
##    year month   
##   <dbl> <chr>   
## 1  2016 February
## 2  2016 January 
## 3  2016 March   
## 4  2017 February
## 5  2017 January 
## 6  2017 March   
## 7  2018 February
## 8  2018 January 
## 9  2018 March