Before getting started

Please open your Rstudio and install the following packages:

install.packages("dplyr")
install.packages("nycflights13")

Unix pipes

Pipes are used for combining existing commands.

wc -l *.txt

%      4 a.txt
       4 b.txt
       4 c.txt
      12 total

Recall that we can use the > operator to redirect the output to a file:

wc -l *.txt > summary.txt

sort -n is frequently used to sort the file. Notice its difference with sort.

the -n option to specify that the sort is numerical instead of alphanumerical.

We can use the following command to sort summary.txt in order:

sort -n summary.txt > summary.txt

We can make it easier by running wc and sort together

wc -l *.txt | sort -n > summary.txt

The vertical bar, |, between the two commands is called a pipe. It tells the shell that we want to use the output of the command on the left as the input to the command on the right.

We can also add another layer of pipe to get the first to lines of the sorted line counts:

wc -l *.txt | sort -n | head -1

The redirection and pipes used in the last few commands are illustrated below:

dplyr

is a grammar of data manipulation, providing a consistent set of verbs that help you solve the most common data manipulation challenges. There are five common commands:

  • mutate() adds new variables that are functions of existing variables
  • select() picks variables based on their names
  • filter() picks cases based on their values
  • summarise() reduces multiple values down to a single summary
  • arrange() changes the ordering of the rows

Prerequisites

library(nycflights13)
library(tidyverse)

nycflights13

We use the dataset flights, which contains all 336,776 flights that departed from New York City in 2013.

flights
#> # A tibble: 336,776 x 19
#>    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
#>   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
#> 1  2013     1     1      517            515         2      830            819
#> 2  2013     1     1      533            529         4      850            830
#> 3  2013     1     1      542            540         2      923            850
#> 4  2013     1     1      544            545        -1     1004           1022
#> 5  2013     1     1      554            600        -6      812            837
#> 6  2013     1     1      554            558        -4      740            728
#> # … with 336,770 more rows, and 11 more variables: arr_delay <dbl>,
#> #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
#> #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

<dbl> suggests the corresponding column is of class double, i.e., a double-precision floating point number.

tibble

We see that the variable type is of tibble, which is a modern reimagining of the data.frame, keeping what time has proven to be effective, and throwing out what is not.

We can create a tibble from an existing object with as_tibble():

as_tibble(iris)
#> # A tibble: 150 x 5
#>    Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#>           <dbl>       <dbl>        <dbl>       <dbl> <fct>  
#>  1          5.1         3.5          1.4         0.2 setosa 
#>  2          4.9         3            1.4         0.2 setosa 
#>  3          4.7         3.2          1.3         0.2 setosa 
#>  4          4.6         3.1          1.5         0.2 setosa 
#>  5          5           3.6          1.4         0.2 setosa 
#>  6          5.4         3.9          1.7         0.4 setosa 
#>  7          4.6         3.4          1.4         0.3 setosa 
#>  8          5           3.4          1.5         0.2 setosa 
#>  9          4.4         2.9          1.4         0.2 setosa 
#> 10          4.9         3.1          1.5         0.1 setosa 
#> # … with 140 more rows

Filter rows with filter():

filter() allows you to subset observations based on their values. The first argument is the name of the data frame. The second and subsequent arguments are the expressions that filter the data frame. For example, we can select all flights on January 1st with:

filter(flights, month == 1, day == 1)
#> # A tibble: 842 x 19
#>    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
#>   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
#> 1  2013     1     1      517            515         2      830            819
#> 2  2013     1     1      533            529         4      850            830
#> 3  2013     1     1      542            540         2      923            850
#> 4  2013     1     1      544            545        -1     1004           1022
#> 5  2013     1     1      554            600        -6      812            837
#> 6  2013     1     1      554            558        -4      740            728
#> # … with 836 more rows, and 11 more variables: arr_delay <dbl>, carrier <chr>,
#> #   flight <int>, tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
#> #   distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

It’s much more efficient that subsetting in data.frame.

dplyr functions never modify their inputs, so if you want to save the result, you’ll need to use the assignment operator, <-:

filter(flights, month == 1, day == 1)

The following code finds all flights that departed in November or December:

filter(flights, month == 11 | month == 12)

A useful short-hand for this problem is x %in% y. This will select every row where x is one of the values in y. We could use it to rewrite the code above:

filter(flights, month %in% c(11, 12))

Exercises

Find all flights that

  • Had an arrival delay of two or more hours
  • Arrived more than two hours late, but didn’t leave late

How many flights have a missing dep_time?

Arrange rows with arrange():

arrange() works similarly to filter() except that instead of selecting rows, it changes their order. It takes a data frame and a set of column names (or more complicated expressions) to order by. If you provide more than one column name, each additional column will be used to break ties in the values of preceding columns:

arrange(flights, year, month, day)
#> # A tibble: 336,776 x 19
#>    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
#>   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
#> 1  2013     1     1      517            515         2      830            819
#> 2  2013     1     1      533            529         4      850            830
#> 3  2013     1     1      542            540         2      923            850
#> 4  2013     1     1      544            545        -1     1004           1022
#> 5  2013     1     1      554            600        -6      812            837
#> 6  2013     1     1      554            558        -4      740            728
#> # … with 336,770 more rows, and 11 more variables: arr_delay <dbl>,
#> #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
#> #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

Use desc() to re-order by a column in descending order:

arrange(flights, desc(dep_delay))
#> # A tibble: 336,776 x 19
#>    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
#>   <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
#> 1  2013     1     9      641            900      1301     1242           1530
#> 2  2013     6    15     1432           1935      1137     1607           2120
#> 3  2013     1    10     1121           1635      1126     1239           1810
#> 4  2013     9    20     1139           1845      1014     1457           2210
#> 5  2013     7    22      845           1600      1005     1044           1815
#> 6  2013     4    10     1100           1900       960     1342           2211
#> # … with 336,770 more rows, and 11 more variables: arr_delay <dbl>,
#> #   carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
#> #   air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>

Missing values are always sorted at the end.

Exercises

Sort flights to find the fastest (highest speed) flights. (Consider columns air_time and distance)

Select columns with select()

allows you to rapidly zoom in on a useful subset using operations based on the names of the variables.

# Select columns by name
select(flights, year, month, day)
#> # A tibble: 336,776 x 3
#>    year month   day
#>   <int> <int> <int>
#> 1  2013     1     1
#> 2  2013     1     1
#> 3  2013     1     1
#> 4  2013     1     1
#> 5  2013     1     1
#> 6  2013     1     1
#> # … with 336,770 more rows
# Select all columns between year and day (inclusive)
select(flights, year:day)
#> # A tibble: 336,776 x 3
#>    year month   day
#>   <int> <int> <int>
#> 1  2013     1     1
#> 2  2013     1     1
#> 3  2013     1     1
#> 4  2013     1     1
#> 5  2013     1     1
#> 6  2013     1     1
#> # … with 336,770 more rows
# Select all columns except those from year to day (inclusive)
select(flights, -(year:day))
#> # A tibble: 336,776 x 16
#>   dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier
#>      <int>          <int>     <dbl>    <int>          <int>     <dbl> <chr>  
#> 1      517            515         2      830            819        11 UA     
#> 2      533            529         4      850            830        20 UA     
#> 3      542            540         2      923            850        33 AA     
#> 4      544            545        -1     1004           1022       -18 B6     
#> 5      554            600        -6      812            837       -25 DL     
#> 6      554            558        -4      740            728        12 UA     
#> # … with 336,770 more rows, and 9 more variables: flight <int>, tailnum <chr>,
#> #   origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
#> #   minute <dbl>, time_hour <dttm>

There are a number of helper functions you can use within select():

  • starts_with("abc"): matches names that begin with “abc”.
  • ends_with("xyz"): matches names that end with “xyz”.
  • contains("ijk"): matches names that contain “ijk”.
  • matches("") allows you to use regular expressions.

Exercise

Brainstorm as many ways as possible to select dep_time and dep_delay from flights.

Add new variables with mutate()

can be used to add new columns.

mutate() always adds new columns at the end of your dataset.

We first create a narrower dataset from flights using select()

flights_sml <- select(flights, 
  year:day, 
  ends_with("delay"), 
  distance, 
  air_time
)

mutate(flights_sml,
  gain = dep_delay - arr_delay,
  speed = distance / air_time * 60
)
#> # A tibble: 336,776 x 9
#>    year month   day dep_delay arr_delay distance air_time  gain speed
#>   <int> <int> <int>     <dbl>     <dbl>    <dbl>    <dbl> <dbl> <dbl>
#> 1  2013     1     1         2        11     1400      227    -9  370.
#> 2  2013     1     1         4        20     1416      227   -16  374.
#> 3  2013     1     1         2        33     1089      160   -31  408.
#> 4  2013     1     1        -1       -18     1576      183    17  517.
#> 5  2013     1     1        -6       -25      762      116    19  394.
#> 6  2013     1     1        -4        12      719      150   -16  288.
#> # … with 336,770 more rows

If you only want to keep the new variables, use transmute():

transmute(flights,
  gain = dep_delay - arr_delay,
  hours = air_time / 60,
)
# A tibble: 336,776 x 2
#>    gain hours
#>   <dbl> <dbl>
#> 1    -9 3.78 
#> 2   -16 3.78 
#> 3   -31 2.67 
#> 4    17 3.05 
#> 5    19 1.93 
#> 6   -16 2.5 
#> # … with 336,770 more rows 

Exercise

Consider the iris dataset. How to replace the column Species by its one-hot-encoded version? (Replace Species by X1 and X2 where X1 = 1 if Species == "setosa" and X2 = 1 if Species == "versicolor").

Questions?

Grouped summaries with summarise()

which collapses a data frame to a single row:

summarise(flights, delay = mean(dep_delay, na.rm = TRUE))
#> # A tibble: 1 x 1
#>   delay
#>   <dbl>
#> 1  12.6

summarise() is not terribly useful unless we pair it with group_by(). This changes the unit of analysis from the complete dataset to individual groups.

by_day <- group_by(flights, year, month, day)
head(by_day)
# A tibble: 6 x 19
# Groups:   year, month, day [1]
#   year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
#  <int> <int> <int>    <int>          <int>     <dbl>    <int>          <int>
# 1  2013     1     1      517            515         2      830            819
# 2  2013     1     1      533            529         4      850            830
# 3  2013     1     1      542            540         2      923            850
# 4  2013     1     1      544            545        -1     1004           1022
# 5  2013     1     1      554            600        -6      812            837
# 6  2013     1     1      554            558        -4      740            728
# … with 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
#   tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
#   hour <dbl>, minute <dbl>, time_hour <dttm>

The group_by() command will generate a new class called grouped_df.

summarise(by_day, delay = mean(dep_delay, na.rm = TRUE))
#> # A tibble: 365 x 4
#> # Groups:   year, month [12]
#>    year month   day delay
#>   <int> <int> <int> <dbl>
#> 1  2013     1     1 11.5 
#> 2  2013     1     2 13.9 
#> 3  2013     1     3 11.0 
#> 4  2013     1     4  8.95
#> 5  2013     1     5  5.73
#> 6  2013     1     6  7.15
#> # … with 359 more rows

Together group_by() and summarise() provide one of the tools that you’ll use most commonly when working with dplyr: grouped summaries. But before we go any further with this, we need to introduce a powerful new idea: the pipe.

Combining multiple operations with the pipe

Using what you know about dplyr, you might write code like this:

by_dest <- group_by(flights, dest)
delay <- summarise(by_dest,
  count = n(),
  dist = mean(distance, na.rm = TRUE),
  delay = mean(arr_delay, na.rm = TRUE)
)
delay <- filter(delay, count > 20, dest != "ABQ")

There’s another way to tackle the same problem with the pipe, %>%:

delays <- flights %>% 
  group_by(dest) %>% 
  summarise(
    count = n(),
    dist = mean(distance, na.rm = TRUE),
    delay = mean(arr_delay, na.rm = TRUE)
  ) %>% 
  filter(count > 20, dest != "ABQ")

Behind the scenes, x %>% f(y) turns into f(x, y), and x %>% f(y) %>% g(z) turns into g(f(x, y), z) and so on. You can use the pipe to rewrite multiple operations in a way that you can read left-to-right, top-to-bottom. We’ll use piping frequently from now on because it considerably improves the readability of code,

Working with the pipe is one of the key criteria for belonging to the tidyverse. The only exception is ggplot2: it was written before the pipe was discovered. Unfortunately, the next iteration of ggplot2, ggvis, which does use the pipe, isn’t quite ready for prime time yet.

Exercise

For each flight, count the number which has arr_delay greater than 1 hour, and their average delay.

Cheatsheet of dplyr

https://rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf

Mock interview questions

  • What variance inflation factor (VIF) in linear regression? What can we infer if the VIF for one predictor is larger than 5?