Discussion 2
Before getting started
Please open your Rstudio and install the following packages:
install.packages("dplyr")
install.packages("nycflights13")
Unix pipes
Pipes are used for combining existing commands.
wc -l *.txt
% 4 a.txt
4 b.txt
4 c.txt
12 total
Recall that we can use the >
operator to redirect the output to a file:
wc -l *.txt > summary.txt
sort -n
is frequently used to sort the file. Notice its difference with sort
.
the -n
option to specify that the sort is numerical instead of alphanumerical.
We can use the following command to sort summary.txt
in order:
sort -n summary.txt > summary.txt
We can make it easier by running wc
and sort
together
wc -l *.txt | sort -n > summary.txt
The vertical bar, |
, between the two commands is called a pipe. It tells the shell that we want to use the output of the command on the left as the input to the command on the right.
We can also add another layer of pipe to get the first to lines of the sorted line counts:
wc -l *.txt | sort -n | head -1
The redirection and pipes used in the last few commands are illustrated below:
dplyr
is a grammar of data manipulation, providing a consistent set of verbs that help you solve the most common data manipulation challenges. There are five common commands:
mutate()
adds new variables that are functions of existing variablesselect()
picks variables based on their namesfilter()
picks cases based on their valuessummarise()
reduces multiple values down to a single summaryarrange()
changes the ordering of the rows
Prerequisites
library(nycflights13)
library(tidyverse)
nycflights13
We use the dataset flights
, which contains all 336,776 flights that departed from New York City in 2013.
flights
#> # A tibble: 336,776 x 19
#> year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
#> <int> <int> <int> <int> <int> <dbl> <int> <int>
#> 1 2013 1 1 517 515 2 830 819
#> 2 2013 1 1 533 529 4 850 830
#> 3 2013 1 1 542 540 2 923 850
#> 4 2013 1 1 544 545 -1 1004 1022
#> 5 2013 1 1 554 600 -6 812 837
#> 6 2013 1 1 554 558 -4 740 728
#> # … with 336,770 more rows, and 11 more variables: arr_delay <dbl>,
#> # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
#> # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
<dbl>
suggests the corresponding column is of class double
, i.e., a double-precision floating point number.
tibble
We see that the variable type is of tibble
, which is a modern reimagining of the data.frame, keeping what time has proven to be effective, and throwing out what is not.
We can create a tibble from an existing object with as_tibble()
:
as_tibble(iris)
#> # A tibble: 150 x 5
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species
#> <dbl> <dbl> <dbl> <dbl> <fct>
#> 1 5.1 3.5 1.4 0.2 setosa
#> 2 4.9 3 1.4 0.2 setosa
#> 3 4.7 3.2 1.3 0.2 setosa
#> 4 4.6 3.1 1.5 0.2 setosa
#> 5 5 3.6 1.4 0.2 setosa
#> 6 5.4 3.9 1.7 0.4 setosa
#> 7 4.6 3.4 1.4 0.3 setosa
#> 8 5 3.4 1.5 0.2 setosa
#> 9 4.4 2.9 1.4 0.2 setosa
#> 10 4.9 3.1 1.5 0.1 setosa
#> # … with 140 more rows
Filter rows with filter():
filter()
allows you to subset observations based on their values. The first argument is the name of the data frame. The second and subsequent arguments are the expressions that filter the data frame. For example, we can select all flights on January 1st with:
filter(flights, month == 1, day == 1)
#> # A tibble: 842 x 19
#> year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
#> <int> <int> <int> <int> <int> <dbl> <int> <int>
#> 1 2013 1 1 517 515 2 830 819
#> 2 2013 1 1 533 529 4 850 830
#> 3 2013 1 1 542 540 2 923 850
#> 4 2013 1 1 544 545 -1 1004 1022
#> 5 2013 1 1 554 600 -6 812 837
#> 6 2013 1 1 554 558 -4 740 728
#> # … with 836 more rows, and 11 more variables: arr_delay <dbl>, carrier <chr>,
#> # flight <int>, tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>,
#> # distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
It’s much more efficient that subsetting in data.frame.
dplyr
functions never modify their inputs, so if you want to save the result, you’ll need to use the assignment operator, <-
:
filter(flights, month == 1, day == 1)
The following code finds all flights that departed in November or December:
filter(flights, month == 11 | month == 12)
A useful short-hand for this problem is x %in% y
. This will select every row where x
is one of the values in y
. We could use it to rewrite the code above:
filter(flights, month %in% c(11, 12))
Exercises
Find all flights that
- Had an arrival delay of two or more hours
- Arrived more than two hours late, but didn’t leave late
How many flights have a missing dep_time?
Arrange rows with arrange():
arrange()
works similarly to filter()
except that instead of selecting rows, it changes their order. It takes a data frame and a set of column names (or more complicated expressions) to order by. If you provide more than one column name, each additional column will be used to break ties in the values of preceding columns:
arrange(flights, year, month, day)
#> # A tibble: 336,776 x 19
#> year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
#> <int> <int> <int> <int> <int> <dbl> <int> <int>
#> 1 2013 1 1 517 515 2 830 819
#> 2 2013 1 1 533 529 4 850 830
#> 3 2013 1 1 542 540 2 923 850
#> 4 2013 1 1 544 545 -1 1004 1022
#> 5 2013 1 1 554 600 -6 812 837
#> 6 2013 1 1 554 558 -4 740 728
#> # … with 336,770 more rows, and 11 more variables: arr_delay <dbl>,
#> # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
#> # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
Use desc()
to re-order by a column in descending order:
arrange(flights, desc(dep_delay))
#> # A tibble: 336,776 x 19
#> year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
#> <int> <int> <int> <int> <int> <dbl> <int> <int>
#> 1 2013 1 9 641 900 1301 1242 1530
#> 2 2013 6 15 1432 1935 1137 1607 2120
#> 3 2013 1 10 1121 1635 1126 1239 1810
#> 4 2013 9 20 1139 1845 1014 1457 2210
#> 5 2013 7 22 845 1600 1005 1044 1815
#> 6 2013 4 10 1100 1900 960 1342 2211
#> # … with 336,770 more rows, and 11 more variables: arr_delay <dbl>,
#> # carrier <chr>, flight <int>, tailnum <chr>, origin <chr>, dest <chr>,
#> # air_time <dbl>, distance <dbl>, hour <dbl>, minute <dbl>, time_hour <dttm>
Missing values are always sorted at the end.
Exercises
Sort flights
to find the fastest (highest speed) flights. (Consider columns air_time
and distance
)
Select columns with select()
allows you to rapidly zoom in on a useful subset using operations based on the names of the variables.
# Select columns by name
select(flights, year, month, day)
#> # A tibble: 336,776 x 3
#> year month day
#> <int> <int> <int>
#> 1 2013 1 1
#> 2 2013 1 1
#> 3 2013 1 1
#> 4 2013 1 1
#> 5 2013 1 1
#> 6 2013 1 1
#> # … with 336,770 more rows
# Select all columns between year and day (inclusive)
select(flights, year:day)
#> # A tibble: 336,776 x 3
#> year month day
#> <int> <int> <int>
#> 1 2013 1 1
#> 2 2013 1 1
#> 3 2013 1 1
#> 4 2013 1 1
#> 5 2013 1 1
#> 6 2013 1 1
#> # … with 336,770 more rows
# Select all columns except those from year to day (inclusive)
select(flights, -(year:day))
#> # A tibble: 336,776 x 16
#> dep_time sched_dep_time dep_delay arr_time sched_arr_time arr_delay carrier
#> <int> <int> <dbl> <int> <int> <dbl> <chr>
#> 1 517 515 2 830 819 11 UA
#> 2 533 529 4 850 830 20 UA
#> 3 542 540 2 923 850 33 AA
#> 4 544 545 -1 1004 1022 -18 B6
#> 5 554 600 -6 812 837 -25 DL
#> 6 554 558 -4 740 728 12 UA
#> # … with 336,770 more rows, and 9 more variables: flight <int>, tailnum <chr>,
#> # origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>, hour <dbl>,
#> # minute <dbl>, time_hour <dttm>
There are a number of helper functions you can use within select():
starts_with("abc")
: matches names that begin with “abc”.ends_with("xyz")
: matches names that end with “xyz”.contains("ijk")
: matches names that contain “ijk”.matches("")
allows you to use regular expressions.
Exercise
Brainstorm as many ways as possible to select dep_time
and dep_delay
from flights.
Add new variables with mutate()
can be used to add new columns.
mutate()
always adds new columns at the end of your dataset.
We first create a narrower dataset from flights
using select()
flights_sml <- select(flights,
year:day,
ends_with("delay"),
distance,
air_time
)
mutate(flights_sml,
gain = dep_delay - arr_delay,
speed = distance / air_time * 60
)
#> # A tibble: 336,776 x 9
#> year month day dep_delay arr_delay distance air_time gain speed
#> <int> <int> <int> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 2013 1 1 2 11 1400 227 -9 370.
#> 2 2013 1 1 4 20 1416 227 -16 374.
#> 3 2013 1 1 2 33 1089 160 -31 408.
#> 4 2013 1 1 -1 -18 1576 183 17 517.
#> 5 2013 1 1 -6 -25 762 116 19 394.
#> 6 2013 1 1 -4 12 719 150 -16 288.
#> # … with 336,770 more rows
If you only want to keep the new variables, use transmute()
:
transmute(flights,
gain = dep_delay - arr_delay,
hours = air_time / 60,
)
# A tibble: 336,776 x 2
#> gain hours
#> <dbl> <dbl>
#> 1 -9 3.78
#> 2 -16 3.78
#> 3 -31 2.67
#> 4 17 3.05
#> 5 19 1.93
#> 6 -16 2.5
#> # … with 336,770 more rows
Exercise
Consider the iris
dataset. How to replace the column Species
by its one-hot-encoded version? (Replace Species
by X1
and X2
where X1 = 1
if Species == "setosa"
and X2 = 1
if Species == "versicolor"
).
Questions?
Grouped summaries with summarise()
which collapses a data frame to a single row:
summarise(flights, delay = mean(dep_delay, na.rm = TRUE))
#> # A tibble: 1 x 1
#> delay
#> <dbl>
#> 1 12.6
summarise()
is not terribly useful unless we pair it with group_by()
. This changes the unit of analysis from the complete dataset to individual groups.
by_day <- group_by(flights, year, month, day)
head(by_day)
# A tibble: 6 x 19
# Groups: year, month, day [1]
# year month day dep_time sched_dep_time dep_delay arr_time sched_arr_time
# <int> <int> <int> <int> <int> <dbl> <int> <int>
# 1 2013 1 1 517 515 2 830 819
# 2 2013 1 1 533 529 4 850 830
# 3 2013 1 1 542 540 2 923 850
# 4 2013 1 1 544 545 -1 1004 1022
# 5 2013 1 1 554 600 -6 812 837
# 6 2013 1 1 554 558 -4 740 728
# … with 11 more variables: arr_delay <dbl>, carrier <chr>, flight <int>,
# tailnum <chr>, origin <chr>, dest <chr>, air_time <dbl>, distance <dbl>,
# hour <dbl>, minute <dbl>, time_hour <dttm>
The group_by()
command will generate a new class called grouped_df
.
summarise(by_day, delay = mean(dep_delay, na.rm = TRUE))
#> # A tibble: 365 x 4
#> # Groups: year, month [12]
#> year month day delay
#> <int> <int> <int> <dbl>
#> 1 2013 1 1 11.5
#> 2 2013 1 2 13.9
#> 3 2013 1 3 11.0
#> 4 2013 1 4 8.95
#> 5 2013 1 5 5.73
#> 6 2013 1 6 7.15
#> # … with 359 more rows
Together group_by() and summarise() provide one of the tools that you’ll use most commonly when working with dplyr: grouped summaries. But before we go any further with this, we need to introduce a powerful new idea: the pipe.
Combining multiple operations with the pipe
Using what you know about dplyr, you might write code like this:
by_dest <- group_by(flights, dest)
delay <- summarise(by_dest,
count = n(),
dist = mean(distance, na.rm = TRUE),
delay = mean(arr_delay, na.rm = TRUE)
)
delay <- filter(delay, count > 20, dest != "ABQ")
There’s another way to tackle the same problem with the pipe, %>%
:
delays <- flights %>%
group_by(dest) %>%
summarise(
count = n(),
dist = mean(distance, na.rm = TRUE),
delay = mean(arr_delay, na.rm = TRUE)
) %>%
filter(count > 20, dest != "ABQ")
Behind the scenes, x %>% f(y)
turns into f(x, y)
, and x %>% f(y) %>% g(z)
turns into g(f(x, y), z)
and so on. You can use the pipe to rewrite multiple operations in a way that you can read left-to-right, top-to-bottom. We’ll use piping frequently from now on because it considerably improves the readability of code,
Working with the pipe is one of the key criteria for belonging to the tidyverse. The only exception is ggplot2: it was written before the pipe was discovered. Unfortunately, the next iteration of ggplot2, ggvis, which does use the pipe, isn’t quite ready for prime time yet.
Exercise
For each flight
, count the number which has arr_delay
greater than 1 hour, and their average delay.
Cheatsheet of dplyr
https://rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf
Mock interview questions
- What variance inflation factor (VIF) in linear regression? What can we infer if the VIF for one predictor is larger than 5?