Setting up our working directory and saving objects vs files

R is a functional programming language, which means that most of what one does is apply functions to objects.

We will begin with a very brief introduction to R objects and how functions work, and then focus on getting data into R, manipulating that data in R, plotting, and analysis.

First, we need to make sure we are working out of the correct directory. You can see teh working directory above the console in RStudio, or use the getwd() function to see the current working directory. If you are not currently working in a good location to read and save files, you can use setwd() to change the working directory to the location of your choice.

setwd('~/Dropbox/Informatics/Workshops/Intro_to_R/')

Vectors

Let’s start by creating one of the simplest R objects, a vector of numbers.

v1<-c(1,2,3,4,5)

v1 is an object which we created by using the <- operator (less followed by a dash).

v1 now contains the output of the function c(1,2,3,4,5) (c, for combined, combines elements into a vector

Note that = is essentially the same as <-, but <- is preferred by the R community primarily for historical reasons.

v1=c(1,2,3,4,5)

Let’s display the contents of v1:

print(v1)
## [1] 1 2 3 4 5

We can also just type the name of an object (in this case v1) to display it.

Let’s make a few more objects:

x<-10
x
## [1] 10
y<-11
y
## [1] 11
some_letters<-c("a", "b", "c", "d")

With this last variable, we had to use "" to specify that we want to create a character vector. Otherwise it thinks we are looking for variables to combine called a,b,c and d.

some_letters <- c(a,b,c,d)

We might want to get a list of all the objects we’ve created at this point. In R, the function ls() returns a character vector with the names of all the objects in a specified environment (by default, the set of user-defined objects and functions).

ls()
## [1] "some_letters" "v1"           "x"            "y"

Getting Help

R has very good built-in documentation that describes what functions do.

To get help about a particular function, use ? followed by the function name, like so:

?ls

If you are not exactly sure of the function name, you can perform a search:

??mean

We can manipulate objects like so:

x+5
## [1] 15
x*2
## [1] 20
x
## [1] 10

Note that the value of x is not modified here. We did not save the output to a new object, so it is printed to the screen.

If we want to update the value of x, we need to use our assignment operator:

x<-x+5
x
## [1] 15

R handles vector math for us automatically:

v1*x
## [1] 15 30 45 60 75
v2<-v1*x
v2
## [1] 15 30 45 60 75

Vector exercises:

  1. Create a new vector, (called nums), that contains any 5 numbers you like

  2. Create a new vector, squares, that contains the elements of nums squared (hint: 2^2 = 2 squared)

  3. There are a number of functions that operate on vectors, including length(), max(), min(), and mean(), all of which do exactly what they sound like they do. So for example length(nums) should return 5, because nums is a 5-element vector. Use these functions to get the minimum, maximum, and mean value for the nums vector you created. Also try them on the some_letters vector. What happens?

  4. Remember that the function ls() returns a vector of the names of the objects you’ve created in your current R session. Use ls() and length() to determine the number of objects in your R session.

Note for this last question, you can solve this either by creating a new object (e.g. cur_obj), or by nesting the functions inside each other.

We can also remove objects from our workspace using the function rm. ?rm

ls()
## [1] "some_letters" "v1"           "v2"           "x"           
## [5] "y"
rm(x)
ls()
## [1] "some_letters" "v1"           "v2"           "y"

We might want to do this if we are working with large object that takes up a lot of memory, and we no longer need that object.

Object Types

All objects have a type. Object types are a complex topic and we are only going to scratch the surface today. To slightly simplify, all data objects in R are either atomic vectors (contain only a single type of data), or data structures that combine atomic vectors in various ways. We’ll walk through a few examples. First, let’s consider a couple of the vectors that we’ve already made.

nums <- c(33, 22, 41, 54, 91)
class(nums)
## [1] "numeric"
class(some_letters)
## [1] "character"

Numeric and character data types are two of the most common we’ll encounter, and are just what they sound like. Another useful type is logical data, as in TRUE/FALSE. We can create a logical vector directly like so:

logic1<-c(TRUE, TRUE, FALSE, FALSE)
logic1
## [1]  TRUE  TRUE FALSE FALSE
class(logic1)
## [1] "logical"

Or we can use logical tests.

logic2<-v1>2
logic2
## [1] FALSE FALSE  TRUE  TRUE  TRUE

Note that the logical test here (>2) is applied independently to each element in the vector. We’ll come back to this when we talk about data subsets.

A final thing to note about logical vectors: they can be converted to numeric with TRUE=1 and FALSE=0. So a simple way to find how many elements in a logical vector are true is sum(). Using mean() is an easy way to get the proportion of true elements in a vector.

sum(logic2)
## [1] 3
mean(logic2)
## [1] 0.6

Object type exercises:

  1. Use class() to determine the type of vector that the ls() function returns. How about the max function? Does it change depending on the input?

  2. Construct a logical vector with the same number of elements as your nums vector, that is TRUE if the corresponding element in the nums vector is less than the mean of the nums vector, and FALSE otherwise.

  3. How many elements in your nums vector are greater than the mean of the nums vector?

Data Frames

So far we’ve been talking about atomic vectors, which only contain a single data type (every element is logical, or character, or numeric). However, data sets will usually have multiple different data types: numeric for continunous data, character for categorical data and sample labels. Depending on how underlying types are combined, we can have four different “higher-level” data types in R:

Dimensions Homogeneous Heterogeneous
1-D atomic vector list
2-D matrix data frame / tibble

We’ll focus on data frames and tibbles for today, but lists and matrices can also be very powerful. A data frame is a collection of vectors, which can be (but don’t have to be) different types, but all have to have the same length. Let’s make a couple of toy data frames.

One way to do this is with the data.frame() function.

df1<-data.frame(label=c("rep1", "rep2", "rep3", "rep4"), data=c(23, 34, 15, 19))
df1
##   label data
## 1  rep1   23
## 2  rep2   34
## 3  rep3   15
## 4  rep4   19
class(df1)
## [1] "data.frame"

str() gives lots of information about the data type of the consituent parts of a data frame

str(df1)
## 'data.frame':    4 obs. of  2 variables:
##  $ label: Factor w/ 4 levels "rep1","rep2",..: 1 2 3 4
##  $ data : num  23 34 15 19

Note that label is a factor – this is a special kind of character vector to represent categorical data. By default, when creating a data frame or reading data from a file, R will convert strings (character vectors) to factors. Factors can be useful, but also introduce some pitfalls, which we can come back to if we have time. A good source of information on factors is here.

We can use the function head() to look at part of a a dataframe (or any R object). This can be very useful if you have a very large or long dataframe. You also can control how many lines of the dataframe you view with n=#, although the default is 6.

head(df1)
##   label data
## 1  rep1   23
## 2  rep2   34
## 3  rep3   15
## 4  rep4   19
head(df1, n=2)
##   label data
## 1  rep1   23
## 2  rep2   34
head(df1, n=10)
##   label data
## 1  rep1   23
## 2  rep2   34
## 3  rep3   15
## 4  rep4   19

The summary() function can also be very useful to get a snapshot of the data in your dataframe.

summary(df1)
##   label        data      
##  rep1:1   Min.   :15.00  
##  rep2:1   1st Qu.:18.00  
##  rep3:1   Median :21.00  
##  rep4:1   Mean   :22.75  
##           3rd Qu.:25.75  
##           Max.   :34.00

R packages and the tidyverse

Data frames are a very useful type of base R object. Base R means that it is a native R object type (similar to vectors and matrices). We have learned several of the R object types this morning, and used some functions that come built in to R. However, as you are probably aware, one of the reasons R is so useful and powerful is because many people have built extensions for R in the form of R packages that you can install. These packages can range from broad statistical packages to packages to analyze specific data types (e.g. comparative phylogenetics). One set of packages that are incredibly useful for working with data is the tidyverse. We will primarily be using various elements of the tidyverse from here on out, so let’s install it.

The way to install new packages you have never used before in R is with the command install.packages(), as we demonstrate below:

#install.packages("tidyverse")

Once the package is installed, we need to load the packages.

library(tidyverse)
## ── Attaching packages ────────────────────────────────── tidyverse 1.2.1 ──
## ✔ ggplot2 2.2.1     ✔ purrr   0.2.4
## ✔ tibble  1.3.4     ✔ dplyr   0.7.4
## ✔ tidyr   0.7.2     ✔ stringr 1.2.0
## ✔ readr   1.1.1     ✔ forcats 0.2.0
## ── Conflicts ───────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()

You only need to load an R package once per session, but you do need to re-load everytime you restart R. It is good practice to load all packages at the top of your R script.

Tibbles

Now back to data manipulation. So far, we have been working with data frames. The tidyverse uses its own version of the data frame which is similar, but has several properties that make it superior. That object type is the tibble. Let’s change our previous data frame (df1) into a tibble and see what it looks like.

tbdf1 <- as.tibble(df1)
tbdf1
## # A tibble: 4 x 2
##    label  data
##   <fctr> <dbl>
## 1   rep1    23
## 2   rep2    34
## 3   rep3    15
## 4   rep4    19

You see here that just printing the tibble to screen displays the data types in each of the columns and the dimensions. Although not apparent with this small dataset, another very handy feature of tibbles is that by default they will only print out the first 10 rows and as many columns as fit in your window. If you have ever worked with large datasets with hundreds or thousands of rows and many columns and have accidently printed to screen, you know how useful this is! Many packages will accept tibbles instead of data frames with no problems, but if you come across an older package that requires a data frame, it is easy to revert with the as.data.frame() function.

Reading files into R

So far we have been working with small objects we created by hand. A more common way to create tibbles is by reading from a file. There are a few functions to do this in R: read_delim() and read_csv() from the tidyverse readR package make this easy.

bus<-read_delim(file="http://software.rc.fas.harvard.edu/ngsdata/workshops/2015_March/mbta_bus.tsv", delim="\t")
## Parsed with column specification:
## cols(
##   route = col_character(),
##   type = col_character(),
##   cost.per.pax = col_double(),
##   ridership = col_integer(),
##   pax.per.trip = col_integer()
## )

Note that R can read data directly from the web – no need to download the file first.

You can see that readr guesses what type of data exist in the different columns, and automatically pulls the first row as the column names. If you

File reading exercise 1. Explore the bus tibble and see what it contains. Do you think that each of the columns are of the correct type? 2. Try using head() to see the first 20 rows and summary() to see some information about the data. 3. What happens when you use the function View() with the bus object?

bus
## # A tibble: 163 x 5
##    route      type cost.per.pax ridership pax.per.trip
##    <chr>     <chr>        <dbl>     <int>        <int>
##  1     1       Key         0.63     13306          416
##  2     4  Commuter         4.38       459           21
##  3     5 Community         2.99       156            5
##  4     7     Local         1.52      3893          177
##  5     8     Local         2.02      3844           85
##  6     9     Local         1.21      5980          187
##  7    10     Local         1.75      3184          398
##  8    11     Local         1.84      3312          237
##  9    14     Local         2.66      1285           22
## 10    15       Key         1.11      6227          111
## # ... with 153 more rows
head(bus, n=20)
## # A tibble: 20 x 5
##    route      type cost.per.pax ridership pax.per.trip
##    <chr>     <chr>        <dbl>     <int>        <int>
##  1     1       Key         0.63     13306          416
##  2     4  Commuter         4.38       459           21
##  3     5 Community         2.99       156            5
##  4     7     Local         1.52      3893          177
##  5     8     Local         2.02      3844           85
##  6     9     Local         1.21      5980          187
##  7    10     Local         1.75      3184          398
##  8    11     Local         1.84      3312          237
##  9    14     Local         2.66      1285           22
## 10    15       Key         1.11      6227          111
## 11    16     Local         0.99      5100         2550
## 12    17     Local         0.71      3461          115
## 13    18     Local         2.71       630           16
## 14    19     Local         1.07      3591          211
## 15    21     Local         0.57      4696           90
## 16    22       Key         1.05      8151          170
## 17    23       Key         1.00     11687          899
## 18    24     Local         1.30      1725           21
## 19    26     Local         0.80      1960           41
## 20    27     Local         1.13       929           22

head() can be used to increase the number of rows a tibble shows. Let’s also look at str()

Again, summary() is a good way to see some basic information about a tibble.

summary(bus)
##     route               type            cost.per.pax     ridership      
##  Length:163         Length:163         Min.   :0.570   Min.   :   30.0  
##  Class :character   Class :character   1st Qu.:1.350   1st Qu.:  755.5  
##  Mode  :character   Mode  :character   Median :2.150   Median : 1481.0  
##                                        Mean   :2.433   Mean   : 2427.1  
##                                        3rd Qu.:3.055   3rd Qu.: 3101.0  
##                                        Max.   :9.760   Max.   :15018.0  
##   pax.per.trip    
##  Min.   :   0.00  
##  1st Qu.:   9.00  
##  Median :  21.00  
##  Mean   :  78.55  
##  3rd Qu.:  48.00  
##  Max.   :2550.00

The view function can be useful for viewing tibbles in a spread-sheet like format.

View(bus)

Tidy Data

Tidy data is the idea that each row in your data frame should be an observation, and each column should be a variable. This simple idea can make a lot of analysis tasks much easier. In many cases, this is straightforward. For example, the output of DESeq2 (a package that computes differential expression statistics for RNA-seq data) is already in tidy format, which each row representing a gene (observation) and each column representing an observation about that gene (expression level, fold change, P-value). In other cases, this will be tricky.

There are a lot of functions in base R you may have come across to these kinds of data manipulation tasks, such as subset, the apply family, and merge. But they have inconsistent syntax and can be difficult to learn.

As an alternative, we will use the tidyr and dplyr packages written by Hadley Wickham (of ggplots and devtools fame). These tools provide a consistent grammar for data manipulation.

To install these packages and load some datasets, source this file: http://software.rc.fas.harvard.edu/ngsdata/workshops/prep_data.R

Working with tidyr

The goal of tidyr is to convert between ‘wide’ data and ‘long’ data. Long data is tidy data: each row is an observation, each column is a variable. Wide data has many columns for the same variable, one for each level of a classification variable. For example, here is some airline passenger data that comes packaged with R. This is in wide format (there is a separate column for each month).

source("prep_data.R")
## 
## The downloaded binary packages are in
##  /var/folders/83/dwxgsccn6x12_m61s_kvgljc0000gp/T//RtmpigfZ0X/downloaded_packages
airpass
##    Year Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov Dec
## 1  1949 112 118 132 129 121 135 148 148 136 119 104 118
## 2  1950 115 126 141 135 125 149 170 170 158 133 114 140
## 3  1951 145 150 178 163 172 178 199 199 184 162 146 166
## 4  1952 171 180 193 181 183 218 230 242 209 191 172 194
## 5  1953 196 196 236 235 229 243 264 272 237 211 180 201
## 6  1954 204 188 235 227 234 264 302 293 259 229 203 229
## 7  1955 242 233 267 269 270 315 364 347 312 274 237 278
## 8  1956 284 277 317 313 318 374 413 405 355 306 271 306
## 9  1957 315 301 356 348 355 422 465 467 404 347 305 336
## 10 1958 340 318 362 348 363 435 491 505 404 359 310 337
## 11 1959 360 342 406 396 420 472 548 559 463 407 362 405
## 12 1960 417 391 419 461 472 535 622 606 508 461 390 432

The gather() function in tidyr turns wide data into long data.

airpass2<-gather(data=airpass, key=Month, value=Passengers, Jan,Feb,Mar,Apr,May,Jun,Jul,Aug,Sep,Oct,Nov,Dec)
head(airpass2, n=2)
##   Year Month Passengers
## 1 1949   Jan        112
## 2 1950   Jan        115

The gather() function takes a data frame (or data-frame-like object), a key, a value, and the columns to be “gathered”. In this case the key is month and the value is passengers. The key-value pair is the critical concept for using gather (and its counterpart, spread). The key is what the columns are in wide data, and value is the variable.

gather() and essentially all tidyr/dplyr functions have some useful features to make things easier.

  1. You can (almost) always use column names like you would use column numbers in base R. So for example, rather than typing out every month, we can just get all the columns in between Jan and Dec, inclusive:
airpass3<-gather(airpass, Month, Passengers, Jan:Dec)
head(airpass3, n=2)
##   Year Month Passengers
## 1 1949   Jan        112
## 2 1950   Jan        115
  1. You can (almost) always use negated column names to mean everything but that column (or to remove a column):
airpass4<-gather(airpass, Month, Passengers, -Year)
head(airpass4, n=2)
##   Year Month Passengers
## 1 1949   Jan        112
## 2 1950   Jan        115

Side Note: %>%

The %>% operator acts like a Unix pipe in R. This means you can pass the output of one command to another in linear fashion, as opposed to having to use either nested operations or temporary objects. This makes your code much easier to read and debug.

Compare:

airpass_long <- gather(as.tibble(airpass), Month, Passengers, -Year)
airpass_tibble <- as.tibble(airpass)
airpass_long <- gather(airpass_tibble, Month, Passengers, -Year)
airpass_long <- airpass %>% 
  as.tibble %>%
  gather(Month, Passengers, -Year)

These all do the same thing, but the last one is the easiest to read and especially to extend with more commands.

Additional tidyr commands:

  • unite() which is an easy way to make new (text) columns out of existing columns
  • separate() which splits a text column into multiple new columns
  • spread() which does the reverse of gather() and spreads a long data frame into a wide data frame

We’ll look at quick example of each using the airpass dataset, and then do some exercises with some additional datasets.

airpass_long_2 <- airpass_long %>% unite(col=Date, Year, Month, sep="-", remove=T)
airpass_long_2
## # A tibble: 144 x 2
##        Date Passengers
##  *    <chr>      <dbl>
##  1 1949-Jan        112
##  2 1950-Jan        115
##  3 1951-Jan        145
##  4 1952-Jan        171
##  5 1953-Jan        196
##  6 1954-Jan        204
##  7 1955-Jan        242
##  8 1956-Jan        284
##  9 1957-Jan        315
## 10 1958-Jan        340
## # ... with 134 more rows
airpass_long_2 %>% separate(col=Date, into=c("Year", "Month"), sep="-", remove=F)
## # A tibble: 144 x 4
##        Date  Year Month Passengers
##  *    <chr> <chr> <chr>      <dbl>
##  1 1949-Jan  1949   Jan        112
##  2 1950-Jan  1950   Jan        115
##  3 1951-Jan  1951   Jan        145
##  4 1952-Jan  1952   Jan        171
##  5 1953-Jan  1953   Jan        196
##  6 1954-Jan  1954   Jan        204
##  7 1955-Jan  1955   Jan        242
##  8 1956-Jan  1956   Jan        284
##  9 1957-Jan  1957   Jan        315
## 10 1958-Jan  1958   Jan        340
## # ... with 134 more rows

A note about separate: the thing in the sep argument is interpretated as a regular expression. The default (which often works) is to separate on any non-alphanumeric character. Usually (but not always) single character expression will do what you expect. Regular expressions in R are a whole workshop on their own. If you do need to use separate on a complicated string, it may take some trial and error.

airpass_long %>% spread(key=Month, value=Passengers)
## # A tibble: 12 x 13
##     Year   Apr   Aug   Dec   Feb   Jan   Jul   Jun   Mar   May   Nov   Oct
##  * <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
##  1  1949   129   148   118   118   112   148   135   132   121   104   119
##  2  1950   135   170   140   126   115   170   149   141   125   114   133
##  3  1951   163   199   166   150   145   199   178   178   172   146   162
##  4  1952   181   242   194   180   171   230   218   193   183   172   191
##  5  1953   235   272   201   196   196   264   243   236   229   180   211
##  6  1954   227   293   229   188   204   302   264   235   234   203   229
##  7  1955   269   347   278   233   242   364   315   267   270   237   274
##  8  1956   313   405   306   277   284   413   374   317   318   271   306
##  9  1957   348   467   336   301   315   465   422   356   355   305   347
## 10  1958   348   505   337   318   340   491   435   362   363   310   359
## 11  1959   396   559   405   342   360   548   472   406   420   362   407
## 12  1960   461   606   432   391   417   622   535   419   472   390   461
## # ... with 1 more variables: Sep <dbl>

Exercises with tidyr functions

Exercise 1

You should have a dataset loaded, called nzmodeshareschool, that has mode share data for travel to school in various regions of New Zealand (it is amazing what you can find on the internet these days). It looks like this:

nzmodeshareschool %>% as.tibble %>% print
## # A tibble: 8 x 10
##            travel_mode Northland Auckland Waikato..Bay.of.Plenty
## *                <chr>     <chr>    <chr>                  <chr>
## 1          Walk (only)       25%      38%                    14%
## 2            Passenger       36%      54%                    60%
## 3 Car passenger + walk        2%       1%                     2%
## 4              Bicycle        4%       0%                     5%
## 5     Public transport       16%       1%                     2%
## 6            Walk + PT       15%       2%                    10%
## 7   Car passenger + PT        2%       2%                     6%
## 8                Other        1%       1%                     0%
## # ... with 6 more variables: Gisborne.Hawkes.Bay <chr>,
## #   Taranaki..Manawatu.Wanganui <chr>, Wellington <chr>,
## #   West.Coast.Nelson..Marlborough.Tasman <chr>, Canterbury <chr>,
## #   Otago..Southland <chr>

Use gather to convert this to a long format. Think about what should be the key, and what should be the value. Use the %>% to first convert to a tibble. Store the new data frame in nzmodeshareschool_clean.

nzmodeshareschool_clean <- nzmodeshareschool %>% as.tibble %>% gather(location, share, -travel_mode) %>% print
## # A tibble: 72 x 3
##             travel_mode  location share
##                   <chr>     <chr> <chr>
##  1          Walk (only) Northland   25%
##  2            Passenger Northland   36%
##  3 Car passenger + walk Northland    2%
##  4              Bicycle Northland    4%
##  5     Public transport Northland   16%
##  6            Walk + PT Northland   15%
##  7   Car passenger + PT Northland    2%
##  8                Other Northland    1%
##  9          Walk (only)  Auckland   38%
## 10            Passenger  Auckland   54%
## # ... with 62 more rows

Exercise 2, using spread

There is a dataset on the count of M&Ms of various colors in a set of bags in mms (note: already in tibble format). Convert this to wide format using spread, so that we have separate columns for each color. Put it in a new tibble called mms_wide.

mms_wide <- mms %>% spread(color, count) %>% print
## # A tibble: 30 x 8
##    Weight BagID  Blue Brown Green Orange   Red Yellow
##  *  <dbl> <int> <int> <int> <int>  <int> <int>  <int>
##  1  46.22    10     3    15     9      9     9      8
##  2  46.72    26     7    19     6      6     4     14
##  3  46.94    13    13    13     9      2    12      6
##  4  47.61     5     7     4     3      9    10     22
##  5  47.67    27    11     7    12      6    10     11
##  6  47.70    28     2    16     4      9     5     18
##  7  47.98    14     7     7     7      2     9     18
##  8  48.28    24     1    19     6     12    12      6
##  9  48.33    16     9    20     6      4     4     12
## 10  48.45     9    10    18     2      6     4     18
## # ... with 20 more rows

Exercise 3, putting it all together

There is a dataset on US house prices in a data frame called housing. Look at this dataframe and make it tidy. You will need to: 1. convert to a tibble 2. use gather to reformat as long data (one observation per row, one variable per column) 3.use separate to split location into city and state columns (dropping location), and date into year and month columns (but keeping the date column)

Two hints: First, there is a National.US column that you will want to keep as a separate variable for each. Second, note that the location (column names) use . for spaces, but that some cities (e.g. Las Vegas) also have a space. So when you use separate, you’ll have to look at what to do with that extra field. Similarly, the date column uses dash to separate year-month-day, but you don’t want to keep the day part. Check ?separate and look at the extra option to figure out what to do here.

You should end up with this:

housing_clean <- housing %>% as.tibble %>%
  gather(location, local_index, -Date, -National.US) %>% 
  separate(location, c("state", "city"),extra="merge") %>%
  separate(Date, c("year", "month"), extra="drop", remove=F) %>%
  print
## # A tibble: 6,960 x 7
##          Date  year month National.US state    city local_index
##  *      <chr> <chr> <chr>       <dbl> <chr>   <chr>       <dbl>
##  1 1987-01-01  1987    01       63.75    AZ Phoenix          NA
##  2 1987-02-01  1987    02       64.15    AZ Phoenix          NA
##  3 1987-03-01  1987    03       64.49    AZ Phoenix          NA
##  4 1987-04-01  1987    04       64.99    AZ Phoenix          NA
##  5 1987-05-01  1987    05       65.57    AZ Phoenix          NA
##  6 1987-06-01  1987    06       66.24    AZ Phoenix          NA
##  7 1987-07-01  1987    07       66.80    AZ Phoenix          NA
##  8 1987-08-01  1987    08       67.29    AZ Phoenix          NA
##  9 1987-09-01  1987    09       67.64    AZ Phoenix          NA
## 10 1987-10-01  1987    10       67.92    AZ Phoenix          NA
## # ... with 6,950 more rows

Subsetting and Manipulating Data with dplyr

tidyr() does one thing well – reformats data from wide to long, or long to wide (with the helper functions separate and unite which make these tasks easier if there are multiple classification variables).

However most datasets require more than just reformating in this way. Organizing, updating, filtering, aggregating, and merging are all done with the dpylr package. In dpylr, as in tidyr, each action gets its own (verb) function – so for example filtering data by rows is done with the filter() function. All of these functions have a very similar syntax to tidyr functions.

Organizing/updating functions

arrange(), rename(), select(), mutate() are used to sort data, rename columns, and upate/create columns

We’ll use the housing dataset to look at how these functions work.

arrange() sorts by one or more columns, with subsequent columns used to break ties in earlier columns. E.g.,

housing_clean %>% arrange(year, month)
## # A tibble: 6,960 x 7
##          Date  year month National.US state          city local_index
##         <chr> <chr> <chr>       <dbl> <chr>         <chr>       <dbl>
##  1 1987-01-01  1987    01       63.75    AZ       Phoenix          NA
##  2 1987-01-01  1987    01       63.75    CA   Los.Angeles       59.33
##  3 1987-01-01  1987    01       63.75    CA     San.Diego       54.67
##  4 1987-01-01  1987    01       63.75    CA San.Francisco       46.61
##  5 1987-01-01  1987    01       63.75    CO        Denver       50.20
##  6 1987-01-01  1987    01       63.75    DC    Washington       64.11
##  7 1987-01-01  1987    01       63.75    FL         Miami       68.50
##  8 1987-01-01  1987    01       63.75    FL         Tampa       77.33
##  9 1987-01-01  1987    01       63.75    GA       Atlanta          NA
## 10 1987-01-01  1987    01       63.75    IL       Chicago       53.55
## # ... with 6,950 more rows
housing_clean %>% arrange(city, year)
## # A tibble: 6,960 x 7
##          Date  year month National.US state    city local_index
##         <chr> <chr> <chr>       <dbl> <chr>   <chr>       <dbl>
##  1 1987-01-01  1987    01       63.75    GA Atlanta          NA
##  2 1987-02-01  1987    02       64.15    GA Atlanta          NA
##  3 1987-03-01  1987    03       64.49    GA Atlanta          NA
##  4 1987-04-01  1987    04       64.99    GA Atlanta          NA
##  5 1987-05-01  1987    05       65.57    GA Atlanta          NA
##  6 1987-06-01  1987    06       66.24    GA Atlanta          NA
##  7 1987-07-01  1987    07       66.80    GA Atlanta          NA
##  8 1987-08-01  1987    08       67.29    GA Atlanta          NA
##  9 1987-09-01  1987    09       67.64    GA Atlanta          NA
## 10 1987-10-01  1987    10       67.92    GA Atlanta          NA
## # ... with 6,950 more rows
housing_clean %>% arrange(month,state)
## # A tibble: 6,960 x 7
##          Date  year month National.US state    city local_index
##         <chr> <chr> <chr>       <dbl> <chr>   <chr>       <dbl>
##  1 1987-01-01  1987    01       63.75    AZ Phoenix          NA
##  2 1988-01-01  1988    01       68.60    AZ Phoenix          NA
##  3 1989-01-01  1989    01       73.62    AZ Phoenix       67.54
##  4 1990-01-01  1990    01       76.53    AZ Phoenix       66.56
##  5 1991-01-01  1991    01       75.53    AZ Phoenix       65.26
##  6 1992-01-01  1992    01       75.70    AZ Phoenix       66.01
##  7 1993-01-01  1993    01       76.40    AZ Phoenix       66.83
##  8 1994-01-01  1994    01       78.21    AZ Phoenix       70.53
##  9 1995-01-01  1995    01       80.04    AZ Phoenix       75.11
## 10 1996-01-01  1996    01       81.46    AZ Phoenix       79.10
## # ... with 6,950 more rows
housing_clean %>% arrange(desc(year))
## # A tibble: 6,960 x 7
##          Date  year month National.US state    city local_index
##         <chr> <chr> <chr>       <dbl> <chr>   <chr>       <dbl>
##  1 2015-01-01  2015    01      166.41    AZ Phoenix      147.84
##  2 2015-02-01  2015    02      166.81    AZ Phoenix      148.22
##  3 2015-03-01  2015    03      168.28    AZ Phoenix      149.19
##  4 2015-04-01  2015    04      170.17    AZ Phoenix      150.39
##  5 2015-05-01  2015    05      172.06    AZ Phoenix      151.48
##  6 2015-06-01  2015    06      173.66    AZ Phoenix      152.83
##  7 2015-07-01  2015    07      174.71    AZ Phoenix      153.93
##  8 2015-08-01  2015    08      175.17    AZ Phoenix      154.82
##  9 2015-09-01  2015    09      175.32    AZ Phoenix      155.15
## 10 2015-10-01  2015    10      175.40    AZ Phoenix      155.92
## # ... with 6,950 more rows

rename() renames a column:

housing_clean %>% arrange(year, month, state, city) %>%
  rename(national_index = National.US)
## # A tibble: 6,960 x 7
##          Date  year month national_index state          city local_index
##         <chr> <chr> <chr>          <dbl> <chr>         <chr>       <dbl>
##  1 1987-01-01  1987    01          63.75    AZ       Phoenix          NA
##  2 1987-01-01  1987    01          63.75    CA   Los.Angeles       59.33
##  3 1987-01-01  1987    01          63.75    CA     San.Diego       54.67
##  4 1987-01-01  1987    01          63.75    CA San.Francisco       46.61
##  5 1987-01-01  1987    01          63.75    CO        Denver       50.20
##  6 1987-01-01  1987    01          63.75    DC    Washington       64.11
##  7 1987-01-01  1987    01          63.75    FL         Miami       68.50
##  8 1987-01-01  1987    01          63.75    FL         Tampa       77.33
##  9 1987-01-01  1987    01          63.75    GA       Atlanta          NA
## 10 1987-01-01  1987    01          63.75    IL       Chicago       53.55
## # ... with 6,950 more rows

select() selects columns to keep. Note that we can simulatenously rename and reorder columns:

housing_clean %>% arrange(year, month, state, city) %>%
  select(year, month, city, state, local_index, national_index = National.US)
## # A tibble: 6,960 x 6
##     year month          city state local_index national_index
##    <chr> <chr>         <chr> <chr>       <dbl>          <dbl>
##  1  1987    01       Phoenix    AZ          NA          63.75
##  2  1987    01   Los.Angeles    CA       59.33          63.75
##  3  1987    01     San.Diego    CA       54.67          63.75
##  4  1987    01 San.Francisco    CA       46.61          63.75
##  5  1987    01        Denver    CO       50.20          63.75
##  6  1987    01    Washington    DC       64.11          63.75
##  7  1987    01         Miami    FL       68.50          63.75
##  8  1987    01         Tampa    FL       77.33          63.75
##  9  1987    01       Atlanta    GA          NA          63.75
## 10  1987    01       Chicago    IL       53.55          63.75
## # ... with 6,950 more rows

distinct() is like unique(), and can be used to idnetify all unique values in your call set:

housing_clean %>%
  select(state,city) %>%
  distinct
## # A tibble: 20 x 2
##    state          city
##    <chr>         <chr>
##  1    AZ       Phoenix
##  2    CA   Los.Angeles
##  3    CA     San.Diego
##  4    CA San.Francisco
##  5    CO        Denver
##  6    DC    Washington
##  7    FL         Miami
##  8    FL         Tampa
##  9    GA       Atlanta
## 10    IL       Chicago
## 11    MA        Boston
## 12    MI       Detroit
## 13    MN   Minneapolis
## 14    NC     Charlotte
## 15    NV     Las.Vegas
## 16    NY      New.York
## 17    OH     Cleveland
## 18    OR      Portland
## 19    TX        Dallas
## 20    WA       Seattle

mutate() creates new columns, or updates existing ones, while keeping everthing else unchanged (transmute does the same but drops other columns).

housing_clean %>% arrange(year, month, state, city) %>%
  select(year, month, city, state, local_index, national_index = National.US) %>%
  mutate(month = month.abb[as.integer(month)])
## # A tibble: 6,960 x 6
##     year month          city state local_index national_index
##    <chr> <chr>         <chr> <chr>       <dbl>          <dbl>
##  1  1987   Jan       Phoenix    AZ          NA          63.75
##  2  1987   Jan   Los.Angeles    CA       59.33          63.75
##  3  1987   Jan     San.Diego    CA       54.67          63.75
##  4  1987   Jan San.Francisco    CA       46.61          63.75
##  5  1987   Jan        Denver    CO       50.20          63.75
##  6  1987   Jan    Washington    DC       64.11          63.75
##  7  1987   Jan         Miami    FL       68.50          63.75
##  8  1987   Jan         Tampa    FL       77.33          63.75
##  9  1987   Jan       Atlanta    GA          NA          63.75
## 10  1987   Jan       Chicago    IL       53.55          63.75
## # ... with 6,950 more rows

We can include several mutate calls at once:

housing_clean %>% arrange(year, month, state, city) %>%
  select(year, month, city, state, local_index, national_index = National.US) %>%
  mutate(month = month.abb[as.integer(month)], city=sub(".", "_", city, fixed=TRUE), rel_index = local_index/national_index)
## # A tibble: 6,960 x 7
##     year month          city state local_index national_index rel_index
##    <chr> <chr>         <chr> <chr>       <dbl>          <dbl>     <dbl>
##  1  1987   Jan       Phoenix    AZ          NA          63.75        NA
##  2  1987   Jan   Los_Angeles    CA       59.33          63.75 0.9306667
##  3  1987   Jan     San_Diego    CA       54.67          63.75 0.8575686
##  4  1987   Jan San_Francisco    CA       46.61          63.75 0.7311373
##  5  1987   Jan        Denver    CO       50.20          63.75 0.7874510
##  6  1987   Jan    Washington    DC       64.11          63.75 1.0056471
##  7  1987   Jan         Miami    FL       68.50          63.75 1.0745098
##  8  1987   Jan         Tampa    FL       77.33          63.75 1.2130196
##  9  1987   Jan       Atlanta    GA          NA          63.75        NA
## 10  1987   Jan       Chicago    IL       53.55          63.75 0.8400000
## # ... with 6,950 more rows

Of course, we could make a new column rel_index in the traditional data frame way, like housing\(rel_index = housing\)local_index/housing$national_index. However, this cannot be chained together with other tidyr/dplyr statements with %>% so is a lot less flexible.

Now we can chain everything together to generate our clean data from scratch in one go.

housing_clean <- housing %>% 
  as.tibble %>%
  gather(location, local_index, -Date, -National.US) %>% 
  separate(location, c("state", "city"),extra="merge") %>%
  separate(Date, c("year", "month"), extra="drop", remove=F) %>%
  select(year, month, city, state, local_index, national_index=National.US) %>%
  arrange(year, month, state, city) %>%
  mutate(year = as.integer(year), month = month.abb[as.integer(month)], city = sub(".", "_", city, fixed=TRUE), rel_index = local_index/national_index)

Exercises using dpylr functions

We’ll turn back to our bus dataset (called bus) in order to try some exercises with these functions. First take a minute just to review what is there.

Exercise 1:

What are all of the different bus types? Hint: Use distinct and select

bus %>%
  select(type) %>%
  distinct %>%
  print
## # A tibble: 5 x 1
##        type
##       <chr>
## 1       Key
## 2  Commuter
## 3 Community
## 4     Local
## 5   Express

Exercise 2:

What is the least expensive bus route to run per passenger (least cost.per.pax)? What is the most expensive bus route to run per passenger (greatest cost.per.pax)? Hint, this will reqire two separate function calls. Use arrange and select.

bus %>%
  arrange(cost.per.pax) %>%
  select(route,cost.per.pax) %>%
  head(n=1)
## # A tibble: 1 x 2
##   route cost.per.pax
##   <chr>        <dbl>
## 1    21         0.57
bus %>%
  arrange(desc(cost.per.pax)) %>%
  select(route,cost.per.pax) %>%
  head(n=1)
## # A tibble: 1 x 2
##   route cost.per.pax
##   <chr>        <dbl>
## 1   439         9.76

Writing Data

We’ve added several variables now, and we might want to write our updated dataset to a file. We do this with the write.table() function, which writes out a data.frame.

write_delim(housing_clean, path="housing_clean.tsv",delim="\t")

But where did R put the file on our computer? R does all file operations without a full path in a working directory. RStudio has a preference to set the default working directory, which is typically something like /Users/Allison/R.

Remember we set this earlier on, but to see the current working directory, use:

getwd()
## [1] "/Users/ashultz/Dropbox/Informatics/Workshops/Intro_to_R"

Again, remember that you can change the working directory with setwd().

Note that you can also save R data objects directly in an R data file (.Rdat). This can come in handy if you have a number of R objects that took a long time to compute, and you want to work on them later. This can also be useful if you are working with R objects that can’t easily be saved as a tab-delimited file.

save(housing,housing_clean,file="housing_objects.Rdat")
rm(housing)
rm(housing_clean)

Later, when you load the file, those objects will automatically be added to your environment.

load("housing_objects.Rdat")

File operators exercises:

  1. Write a copy of the original data frame to your hard drive. Hint: you will have to reload the data from the web since we’ve modified the data frame, but be sure to give it a different name! Otherwise you will overwrite all your modifications.