Practice Session: Joins

TipLearning Objectives
  • Practice joining tables together
  • Practice identifying primary and foreign keys
  • Practice using common cleaning and wrangling functions
NoteAcknowledgements

These exercises are adapted from Allison Horst’s EDS 221: Scientific Programming Essentials Course for the Bren School’s Master of Environmental Data Science program.

About the data

These exercises will be using bird survey data collected from the central Arizona-Phoenix metropolitan area by Arizona State University researchers [@warren2021].

Exercise 1: Practice Joins

TipSetup
  1. Make sure you’re in the right project (training_{USERNAME}) and use the Git workflow by Pulling to check for any changes in the remote repository (aka repository on GitHub).

  2. Create a new Quarto Document.

    1. Title it “R Practice: Tidy Data and Joins”.
    2. Save the file and name it “r-practice-tidy-data-joins” in your scripts folder.

Note: Double check that you’re in the right project. Where in RStudio can you check where you are?

  1. Load the following libraries at the top of your Quarto Document.
library(readr)
library(dplyr)

Attaching package: 'dplyr'
The following objects are masked from 'package:stats':

    filter, lag
The following objects are masked from 'package:base':

    intersect, setdiff, setequal, union
library(here)
here() starts at /Users/lopazanski/Documents/github/duke_data_sci
library(lubridate) # for bonus question

Attaching package: 'lubridate'
The following objects are masked from 'package:base':

    date, intersect, setdiff, union
# Quick question: Do you get a message after loading the libraries? What is it telling you? Talk to your neighbor about it or write a note in your qmd.
  1. Obtain data from the EDI Data Portal Ecological and social interactions in urban parks: bird surveys in local parks in the central Arizona-Phoenix metropolitan area. Download the following files:
  • 52_pp52_birds_1.csv
  • 52_pp52_surveys_1.csv
  • 52_pp52_sites_1.csv
  • 52_pp52_taxalist_1.csv

Note: It’s up to you on how you want to download and load the data! You can either use the download links (obtain by right-clicking the “Download” button and select “Copy Link Address” for each data entity) or manually download the data and then upload the files to RStudio server.

  1. Organize your Quarto Document in a meaningful way. Organization is personal - so this is up to you! Consider the different ways we’ve organized previous files using: headers, bold text, naming code chunks, comments in code chunks. What is most important is organizing and documenting the file so that your future self (or if you share this file with others!) understands it as well as your current self does right now.

  2. Use the Git workflow. After you’ve set up your project and uploaded your data go through the workflow: Stage (add) -> Commit -> Pull -> Push

    • Note: You also want to Pull when you first open a project.

1 Read in the data

NoteQuestion 1

Read in the data and store the data frames as bird_observations, sites, surveys, and taxalist (it should be clear from the raw file names which is which).

Rows: 40425 Columns: 9
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (5): site_id, species_id, distance, notes, direction
dbl (4): survey_id, bird_count, seen, heard

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 2004 Columns: 12
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr  (4): site_id, observer, wind_dir, notes
dbl  (4): survey_id, wind_speed, air_temp, cloud_cover
lgl  (1): temp_units
dttm (3): survey_date, time_start, time_end

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 221 Columns: 7
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (5): site_id, park_code, park_district, park_name, point_code
lgl (2): point_location, park_acreage

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
Rows: 259 Columns: 3
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (2): species_id, common_name
dbl (1): asu_itis

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
# read in data from the data directory after manually downloading data 
bird_observations <- read_csv(here::here("data/52_pp52_birds_1.csv"))
surveys <- read_csv(here::here("data/52_pp52_surveys_1.csv"))
sites <- read_csv(here::here("data/52_pp52_sites_1.csv"))
taxalist <- read_csv(here::here("data/52_pp52_taxalist_1.csv"))

2 Get familiar with the data

NoteQuestion 2a

What functions can you use to explore the data you just read in? Think about which functions we’ve been using to explore the structure of the data frame, information about columns, unique observations, etc. Tip: run View(name_of_your_data_frame) in the console to see data in a spreadsheet-style viewer.

# returns dimensions of the dataframe by number of rows and number of cols
dim(bird_observations)
[1] 40425     9
# returns the top six rows of the dataframe
head(bird_observations)
# A tibble: 6 × 9
  survey_id site_id species_id distance bird_count notes  seen heard direction
      <dbl> <chr>   <chr>      <chr>         <dbl> <chr> <dbl> <dbl> <chr>    
1       144 LI-S    HOSP       5-10              4 <NA>      1     1 NE       
2       145 LI-W    HOSP       20-40            10 <NA>      0     1 E        
3       145 LI-W    AUWA       20-40             2 <NA>      0     1 SE       
4       145 LI-W    RODO       FT                2 <NA>      1     0 E        
5       145 LI-W    GTGR       >40               2 <NA>      0     1 NE       
6       145 LI-W    WCSP       20-40             3 <NA>      0     1 N        
# returns all the columns and some info about the cols
glimpse(bird_observations)
Rows: 40,425
Columns: 9
$ survey_id  <dbl> 144, 145, 145, 145, 145, 145, 145, 145, 145, 145, 145, 145,…
$ site_id    <chr> "LI-S", "LI-W", "LI-W", "LI-W", "LI-W", "LI-W", "LI-W", "LI…
$ species_id <chr> "HOSP", "HOSP", "AUWA", "RODO", "GTGR", "WCSP", "WCSP", "GT…
$ distance   <chr> "5-10", "20-40", "20-40", "FT", ">40", "20-40", "20-40", "F…
$ bird_count <dbl> 4, 10, 2, 2, 2, 3, 3, 2, 2, 3, 1, 10, 3, 1, 6, 6, 20, 12, 2…
$ notes      <chr> NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA,…
$ seen       <dbl> 1, 0, 0, 1, 0, 0, 0, 1, 1, 1, 1, 0, 1, 1, 0, 1, 1, 0, 1, 0,…
$ heard      <dbl> 1, 1, 1, 0, 1, 1, 1, 1, 0, 0, 0, 1, 1, 1, 1, 1, 0, 1, 1, 1,…
$ direction  <chr> "NE", "E", "SE", "E", "NE", "N", "E", "E", "S", "E", "NE", …
# similar to glimpse but returns some summary statistics about the cols
summary(bird_observations)
   survey_id      site_id           species_id          distance        
 Min.   :   1   Length:40425       Length:40425       Length:40425      
 1st Qu.: 570   Class :character   Class :character   Class :character  
 Median :1028   Mode  :character   Mode  :character   Mode  :character  
 Mean   :1043                                                           
 3rd Qu.:1550                                                           
 Max.   :2001                                                           
                                                                        
   bird_count          notes                seen            heard       
 Min.   :   1.000   Length:40425       Min.   :0.0000   Min.   :0.0000  
 1st Qu.:   1.000   Class :character   1st Qu.:1.0000   1st Qu.:0.0000  
 Median :   2.000   Mode  :character   Median :1.0000   Median :0.0000  
 Mean   :   2.938                      Mean   :0.8463   Mean   :0.4967  
 3rd Qu.:   3.000                      3rd Qu.:1.0000   3rd Qu.:1.0000  
 Max.   :1000.000                      Max.   :1.0000   Max.   :1.0000  
 NA's   :33                                                             
  direction        
 Length:40425      
 Class :character  
 Mode  :character  
                   
                   
                   
                   
# returns column names 
names(bird_observations)
[1] "survey_id"  "site_id"    "species_id" "distance"   "bird_count"
[6] "notes"      "seen"       "heard"      "direction" 
# returns unique values in a column. In this case we can see all the different bird species IDs
unique(bird_observations$species_id)
  [1] "HOSP" "AUWA" "RODO" "GTGR" "WCSP" "MODO" "NOMO" "EUST" "ANHU" "CBTH"
 [11] "INDO" "GIWO" "HOFI" "VERD" "GHJU" "ORJU" "KILL" "BCHU" "ABTO" "CACW"
 [21] "WEME" "CORA" "NOHA" "NOFL" "PYRR" "CANW" "GIFL" "SCJU" "BRCR" "RCKI"
 [31] "COHU" "BGGN" "SAPH" "AMKE" "HAHA" "HOLA" "LOSH" "AMGO" "BEVI" "OCWA"
 [41] "BRSP" "COYE" "SPTO" "WBNU" "noca" "BTSP" "ROWR" "PHAI" "RWBL" "FEHA"
 [51] "COHA" "RTHA" "ATFL" "BHCO" "BRBL" "UNDO" "SOVI" "MALL" "SSHA" "CEDW"
 [61] "AMRO" "WEBL" "GAQU" "LEWO" "unwa" "CAGO" "LEGO" "broc" "WWDO" "COFL"
 [71] "SOSP" "NRWS" "GBHE" "WEKI" "NAWA" "LBWO" "BTGN" "YWAR" "UDEJ" "BCNH"
 [81] "BTYW" "LUWA" "CLSW" "CAKI" "UNTA" "PFLB" "UNHA" "HRSH" "BETH" "UNWO"
 [91] "RSFL" "WIWA" "WEFL" "unhu" "GRRO" "ECDO" "BROC" "HOOR" "NOCA" "TOWA"
[101] "YHBL" "WTSW" "UNTH" "RUHU" "WETA" "AMCO" "LENI" "UYRW" "LASP" "RNSA"
[111] "UNFL" "BLPH" "MGWA" "TUVU" "UNBL" "TRES" "UNSP" "GREG" "SNEG" "UNSW"
[121] "CHSP" "WAVI"
NoteQuestion 2b

What are the primary and foreign keys for the tables bird_observations and taxalist? Recall that a primary key is a unique identifier for each observed entity, one per row. And a foreign key references to a primary key in another table (linkage).

Hint: First identify the primary keys for all the tables, then identify the foreign keys.

Answer
  • bird_observations: Primary key is a compound key made up of survey_id, site_id, and species_id. The foreign key is species_id.
  • taxalist: Primary key is species_id and does not have a foreign key that match the primary key in bird_observations.

However, we could join bird_observations and taxalist by species_id, but depending on the type of join some values would be droped or NAs would be introduce in the resulting data frame.

3 Create a subset of bird_observations

NoteQuestion 3

Write code to create a subset of bird_observations called birds_subset that only contains observations for birds with species id BHCO and RWBL, and from sites with site ID LI-W and NU-C.

Hint: What function do you use to subset data by rows?

birds_subset <- bird_observations %>% 
  filter(species_id %in% c("BHCO", "RWBL")) %>% 
  filter(site_id %in% c("LI-W", "NU-C"))

4 Use left_join() to merge birds_subset with the tables sites

NoteQuestion 4a

First, answer: what do you expect the outcome data frame when doing left_join() between birds_subset and sites to look like? What observations do you expect in the outcome data frame.

You can use paper to draw if that helps you or talk to your neighbor. Write down the steps and expected outcome in your Quarto Document.

Answer

I expect to see all columns and all observations from birds_subset and from sites, I expect to see the columns park_code, park_district, park-name, point_code, point_location and park_acreage and only observations for NU-C and LI-W because those are the only site_id values in birds_subset and in a left join only the observations matching the left table (in this case, birds_subset is the left table) will be kept.

NoteQustion 4b

Use a left join to update birds_subset so that it also includes sites information. For each join, include an explicit argument saying which key you are joining by (even if it will just assume the correct one for you). Store the updated data frame as birds_left. Make sure to look at the output - is what it contains consistent with what you expected it to contain?

# syntax using pipe
birds_left <- birds_subset %>% 
  left_join(y = sites, by = "site_id")
# don't see x = birds_subset here because piping in birds_subset means it automatically assumes birds_subset as x.

# syntax without pipe
birds_left <- left_join(x = birds_subset, y = sites, by = "site_id")

5 Use full_join() to merge birds_subset and sites tables

NoteQuestion 5a

First, answer: what do you expect a full_join() between birds_subset and sites to contain? Write this in your Quarto Document or tell a neighbor.

Answer

I expect to see all columns and all observations from birds_subset and all columns and all observations from sites to be merged into one data frame because in a full join everything is kept. NA values could be introduced.

NoteQuestions 5b

Write code to full_join() the birds_subset and sites data into a new object called birds_full. Explicitly include the variable you’re joining by. Look at the output. Is it what you expected?

# syntax using pipe
birds_full <- birds_subset %>% 
  full_join(y = sites, by = "site_id")

# syntax without pipe
birds_full <- full_join(x = birds_subset, y = sites, by = "site_id")

6 Use inner_join() to merge birds_subset and taxalist data

NoteQuestion 6a

First, answer: what do you expect an inner_join() between birds_subset and taxalist to contain? Write this in your Quarto Document or tell a neighbor.

Answer

I expect to only have data merge together based on species_id and since there is only BHCO and RWBL in birds_subset then I will only retain data related to those two species. I will also expect to see the columns from taxalist: common_name and asu_itis to be merged into the joined table.

NoteQuestion 6b

Write code to inner_join() the birds_subset and taxalist, called birds_inner. Include an argument for what variable you’ll be joining by. Make sure you check the output.

# syntax using pipe
birds_inner <- birds_subset %>% 
  inner_join(y = taxalist, by = "species_id")

# syntax without pipe
birds_inner <- inner_join(x = birds_subset, y = taxalist, by = "species_id" )
NoteQuestion 6c

What would you get if instead of inner_join() you’d used left_join() for this example? Write code for the left join and check.

# syntax using pipe
birds_inner_left <- birds_subset %>% 
  left_join(y = taxalist, by = "species_id")

# syntax without pipe
birds_inner_left <- left_join(x = birds_subset, y = taxalist, by = "species_id")
NoteQuestion 6d

Why does that make sense for this scenario? In what case would you expect the outcome to differ from an inner_join()? Write this in your Quarto Document or tell a neighbor.

Answer

You have the same resulting data set regardless of using inner_join() or left_join() to merge bird_subset and taxalist. The reasons for this are:

  • inner_join() keeps only the rows (observations) that have a matching key across both data sets - here, species_id is our key, and the only rows that match across both data sets are those where species_id equals BHCO or RWBL

  • left_join() keeps all rows from the left table (in our case, the left table is birds_subset) and merges on data with matching keys (species_id) on the right (here, the right table is taxalist). Because our left data set (birds_subset) only contains species_ids equal to BHCO or RWBL, only rows with those species will be kept from the right data set (taxalist)

You’d expect the outcome to differ from an inner_join() if birds_subset contained an observation with a species_id that was not found in taxalist. If there was an observation of a species_id in birds_subset that was not in taxalist, then that observation would be kept, and NAs would be assigned to the common_name and asu_itis columns for that observations

Exercise 2: Practice Wrangling & Joining Data

7 Wrangle bird_observations data and merge the data with all the other tables (sites, surveys, and taxalist)

NoteQuestion 7a

Starting with your object bird_observations, rename the notes column to bird_obs_notes (so this doesn’t conflict with notes in the surveys table).

bird_observations <- bird_observations %>% 
  rename(bird_obs_notes = notes)
NoteQuestion 7b
  • Create a subset that contains all observations in the birds_observations data frame,
  • then join the taxalist, sites and surveys tables to it,
  • and finally limit to only columns survey_date, common_name, park_name, bird_count, and observer.

Hint: What function do you use to subset data by columns?

bird_obs_subset <- bird_observations %>% 
  full_join(y = taxalist, by = "species_id") %>% 
  full_join(y = sites, by = "site_id") %>% 
  full_join(y = surveys, by = c("site_id", "survey_id")) %>%  
  select(survey_date, common_name, park_name, bird_count, observer)

8 Explore observer data and fix the values within this column so that all values are in the same format

NoteQuestion 8a

Continuing with bird_obs_subset, first use unique() to see the different unique values in the column observer. How many observers are there? Which value is unlike the others?

unique(bird_obs_subset$observer)
[1] "B. Rambo"   "J. Lemmer"  "D. Stuart"  "C. Putnam"  "S. Lerman" 
[6] "Josh Burns" NA          
NoteQuestion 8b

Replace “Josh Burns” with a format that matches the other observer names. Then use unique() again to check your work.

Hint: What function do you use when you are making a change to an entire column?

bird_obs_subset <- bird_obs_subset %>% 
  mutate(observer = if_else(condition = observer == "Josh Burns", 
                            true = "J. Burns", 
                            false = observer))

unique(bird_obs_subset$observer)
[1] "B. Rambo"  "J. Lemmer" "D. Stuart" "C. Putnam" "S. Lerman" "J. Burns" 
[7] NA         
ImportantSave your work and dont’s forget the Git and GitHub Workflow

After you’ve completed the exercises or reached a significant stopping point, use the workflow: Stage (add) -> Commit -> Pull -> Push

9 Bonus: Use a new package lubridate to wrangle the date data and find the total number of birds by park and month

Hint: How do you learn about a new function or package?

NoteBonus Question(s)
  1. Use lubridate::month() to add a new column to bird_obs_subset called survey_month, containing only the month number. Then, convert the month number to a factor (again within mutate()).

  2. Use dplyr::relocate() to move the new survey_month column to immediately after the survey_date column. You can do this in a separate code chunk, or pipe straight into it from your existing code.

  3. Filter to only include parks Lindo, Orme, Palomino, and Sonrisa.

  4. Find the total number of birds observed by park and month (Hint: You can use group_by() and summarize()).

bird_obs_subset <- bird_obs_subset %>% 
  mutate(survey_month = lubridate::month(survey_date)) %>% 
  mutate(survey_month = as.factor(survey_month)) %>% 
  dplyr::relocate(survey_month, .after = survey_date) %>% 
  filter(park_name %in% c("Lindo", "Orme", "Palomino", "Sonrisa")) %>% 
  group_by(park_name, survey_month) %>% 
  summarize(tot_bird_count_month = n())
`summarise()` has grouped output by 'park_name'. You can override using the
`.groups` argument.

Take a look at your final data frame. Does it give you the outcome you expected? Is it informative? How would you improve this wrangling process?