Practice Session: Collaborative Report

Learning Objectives

Practice using common cleaning and wrangling functions
Practice creating plots using common visualization functions in ggplot
Practice saving and sharing data visualizations
Practice Git and GitHub workflow and collaborating with a colleague

Acknowledgements

These exercises are adapted from Allison Horst’s EDS 221: Scientific Programming Essentials Course for the Bren School’s Master of Environmental Data Science program.

About the data

These exercises will be using data on abundance, size, and trap counts (fishing pressure) of California spiny lobster (Panulirus interruptus) and were collected along the mainland coast of the Santa Barbara Channel by Santa Barbara Coastal LTER researchers.

Your task: Collaborate on an analysis and create a report to publish using GitHub Pages.

Setup

1. Create a new repository with a partner

a.  Determine who is the Owner and who is the Collaborator
b.  The Owner creates a repository on GitHub titled with both your
    names (i.e. If Casey and Camila were partners, and Casey is the
    Owner, she would create a repo called `casey-camila`)
    i. Create a SEPARATE repo from your import-template-{USERNAME} assignments 
    ii.  When creating the repository, add a brief description (i.e.
        R Practice Session: Collaborating on, Wrangling &
        Visualizing Data), keep the repo Public, and Initialize the
        repo with a `README` file and an R `.gitignore` template.
c.  The Owner adds the Collaborator to the repo
d.  Both the Collaborator and the Owner clone the repo into their
    RStudio
e.  Both the Collaborator and the Owner run `git config pull.rebase false` in the Terminal to set the Git default strategy for pulling. -->

**Step 2 and Step 3 are meant to be completed at the same time. - Collaborator completes Step 2 - Owner completes Step 3

2. Collaborator

Create new files for exercise
a.  The Collaborator creates the following directory (folder): `analysis`
b.  After creating the directories, create the following Quarto
    Documents and store them in the listed folders:
    i.  Title it: "Owner Analysis", save it as:
        `owner-analysis.qmd`, and store in `analysis` folder
    ii. Title it: "Collaborator Analysis", save it as:
        `collaborator-analysis.qmd`, and store in `analysis` folder
    iii. Title it: "Lobster Report" and save it as:
         `lobster-report.qmd` and store in `analysis` folder
c.  After creating the files, the Collaborator will `stage (add)`,
    `commit`, write a commit message, `pull`, and `push` the files
    to the remote repository (on GitHub)
d.  The Owner `pull`s the changes and Quarto Documents into their
    local repository (their workspace)

3. Owner

Download data from the EDI Data Portal [SBC LTER: Reef:
Abundance, size and fishing effort for California Spiny Lobster
(Panulirus interruptus), ongoing since
2012](https://portal.edirepository.org/nis/mapbrowse?packageid=knb-lter-sbc.77.8).
a.  Create two new directories one called `data` and one called
    `figs`
    i.  *Note: Git does not track empty directories, so you won't
        see `figs` when you push to GitHub*
b.  Download the following data and upload them to the `data`
    folder:
    i.  Time-series of lobster abundance and size
    ii. Time-series of lobster trap buoy counts
c.  After creating the `data` folder and adding the data, the Owner
    will `stage (add)`, `commit`, write a commit message,`pull`, and
    `push` the files to the remote repository (on GitHub)
d.  The Collaborator `pull`s the changes and data into their local
    repository (their workspace)

1 Explore, clean and wrangle data

For this portion of the exercise, the - Owner will be working with the lobster abundance and size data - Collaborator will be working with the lobster trap buoy counts data

Questions 1-3 you will be working independently since you’re working with different data frames, but you’re welcome to check in with each other.

Owner
Collaborator

Setup

Open the Quarto Document owner-analysis.qmd
1. Check the YAML and add your name to the author field
2. Create a new section with a level 2 header and title it “Exercise: Explore, Clean, and Wrangle Data”
Load the following libraries at the top of your Quarto Document

library(readr)
library(dplyr)
library(ggplot2)
library(tidyr)
library(here)

Read in the data and store the data frame as lobster_abundance

lobster_abundance <- read_csv(here::here("data/Lobster_Abundance_All_Years_20220829.csv"))

Look at your data. Take a minute to explore what your data structure looks like, what data types are in the data frame, or use a function to get a high-level summary of the data you’re working with.
Use the Git workflow: Stage (add) -> Commit -> Pull -> Push
- Note: You also want to Pull when you first open a project

1.0.1 Convert missing values using `mutate()` and `na_if()`

Exercise 1 - Question 1

The variable SIZE_MM uses -99999 as the code for missing values (see metadata or use unique()). This has the potential to cause conflicts with our analyses, so let’s convert -99999 to an NA value. Do this using mutate() and na_if(). Look up the help page to see how to use na_if(). Check your output data using unique().

Answer

lobster_abundance <- lobster_abundance %>% 
    mutate(SIZE_MM = na_if(SIZE_MM, -99999))

1.0.2 `filter()` practice

Exercise 2 - Question 2

Create and store a subset that does NOT include observations from Naples Reef (NAPL). Check your output data frame to ensure that NAPL is NOT in the data frame.

Answer

not_napl <- lobster_abundance %>% 
    filter(SITE != "NAPL")

Exercise 3 - Question 3

Create and store a subset with lobsters at Arroyo Quemado (AQUE) AND with a carapace length greater than 70 mm. Check your output.

Answer

aque_70mm <- lobster_abundance %>% 
    filter(SITE == "AQUE" & SIZE_MM >= 70)

Exercise 4 - Question 4

Find the maximum carapace length using max() and group by SITE and MONTH. Think about how you want to treat the NA values in SIZE_MM (Hint: check the arguments in max()). Check your output.

Answer

# `group_by() %>% summarize()` practice

max_lobster <- lobster_abundance %>% 
  group_by(SITE, MONTH) %>% 
  summarize(MAX_LENGTH = max(SIZE_MM, na.rm = TRUE))

Save your work and don’t forget the Git and GitHub workflow!

After you’ve completed the exercises or reached a significant stopping point, use the workflow: Stage (add) -> Commit -> Pull -> Push

Setup

Open the Quarto Document collaborator-analysis.qmd
1. Check the YAML and add your name to the author field
2. Create a new section with a level 2 header and title it “Exercise: Explore, Clean, and Wrangle Data”
Load the following libraries at the top of your Quarto Document.

library(readr)
library(dplyr)
library(ggplot2)
library(tidyr)
library(here)

Read in the data and store the data frame as lobster_traps

lobster_traps <- read_csv(here::here("data/Lobster_Trap_Counts_All_Years_20210519.csv"))

Look at your data. Take a minute to explore what your data structure looks like, what data types are in the data frame, or use a function to get a high-level summary of the data you’re working with.
Use the Git workflow: Stage (add) -> Commit -> Pull -> Push
- Note: You also want to Pull when you first open a project

1.0.3 Convert missing values using `mutate()` and `na_if()`

Exercise 5 - Question 1

The variable TRAPS uses -99999 as the code for missing values (see metadata or use unique()). This has the potential to cause conflicts with our analyses, so let’s convert -99999 to an NA value. Do this using mutate() and na_if(). Look up the help page to see how to use na_if(). Check your output data using unique().

Answer

lobster_traps <- lobster_traps %>% 
    mutate(TRAPS = na_if(TRAPS, -99999))

1.0.4 `filter()` practice

Exercise 6 - Question 2

Create and store a subset that does NOT include observations from Naples Reef (NAPL). Check your output data frame to ensure that NAPL is NOT in the data frame.

Answer

not_napl <- lobster_traps %>% 
    filter(SITE != "NAPL")

Exercise 7 - Question 3

Create and store a subset with lobsters at Carpinteria Reef (CARP) AND number of commercial trap floats is greater than 20. Check your output.

Answer

carp_20_traps <- lobster_traps %>% 
    filter(SITE == "CARP" & TRAPS > 20)

Exercise 8 - Question 4

Find the maximum number of commercial trap floats using max() and group by SITE and MONTH. Think about how you want to treat the NA values in TRAPS (Hint: check the arguments in max()). Check your output.

Answer

# `group_by() %>% summarize()` practice
max_lobster_traps <- lobster_traps %>% 
    group_by(SITE, MONTH) %>%
    summarize(MAX_TRAPS = max(TRAPS, na.rm = TRUE))

Save your work and Don’t forget the Git and GitHub workflow!

After you’ve completed the exercises or reached a significant stopping point, use the workflow: Stage (add) -> Commit -> Pull -> Push

2 Create visually appealing and informative data visualization

Owner
Collaborator

Setup

Stay in the Quarto Document owner-analysis.qmd and create a new section with a level 2 header and title it “Exercise: Data Visualization”

Structure of the data visualization exercises:

In this section, you will first have you create the necessary subsets to create the data visualizations, as well as the basic code to create a visualization.
The next step is to return to the data visualization code you’ve written and add styling code to it. For this exercise, only add styling code to the visualization you want to include in the lobster-report.qmd (start with just one plot and if there’s time add styling code to another plot).
Lastly, save the final visualizations to the figs folder before collaborating on the lobster-report.qmd.

## Run this chunk to test exercises ##

library(readr)
library(dplyr)
library(ggplot2)
library(tidyr)

lobster_abundance <- read_csv("https://portal.edirepository.org/nis/dataviewer?packageid=knb-lter-sbc.77.8&entityid=f32823fba432f58f66c06b589b7efac6") %>% 
    mutate(SIZE_MM = na_if(SIZE_MM, -99999))

Exercise 9 - Question 5

Create a multi-panel plot of lobster carapace length (SIZE_MM) using ggplot(), geom_histogram(), and facet_wrap(). Use the variable SITE in facet_wrap(). Use the object lobster_abundance.

Answer

ggplot(data = lobster_abundance, 
       aes(x = SIZE_MM)) +
    geom_histogram() +
    facet_wrap(~SITE)

Plots

Exercise 10 - Question 6

Create a line graph of the number of total lobsters observed (y-axis) by year (x-axis) in the study, grouped by SITE.

Answer

First, you’ll need to create a new dataset subset called lobsters_summarize:

Group the data by SITE AND YEAR
Calculate the total number of lobsters observed using count()

lobsters_summarize <- lobster_abundance %>% 
  group_by(SITE, YEAR) %>% 
  summarize(COUNT = n())

Next, create a line graph using ggplot() and geom_line(). Use geom_point() to make the data points more distinct, but ultimately up to you if you want to use it or not. We also want SITE information on this graph, do this by specifying the variable in the color argument. Where should the color argument go? Inside or outside of aes()? Why or why not?

# line plot
ggplot(data = lobsters_summarize, aes(x = YEAR, y = COUNT)) +
  geom_line(aes(color = SITE)) 

# line and point plot
ggplot(data = lobsters_summarize, aes(x = YEAR, y = COUNT)) +
  geom_point(aes(color = SITE)) +
  geom_line(aes(color = SITE))

Plots

Exercise 11 - Question 7

Create a bar graph that shows the amount of small and large sized carapace lobsters at each SITE from 2019-2021. Note: The small and large metrics are completely made up and are not based on any known facts.

Answer

First, you’ll need to create a new dataset subset called lobster_size_lrg:

filter() for the years 2019, 2020, and 2021
Add a new column called SIZE_BIN that contains the values “small” or “large”. A “small” carapace size is <= 70 mm, and a “large” carapace size is greater than 70 mm. Use mutate() and if_else(). Check your output
Calculate the number of “small” and “large” sized lobsters using group() and summarize(). Check your output
Remove the NA values from the subsetted data. Hint: check out drop_na(). Check your output

lobster_size_lrg <- lobster_abundance %>%
    filter(YEAR %in% c(2019, 2020, 2021)) %>%
    mutate(SIZE_BIN = if_else(SIZE_MM <= 70, true = "small", false = "large")) %>%
    group_by(SITE, SIZE_BIN) %>%
    summarize(COUNT = n()) %>%
    drop_na()

Next, create a bar graph using ggplot() and geom_bar(). Note that geom_bar() automatically creates a stacked bar chart. Try using the argument position = "dodge" to make the bars side by side. Pick which bar position you like best.

# bar plot
ggplot(data = lobster_size_lrg, aes(x = SITE, y = COUNT, fill = SIZE_BIN)) +
    geom_col()

# dodged bar plot
ggplot(data = lobster_size_lrg, aes(x = SITE, y = COUNT, fill = SIZE_BIN)) +
    geom_col(position = "dodge")

Plots

Exercise 12 - Question 8

Go back to your visualization code and add some styling code (aka make your plots pretty!). Again, start with one plot and if there’s time add styling code to additional plots. Here’s a list of functions to help you get started (this is not an exhaustive list!) or revisit the data visualization lesson:

labs(): modifying axis, legend and plot labels
theme_(): add a complete theme to your plot (i.e. theme_light())
theme(): use to customize non-data components of a plot. We’ve listed out some parameters here, but run ?theme to see the full list (there’s a lot of customization you can do!)
- axis.title.y
- panel.background
- plot.background
- panel.grid.major.*
- text
scale_*_date(): use with dates and update breaks, limits, and labels
scale_*_continuous(): use with continuous variables and update breaks, limits, and labels
scale_*_discrete(): use with discrete variables and update breaks, limits, and labels
scales package: use this within the above scale functions and you can do things like add percents to axes labels
geom_() within a geom function you can modify:
- fill: updates fill colors (e.g. column, density, violin, & boxplot interior fill color)
- color: updates point & border line colors (generally)
- shape: update point style
- alpha: update transparency (0 = transparent, 1 = opaque)
- size: point size or line width
- linetype: update the line type (e.g. “dotted”, “dashed”, “dotdash”, etc.)

Once you’re happy with how your plot looks, assign it to an object, and save it to the figs directory using ggsave()

Save your work and Don’t forget the Git and GitHub workflow!

After you’ve completed the exercises or reached a significant stopping point, use the workflow: Stage (add) -> Commit -> Pull -> Push

Setup

Stay in the Quarto Document collaborator-analysis.qmd and create a new section with a level 2 header and title it “Exercise: Data Visualization”

Structure of the data visualization exercises:

First you will create the necessary subsets to create the data visualizations, as well as the basic code to create a visualization.
Then, you will return to the data visualization code you’ve written and add styling code to it. For this exercise, only add styling code to the visualization you want to include in the lobster-report.qmd (start with just one plot and if there’s time add styling code to another plot).
Lastly, save the final visualizations to the figs folder before collaborating on the lobster-report.qmd.

## Run this chunk to test exercises ##

library(readr)
library(dplyr)
library(ggplot2)
library(tidyr)

lobster_traps <- read_csv("https://portal.edirepository.org/nis/dataviewer?packageid=knb-lter-sbc.77.8&entityid=66dd61c75bda17c23a3bce458c56ed84") %>% 
    mutate(TRAPS = na_if(TRAPS, -99999))

Exercise 13 - Question 5

Create a multi-panel plot of lobster commercial traps (TRAPS) grouped by year, using ggplot(), geom_histogram(), and facet_wrap(). Use the variable YEAR in facet_wrap(). Use the object lobster_traps.

Answer

ggplot(data = lobster_traps, aes(x = TRAPS)) +
    geom_histogram() +
    facet_wrap( ~ YEAR)

Plots

Exercise 14 - Question 6

Create a line graph of the number of total lobster commercial traps observed (y-axis) by year (x-axis) in the study, grouped by SITE.

Answer

First, you’ll need to create a new dataset subset called lobsters_traps_summarize:

Group the data by SITE AND YEAR
Calculate the total number of lobster commercial traps observed using sum(). Look up sum() if you need to. Call the new column TOTAL_TRAPS. Don’t forget about NAs here!

lobsters_traps_summarize <- lobster_traps %>% 
  group_by(SITE, YEAR) %>% 
  summarize(TOTAL_TRAPS = sum(TRAPS, na.rm = TRUE))

# line plot
ggplot(data = lobsters_traps_summarize, aes(x = YEAR, y = TOTAL_TRAPS)) +
    geom_line(aes(color = SITE))

# line and point plot
ggplot(data = lobsters_traps_summarize, aes(x = YEAR, y = TOTAL_TRAPS)) +
    geom_point(aes(color = SITE)) +
    geom_line(aes(color = SITE))

Plots

Exercise 15 - Question 7

Create a bar graph that shows the amount of high and low fishing pressure of lobster commercial traps at each SITE from 2019-2021. Note: The high and low fishing pressure metrics are completely made up and are not based on any known facts.

Answer

First, you’ll need to create a new dataset subset called lobster_traps_fishing_pressure:

filter() for the years 2019, 2020, and 2021
Add a new column called FISHING_PRESSURE that contains the values “high” or “low”. A “high” fishing pressure has exactly or more than 8 traps, and a “low” fishing pressure has less than 8 traps. Use mutate() and if_else(). Check your output
Calculate the number of “high” and “low” observations using group() and summarize(). Check your output
Remove the NA values from the subsetted data. Hint: check out drop_na(). Check your output

lobster_traps_fishing_pressure <- lobster_traps %>% 
    filter(YEAR %in% c(2019, 2020, 2021)) %>%
    mutate(FISHING_PRESSURE = if_else(TRAPS >= 8, true = "high", false = "low")) %>%
    group_by(SITE, FISHING_PRESSURE) %>%
    summarize(COUNT = n()) %>%
    drop_na()

# bar plot
ggplot(data = lobster_traps_fishing_pressure, aes(x = SITE, y = COUNT, fill = FISHING_PRESSURE)) +
    geom_col()

# dodged bar plot
ggplot(data = lobster_traps_fishing_pressure, aes(x = SITE, y = COUNT, fill = FISHING_PRESSURE)) +
    geom_col(position = "dodge")

Plots

Exercise 16 - Question 8

Go back to your visualization code and add some styling code (aka make your plots pretty!). Again, start with one plot and if there’s time add styling code to additional plots. Here’s a list of functions to help you get started (this is not an exhaustive list!) or revisit the data visualization lesson:

labs(): modifying axis, legend and plot labels
theme_(): add a complete theme to your plot (i.e. theme_light())
theme(): use to customize non-data components of a plot. We’ve listed out some parameters here, but run ?theme to see the full list (there’s a lot of customization you can do!)
- axis.title.y
- panel.background
- plot.background
- panel.grid.major.*
- text
scale_*_date(): use with dates and update breaks, limits, and labels
scale_*_continuous(): use with continuous variables and update breaks, limits, and labels
scale_*_discrete(): use with discrete variables and update breaks, limits, and labels
scales package: use this within the above scale functions and you can do things like add percents to axes labels
geom_() within a geom function you can modify:
- fill: updates fill colors (e.g. column, density, violin, & boxplot interior fill color)
- color: updates point & border line colors (generally)
- shape: update point style
- alpha: update transparency (0 = transparent, 1 = opaque)
- size: point size or line width
- linetype: update the line type (e.g. “dotted”, “dashed”, “dotdash”, etc.)

Once you’re happy with how your plot looks, assign it to an object, and save it to the figs directory using ggsave()

Save your work and Don’t forget the Git and GitHub workflow!

After you’ve completed the exercises or reached a significant stopping point, use the workflow: Stage (add) -> Commit -> Pull -> Push

3 Collaborate on a report and publish using GitHub pages

The final step! Time to work together again. Collaborate with your partner to create a report to publish to GitHub pages. You and your partner will combine some of your analysis into a single new qmd file, called lobster-report.qmd. Then you will render the file into html and make sure that it is visible using GitHub pages (see previous chapter from Week 11).

Code Review

As you’re working on the lobster-report.qmd you will be conducting two types of code reviews: (1) pair programming and (2) lightweight code review.

Pair programming is where two people develop code together at the same workstation. One person is the “driver” and one person is the “navigator”. The driver writes the code while the navigator observes the code being typed, points out any immediate quick fixes, and will also Google / troubleshoot if errors occur. Both the Owner and the Collaborator should experience both roles, so switch halfway through or at a meaningful stopping point.
A lightweight code review is brief and you will be giving feedback on code readability and code logic as you’re adding Owner and Collaborator code from their respective analysis.qmds to the lobster-report.qmd. Think of it as a walk through of your the code for the data visualizations you plan to include in the report (this includes the code you wrote to create the subset for the plot and the code to create the plot) and give quick feedback.

4 Assignment a07

Make sure your Quarto Document (lobster-report.qmd) is well organized and includes the following elements:

a title and your names
citation of the data, including the date downloaded and a working link
brief summary of the abstract (i.e. 1-2 sentences) from the EDI Portal
Owner analysis and visualizations (you choose which plots you want to include)
- Try adding alternative text to your plots (See Quarto Documentation)
- Plots can be added either with the data visualization code or with Markdown syntax (calling a saved image) - it’s up to you if you want to include the code or not.
Collaborator analysis and visualizations (you choose which plots you want to include)
- Try adding alternative text to your plots (See Quarto Documentation)
- plots can be added either with the data visualization code or with Markdown syntax (calling a saved image) - it’s up to you if you want to include the code or not.
a summary or discussion section that shows you gained some insight into lobsters through your plots
- make sure you push all your “html” files to GitHub, so the plots show correctly

Finally, publish on GitHub pages (from Owner’s repository). Refer back to Chapter 12 for steps on how to publish using GitHub pages.

Put a link to your GitHub page in the Assignment a07 on Canvas.

5 Bonus: Add marine protected area (MPA) designation to the data

The sites IVEE and NAPL are marine protected areas (MPAs). Add this designation to your data set using a new function called case_when(). Then create some new plots using this new variable. Does it change how you think about the data? What new plots or analysis can you do with this new variable?

Lobster Abundance & Size Data
Lobster Trap Buoy Counts Data

Use the object lobster_abundance and add a new column called DESIGNATION that contains “MPA” if the site is IVEE or NAPL, and “not MPA” for all other values.

lobster_mpa <- lobster_abundance %>% 
    mutate(DESIGNATION = case_when(
    SITE %in% c("IVEE", "NAPL") ~ "MPA",
    SITE %in% c("AQUE", "CARP", "MOHK") ~ "not MPA"
  ))

Use the object lobster_traps and add a new column called DESIGNATION that contains “MPA” if the site is IVEE or NAPL, and “not MPA” for all other values.

lobster_traps_mpa <- lobster_traps %>%
    mutate(DESIGNATION = case_when(
    SITE %in% c("IVEE", "NAPL") ~ "MPA",
    SITE %in% c("AQUE", "CARP", "MOHK") ~ "not MPA"
  ))

About the data

1. Create a new repository with a partner

2. Collaborator

3. Owner

1 Explore, clean and wrangle data

1.0.1 Convert missing values using mutate() and na_if()

1.0.2 filter() practice

1.0.3 Convert missing values using mutate() and na_if()

1.0.4 filter() practice

2 Create visually appealing and informative data visualization

3 Collaborate on a report and publish using GitHub pages

4 Assignment a07

5 Bonus: Add marine protected area (MPA) designation to the data

1.0.1 Convert missing values using `mutate()` and `na_if()`

1.0.2 `filter()` practice

1.0.3 Convert missing values using `mutate()` and `na_if()`

1.0.4 `filter()` practice