catch_original <- readr::read_csv(file = "https://knb.ecoinformatics.org/knb/d1/mn/v2/object/df35b.302.1")After completing this session, you will be able to:
- Identify some common
dplyrandtidyrfunctions - Define “namespace” in the context of R functions
- Explain how the pipe operator (
%>%) can be used to streamline code operations - Paraphrase how
mutate()can be used to do quality control operations - Explain what “shape” means in the context of data and how data of one shape can be converted to another
- Describe how defensive coding strategies benefit you in the long run and how
rename()is a good example of such strategies - Explain how
group_by()andsummarize()relate to the “Split-Apply-Combine” strategy
1 Introduction
The data we get to work with are rarely, if ever, in the format we need to do our analyses. It’s often the case that one package requires data in one format, while another package requires the data to be in another format. To be efficient analysts, we should have good tools for reformatting data for our needs so we can do further work like making plots and fitting models. The dplyr and tidyr R packages provide a fairly complete and extremely powerful set of functions for us to do this reformatting quickly. Learning these tools well will greatly increase your efficiency as an analyst.
Let’s look at two motivating examples.
2 The dplyr and tidyr Packages
The dplyr and tidyr packages are two particularly powerful and often-used packages for data wrangling. This lesson will cover examples to learn about some of the functions you’ll most commonly use from those packages. See the tabs below for a brief overview of those tools.
| Function name | Description |
|---|---|
mutate() |
Creates and modifies columns |
group_by() |
Groups data by one or more variables |
summarize() |
Summarizes each group down to one row |
select() |
Keep or drop columns using their names |
filter() |
Keeps rows that match conditions |
arrange() |
Order rows by one or more columns |
rename() |
Rename a column |
| Function name | Description |
|---|---|
pivot_longer() |
Reshapes data from a wide to a long format |
pivot_wider() |
Reshapes data from a long to a wide format |
unite() |
Unite multiple columns into one by pasting strings together |
separate_wider_delim() |
Separate one column into multiple columns based on a delimiter (e.g., _, -, .) |
3 Data Cleaning Fundamentals
To demonstrate, we’ll be working with a tidied up version of a data set from the Alaska Department of Fish & Game containing commercial catch data from 1878-1997. The data set and reference to the original source can be found at its public archive.
First, open a new Quarto document. Delete everything below the YAML, and add a new chunk (name it “setup”) that loads the dplyr, tidyr, and readr packages. As we learn more packages, it may become easier to simply load the tidyverse meta-package that contains all these and more - but for now, we’ll load the individual packages one by one.
```{r setup}
library(dplyr)
library(tidyr)
library(readr)
```You may have noticed the following messages pop up when you ran your library chunk.
Attaching package: ‘dplyr’
The following objects are masked from ‘package:stats’:
filter, lag
The following objects are masked from ‘package:base’:
intersect, setdiff, setequal, union
These important messages notify you know that certain functions from the stats and base packages (which are loaded by default when you start R) are masked by different functions with the same name in the dplyr package. It turns out, the order that you load the packages in matters. Since we loaded dplyr after stats, R will assume that if you call filter(), you mean the dplyr version unless you specify otherwise.
You can easily specify which package’s version of a function you want to use by “namespacing” that function. To namespace a function, we use the syntax package_name::function_name(...). So, if we wanted to call the stats version of filter() in this Quarto document, we would use the syntax stats::filter(...).
For this lesson, we will explicitly write every function call using namespacing so you can see which package each function is coming from. But most of the time this is not necessary. Two instances where namespacing is a good idea:
- When multiple packages have a function with the same name (see the examples of masked functions above)
- When you are calling an obscure function from an obscure package, so any collaborators know the source of that function.
Messages and warnings are important, but we might not want them in our final document. After you have read the packages in, adjust the chunk settings in your setup chunk to suppress warnings and messages by adding #| message: false or #| warning: false. These chunk options, when set to false, prevent messages or warnings respectively from appearing in the rendered file.
Now that we have introduced some data wrangling packages, let’s get the data that we are going to use for this lesson.
- Go to KNB Data Package for Alaska commercial salmon catches by management region (1886-1997)
- The link is here: knb.ecoinformatics.org/view/df35b.304.2
- Find the data file “byerlySalmonByRegion.csv”. Right click the “Download” button and select “Copy Link Address”
- Create a new code chunk (call it “read data” or similar). Paste the copied URL into the
read_csv()function in quotes.
This data set is relatively clean and easy to interpret as-is. While it may be clean, it’s not “tidy” - each row represents observations of several different salmon species instead of (tidy-style) a single observation.
4 Data Exploration
It is always good practice to examine the data you just read in. Doing so is important to make sure the data is read as you were expecting and to familiarize yourself with the data. Some simple ways to explore your data include:
### Return the column names of my data frame
colnames(catch_original)
### Return first (or last) few lines of the data frame
head(catch_original); tail(catch_original, n = 10)
### Print a summary of each column of data
summary(catch_original)
### Return unique values in a column (in this case, the region)
unique(catch_original$Region)
### Open data frame in its own tab to see each row and column of the data (do in console)
View(catch_original)- 1
-
The
names()function will return the same result (in this case) - 2
- Note that the “V” is capitalized!
5 About the Pipe Operator (%>%)
Coding in the style of the Tidyverse typically uses the pipe operator (%>%). The pipe is a powerful way to efficiently chain together operations: The pipe will take the output of a previous statement, and use it as the input to the next statement. Let’s look at an example where we want to filter rows of a dataset based on some criteria, and then select specific columns. Here are examples with and without the pipe operator:
Without using the pipe operator, you might write something like the following:
df_filtered <- filter(df, ...)
df_selected <- select(df_filtered, ...)If you did use the pipe, you might write something like the following instead:
df_cleaned <- df %>%
dplyr::filter(...) %>%
dplyr::select(...)Some people like to read the pipe operator as “and then”. So the above chunk could be translated as:
Start with the original data, and then filter it, and then select columns (from the filtered data).
Using pipes to create a sequence of steps helps make code readable and concise, and avoids storing results in a bunch of intermediate variables (or overwriting original variables). However, very long sequences can be challenging to interpret, so it is good to strike a balance between writing efficient code (chaining operations), while ensuring that your code clearly communicates, both to your future self and others, what it is doing and why.
Positron and RStudio both have a keyboard shortcut for %>%
- Windows:
Ctrl+Shift+M - Mac:
cmd+shift+M
There are actually two pipe operators in R, and they are (mostly) interchangeable. The original %>% comes from a package called magrittr and became very popular within the R community - so popular that the R Consortium added a “native” pipe operator, |>. For purposes of this lesson, we will stick with the original %>% but you may see both pipes “in the wild” as you become more experienced with coding in R.
Depending on your settings, the keyboard shortcut may create the native pipe |> instead of the magrittr pipe %>%. If this is the case, for consistency with our lesson, go into your IDE’s settings and change it to %>% (search “pipe” in Positron’s settings).
6 Select, Remove, and Rename Columns Using select()
We’re ready to go back to our salmon data set. The first issue is the extra columns All and notesRegCode. Let’s select only the columns we want (implicitly removing those that we do not), and assign this to a variable called catch_data.
# Select only desired columns
catch_data <- catch_original %>%
dplyr::select(Region, Year, Chinook, Sockeye, Coho, Pink, Chum)
# Check the result
head(catch_data)# A tibble: 6 × 7
Region Year Chinook Sockeye Coho Pink Chum
<chr> <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
1 SSE 1886 0 5 0 0 0
2 SSE 1887 0 155 0 0 0
3 SSE 1888 0 224 16 0 0
4 SSE 1889 0 182 11 92 0
5 SSE 1890 0 251 42 0 0
6 SSE 1891 0 274 24 0 0
Much better! The select() function also allows you to instead say which columns you don’t want, by passing column names preceded by a minus sign (-):
catch_data2 <- catch_original %>%
dplyr::select(-All, -notesRegCode)The select() function also lets you rename columns as you select them, using the format select(new_name = old_name); its close cousin rename() lets you rename columns with no effect on selection (all columns left in place):
catch_data3 <- catch_original %>%
dplyr::select(Region, Year, Chinook, Sockeye, Coho, Pink, Chum, total = All, notes = notesRegCode)
catch_data4 <- catch_original %>%
dplyr::rename(total = All, notes = notesRegCode)In each use of select(), you can choose either which columns to include or which to exclude! R won’t complain if you try to do both in the same select() call, but it does not make logical sense and your results may not be what you expect.
In base R, many people use “column indexing” (based on the column number, not name) to select columns or rename them, for example:
coho_year <- catch_data[ , c(2, 5)] ### year is column 2, Coho is column 5
names(catch_data)[5] <- "Oncorhynchus_kisutch" ### rename Coho column to scientific nameAlthough these methods work just fine in many cases, they do have one major drawback: they rely on you knowing the exact order of columns in the data.
To illustrate why your knowledge of column order isn’t reliable enough for these operations, considering the following scenario:
Your colleague finds data on steelhead counts and adds it into the catch data as the third column (bumping all the true salmon species one column over). The same code above will now select columns 2 and 5, but 5 is no longer Coho (Sockeye is 5, Coho is now 6), and your code has no way of even knowing there is a problem! Thus using column names is safer:
coho_year <- catch_data %>%
select(Year, Oncorhynchus_kisutch = Coho)This is an example of a defensive coding strategy, where you try to anticipate issues before they arise, and write your code in such a way as to keep the issues from happening.
7 Quality Check
Now that we have the data we are interested in using, we should do a little quality check to see that everything seems as expected. A good way of doing this is the glimpse() function from dplyr, a nice alternative to head() from base R.
# Check structure of data
dplyr::glimpse(catch_data)Rows: 1,708
Columns: 7
$ Region <chr> "SSE", "SSE", "SSE", "SSE", "SSE", "SSE", "SSE", "SSE", "SSE",…
$ Year <dbl> 1886, 1887, 1888, 1889, 1890, 1891, 1892, 1893, 1894, 1895, 18…
$ Chinook <chr> "0", "0", "0", "0", "0", "0", "0", "0", "0", "3", "4", "5", "9…
$ Sockeye <dbl> 5, 155, 224, 182, 251, 274, 207, 189, 253, 408, 989, 791, 708,…
$ Coho <dbl> 0, 0, 16, 11, 42, 24, 11, 1, 5, 8, 192, 161, 132, 139, 84, 107…
$ Pink <dbl> 0, 0, 0, 92, 0, 0, 8, 187, 529, 606, 996, 2218, 673, 1545, 204…
$ Chum <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 7, 0, 1, 2, 0, 0, 0, 102, 343…
8 Change Columns Using mutate()
We can use the mutate() function to create a new column, OR change an existing column by overwriting it. First, let’s try to convert the Chinook catch values to numeric type using the base R as.numeric() function, and overwrite the old Chinook column. This kind of operation is sometimes known as “coercing” data into a different class (from character to numeric in this case).
# Make the Chinook column numeric
catch_clean <- catch_data %>%
dplyr::mutate(Chinook = as.numeric(Chinook))Warning: There was 1 warning in `dplyr::mutate()`.
ℹ In argument: `Chinook = as.numeric(Chinook)`.
Caused by warning:
! NAs introduced by coercion
# Check the structure
dplyr::glimpse(catch_clean)Rows: 1,708
Columns: 7
$ Region <chr> "SSE", "SSE", "SSE", "SSE", "SSE", "SSE", "SSE", "SSE", "SSE",…
$ Year <dbl> 1886, 1887, 1888, 1889, 1890, 1891, 1892, 1893, 1894, 1895, 18…
$ Chinook <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 3, 4, 5, 9, 12, 0, 3, 4, 0, 4, 9, 3…
$ Sockeye <dbl> 5, 155, 224, 182, 251, 274, 207, 189, 253, 408, 989, 791, 708,…
$ Coho <dbl> 0, 0, 16, 11, 42, 24, 11, 1, 5, 8, 192, 161, 132, 139, 84, 107…
$ Pink <dbl> 0, 0, 0, 92, 0, 0, 8, 187, 529, 606, 996, 2218, 673, 1545, 204…
$ Chum <dbl> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 7, 0, 1, 2, 0, 0, 0, 102, 343…
We get a warning "NAs introduced by coercion" which is R telling us that it couldn’t convert every value to a number and, for those values it couldn’t convert, it put an NA in its place. This is behavior we commonly experience when cleaning data sets and it’s important to have the skills to deal with it when it comes up.
To investigate, let’s isolate the issue. We can find out which values are NAs with a combination of is.na() and which(), and save that to a variable called i.
(i <- which(is.na(catch_clean$Chinook)))- 1
- By wrapping the entire line in parentheses, we can make the code “echo” in the Console!
[1] 401
It looks like there is only one problem row, lets have a look at it in the original data, using [row, column] notation.
catch_data[i, ]# A tibble: 1 × 7
Region Year Chinook Sockeye Coho Pink Chum
<chr> <dbl> <chr> <dbl> <dbl> <dbl> <dbl>
1 GSE 1955 I 66 0 0 1
The problem value in Chinook is the letter I. It turns out that this data set is from a PDF which was automatically converted into a CSV and this value of I should actually be 1.
Let’s fix it by incorporating the if_else() function (also from dplyr) to our mutate() call, which will change the value of the Chinook column to 1 if the value is equal to I. Once we’ve done that, we can use as.numeric() to turn the character representations of numbers into numeric values without coercing anything to NA.
catch_clean <- catch_data %>%
dplyr::mutate(
### Use a conditional to fix the bad value
Chinook = dplyr::if_else(condition = Chinook == "I",
true = "1",
false = Chinook),
### Make the fixed column numeric
Chinook = as.numeric(Chinook))
### Check the result
catch_clean[i, ]- 1
- What to return if the condition is true; here, replace “I” with the character “1”
- 2
- What to return if the condition is not true; here replace the value of Chinook with the value of Chinook (i.e., don’t change it!)
# A tibble: 1 × 7
Region Year Chinook Sockeye Coho Pink Chum
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 GSE 1955 1 66 0 0 1
Looks like that worked! Another hint that worked is that we no longer get the "NAs introduced by coercion" warning. Note also a helpful practice in the above code: the comments in line with the piped sequence, explaining what each step does!
9 Change Shape Using pivot_longer() and pivot_wider()
The next issue is that the “shape” of the data is not appropriate for our needs! Currently the data are in “wide” format where one variable (fish species) is spread into separate columns containing fish counts (in thousands of fish, per the metadata!). We want the data to be in “long” format (a.k.a. “tidy format”) where each column is only one variable and each row is a single observation!
The function pivot_longer() from the tidyr package helps us do this conversion. If you do not remember all the arguments that go into pivot_longer() you can always check out the help page by typing ?tidyr::pivot_longer in the Console.
catch_long <- catch_clean %>%
# Pivot all columns except Region and Year into long format
tidyr::pivot_longer(cols = -c(Region, Year),
names_to = "species",
values_to = "catch_thousands")
# Check the result
head(catch_long)- 1
- What should the name of the column that holds the old column names be?
- 2
- What should the name of the column that holds all the values be?
# A tibble: 6 × 4
Region Year species catch_thousands
<chr> <dbl> <chr> <dbl>
1 SSE 1886 Chinook 0
2 SSE 1886 Sockeye 5
3 SSE 1886 Coho 0
4 SSE 1886 Pink 0
5 SSE 1886 Chum 0
6 SSE 1887 Chinook 0
Note that–just like with select()–the cols argument of pivot_longer() can accept either column names to pivot or names not to pivot. The opposite of pivot_longer() is the pivot_wider() function. It works in a similar declarative fashion:
catch_wide <- catch_long %>%
# Pivot the data back into wide format
tidyr::pivot_wider(names_from = species,
values_from = catch_thousands,
values_fill = 0)
# Check the result
head(catch_wide)- 1
- Note that here the columns are not quoted. That’s because now that they actually are column names, they use normal Tidyverse style for referencing columns (i.e., not quoted)
- 2
-
If flipping to wide format makes new row/column combinations, we can specify the value to go there! Default is
NA
# A tibble: 6 × 7
Region Year Chinook Sockeye Coho Pink Chum
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 SSE 1886 0 5 0 0 0
2 SSE 1887 0 155 0 0 0
3 SSE 1888 0 224 16 0 0
4 SSE 1889 0 182 11 92 0
5 SSE 1890 0 251 42 0 0
6 SSE 1891 0 274 24 0 0
As before, if you need to remind yourself of what goes to what argument (or even just the argument names themselves), run ?tidyr::pivot_wider in the Console.
10 Add or Modify Columns Using mutate()
Now let’s use mutate() again to create a new column called catch with units of fish (instead of thousands of fish). While we’re at it, let’s also use select() to remove the column with catch in thousands because we’re not necessarily interested in it at this point. This further highlights how we can take advantage of the pipe operator to group together a similar set of statements, which all aim to clean up the catch_clean data frame.
catch_long <- catch_long %>%
# Convert to actual fish numbers
dplyr::mutate(catch = catch_thousands * 1000) %>%
# Remove the superseded column
dplyr::select(-catch_thousands)
# Check the result
dplyr::glimpse(catch_long)Rows: 8,540
Columns: 4
$ Region <chr> "SSE", "SSE", "SSE", "SSE", "SSE", "SSE", "SSE", "SSE", "SSE",…
$ Year <dbl> 1886, 1886, 1886, 1886, 1886, 1887, 1887, 1887, 1887, 1887, 18…
$ species <chr> "Chinook", "Sockeye", "Coho", "Pink", "Chum", "Chinook", "Sock…
$ catch <dbl> 0, 5000, 0, 0, 0, 0, 155000, 0, 0, 0, 0, 224000, 16000, 0, 0, …
We’re now ready to start analyzing the data!
11 Calculate Summary Statistics with dplyr
Suppose we are now interested in getting the average catch per region. In our initial data exploration we saw there are 18 regions, we can easily see their names again:
unique(catch_long$Region) [1] "SSE" "NSE" "YAK" "GSE" "BER" "COP" "PWS" "CKI" "BRB" "KSK" "YUK" "NRS"
[13] "KTZ" "KOD" "CHG" "SOP" "ALU" "NOP"
Think about how we would calculate the average catch per region “by hand”. It would be something like this:
- We start with our table and notice there are multiple regions in the “Regions” column.
- We split our original table to group all observations from the same region together.
- We calculate the average catch for each of the groups we form.
- Then we combine the values for average catch per region into a single table.
Analyses like this conform to what is known as the Split-Apply-Combine strategy. This strategy follows the three steps we explained above:
- Split: Split the data into logical groups (e.g., region, species, etc.)
- Apply: Calculate some summary statistic on each group (e.g. mean catch by year, number of individuals per species)
- Combine: Combine the statistic calculated on each group back together into a single table
The dplyr library lets us easily employ the Split-Apply-Combine strategy by using the group_by() and summarize() functions!
mean_region <- catch_long %>%
# Group by region
dplyr::group_by(Region) %>%
# Calculate averages within the established groups
dplyr::summarize(mean_catch = mean(catch, na.rm = TRUE),
.groups = "drop")
# Check the result
dplyr::glimpse(mean_region)- 1
- Specifies whether we want to keep the grouping structure in the data
Rows: 18
Columns: 2
$ Region <chr> "ALU", "BER", "BRB", "CHG", "CKI", "COP", "GSE", "KOD", "KS…
$ mean_catch <dbl> 40383.91, 16372.55, 2709796.49, 315487.27, 683571.43, 17922…
The .groups argument is not strictly necessary but it’s good practice to include! Without it, your data may secretly still contain groups, which can lead to some bizarre behavior as you perform further operations. You can also use dplyr::ungroup() after a group_by() %>% summarize() to achieve the same result.
Let’s see how the previous code implements the Split-Apply-Combine strategy:
group_by(Region): this is telling R to split the dataframe and create a group for each different value in the columnRegion. R just keeps track of the groups, it doesn’t return separate dataframes per region.mean(catch, na.rm = TRUE): heremeanis the function we want to apply to the columncatchin each group.summarize(mean_catch = mean(catch, na.rm = TRUE)the functionsummarize()is used to combine the results ofmean()from each group into a single table. The argumentmean_catch = mean(...)indicates what the column having the results ofmean()will be named.
Another common use of group_by() followed by summarize() is to count the number of rows in each group. We have to use the n() function from dplyr.
n_region <- catch_long %>%
dplyr::group_by(Region) %>%
# Count rows per region
dplyr::summarize(count = dplyr::n(),
.groups = "drop")
# Check the result
head(n_region)# A tibble: 6 × 2
Region count
<chr> <int>
1 ALU 435
2 BER 510
3 BRB 570
4 CHG 550
5 CKI 525
6 COP 470
12 Conditionally Filter Rows Using filter()
We use the filter() function to filter our data set to only rows that match some condition. It’s similar to subset() from base R. Let’s go back to our long-form data set and do some filtering with filter().
sse_catch <- catch_long %>%
dplyr::filter(Region == "SSE")
# Check the result
unique(sse_catch$Region) [1] "SSE"
13 Sort Data with arrange()
The arrange() function can be used to sort the rows of a data set. For a lot of R functions, row order does not matter, however if you wanted to calculate a cumulative sum (e.g., with base::cumsum()), or display a table with sorted rows (e.g., in a Quarto report), it can be helpful to know how to use code to re-order your rows.
Let’s re-calculate mean catch by region, and then use arrange() to sort the output by mean catch.
mean_region <- catch_long %>%
### Summarize by region
dplyr::group_by(Region) %>%
dplyr::summarize(mean_catch = mean(catch, na.rm = T),
.groups = "drop") %>%
### Order by the mean catch
dplyr::arrange(mean_catch)
# Check the result
head(mean_region)# A tibble: 6 × 2
Region mean_catch
<chr> <dbl>
1 BER 16373.
2 KTZ 18836.
3 ALU 40384.
4 NRS 51503.
5 KSK 67642.
6 YUK 68646.
The default sorting order of arrange() is to sort in ascending order. To reverse the sort order, wrap the column name inside dplyr’s desc() function:
mean_region <- catch_long %>%
# Summarize by region
dplyr::group_by(Region) %>%
dplyr::summarize(mean_catch = mean(catch, na.rm = T),
.groups = "drop") %>%
# Order by descending mean catch
dplyr::arrange(dplyr::desc(mean_catch))
# Check the result
head(mean_region)# A tibble: 6 × 2
Region mean_catch
<chr> <dbl>
1 SSE 3184661.
2 BRB 2709796.
3 NSE 1825021.
4 KOD 1528350
5 PWS 1419237.
6 SOP 1110942.
14 Split or Combine Columns
14.1 Separate Columns with separate_wider_delim()
The separate_wider_delim() function allow us to easily split a single column into numerous columns based on some “delimiter” character in the values of that column. Its complement, the unite() function, allows us to combine multiple columns into a single one.
This can come in really handy when we need to split a column into two pieces by a consistent separator (like a dash).
Let’s make a new data.frame with fake data to illustrate this. Here we have a set of site identification codes with information about the island where the site is (the first 3 letters) and a site number (the 3 numbers). If we want to group and summarize by island, we need a column with just the island information.
sites_df <- data.frame(site = c("HAW-101", "HAW-103", "OAH-320", "OAH-219", "MAU-039"))
# Check that out
head(sites_df) site
1 HAW-101
2 HAW-103
3 OAH-320
4 OAH-219
5 MAU-039
# Split site by the dash between the site letters and its code
sites_df %>%
tidyr::separate_wider_delim(cols = site, delim = "-",
names = c("island", "site_number"), cols_remove = F)# A tibble: 5 × 3
island site_number site
<chr> <chr> <chr>
1 HAW 101 HAW-101
2 HAW 103 HAW-103
3 OAH 320 OAH-320
4 OAH 219 OAH-219
5 MAU 039 MAU-039
Note that separate_wider_delim() has a lot of arguments you may want to use in some circumstances. Run ?tidyr::separate_wider_delim in the Console to check them out.
14.2 Unite Columns with unite()
The unite() function does the reverse of separate_wider_delim(); if we have a data.frame that contains separate columns, we can combine these into one column where all the values in the original columns are pasted together. For example, we might have a data set with columns for year, month, and day, and we might want to unite these into a single date column.
dates_df <- data.frame(
year = rep(x = "1930", times = 3),
month = rep(x = "12", times = 3),
day = 14:16)
# Check that
dates_df year month day
1 1930 12 14
2 1930 12 15
3 1930 12 16
# Combine those columns into a single 'date' column
dates_df %>%
tidyr::unite(col = date,
year, month, day,
sep = "-",
remove = F)- 1
-
Note that you don’t specify an argument for the columns that you want to combine with
unite()! - 2
- This specifies whether you want to remove the original columns
date year month day
1 1930-12-14 1930 12 14
2 1930-12-15 1930 12 15
3 1930-12-16 1930 12 16
15 Putting it All Together!
We just ran through the various things we can do with dplyr, tidyr, and the pipe but if you’re wondering how this might look in a real analysis. Let’s look at that now:
### Grab the data
catch_original <- readr::read_csv(file = "https://knb.ecoinformatics.org/knb/d1/mn/v2/object/df35b.302.1",
show_col_types = F)
### Do desired wrangling
mean_region <- catch_original %>%
### Drop unwanted columns
dplyr::select(-All, -notesRegCode) %>%
### Fix the typo in the Chinook data
dplyr::mutate(
Chinook = dplyr::if_else(Chinook == "I",
true = "1", false = Chinook),
Chinook = as.numeric(Chinook)) %>%
### Reshape to long format
tidyr::pivot_longer(-c(Region, Year),
names_to = "species",
values_to = "catch") %>%
### Transform catch into number of fish
dplyr::mutate(catch = catch*1000) %>%
### Summarize within region
dplyr::group_by(Region) %>%
dplyr::summarize(mean_catch = mean(catch, na.rm = T),
.groups = "drop") %>%
### Sort by mean catch in descending order
dplyr::arrange(dplyr::desc(mean_catch))
# Check the result
head(mean_region)# A tibble: 6 × 2
Region mean_catch
<chr> <dbl>
1 SSE 3184661.
2 BRB 2709796.
3 NSE 1825021.
4 KOD 1528350
5 PWS 1419237.
6 SOP 1110942.
That is a pretty complicated series of operations, but with piping, clearly named functions from tidyr and dplyr, and helpful in-line comments to explain each step, it is pretty easy to follow!
16 Wrap up: Git workflow
Let’s wrap up this lesson by commiting and syncing our changes using our Git workflow.
