### For data wrangling and plotting
library(dplyr)
library(tidyr)
library(ggplot2)
library(purrr) ### for iteration using map() family of functions
library(palmerpenguins) ### example data1 Setup
2 Why Functions?
Many R users write code as a continuous stream of commands—paste from the console, run, repeat. That works, but breaking code into small, reusable functions pays off quickly.
The guiding principle is DRY: Don’t Repeat Yourself. If you’ve copied and pasted a block of code more than twice, it’s time to write a function. Functions give you:
- Less repetition — change logic in one piece of code, not ten
- Fewer errors — one tested implementation instead of many copies
- Better readability — a well-named function documents the coder’s intent
2.1 Function Basics
The syntax for defining a function in R is:
my_function <- function(argument1, argument2) {
<code that performs some well defined task>
return(result)
}You can call it just like any built-in function:
result <- my_function(argument1 = value1, argument2 = value2)Ideally, function names should be short, but still clearly capture what the function does.
Best Practices from Chapter 19 Functions in R for Data Science:
- Function names should be verbs and arguments should be nouns (there are exceptions).
- Use the
snake_casenaming convention for functions that are multiple words. - For a “family” of functions, use a common prefix to indicate that they are connected. For example, most of the functions in the
usethispackage have the prefixuse_*
Based on these tips, is my_function() a good name for a function?
2.2 Example: Temperature Conversion
Imagine you have some temperature data measured in degrees Fahrenheit and you want to convert that to Celsius for your analysis. You might have an R script that does this for you.
airtemps <- c(212, 30.3, 78, 32)
celsius1 <- (airtemps[1] - 32) * 5/9
celsius2 <- (airtemps[2] - 32) * 5/9
celsius3 <- (airtemps[3] - 33) * 5/9Here we repeat the same formula three times. And note the third time, there’s a subtle mistake! This code would be both more compact and more reliable if we didn’t repeat ourselves.
Convert Fahrenheit to Celsius
To create a function in R, we use the function function (so meta!) and assign its result to an object. Let’s create a function that calculates Celsius temperature outputs from Fahrenheit temperature inputs.
convert_f_to_c <- function(fahr) {
celsius <- (fahr - 32) * 5/9
return(celsius)
}Because R operations are vectorized, this works on a single value or an entire vector:
### check single value:
celsius1a <- convert_f_to_c(airtemps[1])
celsius1a == celsius1[1] TRUE
### calculate for whole vector:
celsius_vec <- convert_f_to_c(airtemps)
celsius_vec[1] 100.0000000 -0.9444444 25.5555556 0.0000000
3 Functions: Input, Output, Environment
3.1 Setting Argument Defaults
Function arguments often include a default value - in which case, the user can opt to simply not assign a value. Here, convert_temp() allows the user to specify the scale of their input, defaulting to Fahrenheit (scale = 'f').
convert_temp <- function(t, scale = 'f') {
if(scale == "c") {
result = t * 9/5 + 32
} else {
result = (t - 32) * 5/9
}
return(round(result, 3))
}convert_temp(t = airtemps)[1] 100.000 -0.944 25.556 0.000
convert_temp(t = airtemps, scale = 'Celsius')[1] 100.000 -0.944 25.556 0.000
3.2 Error handling!
Sometimes a user will include an argument that breaks the code - a character when the function expects a number, upper-case when expecting lower-case, etc. Try to anticipate simple errors and include a way to identify and handle them in the function: try to correct the error on the fly, or use stop() to return a useful error message to the user.
Consider some common mistakes a usermight make with convert_temp() and how might we deal with them?
convert_temp <- function(t, scale = 'f') {
scale = tolower(substr(scale, 1, 1))
if(!scale %in% c('c', 'f')) {
stop('scale must be either "c" or "f"')
}
if(scale == "c") {
result = t * 9/5 + 32
} else {
result = (t - 32) * 5/9
}
return(round(result, 3))
}- 1
-
What if user enters
scale = "Celsius"?substr(x, 1, 1)grabs just the first letter, andtolowerforces it to lower case - 2
-
If
scaleis not one of the expected values, generate a sensible error withstop() - 3
- We know the values are valid now, so we can use a simple if/else to do the correct calculation!
3.3 Returning Values
Use return(x) to end the function and hand the resulting value back to the user. To provide multiple values back to a user, consider more complex data structures like a dataframe (well structured) or a named list (very flexible).
convert_f_to_c_k <- function(f) {
c <- (f - 32) * 5/9
k <- c + 273.15
out_df <- data.frame(fahr = f, celsius = c, kelvin = k)
return(out_df)
}
temps_df <- convert_f_to_c_k(seq(-100, 100, 50))
temps_df fahr celsius kelvin
1 -100 -73.33333 199.8167
2 -50 -45.55556 227.5944
3 0 -17.77778 255.3722
4 50 10.00000 283.1500
5 100 37.77778 310.9278
convert_f_to_c_k <- function(f) {
c <- (f - 32) * 5/9
k <- c + 273.15
out_list <- list(fahr = f, celsius = c, kelvin = k)
return(out_list)
}
temps_list <- convert_f_to_c_k(seq(-100, 100, 50))
temps_list$fahr
[1] -100 -50 0 50 100
$celsius
[1] -73.33333 -45.55556 -17.77778 10.00000 37.77778
$kelvin
[1] 199.8167 227.5944 255.3722 283.1500 310.9278
Without an explicit return() we might get unexpected results!
convert_f_to_c_k <- function(f) {
c <- (f - 32) * 5/9
k <- c + 273.15
out_list <- list(fahr = f, celsius = c, kelvin = k)
print('calculation successful!')
}
temps_oops <- convert_f_to_c_k(seq(-100, 100, 50))[1] "calculation successful!"
temps_oops[1] "calculation successful!"
return()
If you don’t explicitly state a value to return(), R will pass the result of the last step of the function. For simple functions, this is fine, but for anything more than a line or two, best practice is to explicitly call return() so it is obvious what is being returned.
3.4 Functions and Environments
When a function is called, it exists in some environment within R; this is its parent environment (when working interactively this is the global environment). The function performs all its calculations in a temporary environment, child of the parent. When the function completes, it returns a value to the parent then that temporary child environment disappears, taking any intermediate values with it. This can cause confusing behavior.
a <- 1 ### create object a in parent environment
add_one <- function(x) {
x <- x + 1 ### increment argument by 1
a <- a + 2 ### modify a in function environment
b <- a + 3 ### create b in function environment
return(x) ### only x is returned back to parent
}
add_one(a) ### give `a` as arg to the function; returns the value a + 1[1] 2
a ### value in parent env is unchanged by calcs in the function[1] 1
exists('b') ### objects created in the function disappear when function ends[1] FALSE
In the above example, there is no starting value of a in the child (function) environment, but like a good parent-child relationship, the child can look “up” to the parent if it can’t find a needed value (a in parent environment) within its own environment. However, the change in a (and the creation of b) within the child environment are lost when the function ends - like a bad parent-child relationship, the child is abandoned and forgotten!
There are complexities and ways to get around this behavior, but it’s best just to be aware, and make sure any values the user needs get sent back to the parent environment using return()!
4 Functions: Use Cases
Functions are amazing for reusing common code logic and for communicating the intent of a block of code by assigning a name. But functions can make it much easier to iterate complex operations on vectors and lists, and can help you create customized plot themes that you can use across multiple plots to maintain a consistent “brand”.
4.1 Functions for Iteration
A function allows you to easily reuse a piece of code. This is especially powerful when coupled with iteration functions: *apply functions (base R) and the map_* functions from the purrr package. For example, you could apply a function to each element of a list or vector in sequence; separate a dataframe into pieces by some variable, then fit a model to each piece or generate a separate plot for each; or read in a series of separate data files and summarize the results into a complete dataframe.
The apply and map functions typically take a list or vector for their first argument, and a function as their second.
- The first argument of the function
.xorXshould generally be the variable being iterated over! - The function argument
.forFUNshould not have parentheses after it!
Apply the square root function sqrt() to each element of a vector of numbers. NOTE, if nums were a vector, this would be trivial since sqrt already takes advantage of vectorization; here we made nums a list just for fun.
nums <- list(1, 3, 9, 49, 101, pi)
purrr::map_vec(.x = nums, .f = sqrt)
base::sapply(X = nums, FUN = sqrt)- 1
-
map()always returns a list, butmap_vecsimplifies into a vector; see alsomap_dfor type-specific versions likemap_int,map_chr, ormap_dbl - 2
-
apply()returns various formats depending on output;sapplyreturns the results simplified into a vector or matrix
[1] 1.000000 1.732051 3.000000 7.000000 10.049876 1.772454
[1] 1.000000 1.732051 3.000000 7.000000 10.049876 1.772454
Let’s take the palmerpenguins::penguins dataset and create a linear model of bill length vs bill depth for each of the three penguin species, Chinstrap, Adelie, and Gentoo. Note, there are more efficient ways to do this, here we are spelling it out as a teaching case.
library(palmerpenguins)
data(penguins)
spp_vec <- unique(penguins$species)
calc_bill_model <- function(spp, df) {
spp_df <- df %>%
filter(species == spp)
bill_mdl <- lm(bill_length_mm ~ bill_depth_mm, data = spp_df)
return(bill_mdl)
}
map_results <- purrr::map(.x = spp_vec, .f = calc_bill_model, df = penguins)
broom::tidy(map_results[[3]])
lapply_results <- base::lapply(X = spp_vec, FUN = calc_bill_model, df = penguins)
broom::tidy(lapply_results[[3]])- 1
-
map()always returns a list, which works well for storing a complex object like a linear model - 2
-
broom::tidy()summarizes a model object into a nice clean dataframe - 3
-
lapply()always returns a list, which works well for storing a complex object like a linear model
# A tibble: 2 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 13.4 5.06 2.66 0.00992
2 bill_depth_mm 1.92 0.274 7.01 0.00000000153
# A tibble: 2 × 5
term estimate std.error statistic p.value
<chr> <dbl> <dbl> <dbl> <dbl>
1 (Intercept) 13.4 5.06 2.66 0.00992
2 bill_depth_mm 1.92 0.274 7.01 0.00000000153
In the examples above, we used named functions (sqrt defined in base R, calc_bill_model defined by our own code). The map and apply functions also work well with anonymous functions - small tasks where the code is only needed once and can be discarded after. The function is created on the fly and not assigned to a named object.
Using function() but not assigning it to an object, just giving it directly to the FUN argument.
base::sapply(X = 1:3, FUN = function(x) x + 1)[1] 2 3 4
R 4.1 introduced a shorthand for function(args) {expr} using a backslash: \(args) expr:
base::sapply(X = 1:3, FUN = \(x) x + 1)[1] 2 3 4
The purrr package uses a tilde-dot syntax as placeholder for the argument. It also works with function() or the backslash shorthand.
purrr::map_vec(.x = 1:3, .f = ~ .x + 1)[1] 2 3 4
purrr::map_vec(.x = 1:3, .f = \(x) x + 1)[1] 2 3 4
4.2 Functions for Custom Plot Themes
If you make many similar plots, a custom theme function keeps formatting consistent and easy to update (we know this is not the greatest plot, we’re just changing various aspects that should be obvious in the resulting plot).
custom_theme <- function(base_size = 9) {
theme(
text = element_text(family = "serif",
color = "slateblue4",
size = base_size),
plot.title = element_text(size = rel(1.25),
hjust = 0.5,
face = "bold"),
panel.background = element_rect(color = 'slateblue3',
fill = 'azure'),
panel.grid.major = element_line(color = "slateblue1",
linewidth = 0.25),
legend.position = c(.9, .4),
axis.ticks = element_line(color = 'red')
)
}You can go further and wrap the entire plot in a function too:
scatterplot <- function(df, point_size = 2, font_size = 9) {
ggplot(data = df, mapping = aes(x = fahr, y = celsius, color = kelvin)) +
geom_point(size = point_size) +
scale_color_viridis_c() +
custom_theme(font_size)
}
scatterplot(temps_df, point_size = 3, font_size = 16) +
labs(title = 'Temperature Conversions')- 1
-
Since
scatterplot()returns aggplotobject, we can continue to addggplotlayers - additional geoms, scales, labels, etc.

Now all plots built with scatterplot() can be reformatted by changing one function – whether you’re making 1, 10, or 100 plots.
5 Documenting Functions
Well-named functions are a start, but good documentation tells collaborators (and future you) what a function expects and what it returns. Comments in the function body are a good start. For a more standardized structure, the roxygen2 package provides a lightweight format for this. Place comments starting with #' immediately above the function definition, and use specific tags (e.g., @param) to define sections of the documentation.
#' Convert temperature from Fahrenheit to Celsius
#'
#' @param fahr Numeric value or vector in degrees Fahrenheit
#'
#' @returns Numeric value or vector in degrees Celsius
#' @export
#'
#' @examples
#' convert_f_to_c(32)
#' convert_f_to_c(c(32, 212, 72))
convert_f_to_c <- function(fahr) {
celsius <- (fahr - 32) * 5/9
return(celsius)
}Key tags:
| Tag | Purpose |
|---|---|
@param |
Describes each input argument to the function |
@returns |
Describes the function’s output |
@examples |
Shows usage examples |
@export |
Makes the function available if bundled in a package |
This roxygen2 structure might seem overly complicated, but it will be useful for others interested in using your functions (including “future you”) - especially when including your functions in an R package!
For more best practices on function documentation, check out Hadley Wickham and Jennifer Bryan’s online book R Packages (2e) - Chapter 10, Section 16: Function Documentation.
6 Exercises
In this sequence of exercises, we will build up a function to calculate the weight of Chinook salmon based only on length, using a simple length-to-width formula \(W = aL^b\).