6  Mutate Data

Author

S Perry

6.1 Introduction

Sometimes, we want to modify existing columns values or even add new ones. This is where the mutate function from the dplyr package comes in.

6.2 Add New Columns

With mutate, it’s simple to add new columns. First, let’s look at our current ones:

glimpse(df_wq)
Rows: 62
Columns: 20
$ Station            <chr> "P8", "D7", "P8", "D7", "P8", "D7", "P8", "D7", "P8…
$ Date               <date> 2020-01-16, 2020-01-22, 2020-02-14, 2020-02-20, 20…
$ Chla               <dbl> 0.64, 0.67, 1.46, 2.15, 1.40, 1.89, 4.73, 1.74, 6.4…
$ Pheophytin         <dbl> 0.50, 0.87, 0.69, 0.50, 0.56, 1.13, 1.25, 0.89, 0.8…
$ TotAlkalinity      <dbl> 98.0, 82.0, 81.0, 86.0, 80.0, 93.0, 59.0, 78.0, 63.…
$ DissAmmonia        <dbl> 0.150, 0.210, 0.250, 0.140, 0.110, 0.220, 0.050, 0.…
$ DissNitrateNitrite <dbl> 2.800, 0.490, 1.700, 0.480, 1.600, 0.380, 1.070, 0.…
$ DOC                <dbl> 3.90, 0.27, 2.80, 0.39, 2.00, 0.19, 2.80, 1.20, 3.1…
$ TOC                <dbl> 4.10, 0.32, 2.50, 0.41, 2.10, 0.20, 2.80, 1.20, 3.1…
$ DON                <dbl> NA, NA, NA, NA, NA, NA, 0.30, 0.20, 0.30, 0.10, 0.5…
$ TotPhos            <dbl> 0.310, 0.082, 0.130, 0.130, 0.190, 0.100, 0.188, 0.…
$ DissOrthophos      <dbl> 0.200, 0.071, 0.130, 0.065, 0.140, 0.082, 0.177, 0.…
$ TDS                <dbl> 380, 9500, 340, 5800, 290, 8700, 280, 7760, 227, 11…
$ TSS                <dbl> 8.9, 38.0, 2.2, 18.0, 1.4, 28.0, 6.6, 35.6, 5.3, 23…
$ TKN                <dbl> 0.520, 0.480, 0.430, 0.250, 0.400, 0.200, 0.400, 0.…
$ Depth              <dbl> 28.9, 18.8, 39.0, 7.1, 39.0, 7.2, 37.1, 5.2, 36.7, …
$ Secchi             <dbl> 116, 30, 212, 52, 340, 48, 100, 40, 160, 44, 120, 6…
$ Microcystis        <dbl> 1, 1, 1, 1, 1, 1, 3, 2, 3, 2, 4, 2, 3, 2, 2, 1, 1, …
$ SpCndSurface       <dbl> 667, 15532, 647, 11369, 530, 16257, 503, 12946, 404…
$ WTSurface          <dbl> 9.67, 9.97, 11.09, 12.51, 13.97, 13.81, 23.46, 21.1…


Now let’s add a new column, Lab, where every value is “BSA”:

df_wq %>%
  mutate(Lab = 'BSA') %>%
  glimpse()
Rows: 62
Columns: 21
$ Station            <chr> "P8", "D7", "P8", "D7", "P8", "D7", "P8", "D7", "P8…
$ Date               <date> 2020-01-16, 2020-01-22, 2020-02-14, 2020-02-20, 20…
$ Chla               <dbl> 0.64, 0.67, 1.46, 2.15, 1.40, 1.89, 4.73, 1.74, 6.4…
$ Pheophytin         <dbl> 0.50, 0.87, 0.69, 0.50, 0.56, 1.13, 1.25, 0.89, 0.8…
$ TotAlkalinity      <dbl> 98.0, 82.0, 81.0, 86.0, 80.0, 93.0, 59.0, 78.0, 63.…
$ DissAmmonia        <dbl> 0.150, 0.210, 0.250, 0.140, 0.110, 0.220, 0.050, 0.…
$ DissNitrateNitrite <dbl> 2.800, 0.490, 1.700, 0.480, 1.600, 0.380, 1.070, 0.…
$ DOC                <dbl> 3.90, 0.27, 2.80, 0.39, 2.00, 0.19, 2.80, 1.20, 3.1…
$ TOC                <dbl> 4.10, 0.32, 2.50, 0.41, 2.10, 0.20, 2.80, 1.20, 3.1…
$ DON                <dbl> NA, NA, NA, NA, NA, NA, 0.30, 0.20, 0.30, 0.10, 0.5…
$ TotPhos            <dbl> 0.310, 0.082, 0.130, 0.130, 0.190, 0.100, 0.188, 0.…
$ DissOrthophos      <dbl> 0.200, 0.071, 0.130, 0.065, 0.140, 0.082, 0.177, 0.…
$ TDS                <dbl> 380, 9500, 340, 5800, 290, 8700, 280, 7760, 227, 11…
$ TSS                <dbl> 8.9, 38.0, 2.2, 18.0, 1.4, 28.0, 6.6, 35.6, 5.3, 23…
$ TKN                <dbl> 0.520, 0.480, 0.430, 0.250, 0.400, 0.200, 0.400, 0.…
$ Depth              <dbl> 28.9, 18.8, 39.0, 7.1, 39.0, 7.2, 37.1, 5.2, 36.7, …
$ Secchi             <dbl> 116, 30, 212, 52, 340, 48, 100, 40, 160, 44, 120, 6…
$ Microcystis        <dbl> 1, 1, 1, 1, 1, 1, 3, 2, 3, 2, 4, 2, 3, 2, 2, 1, 1, …
$ SpCndSurface       <dbl> 667, 15532, 647, 11369, 530, 16257, 503, 12946, 404…
$ WTSurface          <dbl> 9.67, 9.97, 11.09, 12.51, 13.97, 13.81, 23.46, 21.1…
$ Lab                <chr> "BSA", "BSA", "BSA", "BSA", "BSA", "BSA", "BSA", "B…

We can even add columns that are combinations of other ones:

add two numeric columns together

df_wq %>%
  mutate(ChlPheo = Chla + Pheophytin) %>%
  select(Station, Date, Chla, Pheophytin, ChlPheo) %>%
  head()
# A tibble: 6 × 5
  Station Date        Chla Pheophytin ChlPheo
  <chr>   <date>     <dbl>      <dbl>   <dbl>
1 P8      2020-01-16  0.64       0.5     1.14
2 D7      2020-01-22  0.67       0.87    1.54
3 P8      2020-02-14  1.46       0.69    2.15
4 D7      2020-02-20  2.15       0.5     2.65
5 P8      2020-03-03  1.4        0.56    1.96
6 D7      2020-03-06  1.89       1.13    3.02

concatenate two character columns

df_wq %>%
  mutate(StatDate = paste(Station, Date, sep = '___')) %>%
  select(Station, Date, StatDate) %>%
  head()
# A tibble: 6 × 3
  Station Date       StatDate       
  <chr>   <date>     <chr>          
1 P8      2020-01-16 P8___2020-01-16
2 D7      2020-01-22 D7___2020-01-22
3 P8      2020-02-14 P8___2020-02-14
4 D7      2020-02-20 D7___2020-02-20
5 P8      2020-03-03 P8___2020-03-03
6 D7      2020-03-06 D7___2020-03-06
Relocate Columns

Sometimes, we want to re-arrange our column order. For example, with our new Lab column, we might want it in the beginning with the other metadata. Here, we use the relocate function:

df_wq %>%
  mutate(Lab = 'BSA') %>%
  relocate(Lab, .after = Date) %>%
  head()
# A tibble: 6 × 21
  Station Date       Lab    Chla Pheophytin TotAlkalinity DissAmmonia
  <chr>   <date>     <chr> <dbl>      <dbl>         <dbl>       <dbl>
1 P8      2020-01-16 BSA    0.64       0.5             98        0.15
2 D7      2020-01-22 BSA    0.67       0.87            82        0.21
3 P8      2020-02-14 BSA    1.46       0.69            81        0.25
4 D7      2020-02-20 BSA    2.15       0.5             86        0.14
5 P8      2020-03-03 BSA    1.4        0.56            80        0.11
6 D7      2020-03-06 BSA    1.89       1.13            93        0.22
# ℹ 14 more variables: DissNitrateNitrite <dbl>, DOC <dbl>, TOC <dbl>,
#   DON <dbl>, TotPhos <dbl>, DissOrthophos <dbl>, TDS <dbl>, TSS <dbl>,
#   TKN <dbl>, Depth <dbl>, Secchi <dbl>, Microcystis <dbl>,
#   SpCndSurface <dbl>, WTSurface <dbl>

6.3 Modify Existing Columns

We can also use mutate to modify existing columns. For example, say we want to add 20 to every Chla value:

df_wq %>%
  mutate(Chla = Chla+20)
# A tibble: 62 × 20
   Station Date        Chla Pheophytin TotAlkalinity DissAmmonia
   <chr>   <date>     <dbl>      <dbl>         <dbl>       <dbl>
 1 P8      2020-01-16  20.6       0.5             98        0.15
 2 D7      2020-01-22  20.7       0.87            82        0.21
 3 P8      2020-02-14  21.5       0.69            81        0.25
 4 D7      2020-02-20  22.2       0.5             86        0.14
 5 P8      2020-03-03  21.4       0.56            80        0.11
 6 D7      2020-03-06  21.9       1.13            93        0.22
 7 P8      2020-06-11  24.7       1.25            59        0.05
 8 D7      2020-06-17  21.7       0.89            78        0.05
 9 P8      2020-07-13  26.4       0.88            63        0.05
10 D7      2020-07-16  22.8       0.85            80        0.05
# ℹ 52 more rows
# ℹ 14 more variables: DissNitrateNitrite <dbl>, DOC <dbl>, TOC <dbl>,
#   DON <dbl>, TotPhos <dbl>, DissOrthophos <dbl>, TDS <dbl>, TSS <dbl>,
#   TKN <dbl>, Depth <dbl>, Secchi <dbl>, Microcystis <dbl>,
#   SpCndSurface <dbl>, WTSurface <dbl>

We can even change data types! This is useful if, say, you have character column that should be numeric, or a character column that should be factor. We do this by using the as.numeric and as.factor functions (as.character is also an option):

df_wq %>%
  mutate(Chla = as.numeric(Chla),
         Station = as.factor(Station)) %>%
  select(Station, Chla) %>%
  str()
tibble [62 × 2] (S3: tbl_df/tbl/data.frame)
 $ Station: Factor w/ 2 levels "D7","P8": 2 1 2 1 2 1 2 1 2 1 ...
 $ Chla   : num [1:62] 0.64 0.67 1.46 2.15 1.4 1.89 4.73 1.74 6.4 2.79 ...

6.4 case_when

Sometimes you only want to change certain values in a column. For example, maybe you want to add 20 to chla values, but only for rows where the station is P8. You can do this using the case_when() function for this.

case_when() lets you assign new values based on conditions. It takes each row and sequentially compared it to each condition, using the value for the first one that is true. The format looks like this:

case_when(
  condition1 ~ value1,
  condition2 ~ value2,
  TRUE ~ default_value
  )

Since every row must get a value, it’s important to end with TRUE ~ result to handle all remaining cases.

To go back to our example, the code would look like this:

df_wq %>%
  mutate(
    Chla =
      case_when(Station == 'P8' ~ Chla + 20, # if station is P8, add 20
              TRUE ~ Chla) # else, keep as previous value
  ) %>%
  select(c(Station:Date), Chla)
# A tibble: 62 × 3
   Station Date        Chla
   <chr>   <date>     <dbl>
 1 P8      2020-01-16 20.6 
 2 D7      2020-01-22  0.67
 3 P8      2020-02-14 21.5 
 4 D7      2020-02-20  2.15
 5 P8      2020-03-03 21.4 
 6 D7      2020-03-06  1.89
 7 P8      2020-06-11 24.7 
 8 D7      2020-06-17  1.74
 9 P8      2020-07-13 26.4 
10 D7      2020-07-16  2.79
# ℹ 52 more rows

Note that case_when is nested in the mutate function and that we had to specify which column we wanted to manipulate (in this case, Chla). As you may assume, since it’s part of mutate, this means we can also create new columns via this process.

6.4.0.1 Exercise

Now its your turn to try out these summarizing functions we just learned about.

  1. Use case_when to multiple Pheophytin values by 2 for station D7.

  2. Use case_when (nested in mutate) to create a new column, DON_ND, where the value is 0 when when DON is NA (ie. a non-detect)
    HINT: we can use is.na to check if a column is NA

Click below for the answer when you are done!

Code
df_wq %>%
  mutate(
    Test = 
      case_when(is.na(DON) ~ 0,
                TRUE ~ DON)
  )

6.5 Rename

Sometimes, we want to rename our columns (eg. if working with multiple data frames, we want the columns to be standardized). We can do this using the rename function from the dplyr package:

df_wq %>%
  rename(StationCode = Station)
# A tibble: 62 × 20
   StationCode Date        Chla Pheophytin TotAlkalinity DissAmmonia
   <chr>       <date>     <dbl>      <dbl>         <dbl>       <dbl>
 1 P8          2020-01-16  0.64       0.5             98        0.15
 2 D7          2020-01-22  0.67       0.87            82        0.21
 3 P8          2020-02-14  1.46       0.69            81        0.25
 4 D7          2020-02-20  2.15       0.5             86        0.14
 5 P8          2020-03-03  1.4        0.56            80        0.11
 6 D7          2020-03-06  1.89       1.13            93        0.22
 7 P8          2020-06-11  4.73       1.25            59        0.05
 8 D7          2020-06-17  1.74       0.89            78        0.05
 9 P8          2020-07-13  6.4        0.88            63        0.05
10 D7          2020-07-16  2.79       0.85            80        0.05
# ℹ 52 more rows
# ℹ 14 more variables: DissNitrateNitrite <dbl>, DOC <dbl>, TOC <dbl>,
#   DON <dbl>, TotPhos <dbl>, DissOrthophos <dbl>, TDS <dbl>, TSS <dbl>,
#   TKN <dbl>, Depth <dbl>, Secchi <dbl>, Microcystis <dbl>,
#   SpCndSurface <dbl>, WTSurface <dbl>