5  Mutate Data

Author

Perry S

5.1 Introduction

Sometimes, we want to modify our already existing columns or even add new ones. This is where the mutate function from the dplyr package comes in.

5.2 Add New Columns

With mutate, it’s simple to add new columns. First, let’s look at our current ones:

glimpse(df_wq)
Rows: 62
Columns: 20
$ Station            <chr> "P8", "D7", "P8", "D7", "P8", "D7", "P8", "D7", "P8…
$ Date               <date> 2020-01-16, 2020-01-22, 2020-02-14, 2020-02-20, 20…
$ Chla               <dbl> 0.64, 0.67, 1.46, 2.15, 1.40, 1.89, 4.73, 1.74, 6.4…
$ Pheophytin         <dbl> 0.50, 0.87, 0.69, 0.50, 0.56, 1.13, 1.25, 0.89, 0.8…
$ TotAlkalinity      <dbl> 98.0, 82.0, 81.0, 86.0, 80.0, 93.0, 59.0, 78.0, 63.…
$ DissAmmonia        <dbl> 0.150, 0.210, 0.250, 0.140, 0.110, 0.220, 0.050, 0.…
$ DissNitrateNitrite <dbl> 2.800, 0.490, 1.700, 0.480, 1.600, 0.380, 1.070, 0.…
$ DOC                <dbl> 3.90, 0.27, 2.80, 0.39, 2.00, 0.19, 2.80, 1.20, 3.1…
$ TOC                <dbl> 4.10, 0.32, 2.50, 0.41, 2.10, 0.20, 2.80, 1.20, 3.1…
$ DON                <dbl> NA, NA, NA, NA, NA, NA, 0.30, 0.20, 0.30, 0.10, 0.5…
$ TotPhos            <dbl> 0.310, 0.082, 0.130, 0.130, 0.190, 0.100, 0.188, 0.…
$ DissOrthophos      <dbl> 0.200, 0.071, 0.130, 0.065, 0.140, 0.082, 0.177, 0.…
$ TDS                <dbl> 380, 9500, 340, 5800, 290, 8700, 280, 7760, 227, 11…
$ TSS                <dbl> 8.9, 38.0, 2.2, 18.0, 1.4, 28.0, 6.6, 35.6, 5.3, 23…
$ TKN                <dbl> 0.520, 0.480, 0.430, 0.250, 0.400, 0.200, 0.400, 0.…
$ Depth              <dbl> 28.9, 18.8, 39.0, 7.1, 39.0, 7.2, 37.1, 5.2, 36.7, …
$ Secchi             <dbl> 116, 30, 212, 52, 340, 48, 100, 40, 160, 44, 120, 6…
$ Microcystis        <dbl> 1, 1, 1, 1, 1, 1, 3, 2, 3, 2, 4, 2, 3, 2, 2, 1, 1, …
$ SpCndSurface       <dbl> 667, 15532, 647, 11369, 530, 16257, 503, 12946, 404…
$ WTSurface          <dbl> 9.67, 9.97, 11.09, 12.51, 13.97, 13.81, 23.46, 21.1…

Now let’s add a new column, Lab, where every value is “BSA”:

df_wq %>%
  mutate(Lab = 'BSA') %>%
  glimpse()
Rows: 62
Columns: 21
$ Station            <chr> "P8", "D7", "P8", "D7", "P8", "D7", "P8", "D7", "P8…
$ Date               <date> 2020-01-16, 2020-01-22, 2020-02-14, 2020-02-20, 20…
$ Chla               <dbl> 0.64, 0.67, 1.46, 2.15, 1.40, 1.89, 4.73, 1.74, 6.4…
$ Pheophytin         <dbl> 0.50, 0.87, 0.69, 0.50, 0.56, 1.13, 1.25, 0.89, 0.8…
$ TotAlkalinity      <dbl> 98.0, 82.0, 81.0, 86.0, 80.0, 93.0, 59.0, 78.0, 63.…
$ DissAmmonia        <dbl> 0.150, 0.210, 0.250, 0.140, 0.110, 0.220, 0.050, 0.…
$ DissNitrateNitrite <dbl> 2.800, 0.490, 1.700, 0.480, 1.600, 0.380, 1.070, 0.…
$ DOC                <dbl> 3.90, 0.27, 2.80, 0.39, 2.00, 0.19, 2.80, 1.20, 3.1…
$ TOC                <dbl> 4.10, 0.32, 2.50, 0.41, 2.10, 0.20, 2.80, 1.20, 3.1…
$ DON                <dbl> NA, NA, NA, NA, NA, NA, 0.30, 0.20, 0.30, 0.10, 0.5…
$ TotPhos            <dbl> 0.310, 0.082, 0.130, 0.130, 0.190, 0.100, 0.188, 0.…
$ DissOrthophos      <dbl> 0.200, 0.071, 0.130, 0.065, 0.140, 0.082, 0.177, 0.…
$ TDS                <dbl> 380, 9500, 340, 5800, 290, 8700, 280, 7760, 227, 11…
$ TSS                <dbl> 8.9, 38.0, 2.2, 18.0, 1.4, 28.0, 6.6, 35.6, 5.3, 23…
$ TKN                <dbl> 0.520, 0.480, 0.430, 0.250, 0.400, 0.200, 0.400, 0.…
$ Depth              <dbl> 28.9, 18.8, 39.0, 7.1, 39.0, 7.2, 37.1, 5.2, 36.7, …
$ Secchi             <dbl> 116, 30, 212, 52, 340, 48, 100, 40, 160, 44, 120, 6…
$ Microcystis        <dbl> 1, 1, 1, 1, 1, 1, 3, 2, 3, 2, 4, 2, 3, 2, 2, 1, 1, …
$ SpCndSurface       <dbl> 667, 15532, 647, 11369, 530, 16257, 503, 12946, 404…
$ WTSurface          <dbl> 9.67, 9.97, 11.09, 12.51, 13.97, 13.81, 23.46, 21.1…
$ Lab                <chr> "BSA", "BSA", "BSA", "BSA", "BSA", "BSA", "BSA", "B…

We can even add columns that are combinations of other ones:

add two numeric columns together

df_wq %>%
  mutate(ChlPheo = Chla+Pheophytin) %>%
  select(Station, Date, Chla, Pheophytin, ChlPheo) %>%
  head()
# A tibble: 6 × 5
  Station Date        Chla Pheophytin ChlPheo
  <chr>   <date>     <dbl>      <dbl>   <dbl>
1 P8      2020-01-16  0.64       0.5     1.14
2 D7      2020-01-22  0.67       0.87    1.54
3 P8      2020-02-14  1.46       0.69    2.15
4 D7      2020-02-20  2.15       0.5     2.65
5 P8      2020-03-03  1.4        0.56    1.96
6 D7      2020-03-06  1.89       1.13    3.02

concatenate two character columns

df_wq %>%
  mutate(StatDate = paste(Station, Date, sep = '___')) %>%
  select(Station, Date, StatDate) %>%
  head()
# A tibble: 6 × 3
  Station Date       StatDate       
  <chr>   <date>     <chr>          
1 P8      2020-01-16 P8___2020-01-16
2 D7      2020-01-22 D7___2020-01-22
3 P8      2020-02-14 P8___2020-02-14
4 D7      2020-02-20 D7___2020-02-20
5 P8      2020-03-03 P8___2020-03-03
6 D7      2020-03-06 D7___2020-03-06
Relocate Columns

Sometimes, we want to re-arrange our column order. For example, with our new Lab column, we might want it in the beginning with the other metadata. Here, we use the relocate function:

df_wq %>%
  mutate(Lab = 'BSA') %>%
  relocate(Lab, .after = Date) %>%
  head()
# A tibble: 6 × 21
  Station Date       Lab    Chla Pheophytin TotAlkalinity DissAmmonia
  <chr>   <date>     <chr> <dbl>      <dbl>         <dbl>       <dbl>
1 P8      2020-01-16 BSA    0.64       0.5             98        0.15
2 D7      2020-01-22 BSA    0.67       0.87            82        0.21
3 P8      2020-02-14 BSA    1.46       0.69            81        0.25
4 D7      2020-02-20 BSA    2.15       0.5             86        0.14
5 P8      2020-03-03 BSA    1.4        0.56            80        0.11
6 D7      2020-03-06 BSA    1.89       1.13            93        0.22
# ℹ 14 more variables: DissNitrateNitrite <dbl>, DOC <dbl>, TOC <dbl>,
#   DON <dbl>, TotPhos <dbl>, DissOrthophos <dbl>, TDS <dbl>, TSS <dbl>,
#   TKN <dbl>, Depth <dbl>, Secchi <dbl>, Microcystis <dbl>,
#   SpCndSurface <dbl>, WTSurface <dbl>

5.3 Modify Existing Columns

We can also use mutate to modify existing columns. For example, say we want to add 20 to every Chla value:

df_wq %>%
  mutate(Chla = Chla+20)
# A tibble: 62 × 20
   Station Date        Chla Pheophytin TotAlkalinity DissAmmonia
   <chr>   <date>     <dbl>      <dbl>         <dbl>       <dbl>
 1 P8      2020-01-16  20.6       0.5             98        0.15
 2 D7      2020-01-22  20.7       0.87            82        0.21
 3 P8      2020-02-14  21.5       0.69            81        0.25
 4 D7      2020-02-20  22.2       0.5             86        0.14
 5 P8      2020-03-03  21.4       0.56            80        0.11
 6 D7      2020-03-06  21.9       1.13            93        0.22
 7 P8      2020-06-11  24.7       1.25            59        0.05
 8 D7      2020-06-17  21.7       0.89            78        0.05
 9 P8      2020-07-13  26.4       0.88            63        0.05
10 D7      2020-07-16  22.8       0.85            80        0.05
# ℹ 52 more rows
# ℹ 14 more variables: DissNitrateNitrite <dbl>, DOC <dbl>, TOC <dbl>,
#   DON <dbl>, TotPhos <dbl>, DissOrthophos <dbl>, TDS <dbl>, TSS <dbl>,
#   TKN <dbl>, Depth <dbl>, Secchi <dbl>, Microcystis <dbl>,
#   SpCndSurface <dbl>, WTSurface <dbl>

We can even change types! This is useful if, say, you have character column that should be numeric, or a character column that should be factor.

df_wq %>%
  mutate(Chla = as.numeric(Chla),
         Station = as.factor(Station)) %>%
  select(Station, Chla) %>%
  str()
tibble [62 × 2] (S3: tbl_df/tbl/data.frame)
 $ Station: Factor w/ 2 levels "D7","P8": 2 1 2 1 2 1 2 1 2 1 ...
 $ Chla   : num [1:62] 0.64 0.67 1.46 2.15 1.4 1.89 4.73 1.74 6.4 2.79 ...

5.4 case_when

What if we only want to modify certain data in a column? For example, say we want to add 20 to all Chla values only for P8.

This can be done using the case_when function (also available in tidyverse). This replaces the left-hand side of the mutate statement; now, instead of only having one logical statement, you can specify multiple!

(Note: the tilde operator ~ differentiates the left- and right-hand sides of the formula.)

df_wq %>%
  mutate(
    Chla =
      case_when(Station == 'P8' ~ Chla + 20, # if station is P8, add 20
              TRUE ~ Chla) # else, keep as previous value
  ) %>%
  select(c(Station:Date), Chla)
# A tibble: 62 × 3
   Station Date        Chla
   <chr>   <date>     <dbl>
 1 P8      2020-01-16 20.6 
 2 D7      2020-01-22  0.67
 3 P8      2020-02-14 21.5 
 4 D7      2020-02-20  2.15
 5 P8      2020-03-03 21.4 
 6 D7      2020-03-06  1.89
 7 P8      2020-06-11 24.7 
 8 D7      2020-06-17  1.74
 9 P8      2020-07-13 26.4 
10 D7      2020-07-16  2.79
# ℹ 52 more rows

5.5 Rename

Sometimes, we want to rename our columns (eg. if working with multiple data frames, we want the columns to be standardized). We can do this using the rename function from the dplyr package:

df_wq %>%
  rename(StationCode = Station)
# A tibble: 62 × 20
   StationCode Date        Chla Pheophytin TotAlkalinity DissAmmonia
   <chr>       <date>     <dbl>      <dbl>         <dbl>       <dbl>
 1 P8          2020-01-16  0.64       0.5             98        0.15
 2 D7          2020-01-22  0.67       0.87            82        0.21
 3 P8          2020-02-14  1.46       0.69            81        0.25
 4 D7          2020-02-20  2.15       0.5             86        0.14
 5 P8          2020-03-03  1.4        0.56            80        0.11
 6 D7          2020-03-06  1.89       1.13            93        0.22
 7 P8          2020-06-11  4.73       1.25            59        0.05
 8 D7          2020-06-17  1.74       0.89            78        0.05
 9 P8          2020-07-13  6.4        0.88            63        0.05
10 D7          2020-07-16  2.79       0.85            80        0.05
# ℹ 52 more rows
# ℹ 14 more variables: DissNitrateNitrite <dbl>, DOC <dbl>, TOC <dbl>,
#   DON <dbl>, TotPhos <dbl>, DissOrthophos <dbl>, TDS <dbl>, TSS <dbl>,
#   TKN <dbl>, Depth <dbl>, Secchi <dbl>, Microcystis <dbl>,
#   SpCndSurface <dbl>, WTSurface <dbl>