Chapter 3 Making Plots
3.0.1 Coding goals
We will learn the basics of R as we generate a series of plots. These plots will be based on a real data set produced in our office for the Delta Smelt Resiliency Strategy Aquatic Weed Control Action
3.0.2 Study overview
The purpose of this two-year study was to use herbicides to remove invasive aquatic vegetation from potential habitat for endangered delta smelt. In this experiment there were two treated sites and two untreated reference sites. We monitored many aspects of the ecosystem to document changes in response to the herbicide applications. We have data for vegetation biomass and species composition as well as water quality, phytoplankton, zooplankton, and fishes. For now, we will just focus on the vegetation biomass data.
3.0.3 Metadata for data set
Description of columns in vegetation biomass data set.
date: Date a sample was collected rounded to the nearest month, so the data are plotted by survey month instead of exact date.
site: Study sites in the North Delta. Little Hasting (LH) was treated with herbicides to remove aquatic weeds. French Island (FI) is a site near LH that is similar in many ways but remained untreated.
rake_no: Samples of aquatic vegetation were collected using a long-handled thatch rake. There were 20-40 samples collected for each site x date combination.
mass_kg: Total wet biomass of aquatic vegetation collected from a rake sample.
3.0.4 Helpful resources
R for Data Science: This is the primary source for learning how to use the tidyverse packages. Also this whole book was created using RMarkdown! I’ll just refer to it as RDS in the lessons below.
3.0.5 Initial set up
3.0.5.1 Install and load required packages
For the exercises, you’ll need the “tidyverse” which is a suite of packages that are useful for data science, and the “lubridate” package, which is handy for formatting dates.
Packages only need to be installed once, which is why the code to do that is now commented out with the ‘#’
Make sure the names of the packages are in quotes, or the code won’t work.
You will need to load the packages you want to use during every session using the library() function. As you load the packages, you’ll get some warnings and other notifications. It’s good to take a look at them, but most of the time, you don’t need to worry about them. I turned off printing of messages and warning in my markdown, so you won’t see them below.
#install.packages("lubridate")
#install.packages("tidyverse")
library(lubridate)
library(tidyverse)
Note: Packages can also be installed from the “Packages” tab in the lower right pane.
3.0.5.2 Read the data set into R
It is often easiest to import data into R as a comma delimited file, which involves using the read.csv() function. You can import data from other types of files too, but the import function will be a little different.
The data set we will use is published on the Environmental Data Initiative website, so we can read it using the link to the csv file.
<-read.csv("https://portal.edirepository.org/nis/dataviewer?packageid=edi.1079.1&entityid=f29d1162866d4e9fb2fdd7355efd4c1e") veg_data
3.0.5.3 Examine the structure of the data set
The str() function will show you some useful information about each column including what type of column it is.
str(veg_data)
## 'data.frame': 876 obs. of 16 variables:
## $ Site : chr "DI" "DI" "DI" "DI" ...
## $ Site_Type : chr "Treated" "Treated" "Treated" "Treated" ...
## $ Date : chr "2017-05-25" "2017-06-07" "2017-06-07" "2017-05-25" ...
## $ Survey_Month : chr "2017_06" "2017_06" "2017_06" "2017_06" ...
## $ Latitude : num 38.1 38.1 38.1 38.1 38.1 ...
## $ Longitude : num -122 -122 -122 -122 -122 ...
## $ Total_Wet_Biomass_kg : num 0 0 0 0.0055 0.17 0.284 0.304 0.446 0.51 0.518 ...
## $ Egeria_densa : int 0 0 0 50 90 100 100 100 100 100 ...
## $ Elodea_canadensis : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Ceratophyllum_demersum: int 0 0 0 50 0 0 0 0 0 0 ...
## $ Potamogeton_crispus : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Myriophyllum_spicatum : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Stuckenia_pectinata : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Najas_guadalupensis : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Potamogeton_nodosus : int 0 0 0 0 0 0 0 0 0 0 ...
## $ Cabomba_caroliniana : int 0 0 0 0 0 0 0 0 0 0 ...
The head() function will show you the first six rows of data. Similarly, the tail() function will show you the last six rows.
head(veg_data)
## Site Site_Type Date Survey_Month Latitude Longitude
## 1 DI Treated 2017-05-25 2017_06 38.08299 -121.7239
## 2 DI Treated 2017-06-07 2017_06 38.08329 -121.7188
## 3 DI Treated 2017-06-07 2017_06 38.08331 -121.7192
## 4 DI Treated 2017-05-25 2017_06 38.08399 -121.7198
## 5 DI Treated 2017-05-25 2017_06 38.08388 -121.7162
## 6 DI Treated 2017-05-25 2017_06 38.08351 -121.7166
## Total_Wet_Biomass_kg Egeria_densa Elodea_canadensis Ceratophyllum_demersum
## 1 0.0000 0 0 0
## 2 0.0000 0 0 0
## 3 0.0000 0 0 0
## 4 0.0055 50 0 50
## 5 0.1700 90 0 0
## 6 0.2840 100 0 0
## Potamogeton_crispus Myriophyllum_spicatum Stuckenia_pectinata
## 1 0 0 0
## 2 0 0 0
## 3 0 0 0
## 4 0 0 0
## 5 0 0 0
## 6 0 0 0
## Najas_guadalupensis Potamogeton_nodosus Cabomba_caroliniana
## 1 0 0 0
## 2 0 0 0
## 3 0 0 0
## 4 0 0 0
## 5 0 0 0
## 6 0 0 0
tail(veg_data)
## Site Site_Type Date Survey_Month Latitude Longitude
## 871 LH Treated 2018-12-11 2018_12 38.25070 -121.6918
## 872 LH Treated 2018-12-11 2018_12 38.25174 -121.6937
## 873 LH Treated 2018-12-11 2018_12 38.24510 -121.6914
## 874 LH Treated 2018-12-11 2018_12 38.25291 -121.6921
## 875 LH Treated 2018-12-11 2018_12 38.24797 -121.6935
## 876 LH Treated 2018-12-11 2018_12 38.25250 -121.6936
## Total_Wet_Biomass_kg Egeria_densa Elodea_canadensis Ceratophyllum_demersum
## 871 0.610 0 90 0
## 872 0.746 50 50 0
## 873 0.834 100 0 0
## 874 1.026 20 0 80
## 875 1.082 90 0 0
## 876 2.562 100 0 0
## Potamogeton_crispus Myriophyllum_spicatum Stuckenia_pectinata
## 871 10 0 0
## 872 0 0 0
## 873 0 0 0
## 874 0 0 0
## 875 10 0 0
## 876 0 0 0
## Najas_guadalupensis Potamogeton_nodosus Cabomba_caroliniana
## 871 0 0 0
## 872 0 0 0
## 873 0 0 0
## 874 0 0 0
## 875 0 0 0
## 876 0 0 0
3.0.5.4 Simplying the data set a bit for our first plotting exercises
<- veg_data %>%
veg_data_north filter(Site == "LH" | Site == "FI")
3.0.6 Formatting the data set and making the first plot
3.0.6.1 Format the date column
The column type looks fine for all the columns except the date. We need to change it from factor to date. This is where the ‘lubridate’ package comes in handy. The original format of the date in our tibble is month-day-year, so we use the mdy() function.
$Date<-ymd(veg_data_north$Date)
veg_data_northglimpse(veg_data_north)
## Rows: 536
## Columns: 16
## $ Site <chr> "FI", "FI", "FI", "FI", "FI", "FI", "FI", "FI",~
## $ Site_Type <chr> "Reference", "Reference", "Reference", "Referen~
## $ Date <date> 2017-06-09, 2017-05-26, 2017-06-09, 2017-06-07~
## $ Survey_Month <chr> "2017_06", "2017_06", "2017_06", "2017_06", "20~
## $ Latitude <dbl> 38.27472, 38.26994, 38.27605, 38.27073, 38.2732~
## $ Longitude <dbl> -121.7002, -121.6993, -121.6971, -121.7005, -12~
## $ Total_Wet_Biomass_kg <dbl> 0.0000, 0.0840, 0.0917, 0.1780, 0.2420, 0.3080,~
## $ Egeria_densa <int> 0, 50, 0, 0, 70, 20, 70, 0, 90, 0, 0, 0, 80, 20~
## $ Elodea_canadensis <int> 0, 0, 50, 20, 10, 40, 10, 80, 0, 70, 50, 100, 0~
## $ Ceratophyllum_demersum <int> 0, 0, 0, 0, 0, 0, 20, 0, 0, 30, 0, 0, 20, 0, 30~
## $ Potamogeton_crispus <int> 0, 0, 0, 40, 10, 30, 0, 20, 0, 0, 0, 0, 0, 0, 0~
## $ Myriophyllum_spicatum <int> 0, 50, 50, 40, 0, 10, 0, 0, 0, 0, 50, 0, 0, 80,~
## $ Stuckenia_pectinata <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,~
## $ Najas_guadalupensis <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,~
## $ Potamogeton_nodosus <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,~
## $ Cabomba_caroliniana <int> 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0,~
Note that the dollar sign is used to specify a column within a data frame or tibble (i.e., dataframe$column)
3.0.6.2 Start exploring the vegetation biomass data by plotting them as a histogram
ggplot(data=veg_data, aes(Total_Wet_Biomass_kg)) + geom_histogram()
Note: R warns you that six values were removed because they are ‘NA’ which is not a number. Keep in mind that there are NAs in this data set. It will be important later.
3.0.7 Summarize the data in new tibble
Create a new tibble that summarizes the mass data. Calculate sample count, mean, and standard deviation for each date at each site. Don’t forget to account for the “NA” values when doing your summary stats.
The final data set should have columns: date, site, count, mean, sd.
Try using the pipe (%>%) to do all of this in one block of code. The keyboard shortcut for pipe: press and hold Shift + Control + M. See “Help” menu at the top for more RStudio keyboard shortcuts.
<-veg_data_north %>% #name of the source tibble
veg_data_north_statsgroup_by(Site, Date,Survey_Month) %>% #columns of tibble to group by
summarize(
count = n(), #counts number of samples for each site x date combo
mean = mean(Total_Wet_Biomass_kg, na.rm=T), #calculates means
sd = sd(Total_Wet_Biomass_kg, na.rm=T) #calculates standard deviations
)
3.0.8 Plot time series for each study site separately
Now, we will make plots with the summary stats we generated. The plot will show the mean vegetation biomass through time. Plot the mean values as points and connect the means with lines to make the pattern easier to see.
For general background, see RDS sections 3 (Data Visualization) and 7 (Exploratory Data Analysis). Neither section will show you how to make this specific plot though. Check out the example below for more relevant examples. http://www.sthda.com/english/wiki/ggplot2-line-plot-quick-start-guide-r-software-and-data-visualization For this, use the tibble you made in Exercise #3 with the summary stats for both sites.
Also, add the error bars indicating the standard deviations around each mean.
Make the points and lines associated with each of the two sites different colors so they can be easily distinguished. For color blind folks, also use different point types (e.g., circles vs. triangles) and different line types (e.g., solid vs dashed).
To see all the point and line type options check out this webpage
For color options, check out this webpage
In this plot, the line types, point types, and colors are different between sites but just using defaults
<- ggplot(veg_data_north_stats, aes(x=Date, y=mean, group=Site, color=Site, shape=Site)) +
(veg_plot geom_line(aes(linetype=Site))+
geom_errorbar(aes(ymin=mean-sd, ymax=mean+sd), width = 1, position=position_dodge(2)) +
geom_point()
)
This is the same plot but with custom line types, point types, and colors
<-ggplot(veg_data_north_stats, aes(x=Date, y=mean, group=Site, shape=Site, color=Site, fill=Site)) +
(veg_plot_custom geom_line(aes(linetype=Site))+
geom_errorbar(aes(ymin=mean-sd, ymax=mean+sd), width = 1, position=position_dodge(2)) +
geom_point()+
scale_color_manual(values =c("midnightblue","darkorange3"), aesthetics = c("colour", "fill"))+
scale_shape_manual(values = c(21, 25))+
scale_linetype_manual(values = c(2,3))
)
We can also plot the time series for sites in separate panels
ggplot(veg_data_north_stats, aes(x=Date, y=mean, group=Site, shape=Site, color=Site, fill=Site)) +
geom_line(aes(linetype=Site))+
geom_errorbar(aes(ymin=mean-sd, ymax=mean+sd), width = 1, position=position_dodge(2)) +
geom_point()+
scale_color_manual(values =c("midnightblue","darkorange3"), aesthetics = c("colour", "fill"))+
scale_shape_manual(values = c(21, 25))+
scale_linetype_manual(values = c(2,3))+
facet_grid(Site~.)
If we prefer, barplots we can create those.
ggplot(veg_data_north_stats, aes(x=Date, y=mean, group=Site, shape=Site, color=Site, fill=Site)) +
geom_bar(stat="identity",width=15)+
geom_errorbar(aes(ymin=mean-sd, ymax=mean+sd), width = 1) +
scale_color_manual(values =c("midnightblue","darkorange3"), aesthetics = c("colour", "fill"))+
facet_grid(Site~.)
Or boxplots
ggplot(veg_data_north_stats, aes(x=Date, y=mean, group=Site, shape=Site, color=Site, fill=Site)) +
geom_boxplot()+
scale_color_manual(values =c("midnightblue","darkorange3"), aesthetics = c("colour", "fill"))+
facet_grid(Site~.)
Next, let’s look at the correlations in vegetation biomass between sites.
You can export these plots as imaging using code. I exported them as PNG files, but there are other options.
#French Island plot
#ggsave(plot = veg_plot_custom #tell ggsave which plot to export
#, filename = "VegBiomass_TimeSeries_FrenchIsland.png" #provide the name for the image file
# , width = 6, height =4, units = "in" #dimensions of the exported image
# , dpi = 300 #resolution of the image (dots per inch)
# )