SEM Template

Author

Perry

Published

December 29, 2025

Analysis Example

This is a quick example of SEM analysis, covering data exploration + model creation + model interpretation.

My models will be:

\[ \text{log(Chlorophyll)} \sim \text{log(DissAmmonia)} + \text{...} + \text{RE(Region)} \]

\[ \text{log(DissAmmonia)} \sim \text{...} + \text{RE(Region)} \]

I used the monthly nutrient data for this.

Code

library(tidyverse)
library(emmeans)
library(car)
library(lme4)
library(lmerTest)
library(performance)
library(piecewiseSEM)
library(lavaan)
library(semPlot)
library(mgcv)
library(here)
library(conflicted)
theme_set(theme_bw())
options(contrasts = c('contr.sum', 'contr.poly'))

# Declare package conflict preferences
conflicts_prefer(dplyr::filter(), dplyr::lag(), lme4::lmer())

Code

# Define WQ parameters included in this example
wq_param <- c(
  "DissAmmonia",
  "DissNitrate",
  "DissNitrateNitrite",
  "DissOrthophos",
  "DON",
  "TKN",
  "TotPhos",
  "Temperature",
  "Salinity",
  "Turbidity",
  "DissolvedOxygen",
  "Chlorophyll",
  "pH",
  "Microcystis",
  "Sac_Inflow",
  "SJR_Inflow",
  "Total_Inflow"
)

df_monthly <- readRDS(here("data/processed/monthly_values.rds")) %>%
  mutate(Region = factor(Region), Date = make_date(Year, Month, 1)) %>%
  select(MonthYear, Year, Month, Date, Region, all_of(wq_param)) %>%
  arrange(Region, Date)

df_monthly_long <- df_monthly %>%
  pivot_longer(
    cols = all_of(wq_param),
    names_to = "Variable",
    values_to = "Value"
  )

Exploratory Analysis (Data)

First we investigate our individual variables.

Balance check

When you have categorical variables (like Region), it’s best to have balanced data – the same number per category.

How many samples do we have per Region?

Code

df_monthly %>%
  count(Region)

Region	n
Confluence	174
North	174
SouthCentral	174
Suisun Bay	174
Suisun Marsh	174

Basically even, but how many are non-NAs for relevant variables?

Code

df_monthly_long %>%
  drop_na(Value) %>%
  count(Region, Variable) %>%
  pivot_wider(names_from = Region, values_from = n, values_fill = 0)

Variable	Confluence	North	SouthCentral	Suisun Bay	Suisun Marsh
Chlorophyll	174	173	174	173	167
DON	169	170	165	166	80
DissAmmonia	170	172	173	169	83
DissNitrate	114	168	131	0	0
DissNitrateNitrite	170	170	173	169	83
DissOrthophos	170	170	173	169	83
DissolvedOxygen	174	174	174	174	174
Microcystis	143	138	144	130	130
SJR_Inflow	174	174	174	174	174
Sac_Inflow	174	174	174	174	174
Salinity	174	174	174	174	174
TKN	170	170	173	169	83
Temperature	174	174	174	174	174
TotPhos	170	170	173	169	83
Total_Inflow	174	174	174	174	174
Turbidity	174	174	174	174	174
pH	172	174	174	153	152

There are missing values for certain analytes. To group them:

balanced: Temperature, Salinity, Turbidity, DissolvedOxygen, Inflows

relatively balanced (>= 75%): Chlorophyll, pH, MVI

missing data (Suisun Marsh): DissAmmonia, DissNitrateNitrite, DissOrthophos, DON, TKN, TotPhos

missing a lot of data: DissNitrate

We can break this down by month to see if the missing data is interpolable or not:

Code

df_missing <- df_monthly_long %>%
  mutate(
    Missing = factor(
      if_else(is.na(Value), 1, 0),
      levels = c(0, 1),
      labels = c('Present', 'Missing')
    )
  )

plots <- df_missing %>%
  group_by(Variable) %>%
  group_split() %>%
  lapply(function(df) {
    ggplot(df, aes(x = Date, y = Region, fill = Missing)) +
      geom_tile(color = 'grey20', linewidth = 0.3, width = 35) +
      scale_fill_manual(
        values = c('Present' = 'lightblue', 'Missing' = 'white')
      ) +
      scale_x_date(
        name = "Month",
        date_breaks = '1 year',
        date_labels = '%Y',
        expand = expansion(mult = 0)
      ) +
      scale_y_discrete(expand = expansion(add = 0)) +
      theme_bw() +
      labs(
        title = paste0('Missing Data: ', unique(df$Variable)),
        fill = 'P/A'
      ) +
      theme(
        axis.text.x = element_text(angle = 90, hjust = 1, vjust = 0.5),
        legend.position = "top",
        panel.grid = element_blank()
      )
  })

walk(plots, print)

Findings:

No data until 02-2017 (for Suisun Marsh): DON, DissAmmonia, DissNitrateNitrite, DissOrthophos, TKN, TotPhos

rest of missing data is interpolable

Lots of missing data: DissNitrate

Large gaps until 06-2015: MVI

all 5 month gaps between Jan-May

Missing after 01-2023 for Suisun Marsh/Suisun Bay: pH

Small interpolable gaps: Chlorophyll

No gaps: Temperature, Salinity, Turbidity, DissolvedOxygen, Inflows

Discussion

A few analytes can be used for the entire 01-2010 through 06-2024 time range:

Fully ready: Temperature, Salinity, Turbidity, DissolvedOxygen, Inflows
Possibly needs minor interpolation: Chlorophyll

A few could be used for the entire range (with minor interpolation) for all regions except Suisun Marsh. The options are:

cut the analysis range to begin on 02-2017
remove Suisun Marsh from primary analyses, keep full range
- could run separately on narrower time frame
analyze with full dataset, interpret with unbalanced design in mind

pH is missing data after 01-2023 for two regions; this feels like an error to me, so we might want to double check the original data. If not, the options are similar to above:

cut the analysis range to end at 01-2023
remove Suisun Marsh/Suisun Bay from primary analyses
analyze with full dataset, interpret with unbalanced design in mind

MVI has multiple data gaps and some large, non-interpolable gaps. Options are similar to above. We also have chlorophyll data as a response variable to ground truth our results if analyses look off with MVI.

DissNitrate is missing large amounts of data and should be excluded.

Visualize distributions

Next, I’m removing columns I won’t be using for simplicity.

Code

df_monthly_rm <- df_monthly %>%
  select(-c(DissNitrate, DON, Microcystis, SJR_Inflow, Sac_Inflow, TKN, TotPhos))

Now I look at the shape of variable distributions to get a sense of what the data looks like.

Code

df_plt <- df_monthly_rm %>%
  pivot_longer(
    cols = DissAmmonia:Total_Inflow,
    names_to = 'Analyte',
    values_to = 'Value'
  )

ggplot(df_plt, aes(x = Value)) +
  geom_histogram(
    color = 'black',
    fill = 'lightgray'
  ) +
  facet_wrap(~Analyte, scales = 'free')

They’re pretty much all right-skewed to various degrees, except for Temperature (which almost appears bivariate) and pH, which is slightly left skewed. A few graphs, such as chlorophyll and pH, indicate outliers exist.

Visualize time series

If your data has a time component, it’s also useful to look at a time series of the analytes.

Code

# no points
for (a in unique(df_plt$Analyte)) {
  p <- df_plt %>%
    filter(Analyte == a) %>%
    ggplot(aes(x = Date, y = Value)) +
    geom_line(color = 'gray50', linewidth = 0.8, na.rm = TRUE) +
    facet_wrap(~Region, scales = 'free_y', ncol = 2) +
    labs(title = a)

  print(p)
}

Code

# with points
for (a in unique(df_plt$Analyte)) {
  p <- df_plt %>%
    filter(Analyte == a) %>%
    ggplot(aes(x = Date, y = Value)) +
    geom_line(color = 'gray50', linewidth = 0.8, na.rm = TRUE) +
    geom_point(shape = 21, na.rm = TRUE) +
    facet_wrap(~Region, scales = 'free_y', ncol = 2) +
    labs(title = a)

  print(p)
}

These are especially useful for identifying seasonal patterns within the data.

Visualize outliers

Next, I like to note extreme outliers. Even if they’re real data, they can negatively influence regressions either by influencing the coefficients or inflating the error.

If later on (eg. after model diagnostics) I decide to remove any of these, this is where I’d do it.

Code

df_plt <- df_plt %>%
  group_by(Analyte) %>%
  mutate(
    Q1  = quantile(Value, 0.25, na.rm = TRUE),
    Q3  = quantile(Value, 0.75, na.rm = TRUE),
    IQR = Q3 - Q1,
    is_outlier = Value < (Q1 - 1.5 * IQR) | Value > (Q3 + 1.5 * IQR)
  ) %>%
  ungroup()

ggplot(df_plt, aes(x = 1, y = Value)) +
  geom_jitter(
    aes(color = ifelse(is_outlier, 'Outlier', 'Normal')),
    shape = 21,
    width = 0.1,
    na.rm = TRUE
  ) +
  scale_color_manual(
    values = c('Normal' = 'lightgray', 'Outlier' = 'red')
  ) +
  geom_boxplot(
    outlier.shape = NA,
    na.rm = TRUE
  ) +
  facet_wrap(~Analyte, scales = 'free_y') +
  theme(legend.position = 'none')

There’s a few outliers, though none I would remove outright. However, further on I decided I wanted to remove a few extreme DissAmmonia, pH, Chlorophyll, and DissOrthophos values (7 total), which I’ll do here:

Code

df_monthly_rm <- df_monthly_rm %>%
  filter(!(DissAmmonia > 0.4 | pH < 6.5 | Chlorophyll > 35 | DissOrthophos < 0.01) |
           is.na(DissAmmonia) |
           is.na(pH) |
           is.na(Chlorophyll) |
           is.na(DissOrthophos))

Impute NAs (optional)

Next we’ll impute missing data using a linear approximation. If methods don’t require a continuous, equally-spaced time series, this step is optional. However, I like doing it (when feasible) to offset unbalanced design issues.

If your data has non-detects, make sure to uniformly impute those values between (slightly larger than) 0 and the reporting limit.

Code

df_monthly_imputed <- df_monthly_rm %>%
  arrange(Region, Date) %>%
  group_by(Region) %>%
  mutate(
    across(
      DissAmmonia:Total_Inflow,
      ~ is.na(.x),
      .names = 'imp_{.col}'
    ),
    across(
      DissAmmonia:Total_Inflow,
      ~ zoo::na.approx(.x, na.rm = FALSE),
      .names = '{.col}'
    )
  ) %>%
  ungroup()

values_long <- df_monthly_imputed %>%
  pivot_longer(
    cols = DissAmmonia:Total_Inflow,
    names_to = 'variable',
    values_to = 'value'
  )

flags_long <- df_monthly_imputed %>%
  pivot_longer(
    cols = starts_with('imp_'),
    names_to = 'flag_var',
    values_to = 'imp'
  ) %>%
  mutate(variable = sub('imp_', '', flag_var)) %>%
  select(Date, Region, variable, imp)

df_plot <- values_long %>%
  left_join(flags_long, by = c('Date', 'Region', 'variable'))

vars_imputed <- df_plot %>%
  group_by(variable) %>%
  summarise(any_imp = any(imp, na.rm = TRUE)) %>%
  filter(any_imp) %>%
  pull(variable)

regions <- unique(df_plot$Region)

for (r in regions) {
  p <- df_plot %>%
    filter(
      Region == r,
      variable %in% vars_imputed
    ) %>%
    ggplot(aes(Date, value)) +
      geom_line(color = 'gray40', linewidth = 0.5, na.rm = TRUE) +
      geom_point(aes(color = imp), size = 1.5, shape = 21, stroke = 0.8, na.rm = TRUE) +
      scale_color_manual(
        values = c('FALSE' = 'black', 'TRUE' = 'red'),
        labels = c('Observed', 'Imputed'),
        name = ''
      ) +
    labs(title = r) +
    facet_wrap(~ variable, scales = 'free', ncol = 2)

  print(p)
}

Code

# clean dataframe
df_monthly_imputed <- df_monthly_imputed %>%
  select(-starts_with('imp_'))

Exploratory Analysis (Relationships)

Next we can explore the relationships between our variables, primarily the response and its possible predictors, to determine what should go in our models.

Since SEMs are comprised of multiple “sub models”, this step takes a while, since it’s important to do this for every response variable. For simplicity, since log_Chlorophyll is my main response, I’ll only examine it in these examples.

Transform/modify data (optional)

If you know you’re likely to transform/modify the data, you can do it here so you have those columns for further exploratory analysis. Here I:

add log transforms: helps stabilize variance, fits certain data patterns
add AR1 lag terms: this is so the previous time point can be used in regressions. I just do this for response variables (log_Chlorophyll and (for later) log_DissAmmonia).

Code

# log data
df_monthly_log <- df_monthly_imputed %>%
  mutate(
    across(
      DissAmmonia:Total_Inflow,
      ~ log(.x),
      .names = 'log_{.col}'
    )
  )

# add lags
df_monthly_log <- df_monthly_log %>%
  arrange(Region, Date) %>%
  group_by(Region) %>%
  mutate(
    lag_logChlorophyll = lag(log_Chlorophyll),
    lag_Chlorophyll = lag(Chlorophyll),
    lag_logDissAmmonia = lag(log_DissAmmonia)
  ) %>%
  ungroup() %>%
  relocate(c(Chlorophyll, log_Chlorophyll, Date),
         .after = last_col())

Correlations

First we examine if any of our data is highly correlated (or not); high correlation between variables is an issue for causal inference models.

Code

corrplot::corrplot(
  cor(df_monthly_log %>% select(c(Chlorophyll,DissAmmonia:Total_Inflow)), use = 'pairwise.complete.obs'),
  method = 'color',
  type = 'lower',
  diag = TRUE
)

Code

corrplot::corrplot(
  cor(df_monthly_log %>% select(c(Chlorophyll,DissAmmonia:Total_Inflow)), use = 'pairwise.complete.obs'),
  method = 'number',
  number.cex = 0.6,
  type = 'lower',
  diag = TRUE
)

DissolvedOxygen and Temperature are strongly negatively correlated; DO is also correlated with DissNitrateNitrite. Temp is also slightly correlated with DissAmmonia and, more strongly, DissNitrateNitrite. DON/DissNitrateNitrite and Turbidity/Total Inflow are moderately correlated, which makes sense.

The strong DissolvedOxygen/Temperature relationship likely warrants further investigation (ie. determining their relationship/causal structure with respect to other variables).

Bivariate Plots (Continuous)

Next we graph our explanatory variables against our response variable. I like to look at the raw points (to get a true sense of the data) and interpolated splines to understand the functional relationship.

I ended up using log_Chlorophyll as my response, but kept the Chlorophyll graphs for reference.

Code

df_plt <- df_monthly_log %>%
  pivot_longer(
    cols = c(DissAmmonia:Total_Inflow, lag_Chlorophyll),
    names_to = 'Analyte',
    values_to = 'Value'
  )

df_plt_log <- df_monthly_log %>%
  pivot_longer(
    cols = c(log_DissAmmonia:log_Total_Inflow, lag_logChlorophyll),
    names_to = 'Analyte',
    values_to = 'Value'
  )

Scatterplots

Code

ggplot(df_plt) +
  geom_point(aes(Value, Chlorophyll, color = Region), shape = 21, alpha = 0.5) +
  facet_wrap(~Analyte, scales = 'free_x')

Code

ggplot(df_plt_log) +
  geom_point(aes(Value, Chlorophyll, color = Region), shape = 21, alpha = 0.5) +
  facet_wrap(~Analyte, scales = 'free_x')

Code

ggplot(df_plt) +
  geom_point(aes(Value, log_Chlorophyll, color = Region), shape = 21, alpha = 0.5) +
  facet_wrap(~Analyte, scales = 'free_x')

Code

ggplot(df_plt_log) +
  geom_point(aes(Value, log_Chlorophyll, color = Region), shape = 21, alpha = 0.5) +
  facet_wrap(~Analyte, scales = 'free_x')

Splines (Global)

Code

ggplot(df_plt) +
  geom_smooth(aes(Value, Chlorophyll)) +
  facet_wrap(~Analyte, scales = 'free_x')

Code

ggplot(df_plt_log) +
  geom_smooth(aes(Value, Chlorophyll)) +
  facet_wrap(~Analyte, scales = 'free_x')

Code

ggplot(df_plt) +
  geom_smooth(aes(Value, log_Chlorophyll)) +
  facet_wrap(~Analyte, scales = 'free_x')

Code

ggplot(df_plt_log) +
  geom_smooth(aes(Value, log_Chlorophyll)) +
  facet_wrap(~Analyte, scales = 'free_x')

From these, I decided to keep log_DissAmmonia, log_Turbidity, and Temperature in my model. I also added log_DissOrthophosphate (since I think it’s considered an important variable) and the response’s lag term (lag_logChlorophyll) to account for autocorrelation. Note that Temperature appears to have a non-linear, almost cubic relationship with log_Chlorophyll.

I’m also adding log_Salinity, which is probably wrong, but it’s useful for the SEM examples.

Bivariate (Categorical)

It’s also worth seeing how chlorophyll correlates with categorical variables, in this case Region:

Code

ggplot(df_plt) +
  geom_boxplot(aes(Region, Chlorophyll, fill = Region))

Code

ggplot(df_plt) +
  geom_violin(aes(Region, Chlorophyll, fill = Region))

South Central has the most variance, followed by Suisun Marsh. The other regions seem similarly distributed. Overall, the means are similar.

Trivariate (Cat./Cont.)

You can also visualize potential interaction effects between categorical and continuous variables (or cat:cat, though cont:cont is much more difficult; it’s easier to choose those via background knowledge+model selection). If the relationship between the response and continuous predictor appears to change based on the categorical one, the interaction may be significant.

Code

ggplot(df_plt) +
  geom_smooth(aes(Value, Chlorophyll, color = Region)) +
  facet_wrap(~Analyte, scales = 'free_x')

Code

ggplot(df_plt_log) +
  geom_smooth(aes(Value, Chlorophyll, color = Region)) +
  facet_wrap(~Analyte, scales = 'free_x')

Code

ggplot(df_plt) +
  geom_smooth(aes(Value, log_Chlorophyll, color = Region)) +
  facet_wrap(~Analyte, scales = 'free_x')

Code

ggplot(df_plt_log) +
  geom_smooth(aes(Value, log_Chlorophyll, color = Region)) +
  facet_wrap(~Analyte, scales = 'free_x')

Here, no interaction effect particularly stands out. This makes sense, since Region is more of a grouping variable (which is why we’ll likely use it as a random effect).

Individual Models

Since SEMs are a series of related models (with shared covariance), it’s worth checking each sub-model individually. For traditional SEMs, only lm (fixed effects) models can be used, though piecewiseSEM can handle other model types. Here, I used linear mixed effects models.

For simplicity, I subset the data from 02-01-2017 to 12-31-2022 to escape issues with unbalanced regional data. I also added three new variables: three basis splines for Temperature (essentially \(T^1\), \(T^2\), and \(T^3\)). This is so I can correctly model Temperature’s likely non-linear dependence on Chlorophyll; with three terms, I’m assuming it’s cubic. This is called polynomial regression.

I also remove some extreme outliers I found in the model.

Code

df_monthly_subset <- df_monthly_log %>%
  filter(Year < 2023 & Date > '2017-02-01')

df_monthly_subset <- df_monthly_subset %>%
  slice(-c(5, 337))

df_monthly_subset <- df_monthly_subset %>%
  mutate(
    Temp_poly = poly(Temperature, 3),
    Temp1 = Temp_poly[,1],
    Temp2 = Temp_poly[,2],
    Temp3 = Temp_poly[,3]
    )

# clean dataframe
df_monthly_subset <- df_monthly_subset %>%
  select(-contains('_poly'))

df_monthly_subset <- df_monthly_subset %>%
  drop_na(log_Chlorophyll, log_Salinity, log_DissAmmonia, log_DissOrthophos, log_Turbidity, lag_logChlorophyll, lag_logDissAmmonia, log_pH, log_Temperature)

Note: Model Comparisons

Usually you help select an appropriate model via model comparison procedures such as anova(mod1, mod2) and looking at statistics such as AIC/BIC/log likelihood and/or doing likelihood ratio tests (as well as using background knowledge). I didn’t here, though.

LMM - Chla

Testing/Training

Usually not needed for causal inference, mostly just doing as a quick check that the model makes sense. Will re-run with full dataset.

First I’ll create a model based on training data and see how well it predicts my testing data. Temperature looked cubic to me, so I added cubic basis splines.

Code

# -- split testing/training --
sorted_dates <- sort(unique(df_monthly_subset$Date))

cut_idx <- floor(0.8 * length(sorted_dates))

train_dates <- sorted_dates[1:cut_idx]
test_dates  <- sorted_dates[(cut_idx + 1):length(sorted_dates)]

df_train <- df_monthly_subset %>% filter(Date %in% train_dates)
df_test  <- df_monthly_subset %>% filter(Date %in% test_dates)

# -- training model --
fit_chl <- lmer(
  log_Chlorophyll ~
    log_Salinity +
    log_DissAmmonia +
    log_DissOrthophos +
    log_Turbidity +
    lag_logChlorophyll +
    Temp1 +
    Temp2 +
    Temp3 +
    (1|Region),
  data = df_train,
  na.action = na.omit
)

summary(fit_chl)

Linear mixed model fit by REML ['lmerMod']
Formula: 
log_Chlorophyll ~ log_Salinity + log_DissAmmonia + log_DissOrthophos +  
    log_Turbidity + lag_logChlorophyll + Temp1 + Temp2 + Temp3 +  
    (1 | Region)
   Data: df_train

REML criterion at convergence: 262

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-2.6782 -0.6343 -0.0389  0.5485  3.0145 

Random effects:
 Groups   Name        Variance Std.Dev.
 Region   (Intercept) 0.02142  0.1464  
 Residual             0.13622  0.3691  
Number of obs: 278, groups:  Region, 5

Fixed effects:
                    Estimate Std. Error t value
(Intercept)        -0.648463   0.258606  -2.508
log_Salinity        0.003173   0.022728   0.140
log_DissAmmonia    -0.118220   0.046442  -2.546
log_DissOrthophos  -0.063448   0.071647  -0.886
log_Turbidity       0.248301   0.051955   4.779
lag_logChlorophyll  0.417455   0.051700   8.075
Temp1               1.159657   0.583910   1.986
Temp2               1.106531   0.448444   2.467
Temp3               2.270153   0.462919   4.904

Correlation of Fixed Effects:
            (Intr) lg_Sln lg_DsA lg_DsO lg_Trb lg_lgC Temp1  Temp2 
log_Salinty -0.298                                                 
log_DssAmmn  0.354 -0.152                                          
lg_DssOrthp  0.635 -0.159 -0.079                                   
log_Turbdty -0.580  0.249  0.062 -0.057                            
lg_lgChlrph  0.034 -0.043  0.120  0.123 -0.158                     
Temp1       -0.032 -0.009  0.510 -0.274  0.257 -0.341              
Temp2       -0.034 -0.066 -0.037 -0.064 -0.096  0.168 -0.082       
Temp3       -0.079  0.145 -0.150  0.037  0.120 -0.180  0.053 -0.013

Code

# -- predicted model --
df_test <- df_test %>%
  drop_na(log_Chlorophyll, log_Temperature, log_DissAmmonia, log_Total_Inflow)

df_test <- df_test %>%
  mutate(pred_log_Chl = predict(fit_chl, newdata = df_test))

# -- plot --
rng <- range(c(df_test$pred_log_Chl, df_test$log_Chlorophyll), na.rm = TRUE)

ggplot(df_test, aes(x = pred_log_Chl, y = log_Chlorophyll)) +
  geom_point(alpha = 0.6) +
  geom_smooth(method = 'lm', se = FALSE, linewidth = 0.7) +
  geom_abline(intercept = 0, slope = 1, linetype = 'dashed') +
  coord_equal(xlim = rng, ylim = rng) +
  labs(
    x = 'Predicted',
    y = 'Observed',
    title = 'Predicted vs Observed'
  ) +
  theme_minimal()

Looks pretty good! It does overestimate, but it’s a consistent offset and not horribly off.

Next is model diagnostics:

Code

RMSE <- sqrt(mean((df_test$log_Chlorophyll - df_test$pred_log_Chl)^2))
MAE  <- mean(abs(df_test$log_Chlorophyll - df_test$pred_log_Chl))
bias <- mean(df_test$pred_log_Chl - df_test$log_Chlorophyll)

print(RMSE); print(MAE); print(bias)

[1] 0.4276552

[1] 0.341872

[1] 0.1401536

Code

varcomp <- as.data.frame(VarCorr(fit_chl))
var_re  <- varcomp$vcov[varcomp$grp == 'Region']
var_res <- varcomp$vcov[varcomp$grp == 'Residual']

ICC <- var_re / (var_re + var_res)

print(ICC)

[1] 0.135882

Code

plot(fit_chl)

Code

res <- residuals(fit_chl)

df_res <- df_train
df_res$resid <- res

regions <- unique(df_res$Region)
n <- length(regions)

for (r in regions) {
  tmp <- df_res$resid[df_res$Region == r]
  acf(tmp,
      main = paste(r),
      na.action = na.pass)
}

Code

infl <- check_outliers(fit_chl)
infl

OK: No outliers detected.
- Based on the following method and threshold: cook (0.929).
- For variable: (Whole model)

Code

res <- residuals(fit_chl)

qqnorm(res)
qqline(res, col = 'red', lwd = 2)

Model diagnostics look pretty good; a few minor lag spikes, and there’s some right-tail skew, but neither are particularly noteworthy. A few acf graphs do show some seasonality but it’s not significant.

Full model

Now we run the full model:

Code

fit_chl <- lmer(
  log_Chlorophyll ~
    log_Salinity +
    log_DissAmmonia +
    log_DissOrthophos +
    log_Turbidity +
    lag_logChlorophyll +
    Temp1 +
    Temp2 +
    Temp3 +
    (1|Region),
  data = df_monthly_subset,
  na.action = na.omit
)

summary(fit_chl)

Linear mixed model fit by REML ['lmerMod']
Formula: 
log_Chlorophyll ~ log_Salinity + log_DissAmmonia + log_DissOrthophos +  
    log_Turbidity + lag_logChlorophyll + Temp1 + Temp2 + Temp3 +  
    (1 | Region)
   Data: df_monthly_subset

REML criterion at convergence: 340

Scaled residuals: 
     Min       1Q   Median       3Q      Max 
-2.58303 -0.63664 -0.06386  0.54888  2.81484 

Random effects:
 Groups   Name        Variance Std.Dev.
 Region   (Intercept) 0.02514  0.1586  
 Residual             0.14297  0.3781  
Number of obs: 347, groups:  Region, 5

Fixed effects:
                    Estimate Std. Error t value
(Intercept)        -0.614994   0.245478  -2.505
log_Salinity       -0.002593   0.021635  -0.120
log_DissAmmonia    -0.088624   0.038233  -2.318
log_DissOrthophos  -0.037385   0.067214  -0.556
log_Turbidity       0.281013   0.049134   5.719
lag_logChlorophyll  0.399493   0.046634   8.567
Temp1               1.448116   0.517629   2.798
Temp2               0.566255   0.400806   1.413
Temp3               2.353127   0.402061   5.853

Correlation of Fixed Effects:
            (Intr) lg_Sln lg_DsA lg_DsO lg_Trb lg_lgC Temp1  Temp2 
log_Salinty -0.250                                                 
log_DssAmmn  0.310 -0.098                                          
lg_DssOrthp  0.631 -0.130 -0.150                                   
log_Turbdty -0.604  0.232  0.019 -0.080                            
lg_lgChlrph -0.004 -0.021 -0.001  0.074 -0.217                     
Temp1       -0.069  0.001  0.492 -0.290  0.263 -0.401              
Temp2       -0.023 -0.066  0.001 -0.050 -0.081  0.206 -0.063       
Temp3       -0.064  0.112 -0.143  0.057  0.119 -0.159  0.008  0.001

The coefficients are similar, which is expected.

Model diagnostics:

Code

varcomp <- as.data.frame(VarCorr(fit_chl))
var_re  <- varcomp$vcov[varcomp$grp == 'Region']
var_res <- varcomp$vcov[varcomp$grp == 'Residual']

ICC <- var_re/(var_re + var_res)

print(ICC)

[1] 0.1495563

Code

plot(fit_chl)

Code

res <- residuals(fit_chl)

df_res <- df_monthly_subset
df_res$resid <- res

regions <- unique(df_res$Region)
n <- length(regions)

for (r in regions) {
  tmp <- df_res$resid[df_res$Region == r]
  acf(tmp,
      main = paste(r),
      na.action = na.pass)
}

Code

infl <- check_outliers(fit_chl)
infl

OK: No outliers detected.
- Based on the following method and threshold: cook (0.929).
- For variable: (Whole model)

Code

vif(fit_chl)

      log_Salinity    log_DissAmmonia  log_DissOrthophos      log_Turbidity 
          1.104629           1.450255           1.127412           1.172911 
lag_logChlorophyll              Temp1              Temp2              Temp3 
          1.355640           1.865876           1.058316           1.070728

Code

mf <- model.frame(fit_chl)
row_map <- as.numeric(rownames(mf))
res <- residuals(fit_chl)

df_resrow <- tibble(
  row = row_map,
  resid = res,
  fitted = fitted(fit_chl)
)

df_res <- df_monthly_subset %>%
  mutate(row = row_number()) %>%
  left_join(df_resrow, by = 'row')

Code

res <- residuals(fit_chl)

qqnorm(res)
qqline(res, col = 'red', lwd = 2)

Looks similar to the test model, so we’re good.

LMM - DissAmmonia

Since SEMs ideally have at least two “sub models”, I’ll throw together a quick model for DissolvedAmmonia. I didn’t bother with intense model diagnostics here, but they should definitely be done for real models.

Code

df_plt_log <- df_monthly_log %>%
  mutate(lag_logDissAmmonia = lag(log(DissAmmonia))) %>%
  pivot_longer(
    cols = c(log_DissNitrateNitrite:log_Total_Inflow, lag_logChlorophyll, lag_logDissAmmonia),
    names_to = 'Analyte',
    values_to = 'Value'
  )

Code

ggplot(df_plt) +
  geom_smooth(aes(Value, log_DissAmmonia)) +
  facet_wrap(~Analyte, scales = 'free_x') +
  theme_bw()

Code

ggplot(df_plt_log) +
  geom_smooth(aes(Value, log_DissAmmonia)) +
  facet_wrap(~Analyte, scales = 'free_x') +
  theme_bw()

Code

fit_amm <- lmer(
  log_DissAmmonia ~
    lag_logDissAmmonia +
    Temp1 +
    log_Turbidity +
    (1|Region),
  data = df_monthly_subset,
  na.action = na.omit
)

summary(fit_amm)

Linear mixed model fit by REML ['lmerMod']
Formula: log_DissAmmonia ~ lag_logDissAmmonia + Temp1 + log_Turbidity +  
    (1 | Region)
   Data: df_monthly_subset

REML criterion at convergence: 489.3

Scaled residuals: 
    Min      1Q  Median      3Q     Max 
-4.6814 -0.6246  0.0659  0.6315  3.3702 

Random effects:
 Groups   Name        Variance Std.Dev.
 Region   (Intercept) 0.06061  0.2462  
 Residual             0.22566  0.4750  
Number of obs: 347, groups:  Region, 5

Fixed effects:
                   Estimate Std. Error t value
(Intercept)        -0.96356    0.25568  -3.769
lag_logDissAmmonia  0.49271    0.04699  10.484
Temp1              -3.50570    0.55817  -6.281
log_Turbidity      -0.13655    0.05981  -2.283

Correlation of Fixed Effects:
            (Intr) lg_lDA Temp1 
lg_lgDssAmm  0.560              
Temp1        0.141  0.487       
log_Turbdty -0.747 -0.088  0.144

Code

plot(fit_amm)

SEMs

Linear Mixed Effects Model

Piecewise

Now we can fit our models into an SEM. Since they’re both LMMs, we have to use piecewiseSEM.

Code

fit_chl <- lmer(
  log_Chlorophyll ~
    log_Salinity +
    log_DissAmmonia +
    log_DissOrthophos +
    log_Turbidity +
    lag_logChlorophyll +
    Temp1 +
    Temp2 +
    Temp3 +
    (1|Region),
  data = df_monthly_subset)

fit_amm <- lmer(
  log_DissAmmonia ~
    lag_logDissAmmonia +
    Temp1 +
    log_Turbidity +
    (1|Region),
  data = df_monthly_subset)


psem_out <- psem(fit_chl, fit_amm, data = df_monthly_subset)
summary(psem_out, standardize = 'scale')


  |                                                                            
  |                                                                      |   0%


  |                                                                            
  |============                                                          |  17%
  |                                                                            
  |=======================                                               |  33%
  |                                                                            
  |===================================                                   |  50%
  |                                                                            
  |===============================================                       |  67%
  |                                                                            
  |==========================================================            |  83%
  |                                                                            
  |======================================================================| 100%


Structural Equation Model of psem_out 

Call:
  log_Chlorophyll ~ log_Salinity + log_DissAmmonia + log_DissOrthophos + log_Turbidity + lag_logChlorophyll + Temp1 + Temp2 + Temp3
  log_DissAmmonia ~ lag_logDissAmmonia + Temp1 + log_Turbidity

    AIC
 863.279

---
Tests of directed separation:

                              Independ.Claim Test.Type       DF Crit.Value
        log_DissAmmonia ~ log_Salinity + ...      coef  38.7998     6.7305
   log_DissAmmonia ~ log_DissOrthophos + ...      coef 339.1377     2.4154
  log_DissAmmonia ~ lag_logChlorophyll + ...      coef 341.9718     0.8992
               log_DissAmmonia ~ Temp2 + ...      coef 341.1981     0.7206
               log_DissAmmonia ~ Temp3 + ...      coef 338.9824     5.8138
  log_Chlorophyll ~ lag_logDissAmmonia + ...      coef 336.4220     5.0180
  P.Value  
   0.0133 *
   0.1211  
   0.3437  
   0.3966  
   0.0164 *
   0.0257 *

--
Global goodness-of-fit:

Chi-Squared = 6.016 with P-value = 0.421 and on 6 degrees of freedom
Fisher's C = 32.385 with P-value = 0.001 and on 12 degrees of freedom

---
Coefficients:

         Response          Predictor Estimate Std.Error       DF Crit.Value
  log_Chlorophyll       log_Salinity  -0.0026    0.0216  55.3931     0.0121
  log_Chlorophyll    log_DissAmmonia  -0.0886    0.0382 336.6293     5.2865
  log_Chlorophyll  log_DissOrthophos  -0.0374    0.0672 306.3545     0.2959
  log_Chlorophyll      log_Turbidity   0.2810    0.0491  69.9676    28.1085
  log_Chlorophyll lag_logChlorophyll   0.3995    0.0466 334.4428    71.6796
  log_Chlorophyll              Temp1   1.4481    0.5176 332.8132     7.6704
  log_Chlorophyll              Temp2   0.5663    0.4008 337.1051     1.9700
  log_Chlorophyll              Temp3   2.3531    0.4021 335.1774    34.1892
  log_DissAmmonia lag_logDissAmmonia   0.4927    0.0470 342.8033   108.9828
  log_DissAmmonia              Temp1  -3.5057    0.5582 342.6721    39.1401
  log_DissAmmonia      log_Turbidity  -0.1365    0.0598 232.7812     4.8258
  P.Value Std.Estimate    
   0.9129      -0.0080    
   0.0221      -0.1149   *
   0.5868      -0.0260    
   0.0000       0.3658 ***
   0.0000       0.4042 ***
   0.0059       0.1433  **
   0.1614       0.0560    
   0.0000       0.2328 ***
   0.0000       0.4879 ***
   0.0000      -0.2676 ***
   0.0290      -0.1371   *

  Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05

---
Individual R-squared:

         Response method Marginal Conditional
  log_Chlorophyll   none     0.43        0.51
  log_DissAmmonia   none     0.40        0.53

There seems to be a bug, since it says it can’t test things that show up in the subsequent dataframes. Nevertheless, we note:

AIC: this is only meaningful compared to other models (with the same RE structure)
- (relatively) smaller the better
Chi-squared (Goodness of Fit): not significant, which means our model fits the observed data well
Fisher’s C: significant, there’s some missing paths in our causal structure
- we can look at the independence claims (first df) to see which (eg. log_DissAmmonia and log_Salinity)
the second dataframe gives the coefficients (standardized)
the third dataframe gives the \(R^2\) values
- marginal are without the random effects while the conditional ones are

Also note that the coefficients can be interpreted (1 SD change in x corresponds to \(\beta\) SD change in y, conditional on other variables) except for the three Temperature ones. Together, they define a nonlinear chl–temperature curve, which is not constant (but can be graphed).

Linear Fixed Effects Model

Piecewise

For comparison, we can run the same model without the random effect; this makes it comparable to traditional SEM.

Code

fit_chl <- lm(
  log_Chlorophyll ~
    log_Salinity +
    log_DissAmmonia +
    log_DissOrthophos +
    log_Turbidity +
    lag_logChlorophyll +
    Temp1 +
    Temp2 +
    Temp3,
  data = df_monthly_subset)

fit_amm <- lm(
  log_DissAmmonia ~
    lag_logDissAmmonia +
    Temp1 +
    log_DissOrthophos +
    log_Turbidity,
    data = df_monthly_subset)

psem_out <- psem(fit_chl, fit_amm, data = df_monthly_subset)
summary(psem_out, standardize = 'scale')


  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |==============                                                        |  20%
  |                                                                            
  |============================                                          |  40%
  |                                                                            
  |==========================================                            |  60%
  |                                                                            
  |========================================================              |  80%
  |                                                                            
  |======================================================================| 100%


Structural Equation Model of psem_out 

Call:
  log_Chlorophyll ~ log_Salinity + log_DissAmmonia + log_DissOrthophos + log_Turbidity + lag_logChlorophyll + Temp1 + Temp2 + Temp3
  log_DissAmmonia ~ lag_logDissAmmonia + Temp1 + log_DissOrthophos + log_Turbidity

    AIC
 848.324

---
Tests of directed separation:

                              Independ.Claim Test.Type  DF Crit.Value P.Value
        log_DissAmmonia ~ log_Salinity + ...      coef 341     3.8375  0.0001
  log_DissAmmonia ~ lag_logChlorophyll + ...      coef 341     0.4192  0.6753
               log_DissAmmonia ~ Temp2 + ...      coef 341    -1.8023  0.0724
               log_DissAmmonia ~ Temp3 + ...      coef 341     1.9241  0.0552
  log_Chlorophyll ~ lag_logDissAmmonia + ...      coef 337     1.8446  0.0660
     
  ***
     
     
     
     

--
Global goodness-of-fit:

Chi-Squared = 25.971 with P-value = 0 and on 5 degrees of freedom
Fisher's C = 34.904 with P-value = 0 and on 10 degrees of freedom

---
Coefficients:

         Response          Predictor Estimate Std.Error  DF Crit.Value P.Value
  log_Chlorophyll       log_Salinity  -0.0083    0.0145 338    -0.5715  0.5680
  log_Chlorophyll    log_DissAmmonia  -0.1177    0.0382 338    -3.0803  0.0022
  log_Chlorophyll  log_DissOrthophos   0.0260    0.0604 338     0.4306  0.6670
  log_Chlorophyll      log_Turbidity   0.2212    0.0363 338     6.0935  0.0000
  log_Chlorophyll lag_logChlorophyll   0.5050    0.0442 338    11.4285  0.0000
  log_Chlorophyll              Temp1   0.7501    0.5149 338     1.4566  0.1462
  log_Chlorophyll              Temp2   0.9254    0.4076 338     2.2702  0.0238
  log_Chlorophyll              Temp3   2.4551    0.4164 338     5.8967  0.0000
  log_DissAmmonia lag_logDissAmmonia   0.5352    0.0478 342    11.1842  0.0000
  log_DissAmmonia              Temp1  -3.1621    0.5874 342    -5.3832  0.0000
  log_DissAmmonia  log_DissOrthophos   0.1461    0.0740 342     1.9753  0.0490
  log_DissAmmonia      log_Turbidity   0.0796    0.0411 342     1.9369  0.0536
  Std.Estimate    
       -0.0256    
       -0.1525  **
        0.0180    
        0.2879 ***
        0.5110 ***
        0.0742    
        0.0916   *
        0.2429 ***
        0.5300 ***
       -0.2414 ***
        0.0783   *
        0.0800    

  Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05

---
Individual R-squared:

         Response method R.squared
  log_Chlorophyll   none      0.49
  log_DissAmmonia   none      0.52

Traditional

Code

model <- '
    log_Chlorophyll ~ log_Salinity +
    log_DissAmmonia +
    log_DissOrthophos +
    log_Turbidity +
    lag_logChlorophyll +
    Temp1 +
    Temp2 +
    Temp3

    log_DissAmmonia ~ lag_logDissAmmonia +
    Temp1 +
    log_DissOrthophos +
    log_Turbidity
'

fit <- sem(model, data = df_monthly_subset)

summary(fit, fit.measures = TRUE, standardized = TRUE)

lavaan 0.6-20 ended normally after 2 iterations

  Estimator                                         ML
  Optimization method                           NLMINB
  Number of model parameters                        14

  Number of observations                           347

Model Test User Model:
                                                      
  Test statistic                                25.971
  Degrees of freedom                                 5
  P-value (Chi-square)                           0.000

Model Test Baseline Model:

  Test statistic                               510.446
  Degrees of freedom                                17
  P-value                                        0.000

User Model versus Baseline Model:

  Comparative Fit Index (CFI)                    0.958
  Tucker-Lewis Index (TLI)                       0.856

Loglikelihood and Information Criteria:

  Loglikelihood user model (H0)               -408.162
  Loglikelihood unrestricted model (H1)       -395.177
                                                      
  Akaike (AIC)                                 844.324
  Bayesian (BIC)                               898.214
  Sample-size adjusted Bayesian (SABIC)        853.802

Root Mean Square Error of Approximation:

  RMSEA                                          0.110
  90 Percent confidence interval - lower         0.071
  90 Percent confidence interval - upper         0.153
  P-value H_0: RMSEA <= 0.050                    0.008
  P-value H_0: RMSEA >= 0.080                    0.899

Standardized Root Mean Square Residual:

  SRMR                                           0.023

Parameter Estimates:

  Standard errors                             Standard
  Information                                 Expected
  Information saturated (h1) model          Structured

Regressions:
                    Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
  log_Chlorophyll ~                                                      
    log_Salinity      -0.008    0.014   -0.591    0.554   -0.008   -0.026
    log_DissAmmoni    -0.118    0.037   -3.205    0.001   -0.118   -0.152
    log_DssOrthphs     0.026    0.059    0.440    0.660    0.026    0.018
    log_Turbidity      0.221    0.036    6.092    0.000    0.221    0.287
    lg_lgChlrphyll     0.505    0.044   11.585    0.000    0.505    0.510
    Temp1              0.750    0.503    1.491    0.136    0.750    0.074
    Temp2              0.925    0.402    2.302    0.021    0.925    0.091
    Temp3              2.455    0.408    6.019    0.000    2.455    0.242
  log_DissAmmonia ~                                                      
    lag_logDssAmmn     0.535    0.048   11.266    0.000    0.535    0.530
    Temp1             -3.162    0.583   -5.422    0.000   -3.162   -0.241
    log_DssOrthphs     0.146    0.073    1.990    0.047    0.146    0.078
    log_Turbidity      0.080    0.041    1.951    0.051    0.080    0.080

Variances:
                   Estimate  Std.Err  z-value  P(>|z|)   Std.lv  Std.all
   .log_Chlorphyll    0.151    0.011   13.172    0.000    0.151    0.512
   .log_DissAmmoni    0.238    0.018   13.172    0.000    0.238    0.481

Code

modindices(fit, sort = TRUE)

	lhs	op	rhs	mi	epc	sepc.lv	sepc.all	sepc.nox
54	log_DissAmmonia	~	log_Salinity	14.3653045	0.0661755	0.0661755	0.1576313	0.0941093
68	log_DissOrthophos	~	log_DissAmmonia	4.5340375	0.2662180	0.2662180	0.4969079	0.4969079
57	log_DissAmmonia	~	Temp3	3.7267608	0.9639944	0.9639944	0.0735945	1.3709130
59	log_Salinity	~	log_DissAmmonia	3.4500263	0.1463177	0.1463177	0.0614259	0.0614259
51	log_Chlorophyll	~~	log_DissAmmonia	3.4289810	-0.0366593	-0.0366593	-0.1931149	-0.1931149
52	log_Chlorophyll	~	lag_logDissAmmonia	3.4289810	0.0824490	0.0824490	0.1055732	0.1516052
56	log_DissAmmonia	~	Temp2	3.2743916	-0.8991260	-0.8991260	-0.0686422	-1.2786625
58	log_Salinity	~	log_Chlorophyll	2.5503856	-0.8101295	-0.8101295	-0.2630362	-0.2630362
67	log_DissOrthophos	~	log_Chlorophyll	1.8942228	-0.2135538	-0.2135538	-0.3082850	-0.3082850
94	Temp1	~	log_Chlorophyll	1.5926568	0.0129093	0.0129093	0.1307791	0.1307791
77	log_Turbidity	~	log_DissAmmonia	1.5157448	-0.0712911	-0.0712911	-0.0709825	-0.0709825
113	Temp3	~	log_DissAmmonia	1.1365886	0.0028511	0.0028511	0.0373452	0.0373452
112	Temp3	~	log_Chlorophyll	0.8568775	-0.0157100	-0.0157100	-0.1591520	-0.1591520
121	lag_logDissAmmonia	~	log_Chlorophyll	0.6779539	0.0318754	0.0318754	0.0248935	0.0248935
104	Temp2	~	log_DissAmmonia	0.4806518	-0.0019021	-0.0019021	-0.0249153	-0.0249153
76	log_Turbidity	~	log_Chlorophyll	0.3055162	-0.0714940	-0.0714940	-0.0550544	-0.0550544
122	lag_logDissAmmonia	~	log_DissAmmonia	0.2053313	-0.1007060	-0.1007060	-0.1016905	-0.1016905
85	lag_logChlorophyll	~	log_Chlorophyll	0.1814990	0.0704217	0.0704217	0.0697551	0.0697551
53	log_DissAmmonia	~	log_Chlorophyll	0.1789647	-0.0307024	-0.0307024	-0.0237453	-0.0237453
55	log_DissAmmonia	~	lag_logChlorophyll	0.1787199	0.0224721	0.0224721	0.0175461	0.0319580
95	Temp1	~	log_DissAmmonia	0.1355445	-0.0023122	-0.0023122	-0.0302873	-0.0302873
103	Temp2	~	log_Chlorophyll	0.1099250	0.0057720	0.0057720	0.0584736	0.0584736
86	lag_logChlorophyll	~	log_DissAmmonia	0.0058513	-0.0018456	-0.0018456	-0.0023637	-0.0023637
20	log_Salinity	~~	Temp2	0.0000000	0.0000000	0.0000000	NA	0.0000000
45	Temp2	~~	Temp2	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000
21	log_Salinity	~~	Temp3	0.0000000	0.0000000	0.0000000	NA	0.0000000
48	Temp3	~~	Temp3	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000
39	lag_logChlorophyll	~~	Temp3	0.0000000	0.0000000	0.0000000	NA	0.0000000
78	log_Turbidity	~	log_Salinity	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000
119	Temp3	~	Temp2	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000
32	log_Turbidity	~~	Temp1	0.0000000	0.0000000	0.0000000	NA	0.0000000
69	log_DissOrthophos	~	log_Salinity	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000
23	log_DissOrthophos	~~	log_DissOrthophos	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000
30	log_Turbidity	~~	log_Turbidity	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000
26	log_DissOrthophos	~~	Temp1	0.0000000	0.0000000	0.0000000	NA	0.0000000
96	Temp1	~	log_Salinity	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000
16	log_Salinity	~~	log_DissOrthophos	0.0000000	0.0000000	0.0000000	NA	0.0000000
29	log_DissOrthophos	~~	lag_logDissAmmonia	0.0000000	0.0000000	0.0000000	NA	0.0000000
22	log_Salinity	~~	lag_logDissAmmonia	0.0000000	0.0000000	0.0000000	NA	0.0000000
38	lag_logChlorophyll	~~	Temp2	0.0000000	0.0000000	0.0000000	NA	0.0000000
125	lag_logDissAmmonia	~	log_Turbidity	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000
24	log_DissOrthophos	~~	log_Turbidity	0.0000000	0.0000000	0.0000000	NA	0.0000000
50	lag_logDissAmmonia	~~	lag_logDissAmmonia	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000
65	log_Salinity	~	Temp3	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000
19	log_Salinity	~~	Temp1	0.0000000	0.0000000	0.0000000	NA	0.0000000
79	log_Turbidity	~	log_DissOrthophos	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000
28	log_DissOrthophos	~~	Temp3	0.0000000	0.0000000	0.0000000	NA	0.0000000
73	log_DissOrthophos	~	Temp2	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000
123	lag_logDissAmmonia	~	log_Salinity	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000
18	log_Salinity	~~	lag_logChlorophyll	0.0000000	0.0000000	0.0000000	NA	0.0000000
17	log_Salinity	~~	log_Turbidity	0.0000000	0.0000000	0.0000000	NA	0.0000000
82	log_Turbidity	~	Temp2	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000
81	log_Turbidity	~	Temp1	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000
33	log_Turbidity	~~	Temp2	0.0000000	0.0000000	0.0000000	NA	0.0000000
27	log_DissOrthophos	~~	Temp2	0.0000000	0.0000000	0.0000000	NA	0.0000000
66	log_Salinity	~	lag_logDissAmmonia	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000
43	Temp1	~~	Temp3	0.0000000	0.0000000	0.0000000	NA	0.0000000
46	Temp2	~~	Temp3	0.0000000	0.0000000	0.0000000	NA	0.0000000
129	lag_logDissAmmonia	~	Temp3	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000
98	Temp1	~	log_Turbidity	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000
75	log_DissOrthophos	~	lag_logDissAmmonia	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000
105	Temp2	~	log_Salinity	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000
74	log_DissOrthophos	~	Temp3	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000
42	Temp1	~~	Temp2	0.0000000	0.0000000	0.0000000	NA	0.0000000
101	Temp1	~	Temp3	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000
62	log_Salinity	~	lag_logChlorophyll	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000
49	Temp3	~~	lag_logDissAmmonia	0.0000000	0.0000000	0.0000000	NA	0.0000000
97	Temp1	~	log_DissOrthophos	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000
83	log_Turbidity	~	Temp3	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000
106	Temp2	~	log_DissOrthophos	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000
41	Temp1	~~	Temp1	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000
61	log_Salinity	~	log_Turbidity	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000
88	lag_logChlorophyll	~	log_DissOrthophos	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000
71	log_DissOrthophos	~	lag_logChlorophyll	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000
70	log_DissOrthophos	~	log_Turbidity	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000
34	log_Turbidity	~~	Temp3	0.0000000	0.0000000	0.0000000	NA	0.0000000
114	Temp3	~	log_Salinity	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000
118	Temp3	~	Temp1	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000
40	lag_logChlorophyll	~~	lag_logDissAmmonia	0.0000000	0.0000000	0.0000000	NA	0.0000000
124	lag_logDissAmmonia	~	log_DissOrthophos	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000
115	Temp3	~	log_DissOrthophos	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000
111	Temp2	~	lag_logDissAmmonia	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000
80	log_Turbidity	~	lag_logChlorophyll	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000
47	Temp2	~~	lag_logDissAmmonia	0.0000000	0.0000000	0.0000000	NA	0.0000000
100	Temp1	~	Temp2	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000
44	Temp1	~~	lag_logDissAmmonia	0.0000000	0.0000000	0.0000000	NA	0.0000000
72	log_DissOrthophos	~	Temp1	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000
63	log_Salinity	~	Temp1	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000
25	log_DissOrthophos	~~	lag_logChlorophyll	0.0000000	0.0000000	0.0000000	NA	0.0000000
92	lag_logChlorophyll	~	Temp3	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000
89	lag_logChlorophyll	~	log_Turbidity	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000
36	lag_logChlorophyll	~~	lag_logChlorophyll	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000
84	log_Turbidity	~	lag_logDissAmmonia	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000
107	Temp2	~	log_Turbidity	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000
93	lag_logChlorophyll	~	lag_logDissAmmonia	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000
108	Temp2	~	lag_logChlorophyll	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000
102	Temp1	~	lag_logDissAmmonia	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000
99	Temp1	~	lag_logChlorophyll	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000
60	log_Salinity	~	log_DissOrthophos	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000
120	Temp3	~	lag_logDissAmmonia	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000
37	lag_logChlorophyll	~~	Temp1	0.0000000	0.0000000	0.0000000	NA	0.0000000
109	Temp2	~	Temp1	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000
117	Temp3	~	lag_logChlorophyll	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000
127	lag_logDissAmmonia	~	Temp1	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000
116	Temp3	~	log_Turbidity	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000
87	lag_logChlorophyll	~	log_Salinity	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000
110	Temp2	~	Temp3	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000
126	lag_logDissAmmonia	~	lag_logChlorophyll	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000
90	lag_logChlorophyll	~	Temp1	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000
31	log_Turbidity	~~	lag_logChlorophyll	0.0000000	0.0000000	0.0000000	NA	0.0000000
35	log_Turbidity	~~	lag_logDissAmmonia	0.0000000	0.0000000	0.0000000	NA	0.0000000
128	lag_logDissAmmonia	~	Temp2	0.0000000	0.0000000	0.0000000	0.0000000	0.0000000

We note that the coefficients are very similar; the main difference is in interpretation. Here, the high RMSEA, especially, points to missing structural paths (as does the somewhat low TLI). By looking at the modification indices, we see that (similar to the d-separation tests from psem) a log_DissAmmonia ~ log_Salinity link is likely missing.

Note that traditional SEM can also evaluate residual covariance between response variables, while psem cannot (the log_Chlorophyll ~~ log_DissAmmonia row in the MI table).

Also note that, if we include latent variables, we must use traditional SEM. See the stats examples SEM file for that code.

Traditional: Plot

We can also plot traditional SEMs, just for fun:

Code

semPaths(
  fit,
  what = 'std',
  whatLabels = 'std',
  layout = 'tree2',
  nCharNodes = 5,
  edge.label.cex = 1.1,
  sizeMan = 5,
  edge.width = 1,
  esize = 1,
  residuals = FALSE,
  mar = c(7, 7, 7, 7)
)

General Additive Model (GAM)

Piecewise SEMs can also handle GAMs, which are another type of linear model. The advantage is how they handle non-linear relationships.

In the previous models, we identified the relationship between Chlorophyll and Temperature as non-linear, so we added polynomial basis functions to our formulas. GAMs essentially do the exact same thing but use splines instead, which are more sophisticated.

(I won’t go into model diagnostics for GAMs here.)

LMM-like GAM

We can essentially re-create the LMM model by switching out the Temperature terms for a smooth function and keeping the others as linear predictors:

Code

# default basis spline = thin plate
gam_chl <- gam(
  log_Chlorophyll ~
    log_Salinity +
    log_DissAmmonia +
    log_DissOrthophos +
    log_Turbidity +
    lag_logChlorophyll +
    s(Temperature, k = 6) +
    s(Region, bs = 're'), # random effect (ridge penalty)
  data = df_monthly_subset,
  method = 'REML'
)

gam_amm <- gam(
  log_DissAmmonia ~
    lag_logDissAmmonia +
    log_Turbidity +
    s(Temperature, k = 6) +
    s(Region, bs = 're'),
  data = df_monthly_subset,
  method = 'REML'
)

psem_gam <- psem(gam_chl, gam_amm, data = df_monthly_subset)
summary(psem_gam, standardize = 'scale')


  |                                                                            
  |                                                                      |   0%


  |                                                                            
  |==================                                                    |  25%


  |                                                                            
  |===================================                                   |  50%
  |                                                                            
  |====================================================                  |  75%


  |                                                                            
  |======================================================================| 100%


Structural Equation Model of psem_gam 

Call:
  log_Chlorophyll ~ log_Salinity + log_DissAmmonia + log_DissOrthophos + log_Turbidity + lag_logChlorophyll + s(Temperature, k = 6) + s(Region, bs = "re")
  log_DissAmmonia ~ lag_logDissAmmonia + log_Turbidity + s(Temperature, k = 6) + s(Region, bs = "re")

    AIC
 802.565

---
Tests of directed separation:

                              Independ.Claim Test.Type  DF Crit.Value P.Value   
        log_DissAmmonia ~ log_Salinity + ...      coef 347     3.1598  0.0017 **
   log_DissAmmonia ~ log_DissOrthophos + ...      coef 347     1.7240  0.0856   
  log_DissAmmonia ~ lag_logChlorophyll + ...      coef 347     0.9532  0.3412   
  log_Chlorophyll ~ lag_logDissAmmonia + ...      coef 347     2.1399  0.0331  *

--
Global goodness-of-fit:

Chi-Squared = 13.168 with P-value = 0.013 and on 4.242 degrees of freedom
Fisher's C = 26.612 with P-value = 0.001 and on 8 degrees of freedom

---
Coefficients:

         Response          Predictor Estimate Std.Error       DF Crit.Value
  log_Chlorophyll       log_Salinity  -0.0053    0.0218 347.0000    -0.2448
  log_Chlorophyll    log_DissAmmonia  -0.0874    0.0381 347.0000    -2.2951
  log_Chlorophyll  log_DissOrthophos  -0.0413     0.067 347.0000    -0.6165
  log_Chlorophyll      log_Turbidity   0.2783    0.0491 347.0000     5.6733
  log_Chlorophyll lag_logChlorophyll    0.404    0.0467 347.0000     8.6581
  log_Chlorophyll     s(Temperature)        -         -   4.8849     8.5360
  log_Chlorophyll          s(Region)        -         -   4.0000     7.4867
  log_DissAmmonia lag_logDissAmmonia   0.4968    0.0471 347.0000    10.5459
  log_DissAmmonia      log_Turbidity  -0.1321    0.0596 347.0000    -2.2159
  log_DissAmmonia     s(Temperature)        -         -   2.9887    14.2789
  log_DissAmmonia          s(Region)        -         -   4.0000     6.6010
  P.Value Std.Estimate    
   0.8068            -    
   0.0223            -   *
   0.5380            -    
   0.0000            - ***
   0.0000            - ***
   0.0000            - ***
   0.0000            - ***
   0.0000            - ***
   0.0274            -   *
   0.0000            - ***
   0.0000            - ***

  Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05

---
Individual R-squared:

         Response method R.squared
  log_Chlorophyll   none      0.52
  log_DissAmmonia   none      0.55

Our results are very similar, which makes sense:

\(R^2\) values are almost the same (though the GAM is slightly better)
linear coefficient and p-values are almost the same

However, the big difference is that Temperature is now included as a smooth function instead of multiple polynomial terms. This changes the shape of its partial effect curve:

Code

df_temp_plt <- df_monthly_subset %>%
  mutate(
    poly_temp = poly(Temperature, 3)
  )

fit_chl <- lmer(
  log_Chlorophyll ~
    log_Salinity +
    log_DissAmmonia +
    log_DissOrthophos +
    log_Turbidity +
    lag_logChlorophyll +
    Temp1 +
    Temp2 +
    Temp3 +
    (1|Region),
  data = df_temp_plt
)

poly_obj <- attr(df_temp_plt$poly_temp, 'coefs')

temp_seq <- seq(
  min(df_monthly_subset$Temperature, na.rm = TRUE),
  max(df_monthly_subset$Temperature, na.rm = TRUE),
  length.out = 300
)

new_poly <- poly(temp_seq, degree = 3, coefs = poly_obj)

new_dat <- data.frame(
  log_Salinity = mean(df_monthly_subset$log_Salinity, na.rm = TRUE),
  log_DissAmmonia = mean(df_monthly_subset$log_DissAmmonia, na.rm = TRUE),
  log_DissOrthophos = mean(df_monthly_subset$log_DissOrthophos, na.rm = TRUE),
  log_Turbidity = mean(df_monthly_subset$log_Turbidity, na.rm = TRUE),
  lag_logChlorophyll = mean(df_monthly_subset$lag_logChlorophyll, na.rm = TRUE),
  Temp1 = new_poly[,1],
  Temp2 = new_poly[,2],
  Temp3 = new_poly[,3]
)

new_dat$partial <- predict(fit_chl, newdata = new_dat, re.form = NA)

new_dat$partial_centered <- new_dat$partial - mean(new_dat$partial)

ggplot(new_dat, aes(x = temp_seq, y = partial_centered)) +
  geom_line() +
  labs(
    x = 'Temperature',
    y = 'Partial Effect',
  )

Code

plot(gam_chl, select = 1)

They’re similar, but the GAM function is more flexible (it also more closely fits our original log_Chlorophyll ~ Temperature plot; this makes sense if Temperature is an influential predictor). We also note that our chi-square goodness of fit test is now significant, meaning this model (with a more correctly modeled Temperature) is a poorer fit, possibly due to missing variables.

This all being said, for practical purposes, the LMM gives us a similar overall picture to the GAM; both are valid if you correctly specify the model structure.

All Smooths

You could also make all variables smooth terms if you suspect other relationships may be non-linear (though for truly linear relationships it’s better not to).

Code

gam_chl <- gam(
  log_Chlorophyll ~
    s(log_Salinity, k = 6) +
    s(log_DissAmmonia, k = 6) +
    s(log_DissOrthophos, k = 6) +
    s(log_Turbidity, k = 6) +
    s(lag_logChlorophyll, k = 6) +
    s(Temperature, k = 6) +
    s(Region, bs = 're'),
  data = df_monthly_subset,
  method = 'REML'
)

gam_amm <- gam(
  log_DissAmmonia ~
    s(lag_logDissAmmonia, k = 6) +
    s(log_Turbidity, k = 6) +
    s(Temperature, k = 6) +
    s(Region, bs = 're'),
  data = df_monthly_subset,
  method = 'REML'
)

psem_gam <- psem(gam_chl, gam_amm, data = df_monthly_subset)
summary(psem_gam, standardize = 'scale')


  |                                                                            
  |                                                                      |   0%


  |                                                                            
  |==================                                                    |  25%


  |                                                                            
  |===================================                                   |  50%


  |                                                                            
  |====================================================                  |  75%


  |                                                                            
  |======================================================================| 100%


Structural Equation Model of psem_gam 

Call:
  log_Chlorophyll ~ s(log_Salinity, k = 6) + s(log_DissAmmonia, k = 6) + s(log_DissOrthophos, k = 6) + s(log_Turbidity, k = 6) + s(lag_logChlorophyll, k = 6) + s(Temperature, k = 6) + s(Region, bs = "re")
  log_DissAmmonia ~ s(lag_logDissAmmonia, k = 6) + s(log_Turbidity, k = 6) + s(Temperature, k = 6) + s(Region, bs = "re")

    AIC
 777.996

---
Tests of directed separation:

                                        Independ.Claim Test.Type     DF
        log_DissAmmonia ~ s(log_Salinity, k = 6) + ...     anova 2.0358
   log_DissAmmonia ~ s(log_DissOrthophos, k = 6) + ...     anova 1.0007
  log_DissAmmonia ~ s(lag_logChlorophyll, k = 6) + ...     anova 2.1390
  log_Chlorophyll ~ s(lag_logDissAmmonia, k = 6) + ...     anova 1.0001
  Crit.Value P.Value  
      3.4922  0.0285 *
      3.8124  0.0517  
      0.8399  0.4843  
      5.0868  0.0248 *

--
Global goodness-of-fit:

Chi-Squared = 14.835 with P-value = 0.033 and on 6.736 degrees of freedom
Fisher's C = 21.886 with P-value = 0.005 and on 8 degrees of freedom

---
Coefficients:

         Response             Predictor Estimate Std.Error     DF Crit.Value
  log_Chlorophyll       s(log_Salinity)        -         - 1.0004     0.0001
  log_Chlorophyll    s(log_DissAmmonia)        -         - 3.0864     3.6556
  log_Chlorophyll  s(log_DissOrthophos)        -         - 1.0001     0.7318
  log_Chlorophyll      s(log_Turbidity)        -         - 1.2119    22.7294
  log_Chlorophyll s(lag_logChlorophyll)        -         - 1.0005    70.6783
  log_Chlorophyll        s(Temperature)        -         - 4.9104     9.6466
  log_Chlorophyll             s(Region)        -         - 4.0000     8.2700
  log_DissAmmonia s(lag_logDissAmmonia)        -         - 4.7537    29.7694
  log_DissAmmonia      s(log_Turbidity)        -         - 3.2500     3.6661
  log_DissAmmonia        s(Temperature)        -         - 1.0031    31.4752
  log_DissAmmonia             s(Region)        -         - 4.0000     5.3363
  P.Value Std.Estimate    
   0.9960            -    
   0.0115            -   *
   0.3929            -    
   0.0000            - ***
   0.0000            - ***
   0.0000            - ***
   0.0000            - ***
   0.0000            - ***
   0.0125            -   *
   0.0000            - ***
   0.0001            - ***

  Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05

---
Individual R-squared:

         Response method R.squared
  log_Chlorophyll   none      0.53
  log_DissAmmonia   none      0.58

Our \(R^2\) improved, though our interpretations of significance didn’t change much, which makes sense. The biggest advantage is that, by examining the individual GAMs, we can visualize/interpret any underlying non-linear relationships in a way that LMMs can’t.