1 Overview

This document summarizes bacteria exceedance prediction models for the Mystic River Recreational Flagging project.

1.1 Methodology

A logistic regression model was fit to each of three locations (Mystic River @ Rt 16 Bridge, Lower Malden River @ Rt 16 Bridge, Shannon Beach @ Upper Mystic Lake). Robust models could not be fit to the other potential locations (Blessing of the Bay Boathouse, Wedge Pond) either because there were not good climatic and hydrologic predictor variables, or because the exceedance frequency was too low.

At each location, the data were split into training (75%) and testing (25%) subsets using stratified random sampling (i.e. each subset contains approximately the same percent of exceedances). The training subsets were then fed into multiple classification models including logistic regression (using all variables, and step-wise feature selection based on AIC), elastic net regression, random forest, gradient boosting, and support vector machine. Using the caret package, each type of model was optimized over its tuning parameters. The variable importance functions of each model were then used to identify which predictors were most common among the different types of models. In general, the step-wise logistic regression model to minimize AIC performed very well, and resulted in the set of strongest predictors at each site. Various combinations of these predictors were then fit to a standard logistic regression model using 10-fold cross validation repeated 3 times. The final model was then based on the set of predictors that performed best.

1.2 Performance Metrics

Each model can be evaluated by various metrics. Descriptions and interpretation for each metric (more info here are as follows:

  • Null Deviance: Deviance of a logistic regression model with only an intercept term (i.e. no other predictors)
  • Model Deviance: Remaining deviance of the fitted logistic regression model with predictors
  • No Information Rate: The percent of exceedances in the dataset (i.e. the probability of any given sample being a non-exceedance regardless of other information)
  • Accuracy: The percent of samples correctly predicted (# True Positives + # True Negatives / Total #). Accuracy should be greater than the No Information Rate. For Shannon Beach, the accuracy is slightly lower than the no information rate, however, the other metrics suggest it is still a credible model. This metric can be misleading if there is large class imbalance (i.e. much higher % of non-exceedances than exceedances).
  • Kappa: Overall measure of agreement that is more robust than Accuracy. Ranges from -1 (predicts opposite of true values), 0 (no agreement), +1 (predicts exact true values). Kappa values greater than between 0.3 - 0.5 are considered acceptable.
  • ROC AUC: Area under the Receiving Operator Curve (ROC). A value of 0.5 indicates a useless model. Values greater than 0.9 are considered “excellent”.
  • Sensitivity: True positive rate (# True Positive / # True Positive + # False Negative). In other words, how many of the true positive (exceedance) samples were correctly identified.
  • Specificity: True negative rate (# True Negative / # True Negative + # False Positive). How many of true negatives (non-exceedances) were correctly identified. Because of the class imbalance, this number is always high (it’s easy to predict non-exceedance).
  • Positive Predictive Value (PPV): The percent of predicted positive samples (exceedances) that were in fact positive (# True Positive / # True Positive + # False Positive). Also known as the precision.
  • Negative Predictive Value (NPV): The percent of predicted negative samples (non-exceedances) that were in fact negative (# True Negative / # True Negative + # False Negative).

Following the guidance by Francy et al (2013), model acceptance was based on the following goals:

  • Accuracy > 0.80
  • Sensitivity > 0.5
  • Specificity > 0.85

The following table summarizes the performance of each model on the training datasets. All three models meet each of these criteria.

Model ID # Samples % Exceedances Null Deviance Model Deviance No Information Rate Accuracy Kappa ROC AUC Sensitivity Specificity Pos Pred Value Neg Pred Value
MALDENLOWER_ECOLI 237 11.39 168.10 90.73 0.89 0.91 0.59 0.92 0.70 0.94 0.59 0.96
MYSTIC_ECOLI 138 11.59 99.02 66.26 0.88 0.91 0.55 0.89 0.62 0.94 0.59 0.95
SHANNON_ENT 232 4.31 82.45 50.77 0.96 0.94 0.41 0.92 0.60 0.95 0.35 0.98

This table shows the same metrics based on the testing subset for each model (excludes deviance metrics that are not relevant for the testing subset). The Mystic and Malden models continue to meet the criteria. The Shannon Beach model, however, has relatively poor performance. However, the overall frequency of exceedances is much lower at this site relative to the others. The training dataset only included 3 exceedances (out of 77 samples total), so these relatively low performance metrics are to be expected.

Model ID # Samples % Exceedances No Information Rate Accuracy Kappa ROC AUC Sensitivity Specificity Pos Pred Value Neg Pred Value
MALDENLOWER_ECOLI 77 10.39 0.90 0.92 0.62 0.93 0.75 0.94 0.6 0.97
MYSTIC_ECOLI 45 11.11 0.89 0.89 0.55 0.91 0.80 0.90 0.5 0.97
SHANNON_ENT 77 3.90 0.96 0.92 0.21 0.88 0.33 0.95 0.2 0.97

2 Mystic River at Rt 16 Bridge (E. coli)

2.1 Dataset

Number of exceedance (yes) and non-exceedance (no) samples by project and monitoring location.

##                   yes  no Total
## CSORWM - MWRA177   15  89   104
## RECFLAG - MYR0435   6  73    79
## Total              21 162   183

Dataset was split into training (75%) and testing (25%) subsets.

##          yes  no Total
## Training  16 122   138
## Testing    5  40    45
## Total     21 162   183

2.2 Prediction Model

The logistic regression model for the Mystic River is based on 3 predictors:

  • precip_sum_p24hr_lag0hr: total rainfall (inches) over 24 hours prior to sample timestamp
  • precip_sum_p24hr_lag24hr: total rainfall (inches) between 24 and 48 hours prior to sample timestamp
  • hours_since_050in_precip_event: number of hours since last 0.5 inch rainfall event

Note that the sum of precip_sum_p24hr_lag0hr + precip_sum_p24hr_lag24hr is equal to the 48-hour antecedent rainfall. However, splitting that amount into two separate variables resulted in a better model. They are also not strongly correlated.

All predictors were centered and scaled prior to fitting.

Diagnostic output of the logistic regression model:

## 
## Call:
## NULL
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.7030   0.2280   0.2758   0.3597   2.0812  
## 
## Coefficients:
##                                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                      2.5727     0.3804   6.762 1.36e-11 ***
## precip_sum_p24hr_lag0hr         -1.5384     0.3761  -4.090 4.31e-05 ***
## precip_sum_p24hr_lag24hr        -0.4526     0.2748  -1.647   0.0996 .  
## hours_since_050in_precip_event  -0.4421     0.3643  -1.214   0.2249    
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 99.018  on 137  degrees of freedom
## Residual deviance: 66.262  on 134  degrees of freedom
## AIC: 74.262
## 
## Number of Fisher Scoring iterations: 6

This model accounts for 66.9% of the total deviance (pseudo-R^2 value of 0.33).

Relationships between E. coli concentration and value of each predictor.

## Warning: Suppressing axis rendering when strip.position = 'bottom' and
## strip.placement == 'outside'

2.3 ROC Curves

Data Subset Area Under Curve (AUC)
Testing 0.910000
Training 0.892418

2.4 Cutoff Optimization

The logistic regression model predicts the probability of an exceedance. In order to classify that probability as being an exceedance or not, we need to determine an optimal cutoff probability. By default, logistic regression algorithms tend to use 0.5 as the cutoff. However, when the dataset has large class imbalance (exceedances are far less frequent than non-exceedances), then the optimal cutoff probability can be higher or lower than 0.5. One way to determine the optimal cutoff is to compute the accuracy (Number of Accurate Predictions), specificity (True Negative Rate), and sensitivity (True Positive Rate) for a range of cutoff thresholds. The following plots show these three metrics using both the training and testing subsets. The distance metric is the combined distance of both the sensitivity and specificity to values of 1. The goal is to minimize the distance metric, while not sacrificing accuracy. The optimal cutoff for this model was determined to be about 0.2.

## Warning: Removed 3 rows containing missing values (geom_path).

2.5 Confusion Matrix

For the training dataset, the following shows various metrics using the confusion matrix (i.e. tabulation of true positive, false negative, true negative, and false positive). The overall accuracy should be greater than the No Information Rate, which is simply the proportion of non-exceedance samples in the dataset. If the accuracy is less than the No Information Rate, then a naive model based solely on the proportion of exceedances and non-exceedances is a better predictor than the model.

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction yes  no
##        yes  10   7
##        no    6 115
##                                           
##                Accuracy : 0.9058          
##                  95% CI : (0.8443, 0.9489)
##     No Information Rate : 0.8841          
##     P-Value [Acc > NIR] : 0.2594          
##                                           
##                   Kappa : 0.5526          
##  Mcnemar's Test P-Value : 1.0000          
##                                           
##             Sensitivity : 0.62500         
##             Specificity : 0.94262         
##          Pos Pred Value : 0.58824         
##          Neg Pred Value : 0.95041         
##              Prevalence : 0.11594         
##          Detection Rate : 0.07246         
##    Detection Prevalence : 0.12319         
##       Balanced Accuracy : 0.78381         
##                                           
##        'Positive' Class : yes             
## 

For the test dataset, the confusion matrix and associated summary metrics are:

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction yes no
##        yes   4  4
##        no    1 36
##                                           
##                Accuracy : 0.8889          
##                  95% CI : (0.7595, 0.9629)
##     No Information Rate : 0.8889          
##     P-Value [Acc > NIR] : 0.6162          
##                                           
##                   Kappa : 0.5545          
##  Mcnemar's Test P-Value : 0.3711          
##                                           
##             Sensitivity : 0.80000         
##             Specificity : 0.90000         
##          Pos Pred Value : 0.50000         
##          Neg Pred Value : 0.97297         
##              Prevalence : 0.11111         
##          Detection Rate : 0.08889         
##    Detection Prevalence : 0.17778         
##       Balanced Accuracy : 0.85000         
##                                           
##        'Positive' Class : yes             
## 

2.6 Cross-Validation

During model development, the logistic regression model was fitted to 30 random resamples of the training subset using 10-fold cross-validation repeated 3 times. For each resample, the coefficients were fitted and then ROC statistics computed using a cutoff probability of 0.2. This boxplot shows the distribution of each statistic across the 30 resamples (red circle = mean). This figure gives some indication as to how sensitive the model is to the specific dataset. The final model was fitted to all of the data in the training subset.

3 Lower Malden River at Rt 16 Bridge (E. coli)

3.1 Dataset

Number of exceedance (yes) and non-exceedance (no) samples by project and monitoring location.

##                   yes  no Total
## CSORWM - MWRA176   26 210   236
## RECFLAG - MAR0065   9  69    78
## Total              35 279   314

Dataset was split into training (75%) and testing (25%) subsets.

##          yes  no Total
## Training  27 210   237
## Testing    8  69    77
## Total     35 279   314

3.2 Prediction Model

The logistic regression model for the Mystic River is based on 5 predictors:

  • precip_sum_p24hr_lag0hr: total rainfall (inches) over 24 hours prior to sample timestamp
  • precip_sum_p24hr_lag24hr: total rainfall (inches) between 24 and 48 hours prior to sample timestamp
  • hours_since_100in_precip_event: number of hours since last 1.0 inch rainfall event
  • pressure_change_p72hr: change in barometric pressure (inHg) from 72 hours prior to sample timestamp (when negative, indicates storms moving in)
  • temp_change_p72hr: change in air temperature (degF) from 72 hours prior to sample timestamp

Note that the sum of precip_sum_p24hr_lag0hr + precip_sum_p24hr_lag24hr is equal to the 48-hour antecedent rainfall. However, splitting that amount into two separate variables resulted in a better model. They are also not strongly correlated.

All predictors were centered and scaled prior to fitting.

Diagnostic output of the logistic regression model:

## 
## Call:
## NULL
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -2.7908   0.1088   0.2001   0.3294   1.5014  
## 
## Coefficients:
##                                Estimate Std. Error z value Pr(>|z|)    
## (Intercept)                      3.1521     0.4176   7.549 4.39e-14 ***
## precip_sum_p24hr_lag0hr         -1.2656     0.2905  -4.357 1.32e-05 ***
## precip_sum_p24hr_lag24hr        -0.7792     0.2425  -3.214  0.00131 ** 
## hours_since_100in_precip_event  -0.6612     0.2669  -2.477  0.01324 *  
## pressure_change_p72hr            1.1378     0.3525   3.228  0.00125 ** 
## temp_change_p72hr                0.7718     0.3494   2.209  0.02717 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 168.100  on 236  degrees of freedom
## Residual deviance:  90.726  on 231  degrees of freedom
## AIC: 102.73
## 
## Number of Fisher Scoring iterations: 6

This model accounts for 54% of the total deviance (pseudo-R^2 value of 0.46).

Relationships between E. coli concentration and value of each predictor.

## Warning: Suppressing axis rendering when strip.position = 'bottom' and
## strip.placement == 'outside'

3.3 ROC Curves

Data Subset Area Under Curve (AUC)
Testing 0.9311594
Training 0.9179894

3.4 Cutoff Optimization

The logistic regression model predicts the probability of an exceedance. In order to classify that probability as being an exceedance or not, we need to determine an optimal cutoff probability. By default, logistic regression algorithms tend to use 0.5 as the cutoff. However, when the dataset has large class imbalance (exceedances are far less frequent than non-exceedances), then the optimal cutoff probability can be higher or lower than 0.5. One way to determine the optimal cutoff is to compute the accuracy (Number of Accurate Predictions), specificity (True Negative Rate), and sensitivity (True Positive Rate) for a range of cutoff thresholds. The following plots show these three metrics using both the training and testing subsets. The distance metric is the combined distance of both the sensitivity and specificity to values of 1. The goal is to minimize the distance metric, while not sacrificing accuracy. The optimal cutoff for this model was determined to be about 0.2.

## Warning: Removed 3 rows containing missing values (geom_path).

3.5 Confusion Matrix

For the training dataset, the following shows various metrics using the confusion matrix (i.e. tabulation of true positive, false negative, true negative, and false positive). The overall accuracy should be greater than the No Information Rate, which is simply the proportion of non-exceedance samples in the dataset. If the accuracy is less than the No Information Rate, then a naive model based solely on the proportion of exceedances and non-exceedances is a better predictor than the model.

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction yes  no
##        yes  19  13
##        no    8 197
##                                           
##                Accuracy : 0.9114          
##                  95% CI : (0.8677, 0.9443)
##     No Information Rate : 0.8861          
##     P-Value [Acc > NIR] : 0.1285          
##                                           
##                   Kappa : 0.5939          
##  Mcnemar's Test P-Value : 0.3827          
##                                           
##             Sensitivity : 0.70370         
##             Specificity : 0.93810         
##          Pos Pred Value : 0.59375         
##          Neg Pred Value : 0.96098         
##              Prevalence : 0.11392         
##          Detection Rate : 0.08017         
##    Detection Prevalence : 0.13502         
##       Balanced Accuracy : 0.82090         
##                                           
##        'Positive' Class : yes             
## 

For the test dataset, the confusion matrix and associated summary metrics are:

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction yes no
##        yes   6  4
##        no    2 65
##                                           
##                Accuracy : 0.9221          
##                  95% CI : (0.8381, 0.9709)
##     No Information Rate : 0.8961          
##     P-Value [Acc > NIR] : 0.2999          
##                                           
##                   Kappa : 0.6232          
##  Mcnemar's Test P-Value : 0.6831          
##                                           
##             Sensitivity : 0.75000         
##             Specificity : 0.94203         
##          Pos Pred Value : 0.60000         
##          Neg Pred Value : 0.97015         
##              Prevalence : 0.10390         
##          Detection Rate : 0.07792         
##    Detection Prevalence : 0.12987         
##       Balanced Accuracy : 0.84601         
##                                           
##        'Positive' Class : yes             
## 

3.6 Cross-Validation

During model development, the logistic regression model was fitted to 30 random resamples of the training subset using 10-fold cross-validation repeated 3 times. For each resample, the coefficients were fitted and then ROC statistics computed using a cutoff probability of 0.2. This boxplot shows the distribution of each statistic across the 30 resamples (red circle = mean). This figure gives some indication as to how sensitive the model is to the specific dataset. The final model was fitted to all of the data in the training subset.

4 Shannon Beach at Upper Mystic Lake (Ent)

4.1 Dataset

Number of exceedance (yes) and non-exceedance (no) samples by project and monitoring location.

##                   yes  no Total
## DCRBCH - UPLSHBM   10 224   234
## RECFLAG - UPLSHBC   3  72    75
## Total              13 296   309

Dataset was split into training (75%) and testing (25%) subsets.

##          yes  no Total
## Training  10 222   232
## Testing    3  74    77
## Total     13 296   309

4.2 Prediction Model

The logistic regression model for the Mystic River is based on 1 predictor:

  • aberjona_logflow_p1d: daily mean streamflow (log-10 transformed) at Aberjona River gauge (Station ID: 01102500) on the day before sample was collected

All predictors were centered and scaled prior to fitting.

Diagnostic output of the logistic regression model:

## 
## Call:
## NULL
## 
## Deviance Residuals: 
##      Min        1Q    Median        3Q       Max  
## -2.89322   0.05378   0.10094   0.20426   1.39868  
## 
## Coefficients:
##                      Estimate Std. Error z value Pr(>|z|)    
## (Intercept)            5.0049     0.8355    5.99  2.1e-09 ***
## aberjona_logflow_p1d  -2.2503     0.5161   -4.36  1.3e-05 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 82.446  on 231  degrees of freedom
## Residual deviance: 50.768  on 230  degrees of freedom
## AIC: 54.768
## 
## Number of Fisher Scoring iterations: 8

This model accounts for 61.6% of the total deviance (pseudo-R^2 value of 0.38).

Relationships between E. coli concentration and value of each predictor.

4.3 ROC Curves

Data Subset Area Under Curve (AUC)
Testing 0.8828829
Training 0.9247748

4.4 Cutoff Optimization

The logistic regression model predicts the probability of an exceedance. In order to classify that probability as being an exceedance or not, we need to determine an optimal cutoff probability. By default, logistic regression algorithms tend to use 0.5 as the cutoff. However, when the dataset has large class imbalance (exceedances are far less frequent than non-exceedances), then the optimal cutoff probability can be higher or lower than 0.5. One way to determine the optimal cutoff is to compute the accuracy (Number of Accurate Predictions), specificity (True Negative Rate), and sensitivity (True Positive Rate) for a range of cutoff thresholds. The following plots show these three metrics using both the training and testing subsets. The distance metric is the combined distance of both the sensitivity and specificity to values of 1. The goal is to minimize the distance metric, while not sacrificing accuracy. The optimal cutoff for this model was determined to be about 0.2.

## Warning: Removed 7 rows containing missing values (geom_path).
## Warning: Removed 8 rows containing missing values (geom_path).

4.5 Confusion Matrix

For the training dataset, the following shows various metrics using the confusion matrix (i.e. tabulation of true positive, false negative, true negative, and false positive). The overall accuracy should be greater than the No Information Rate, which is simply the proportion of non-exceedance samples in the dataset. If the accuracy is less than the No Information Rate, then a naive model based solely on the proportion of exceedances and non-exceedances is a better predictor than the model.

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction yes  no
##        yes   6  11
##        no    4 211
##                                           
##                Accuracy : 0.9353          
##                  95% CI : (0.8956, 0.9634)
##     No Information Rate : 0.9569          
##     P-Value [Acc > NIR] : 0.9550          
##                                           
##                   Kappa : 0.4126          
##  Mcnemar's Test P-Value : 0.1213          
##                                           
##             Sensitivity : 0.60000         
##             Specificity : 0.95045         
##          Pos Pred Value : 0.35294         
##          Neg Pred Value : 0.98140         
##              Prevalence : 0.04310         
##          Detection Rate : 0.02586         
##    Detection Prevalence : 0.07328         
##       Balanced Accuracy : 0.77523         
##                                           
##        'Positive' Class : yes             
## 

For the test dataset, the confusion matrix and associated summary metrics are:

## Confusion Matrix and Statistics
## 
##           Reference
## Prediction yes no
##        yes   1  4
##        no    2 70
##                                           
##                Accuracy : 0.9221          
##                  95% CI : (0.8381, 0.9709)
##     No Information Rate : 0.961           
##     P-Value [Acc > NIR] : 0.9695          
##                                           
##                   Kappa : 0.2116          
##  Mcnemar's Test P-Value : 0.6831          
##                                           
##             Sensitivity : 0.33333         
##             Specificity : 0.94595         
##          Pos Pred Value : 0.20000         
##          Neg Pred Value : 0.97222         
##              Prevalence : 0.03896         
##          Detection Rate : 0.01299         
##    Detection Prevalence : 0.06494         
##       Balanced Accuracy : 0.63964         
##                                           
##        'Positive' Class : yes             
## 

4.6 Cross-Validation

During model development, the logistic regression model was fitted to 30 random resamples of the training subset using 10-fold cross-validation repeated 3 times. For each resample, the coefficients were fitted and then ROC statistics computed using a cutoff probability of 0.2. This boxplot shows the distribution of each statistic across the 30 resamples (red circle = mean). This figure gives some indication as to how sensitive the model is to the specific dataset. The final model was fitted to all of the data in the training subset.

The sensitivity for this model is relatively low. However, this is to be expected given that only 4.21% of all samples were exceedances.