1 Background

1.1 Objective

Objective of both scripts is to semi-automatically obtain the species/genus for trees (>15 m height) from remote sensing data. Classes are Norway spruce (Picea abies, PA), European silver fir (Abies alba, AA), Scots Pine (Pinus sylvestris, PS), European Larch (Larix decidua, LD), Douglas fir (Pseudotsugea menziesii, PM), unspecified deciduous broadleaved species (DB) and snags (WD).

This present Script 1 sets up the model parameter of a Random Forest classification model using 1132 treetop areas identified through visual interpretation of stereoscopic aerial photos that will be applied to the remote sensing data in Script 2. A feature selection approach is applied to reduce the initial 129 features to 11 features without significantly reducing classification accuracy.

1.1.1 Original input data and derived datasets

Original remote sensing data used in this study are: 1) aerial stereoscopic photos (RGBI, 10cm, acquired Juli 2014, leaf-on)
2) aerial stereoscopic photos (RGBI, 10cm, acquired April 2015, leaf-off)
3) airborne laserscan (ALS) data (25 beams/m², acquired April/May 2015, leaf-off)

From these original data the following datasets were derived as input data for this classification process: 1) TDOP: The aerial stereoscopic photos were converted into true digital orthophoto (TDOP). 2) Ratio images: Each of the spectral bands of the 2015 TDOP was normalized by the sum of all four sectral bands. 3) NDVI: for both data sets, TDOP_2014 and TDOP_2015, the Normalized Differential Vegetation Index (NDVI) was calculated from the red and near infrared spectral band. 4) Derivation NIR: for the near infrared band of the 2015 dataset the derivation was calculated 5) Tree crown outline polygons: By selecting exclusivly the first targets of each ALS beam, a canopy height model (CHM) was interpolated and tree crown outline polygons have been calculated. Additional features including treetop xyz-coordinates tree crown area are provided. 6) Altitude-normalized points: By selecting exclusivly the last targets of each ALS beam, a digital terrain model (DTM) was interpolated in order to altitude-normalize the original ALS targets. Hence their z-values reflect height above ground.

1.1.2 Working steps

  1. Calculation of model input features by extraction of raster values and calculation of ALS target metrics for the 1132 reference treetop buffers.
  2. Identification of best method for feature selection.

2 General settings

2.1 Define and set control switches

Switches trigger the skipping of time consuming calculations during test runs. The required data are read from RDS-objects.

R.version.string
## [1] "R version 4.1.1 (2021-08-10)"
sw_test_run <- T              # TRUE: run test dataset 
                                # (use different directories for input/output)
sw_save_results_as_rds <- T   # TRUE: save results as RDS-file (useful to
                                # avoid repeating time consuming processes)
sw_find_field_order <- F # TRUE: find field order in the feature table that 
                                # mimimizes OOB error in Random Forest 
                                # (which is time consuming)
sw_find_ntree <- F # TRUE: find suitable value for Random Forest parameter ntree 
                                # (which is time consuming)
sw_find_mtry  <- F # TRUE: find suitable value for Random Forest parameter mytry
                                # (which is time consuming)
sw_markup_plots <- T          # TRUE: plots will be drawn in markup 
                                # (which is time consuming)
options(stringsAsFactors = F) # FALSE: strings are not factors

2.2 Include R-packages

# general purpose function by Anish Sugathan to install not-yet installed packages and load them 
usePackage <- function(p) {
  # load a package if installed, else load after installation.
  # Args:  p: package name in quotes
  if (!is.element(p, installed.packages()[,1])) install.packages(p, dep = TRUE)
  require(p, character.only = TRUE)
}

usePackage('raster')        # read and manipiulate raster data
usePackage('rgdal')         # read and manipulate vector data (GDAL)
usePackage('sf')            # read and manipulate vector data (simple features)
usePackage('dplyr')         # efficiently manipulating datasets
usePackage('ggplot2')       # advanced visualization, ggplot (with two 'g's)
usePackage('rasterVis')     # visualize of raster data, gplot (with one 'g') 
usePackage('gridExtra')     # arrange multiple grid-based plots on a page
usePackage('randomForest')  # random forest model functions
usePackage('caret')         # optimize and validiate models
usePackage('RStoolbox')     # remote sensing data processing  
usePackage('foreign')       # read dbf-files
usePackage('rgeos')         # interface to Geometry Engine - Open Source 
usePackage('magrittr')      # provides the use of pipes %>%
usePackage('readr')         # parsing a flat file into a tibble
usePackage('psych')         # calculate kappa and its variance

2.3 Set data paths for input and output

if(sw_test_run == TRUE) {
  path_input_data  <- "data_markup/data_input/"
  path_output_data <- "data_markup/data_output/"
  
} else {
  path_input_data  <- "input_data/"
  path_output_data <- "output_data/"
}

2.4 Import the reference data

# ground reference points (trees)
sf_ref_points <- st_read(paste0(path_input_data, "reference_data/ref_points.geojson"))
st_crs(sf_ref_points)

# polygon of 1000m buffer of convexhull around National Park boundary
sf_nlp_1000m_boundary_poly <- st_read(paste0(path_input_data,
                    "project_boundary/NLP_ConvexHull_AllBuffer_1000m.geojson"))

# polygon of National Park boundary
sf_nlp_boundary_poly <- st_read(paste0(path_input_data,
                    "project_boundary/NLP_boundary.geojson"))

ggplot() + 
  geom_sf(data = st_geometry(sf_nlp_1000m_boundary_poly), 
          aes(colour = "Convex hull 1km buffer")) +
  geom_sf(data = st_geometry(sf_nlp_boundary_poly), 
          aes(colour = "National Park boundary")) +
  geom_sf(data = sf_ref_points)
Reference trees with National Park boundary and its 1000m buffer convex hull outline

Reference trees with National Park boundary and its 1000m buffer convex hull outline

3 Calculation of model input features

This is the list of the 129 input features that have been calculated for the 1132 treetop buffer areas. Both the raster extract process and lidar metrics calculation process is described in detail in the script markup of script 2, hence these steps are skipped here to avoid replication. Instead the result of those time consuming processes is imported from RDS-file.

df_treetop_buffer_features <- 
     readRDS(paste0(path_input_data, "reference_data/treetop_buffer_features.rds"))

paste(names(df_treetop_buffer_features), collapse="  ")
## [1] "ID  EASTTOP  NORTHTOP  HEIGHTTOP  HEIGHTDTM  AREACROWN  RAD1ELLIP  RAD2ELLIP  DIRELLIP  Class  Feature_ID  TDOP_Red_2015.SD  TDOP_Green_2015.SD  TDOP_Blue_2015.SD  TDOP_NIR_2015.SD  TDOP_Red_2015.Mean  TDOP_Green_2015.Mean  TDOP_Blue_2015.Mean  TDOP_NIR_2015.Mean  TDOP_Red_2015.Min  TDOP_Green_2015.Min  TDOP_Blue_2015.Min  TDOP_NIR_2015.Min  TDOP_Red_2015.Max  TDOP_Green_2015.Max  TDOP_Blue_2015.Max  TDOP_NIR_2015.Max  NDVI_2014.SD  NDVI_2015.SD  NDVI_2014.Mean  NDVI_2015.Mean  NDVI_2014.Min  NDVI_2015.Min  NDVI_2014.Max  NDVI_2015.Max  NDVI_Diff_2014_2015.Mean  NDVI_nDiff_2014_2015.Mean  Ratio_2015_Red.SD  Ratio_2015_Green.SD  Ratio_2015_Blue.SD  Ratio_2015_NIR.SD  Ratio_2015_Red.Mean  Ratio_2015_Green.Mean  Ratio_2015_Blue.Mean  Ratio_2015_NIR.Mean  Ratio_2015_Red.Min  Ratio_2015_Green.Min  Ratio_2015_Blue.Min  Ratio_2015_NIR.Min  Ratio_2015_Red.Max  Ratio_2015_Green.Max  Ratio_2015_Blue.Max  Ratio_2015_NIR.Max  PCA_1_2015.SD  PCA_2_2015.SD  PCA_3_2015.SD  PCA_1_2015.Mean  PCA_2_2015.Mean  PCA_3_2015.Mean  PCA_1_2015.Min  PCA_2_2015.Min  PCA_3_2015.Min  PCA_1_2015.Max  PCA_2_2015.Max  PCA_3_2015.Max  CHM_2015.SD  CHM_2015.Mean  CHM_2015.Min  CHM_2015.Max  DSM_2015.SD  DSM_2015.Mean  DSM_2015.Min  DSM_2015.Max  DSM_Roughness.SD  DSM_Roughness.Mean  DSM_Roughness.Min  DSM_Roughness.Max  CHM_Roughness.SD  CHM_Roughness.Mean  CHM_Roughness.Min  CHM_Roughness.Max  DSM_Slope_1.SD  DSM_Slope_1.Mean  DSM_Slope_1.Min  DSM_Slope_1.Max  DSM_Slope_2.SD  DSM_Slope_2.Mean  DSM_Slope_2.Min  DSM_Slope_2.Max  CHM_Neigung.SD  CHM_Neigung.Mean  CHM_Neigung.Min  CHM_Neigung.Max  DSM_Curvature.SD  DSM_Curvature.Mean  DSM_Curvature.Min  DSM_Curvature.Max  CHM_Curvature.SD  CHM_Curvature.Mean  CHM_Curvature.Min  CHM_Curvature.Max  Deri_1_NIR_2015.SD  Deri_1_NIR_2015.Mean  Deri_1_NIR_2015.Min  Deri_1_NIR_2015.Max  EVI_2015.Mean  Mean_Intensity_all  Mean_TargetNum_all  Mean_NumTargets_all  Sum_Targets_all  Mean_Intensity_top150cm  Mean_TargetNum_top150cm  Mean_NumTargets_top150cm  Sum_Targets_top150cm  Ratio_Targets_top150cm_to_all  Mean_Intensity_1st_Targets_top150cm  Mean_NumTargets_1st_Targets_top150cm  Sum_Targets_1st_Targets_top150cm  Sum_1st_Targets_all  Ratio_Targets_1st_Targets_top150cm_to_1st_Targets_all  Ratio_Targets_1st_Targets_top150cm_to_all_top150cm  Mean_Intensity_no_1st_Targets_top300cm  Mean_TargetNum_no_1st_Targets  Mean_NumTargets_no_1st_Targets_top300cm  Sum_Targets_no_1st_Targets_top300cm  Sum_Targets_no_1st_Targets_all  Sum_Targets_top300cm  Ratio_Targets_no_1st_Targets_top300cm_to_no_1st_Targets_all  Ratio_Targets_no_1st_Targets_top300cm_to_all_top300cm"

4 Feature selection

The preleminary Random Forest model is run including all 129 input features listed above to create the reference model. As calculating the estimated 4 mio trees using all features exceeds calculation effort, the number of features need to be reduced without compromising in model accuracy.

Two feature selection methods are compared:
1) the automatic Recursive Feature Elimination (RFE), and 2) a manual selection of features that score high variable importance values.

For each feature selection method a Random Forest model ist generatend and evaluated. Finally, the best model is compared to the reference model.

4.1 Calculate reference Random Forest model including all features

4.1.1 Function to calculate Confusion Matrix, Accuracies, Kappa, Z-Value

RFevaluate <- function(model, ref) {
  
  # calculate confusion matrix
  confmatrix <- confusionMatrix(model$predicted,
                                reference = ref)
  
    v_predicted <- levels(model$predicted)
  
  # Producer's und User's Accuracy 
  list_acc_user <- list()
  list_acc_prod <- list()

  for (i in 1:length(v_predicted)) {
    list_acc_user[i] <- confmatrix$table[i,i] / sum(confmatrix$table[i,])
    list_acc_prod[i] <- confmatrix$table[i,i] / sum(confmatrix$table[,i])
  }

  df_accuracies <- data.frame(v_predicted, 
                                 unlist(list_acc_prod), 
                                 unlist(list_acc_user)) %>%
                     set_colnames(c("Predicted", "Producer's Accuracy", 
                                    "User's Accuracy"))
  
  # Cohen's Kappa: compares an Observed Accuracy with an Expected Accuracy
  # values 0.81–1.00 are considered almost perfect agreement.
  co_kappa <- cohen.kappa(confmatrix$table)
  
  # Z-value: significance of difference to random classification
  z_value <- co_kappa$kappa / sqrt(co_kappa$var.kappa)
  
  df_v <- data.frame(metric=character(),value=numeric())
  df_v[1,] <- c("Overall Accuracy", round(as.numeric(confmatrix$overall[1]),6))
  df_v[2,] <- c("Balanced Accuracy",round(mean(confmatrix$byClass[,11]),6))
  df_v[3,] <- c("Cohen's Kappa", round(co_kappa$kappa, 6))
  df_v[4,] <- c("var Kappa", format(co_kappa$var.kappa, scientific=F))
  df_v[5,] <- c("Z-Value", round(z_value, 6))
  
 list_output <- list(ConfusionMatrix = as.matrix(confmatrix), 
                  Accuracies = df_accuracies, 
                  Metrics = df_v)
 return(list_output)
 
} # end function RFevaluate 

4.1.2 General settings

# Species classes extracted from the reference data  
v_species_classes <- sort(unique(df_treetop_buffer_features$Class))

print(paste0("Species classes: ",v_species_classes[1]," (Abies alba), ",
             v_species_classes[2]," (Deciduous broadleaved), ",
             v_species_classes[3]," (Larix decidua), ", 
             v_species_classes[4]," (Picea abies), ", 
             v_species_classes[5]," (Pseudotsuga menziesii), ", 
             v_species_classes[6]," (Pinus sylvestris), ", 
             v_species_classes[7]," (snags)"))
## [1] "Species classes: AA (Abies alba), DB (Deciduous broadleaved), LD (Larix decidua), PA (Picea abies), PM (Pseudotsuga menziesii), PS (Pinus sylvestris), WD (snags)"
# Fixing a seed so that random objects can be reproduced
seed <- 1
set.seed(seed)

4.1.3 Random Forest model parameter tuning

4.1.3.1 Establish error rates for variation in field order inside the feature table

The values of OOB error estimate change with the sequence of fields within the feature table.
Repeated model runs with randomly change of field sequence identify the sequence that results in the minimum error values.

if (sw_find_field_order == FALSE) {
  # change order of the fields in the feature table
  df_treetop_buffer_features <- 
             df_treetop_buffer_features[,
                readRDS(paste0(path_output_data,"v_features_field_order.rds"))]
} else {
   
   num_runs      <-  25                   # number of loop runs
   v_oob         <- c()                   # save oob  of each loop run
   m_field_order <- matrix(0,             # save field names of each loop run
                           ncol = ncol(df_treetop_buffer_features), 
                           nrow = num_runs)   
   
   for (k in 1:num_runs)  {
      # randomize the dataframe's column order
      m_field_order[k,] <- sample(names(df_treetop_buffer_features))
      df_treetop_buffer_features <- 
                          df_treetop_buffer_features[,m_field_order[k,]]
      
      set.seed(seed) 
      RF_ref_model <- 
        randomForest(Class ~ . -ID -Feature_ID,
                     data = df_treetop_buffer_features, 
                     importance = TRUE, 
                     proximity = TRUE, 
                     oob.prox = TRUE 
      )
    
      v_oob[k] <- RF_ref_model$err.rate[500,1] # save the OOB error value 
   } # for
   
   # demonstrate the variability of results caused by sequence of fields
   boxplot(v_oob, notch=F, names=names(v_oob),
        ylab = "OOB error rate",
        ylim = c(0.05,0.07), 
        main = "OOB error rate variability caused by sequence of fields"
        )

   if (sw_save_results_as_rds== TRUE) {
     # save order of the fields in the feature table to the minimum OOB error
     saveRDS(m_field_order[which.min(v_oob),],
                  file = paste0(path_output_data,"v_features_field_order.rds"))
   }
   # ... and apply this order to the feature table
   df_treetop_buffer_features <- 
                  df_treetop_buffer_features[,m_field_order[which.min(v_oob),]]

} # else

set.seed(seed)
RF_ref_model <- randomForest(Class ~ . , 
                                   data = df_treetop_buffer_features,
                                   importance=TRUE, 
                                   proximity = TRUE,
                                   oob.prox = TRUE
   )

eval_RF_ref_model <- RFevaluate(model=RF_ref_model, 
                                ref=df_treetop_buffer_features$Class)
eval_RF_ref_model
## $ConfusionMatrix
##     AA  DB  LD  PA PM  PS  WD
## AA 245   0   0   2  8   1   0
## DB   0 155   3   0  0   0   0
## LD   0   4 124   0  0   0   1
## PA   1   0   0 259 18  14   1
## PM   0   0   0   3 13   0   0
## PS   3   2   0   8  2 117   0
## WD   0   2   1   1  0   0 144
## 
## $Accuracies
##   Predicted Producer's Accuracy User's Accuracy
## 1        AA           0.9839357       0.9570312
## 2        DB           0.9509202       0.9810127
## 3        LD           0.9687500       0.9612403
## 4        PA           0.9487179       0.8839590
## 5        PM           0.3170732       0.8125000
## 6        PS           0.8863636       0.8863636
## 7        WD           0.9863014       0.9729730
## 
## $Metrics
##              metric        value
## 1  Overall Accuracy     0.933746
## 2 Balanced Accuracy     0.925724
## 3     Cohen's Kappa     0.919582
## 4         var Kappa 0.0000799014
## 5           Z-Value   102.875835

4.1.3.2 ntree: the number of decision trees to be generated

The default value for ntree is 500. This procedure checks, whether a different value results in significantly lower OOB error.

# as calculation of error rates is time consuming, 
# during test runs read the results from a RDS file 
if (sw_find_ntree == FALSE) {
  df_error_rates <- 
     readRDS(paste0(path_output_data, "df_error_rates.rds"))
} else {

  # identify the ntree (number of decision trees) 
  # that minimises the OOB estimate of error rate
  v_ntree_options <- c(250,500,750,1000,1500)
  # number of tests runs
  ntests <- 10
  # create empty dataframe to take up the results of the for-loop runs
  df_error_rates <- as.data.frame(matrix(nrow = ntests, 
                                         ncol=length(v_ntree_options)))
  
  for (i in 1:length(v_ntree_options)){ 
    names(df_error_rates)[i] <- 
                   paste(as.character(v_ntree_options[i]),"trees",sep=" ")
    set.seed(seed) # render results reproducible
    for (j in 1:ntests){ 
      
      rfmodel <- randomForest(Class ~ .-ID -Feature_ID, 
                          data = df_treetop_buffer_features,
                          ntree = v_ntree_options[i], 
                          importance = TRUE)
      
      # save error rates for each run
      df_error_rates[j,i] <- rfmodel$err.rate[v_ntree_options[i]]
    } 
  } 
  if (sw_save_results_as_rds== TRUE) {
    saveRDS(df_error_rates, 
                 file = paste0(path_output_data, "df_error_rates.rds"))
  } # if
} # else

# plot OOB (out of box) error rates for different ntree values
boxplot(df_error_rates, notch=F,names=names(df_error_rates),
        ylab = "OOB error rate",
        ylim = c(0.05,0.07), 
        main = "OOB (out of box) error rates for different ntree values"
        )
grid() # adds a rectangular grid to the plot

# As the error rates for all ntree values are very similar
# we choose the standard value 500
ntree <-  500

4.1.3.3 mtry: number of random selected features used in the construction of each tree

The default value for mtry is sqrt(number of features). This procedure tries to identify a value that provides a higher accuracy.

# as calculation of mtry is time consuming, 
# during test runs read the results from a RDS file 
if (sw_find_mtry == FALSE) {
  v_accuracy_mtry <- 
     readRDS(paste0(path_output_data, "accuracy_mtry.rds"))
} else {
  
  v_accuracy_mtry = c() # vector to be filled by loop
  
  for (i in 1:50) {                 # values for mtry
    set.seed(seed) 
    rf <- randomForest(Class ~ .-ID -Feature_ID, 
                       data = df_treetop_buffer_features,
                       ntree = ntree, 
                       mtry = i, 
                       importance = TRUE)
    
    v_accuracy_mtry[i] <- 
      mean(rf$predicted == df_treetop_buffer_features$Class)
  } # for
  
  if (sw_save_results_as_rds== TRUE) {
    saveRDS(v_accuracy_mtry, 
           file = paste0(path_output_data, "accuracy_mtry.rds"))
  } # if
} # else

mtry <- which.max(v_accuracy_mtry)
print(paste0("The max value for mtry is ",
             round(max(v_accuracy_mtry),4), 
             " for mtry = ", mtry))
## [1] "The max value for mtry is 0.9461 for mtry = 31"

4.1.4 Set up the reference model

set.seed(seed)
RF_ref_model <- randomForest(Class ~ . , 
                                   data = df_treetop_buffer_features,
                                   mtry = mtry, 
                                   ntree = ntree,
                                   importance = TRUE, 
                                   proximity = TRUE,
                                   oob.prox = TRUE )

eval_RF_ref_model <- RFevaluate(model=RF_ref_model, 
                                ref=df_treetop_buffer_features$Class)
eval_RF_ref_model
## $ConfusionMatrix
##     AA  DB  LD  PA PM  PS  WD
## AA 246   0   0   3  7   1   0
## DB   0 155   3   0  0   0   0
## LD   0   4 124   0  0   0   1
## PA   1   0   0 259 11   6   1
## PM   0   0   0   2 19   2   0
## PS   2   2   0   8  4 123   0
## WD   0   2   1   1  0   0 144
## 
## $Accuracies
##   Predicted Producer's Accuracy User's Accuracy
## 1        AA           0.9879518       0.9571984
## 2        DB           0.9509202       0.9810127
## 3        LD           0.9687500       0.9612403
## 4        PA           0.9487179       0.9316547
## 5        PM           0.4634146       0.8260870
## 6        PS           0.9318182       0.8848921
## 7        WD           0.9863014       0.9729730
## 
## $Metrics
##              metric         value
## 1  Overall Accuracy       0.94523
## 2 Balanced Accuracy      0.940821
## 3     Cohen's Kappa      0.933687
## 4         var Kappa 0.00006657626
## 5           Z-Value    114.430433

4.1.5 Analysis of the reference model

Plot error rate versus number of decision trees

# set colors for the genus_classes (OOB-error = 1)
pal_genus_classes <- c("yellow","red","green","blue",  "grey",
                          "violet", "orange",  "black")

layout(matrix(c(1,2), nrow = 1),  width=c(4,1))
par(mar=c(5,4,4,0)) # no right padding

plot(RF_ref_model, log = "y", col = pal_genus_classes, lty = 1,
     main= "Reference model: error rate versus number of decision trees")
grid()
par(mar=c(5,0,4,2)) # no left padding

# add legend
plot(c(0,1),type = "n", axes = F, xlab = "", ylab = "")
legend("top", colnames(RF_ref_model$err.rate),
       col = pal_genus_classes, lty = 1)

4.2 Feature selection methods

4.2.1 Recursive Feature Elimination (RFE)

RFE automatically identifies the minimum set of features that meet the set quality standards.

# rfFuncs is a set of functions used by rfControl
# some of these functions will be adjusted in the following
RFE_adjusted <- rfFuncs

# function "fit" requires adjustment otherwise rerank=TRUE will not work
RFE_adjusted$fit <- function (x, y, first, last, ...) {
  loadNamespace("randomForest")
  randomForest::randomForest(x, y, ...)
}

# adjustment of function "selectSize"
# - determins which set of features is selected eventually
# - use pickSizeTolerance rather than pickSizeBest
#   as pickSizeBest will always select the full set of features as "best",
#   whereas pickSizeTolerance will choose the minimum feature set that meets 
#   the performance within the set tolerance 
#   (tol=1 meaning 1% from "best", i.e. 99%)
RFE_adjusted$selectSize <- function (x, metric, tol = 1, maximize)
{                                # code origins from print(pickSizeTolerance)
  if (!maximize) {
    best <- min(x[, metric])
    perf <- (x[, metric] - best)/best * 100
    flag <- perf <= tol
  }
  else {
    best <- max(x[, metric])
    perf <- (best - x[, metric])/best * 100
    flag <- perf <= tol
  }
  min(x[flag, "Variables"])
}

# function for model fitting within rfe
ctrl_rerank <- rfeControl(functions = RFE_adjusted,
                          method = "cv",  # use cross-validation
                          number = 5,         
                          rerank = TRUE   # recalc importance after each subset
)

# RFE (RandomForest is being executed within RFE)
set.seed(seed)
this_RFE <- rfe(x = select(df_treetop_buffer_features, -ID, -Feature_ID, -Class), 
                y = df_treetop_buffer_features$Class,
                sizes = c(5:25),
                rfeControl = ctrl_rerank)
ggplot(data = this_RFE, mapping = NULL,
       metric = this_RFE$metric[1], output = "layered",
       environment = NULL)

# display features identified by RFE
RFE_predictors <- predictors(this_RFE)
RFE_predictors
##  [1] "Ratio_2015_NIR.Mean"           "NDVI_2015.Mean"               
##  [3] "Ratio_2015_Green.Mean"         "Deri_1_NIR_2015.Mean"         
##  [5] "Ratio_2015_Red.Mean"           "NDVI_2014.Min"                
##  [7] "PCA_1_2015.Mean"               "NDVI_2014.Mean"               
##  [9] "Ratio_2015_Blue.Mean"          "NDVI_Diff_2014_2015.Mean"     
## [11] "Ratio_Targets_top150cm_to_all"
# subset selected features from treetop buffer feature data frame  
df_features_selected <- 
  df_treetop_buffer_features[, names(df_treetop_buffer_features) %in% RFE_predictors]

# sort columns to reflect the order of vector name in the order of plots
df_features_selected <- df_features_selected[ , RFE_predictors]

# add columns Class to the dataframe
df_features_selected$Class <- df_treetop_buffer_features$Class 

col_class <- c("AA"="red",
               "DB"="green",
               "LD"="blue",
               "PA"="grey",
               "PM"="violet",
               "PS"="orange", 
               "WD"="black")

violin_plots <- lapply(1:length(RFE_predictors), function(i) {
    ggplot(df_features_selected) + 
    scale_fill_manual( values=col_class) + # defines the color display of the points 
    theme_classic() + 
    theme(legend.position = "none",  
          panel.grid.major = element_line(color= "black", size = 0.1), 
          panel.grid.minor = element_line(color = "black", size = 0.05), 
          panel.border = element_rect(color = "black", size = 1, fill = NA),
          plot.title = element_text(hjust = 0.5, size = 11), 
          axis.title = element_text(size = 9)) + 
    geom_violin(aes(x=Class, y=df_features_selected[,i], fill = Class)) + 
    # exclude outlier by limiting values to range 0.1-99.9%
    ylim(quantile(df_features_selected[,i], .001), 
         quantile(df_features_selected[,i], .999))+
    labs(title= RFE_predictors[i], y= '', x='') 
  })

grid.arrange(grobs = violin_plots, ncol = 3)

RF_RFE <- this_RFE$fit 

eval_RF_RFE <- RFevaluate(model=RF_RFE, 
                          ref=df_treetop_buffer_features$Class)
eval_RF_RFE
## $ConfusionMatrix
##     AA  DB  LD  PA PM  PS  WD
## AA 243   0   0   3  8   2   0
## DB   0 154   4   0  0   0   1
## LD   0   5 122   0  0   0   1
## PA   3   0   0 259 12  13   3
## PM   1   0   0   3 16   1   0
## PS   2   2   0   7  5 116   0
## WD   0   2   2   1  0   0 141
## 
## $Accuracies
##   Predicted Producer's Accuracy User's Accuracy
## 1        AA           0.9759036       0.9492188
## 2        DB           0.9447853       0.9685535
## 3        LD           0.9531250       0.9531250
## 4        PA           0.9487179       0.8931034
## 5        PM           0.3902439       0.7619048
## 6        PS           0.8787879       0.8787879
## 7        WD           0.9657534       0.9657534
## 
## $Metrics
##              metric         value
## 1  Overall Accuracy      0.928445
## 2 Balanced Accuracy      0.926408
## 3     Cohen's Kappa       0.91322
## 4         var Kappa 0.00008562746
## 5           Z-Value     98.689075

4.2.2 Variable Importance Plot (VIP)

In order to find a feature selection method that gives improved results and/or requires less features, the Variable Importance Plot is analysed. The Variable Importance Plot ranks features according to their importance for the model expressed by MeanDecreaseAccuracy and MeanDecreaseGini.

varImpPlot(RF_ref_model, sort =TRUE, n.var = 20, lcolor = "grey",
           cex = 0.6, main = "Variable Importance Plot for Reference Model",
           cex.main = 0.9)

AS RFE has identified 11 features, the top ranking 11 features in a combination of MeanDecreaseAccuracy and MeanDecreaseGini are selected.

# save ranks of importance in dataframe, saving row names as field "feature"
# save rank position for MeanDecreaseAccuracy and MeanDecreaseGini as new fields
df_ref_model_feature_importance <- 
                  as.data.frame(round(importance(RF_ref_model),2)) %>%
                  tibble::rownames_to_column("feature") %>%
                  arrange(desc(MeanDecreaseGini)) %>%
                  tibble::rownames_to_column("Gini_Ranking") %>%
                  mutate(Gini_Ranking = as.numeric(Gini_Ranking)) %>%
                  arrange(desc(MeanDecreaseAccuracy)) %>%
                  tibble::rownames_to_column("Acc_Ranking") %>%
                  mutate(Acc_Ranking = as.numeric(Acc_Ranking)) 
str(df_ref_model_feature_importance)
## 'data.frame':    128 obs. of  12 variables:
##  $ Acc_Ranking         : num  1 2 3 4 5 6 7 8 9 10 ...
##  $ Gini_Ranking        : num  5 2 1 7 6 17 3 8 22 4 ...
##  $ feature             : chr  "Deri_1_NIR_2015.Mean" "NDVI_2015.Mean" "Ratio_2015_NIR.Mean" "NDVI_2014.Min" ...
##  $ AA                  : num  4.33 16.1 17.71 2.57 5.46 ...
##  $ DB                  : num  4.85 15.09 15 12.56 14.77 ...
##  $ LD                  : num  2.5 16.9 15.7 14 17.5 ...
##  $ PA                  : num  42.87 15.91 17.31 3.7 5.69 ...
##  $ PM                  : num  8.93 5.42 5.95 1.46 0.36 3.53 -0.7 2.11 5.21 3.69 ...
##  $ PS                  : num  26.37 19.76 18.47 11.28 9.32 ...
##  $ WD                  : num  4.59 15.92 16.75 17.99 14.49 ...
##  $ MeanDecreaseAccuracy: num  42.9 22.2 22.2 20.2 19.6 ...
##  $ MeanDecreaseGini    : num  41.2 89.6 89.8 40.3 41.1 ...
# SELECT features that rank top x in MeanDecreaseAccuracy 
# OR top y in MeanDecreaseGini
df_ref_model_feature_VIP <- filter(df_ref_model_feature_importance, 
                                    Acc_Ranking <= 7 | Gini_Ranking <= 10) %>%
                               select(.,-c("Acc_Ranking","Gini_Ranking"))

# select the identified set of features from the original features
df_treetop_buffer_features_VIP <- 
              select(df_treetop_buffer_features,
                      all_of(df_ref_model_feature_VIP$feature)) %>%
              mutate(Class = df_treetop_buffer_features$Class)


if (sw_find_field_order == FALSE) {
  # change order of the fields in the feature table
  df_treetop_buffer_features_VIP <- 
          df_treetop_buffer_features_VIP[,
            readRDS(paste0(path_output_data,"v_features_field_order_VIP.rds"))]
} else {
   
   # values OOB error estimate change for each field order in the data table  
   # multiple runs with random order of fields are carried out
   # to identify the order that produces the the minimum error value
   num_runs      <-  25                   # number of loop runs
   v_oob         <- c()                    # save oob  of each loop run
   m_field_order <- matrix(0,              # save field names of each loop run
                           ncol = ncol(df_treetop_buffer_features_VIP), 
                           nrow = num_runs)   
   
   for (k in 1:num_runs)  {
     
      # randomize the field order of the feature table, while saving the order
      m_field_order[k,] <- sample(names(df_treetop_buffer_features_VIP))
      # ... and apply this order to the feature table  
      df_treetop_buffer_features_VIP <- 
                          df_treetop_buffer_features_VIP[,m_field_order[k,]]

      set.seed(seed)
      RF_VIP <- randomForest(Class ~ . , 
                                      data = df_treetop_buffer_features_VIP,
                                      mtry=3, ntrees=500,
                                      importance=TRUE, 
                                      proximity = TRUE,
                                      oob.prox = TRUE
      )
      v_oob[k] <- RF_VIP$err.rate[500,1] # save the OOB error value 
   } # for
   
   boxplot(v_oob, notch=F, names=names(v_oob),
        ylab = "OOB error rate",
        ylim = c(0.05,0.07), 
        main = "OOB error rate variability caused by sequence of fields"
        )

   if (sw_save_results_as_rds== TRUE) {
     # save order of the fields in the feature table to the minimum OOB error
     saveRDS(m_field_order[which.min(v_oob),],
              file = paste0(path_output_data,"v_features_field_order_VIP.rds"))
   }
   # ... and apply this order to the feature table
   df_treetop_buffer_features_VIP <- 
              df_treetop_buffer_features_VIP[,m_field_order[which.min(v_oob),]]

} # else

# run RandomForest with the min OOB feature order
set.seed(seed)
RF_VIP <- randomForest(Class ~ . , 
                                   data = df_treetop_buffer_features_VIP,
                                   mtry=3, ntrees=500,
                                   importance=TRUE, 
                                   proximity = TRUE,
                                   oob.prox = TRUE
   )

eval_RF_VIP <- RFevaluate(model=RF_VIP, 
                             ref=df_treetop_buffer_features$Class)
eval_RF_VIP
## $ConfusionMatrix
##     AA  DB  LD  PA PM  PS  WD
## AA 238   0   0   3 12   1   0
## DB   0 155   3   0  0   0   1
## LD   0   5 123   0  0   0   1
## PA   4   1   0 256 10  28   3
## PM   7   0   0   1 11   2   0
## PS   0   0   0  12  8 101   0
## WD   0   2   2   1  0   0 141
## 
## $Accuracies
##   Predicted Producer's Accuracy User's Accuracy
## 1        AA           0.9558233       0.9370079
## 2        DB           0.9509202       0.9748428
## 3        LD           0.9609375       0.9534884
## 4        PA           0.9377289       0.8476821
## 5        PM           0.2682927       0.5238095
## 6        PS           0.7651515       0.8347107
## 7        WD           0.9657534       0.9657534
## 
## $Metrics
##              metric       value
## 1  Overall Accuracy    0.905477
## 2 Balanced Accuracy    0.906328
## 3     Cohen's Kappa    0.885207
## 4         var Kappa 0.000110333
## 5           Z-Value   84.273752

4.3 Comparison of the feature selection models to the reference model

Based on the quality metrics (Balanced Accuracy, Kappa, Z-value) the 11 features selected by the Recursive Feature Elimination (RF_RFE) perform clearly better than 11 features combined of top ranking features for MeanDecreaseAccuracy and MeanDecreaseGini (RF_VIP). In comparison to the reference model RF_RFE shows a much lower Kappa difference (Kappa1 - Kappa2) / sqrt(Var.Kappa1 + Var.Kappa2) RFE is therefore applied as the model to the entire National Park (Script 2).

# kappa difference reference model to RFE model
diff_kappa_ref_model_RFE <- (as.numeric(eval_RF_ref_model$Metrics[3,2]) -
                                as.numeric(eval_RF_RFE$Metrics[3,2])) /
                            sqrt(as.numeric(eval_RF_ref_model$Metrics[4,2]) +
                                 as.numeric(eval_RF_RFE$Metrics[4,2]))
print(paste0("The kappa difference between reference model (129 features)", 
             " and RF_RFE (", length(predictors(this_RFE)),
             " features) is: ", round(diff_kappa_ref_model_RFE,3)))
## [1] "The kappa difference between reference model (129 features) and RF_RFE (11 features) is: 1.659"
# kappa difference reference model to VIP model
diff_kappa_ref_model_VIP <- (as.numeric(eval_RF_ref_model$Metrics[3,2]) -
                                as.numeric(eval_RF_VIP$Metrics[3,2])) /
                            sqrt(as.numeric(eval_RF_ref_model$Metrics[4,2]) +
                                 as.numeric(eval_RF_VIP$Metrics[4,2]))
print(paste0("The kappa difference between reference model (129 features)", 
             " and RF_VIP (",ncol(df_treetop_buffer_features_VIP)-1,
             " features) is: ", round(diff_kappa_ref_model_VIP,3)))
## [1] "The kappa difference between reference model (129 features) and RF_VIP (11 features) is: 3.645"
   if (sw_save_results_as_rds== TRUE) {
     # save the final RandomForest model to be used in Script 2
     saveRDS(RF_RFE,  file = paste0(path_output_data,"RF_final.rds"))
   }