Objective of both scripts is to semi-automatically obtain the species/genus for trees (>15 m height) from remote sensing data. Classes are Norway spruce (Picea abies, PA), European silver fir (Abies alba, AA), Scots Pine (Pinus sylvestris, PS), European Larch (Larix decidua, LD), Douglas fir (Pseudotsugea menziesii, PM), unspecified deciduous broadleaved species (DB) and snags (WD).
This present Script 1 sets up the model parameter of a Random Forest classification model using 1132 treetop areas identified through visual interpretation of stereoscopic aerial photos that will be applied to the remote sensing data in Script 2. A feature selection approach is applied to reduce the initial 129 features to 11 features without significantly reducing classification accuracy.
Original remote sensing data used in this study are: 1) aerial stereoscopic photos (RGBI, 10cm, acquired Juli 2014, leaf-on)
2) aerial stereoscopic photos (RGBI, 10cm, acquired April 2015, leaf-off)
3) airborne laserscan (ALS) data (25 beams/m², acquired April/May 2015, leaf-off)
From these original data the following datasets were derived as input data for this classification process: 1) TDOP: The aerial stereoscopic photos were converted into true digital orthophoto (TDOP). 2) Ratio images: Each of the spectral bands of the 2015 TDOP was normalized by the sum of all four sectral bands. 3) NDVI: for both data sets, TDOP_2014 and TDOP_2015, the Normalized Differential Vegetation Index (NDVI) was calculated from the red and near infrared spectral band. 4) Derivation NIR: for the near infrared band of the 2015 dataset the derivation was calculated 5) Tree crown outline polygons: By selecting exclusivly the first targets of each ALS beam, a canopy height model (CHM) was interpolated and tree crown outline polygons have been calculated. Additional features including treetop xyz-coordinates tree crown area are provided. 6) Altitude-normalized points: By selecting exclusivly the last targets of each ALS beam, a digital terrain model (DTM) was interpolated in order to altitude-normalize the original ALS targets. Hence their z-values reflect height above ground.
Switches trigger the skipping of time consuming calculations during test runs. The required data are read from RDS-objects.
## [1] "R version 4.1.1 (2021-08-10)"
sw_test_run <- T # TRUE: run test dataset
# (use different directories for input/output)
sw_save_results_as_rds <- T # TRUE: save results as RDS-file (useful to
# avoid repeating time consuming processes)
sw_find_field_order <- F # TRUE: find field order in the feature table that
# mimimizes OOB error in Random Forest
# (which is time consuming)
sw_find_ntree <- F # TRUE: find suitable value for Random Forest parameter ntree
# (which is time consuming)
sw_find_mtry <- F # TRUE: find suitable value for Random Forest parameter mytry
# (which is time consuming)
sw_markup_plots <- T # TRUE: plots will be drawn in markup
# (which is time consuming)
options(stringsAsFactors = F) # FALSE: strings are not factors
# general purpose function by Anish Sugathan to install not-yet installed packages and load them
usePackage <- function(p) {
# load a package if installed, else load after installation.
# Args: p: package name in quotes
if (!is.element(p, installed.packages()[,1])) install.packages(p, dep = TRUE)
require(p, character.only = TRUE)
}
usePackage('raster') # read and manipiulate raster data
usePackage('rgdal') # read and manipulate vector data (GDAL)
usePackage('sf') # read and manipulate vector data (simple features)
usePackage('dplyr') # efficiently manipulating datasets
usePackage('ggplot2') # advanced visualization, ggplot (with two 'g's)
usePackage('rasterVis') # visualize of raster data, gplot (with one 'g')
usePackage('gridExtra') # arrange multiple grid-based plots on a page
usePackage('randomForest') # random forest model functions
usePackage('caret') # optimize and validiate models
usePackage('RStoolbox') # remote sensing data processing
usePackage('foreign') # read dbf-files
usePackage('rgeos') # interface to Geometry Engine - Open Source
usePackage('magrittr') # provides the use of pipes %>%
usePackage('readr') # parsing a flat file into a tibble
usePackage('psych') # calculate kappa and its variance
# ground reference points (trees)
sf_ref_points <- st_read(paste0(path_input_data, "reference_data/ref_points.geojson"))
st_crs(sf_ref_points)
# polygon of 1000m buffer of convexhull around National Park boundary
sf_nlp_1000m_boundary_poly <- st_read(paste0(path_input_data,
"project_boundary/NLP_ConvexHull_AllBuffer_1000m.geojson"))
# polygon of National Park boundary
sf_nlp_boundary_poly <- st_read(paste0(path_input_data,
"project_boundary/NLP_boundary.geojson"))
ggplot() +
geom_sf(data = st_geometry(sf_nlp_1000m_boundary_poly),
aes(colour = "Convex hull 1km buffer")) +
geom_sf(data = st_geometry(sf_nlp_boundary_poly),
aes(colour = "National Park boundary")) +
geom_sf(data = sf_ref_points)
Reference trees with National Park boundary and its 1000m buffer convex hull outline
This is the list of the 129 input features that have been calculated for the 1132 treetop buffer areas. Both the raster extract process and lidar metrics calculation process is described in detail in the script markup of script 2, hence these steps are skipped here to avoid replication. Instead the result of those time consuming processes is imported from RDS-file.
df_treetop_buffer_features <-
readRDS(paste0(path_input_data, "reference_data/treetop_buffer_features.rds"))
paste(names(df_treetop_buffer_features), collapse=" ")
## [1] "ID EASTTOP NORTHTOP HEIGHTTOP HEIGHTDTM AREACROWN RAD1ELLIP RAD2ELLIP DIRELLIP Class Feature_ID TDOP_Red_2015.SD TDOP_Green_2015.SD TDOP_Blue_2015.SD TDOP_NIR_2015.SD TDOP_Red_2015.Mean TDOP_Green_2015.Mean TDOP_Blue_2015.Mean TDOP_NIR_2015.Mean TDOP_Red_2015.Min TDOP_Green_2015.Min TDOP_Blue_2015.Min TDOP_NIR_2015.Min TDOP_Red_2015.Max TDOP_Green_2015.Max TDOP_Blue_2015.Max TDOP_NIR_2015.Max NDVI_2014.SD NDVI_2015.SD NDVI_2014.Mean NDVI_2015.Mean NDVI_2014.Min NDVI_2015.Min NDVI_2014.Max NDVI_2015.Max NDVI_Diff_2014_2015.Mean NDVI_nDiff_2014_2015.Mean Ratio_2015_Red.SD Ratio_2015_Green.SD Ratio_2015_Blue.SD Ratio_2015_NIR.SD Ratio_2015_Red.Mean Ratio_2015_Green.Mean Ratio_2015_Blue.Mean Ratio_2015_NIR.Mean Ratio_2015_Red.Min Ratio_2015_Green.Min Ratio_2015_Blue.Min Ratio_2015_NIR.Min Ratio_2015_Red.Max Ratio_2015_Green.Max Ratio_2015_Blue.Max Ratio_2015_NIR.Max PCA_1_2015.SD PCA_2_2015.SD PCA_3_2015.SD PCA_1_2015.Mean PCA_2_2015.Mean PCA_3_2015.Mean PCA_1_2015.Min PCA_2_2015.Min PCA_3_2015.Min PCA_1_2015.Max PCA_2_2015.Max PCA_3_2015.Max CHM_2015.SD CHM_2015.Mean CHM_2015.Min CHM_2015.Max DSM_2015.SD DSM_2015.Mean DSM_2015.Min DSM_2015.Max DSM_Roughness.SD DSM_Roughness.Mean DSM_Roughness.Min DSM_Roughness.Max CHM_Roughness.SD CHM_Roughness.Mean CHM_Roughness.Min CHM_Roughness.Max DSM_Slope_1.SD DSM_Slope_1.Mean DSM_Slope_1.Min DSM_Slope_1.Max DSM_Slope_2.SD DSM_Slope_2.Mean DSM_Slope_2.Min DSM_Slope_2.Max CHM_Neigung.SD CHM_Neigung.Mean CHM_Neigung.Min CHM_Neigung.Max DSM_Curvature.SD DSM_Curvature.Mean DSM_Curvature.Min DSM_Curvature.Max CHM_Curvature.SD CHM_Curvature.Mean CHM_Curvature.Min CHM_Curvature.Max Deri_1_NIR_2015.SD Deri_1_NIR_2015.Mean Deri_1_NIR_2015.Min Deri_1_NIR_2015.Max EVI_2015.Mean Mean_Intensity_all Mean_TargetNum_all Mean_NumTargets_all Sum_Targets_all Mean_Intensity_top150cm Mean_TargetNum_top150cm Mean_NumTargets_top150cm Sum_Targets_top150cm Ratio_Targets_top150cm_to_all Mean_Intensity_1st_Targets_top150cm Mean_NumTargets_1st_Targets_top150cm Sum_Targets_1st_Targets_top150cm Sum_1st_Targets_all Ratio_Targets_1st_Targets_top150cm_to_1st_Targets_all Ratio_Targets_1st_Targets_top150cm_to_all_top150cm Mean_Intensity_no_1st_Targets_top300cm Mean_TargetNum_no_1st_Targets Mean_NumTargets_no_1st_Targets_top300cm Sum_Targets_no_1st_Targets_top300cm Sum_Targets_no_1st_Targets_all Sum_Targets_top300cm Ratio_Targets_no_1st_Targets_top300cm_to_no_1st_Targets_all Ratio_Targets_no_1st_Targets_top300cm_to_all_top300cm"
The preleminary Random Forest model is run including all 129 input features listed above to create the reference model. As calculating the estimated 4 mio trees using all features exceeds calculation effort, the number of features need to be reduced without compromising in model accuracy.
Two feature selection methods are compared:
1) the automatic Recursive Feature Elimination (RFE), and 2) a manual selection of features that score high variable importance values.
For each feature selection method a Random Forest model ist generatend and evaluated. Finally, the best model is compared to the reference model.
RFevaluate <- function(model, ref) {
# calculate confusion matrix
confmatrix <- confusionMatrix(model$predicted,
reference = ref)
v_predicted <- levels(model$predicted)
# Producer's und User's Accuracy
list_acc_user <- list()
list_acc_prod <- list()
for (i in 1:length(v_predicted)) {
list_acc_user[i] <- confmatrix$table[i,i] / sum(confmatrix$table[i,])
list_acc_prod[i] <- confmatrix$table[i,i] / sum(confmatrix$table[,i])
}
df_accuracies <- data.frame(v_predicted,
unlist(list_acc_prod),
unlist(list_acc_user)) %>%
set_colnames(c("Predicted", "Producer's Accuracy",
"User's Accuracy"))
# Cohen's Kappa: compares an Observed Accuracy with an Expected Accuracy
# values 0.81–1.00 are considered almost perfect agreement.
co_kappa <- cohen.kappa(confmatrix$table)
# Z-value: significance of difference to random classification
z_value <- co_kappa$kappa / sqrt(co_kappa$var.kappa)
df_v <- data.frame(metric=character(),value=numeric())
df_v[1,] <- c("Overall Accuracy", round(as.numeric(confmatrix$overall[1]),6))
df_v[2,] <- c("Balanced Accuracy",round(mean(confmatrix$byClass[,11]),6))
df_v[3,] <- c("Cohen's Kappa", round(co_kappa$kappa, 6))
df_v[4,] <- c("var Kappa", format(co_kappa$var.kappa, scientific=F))
df_v[5,] <- c("Z-Value", round(z_value, 6))
list_output <- list(ConfusionMatrix = as.matrix(confmatrix),
Accuracies = df_accuracies,
Metrics = df_v)
return(list_output)
} # end function RFevaluate
# Species classes extracted from the reference data
v_species_classes <- sort(unique(df_treetop_buffer_features$Class))
print(paste0("Species classes: ",v_species_classes[1]," (Abies alba), ",
v_species_classes[2]," (Deciduous broadleaved), ",
v_species_classes[3]," (Larix decidua), ",
v_species_classes[4]," (Picea abies), ",
v_species_classes[5]," (Pseudotsuga menziesii), ",
v_species_classes[6]," (Pinus sylvestris), ",
v_species_classes[7]," (snags)"))
## [1] "Species classes: AA (Abies alba), DB (Deciduous broadleaved), LD (Larix decidua), PA (Picea abies), PM (Pseudotsuga menziesii), PS (Pinus sylvestris), WD (snags)"
The values of OOB error estimate change with the sequence of fields within the feature table.
Repeated model runs with randomly change of field sequence identify the sequence that results in the minimum error values.
if (sw_find_field_order == FALSE) {
# change order of the fields in the feature table
df_treetop_buffer_features <-
df_treetop_buffer_features[,
readRDS(paste0(path_output_data,"v_features_field_order.rds"))]
} else {
num_runs <- 25 # number of loop runs
v_oob <- c() # save oob of each loop run
m_field_order <- matrix(0, # save field names of each loop run
ncol = ncol(df_treetop_buffer_features),
nrow = num_runs)
for (k in 1:num_runs) {
# randomize the dataframe's column order
m_field_order[k,] <- sample(names(df_treetop_buffer_features))
df_treetop_buffer_features <-
df_treetop_buffer_features[,m_field_order[k,]]
set.seed(seed)
RF_ref_model <-
randomForest(Class ~ . -ID -Feature_ID,
data = df_treetop_buffer_features,
importance = TRUE,
proximity = TRUE,
oob.prox = TRUE
)
v_oob[k] <- RF_ref_model$err.rate[500,1] # save the OOB error value
} # for
# demonstrate the variability of results caused by sequence of fields
boxplot(v_oob, notch=F, names=names(v_oob),
ylab = "OOB error rate",
ylim = c(0.05,0.07),
main = "OOB error rate variability caused by sequence of fields"
)
if (sw_save_results_as_rds== TRUE) {
# save order of the fields in the feature table to the minimum OOB error
saveRDS(m_field_order[which.min(v_oob),],
file = paste0(path_output_data,"v_features_field_order.rds"))
}
# ... and apply this order to the feature table
df_treetop_buffer_features <-
df_treetop_buffer_features[,m_field_order[which.min(v_oob),]]
} # else
set.seed(seed)
RF_ref_model <- randomForest(Class ~ . ,
data = df_treetop_buffer_features,
importance=TRUE,
proximity = TRUE,
oob.prox = TRUE
)
eval_RF_ref_model <- RFevaluate(model=RF_ref_model,
ref=df_treetop_buffer_features$Class)
eval_RF_ref_model
## $ConfusionMatrix
## AA DB LD PA PM PS WD
## AA 245 0 0 2 8 1 0
## DB 0 155 3 0 0 0 0
## LD 0 4 124 0 0 0 1
## PA 1 0 0 259 18 14 1
## PM 0 0 0 3 13 0 0
## PS 3 2 0 8 2 117 0
## WD 0 2 1 1 0 0 144
##
## $Accuracies
## Predicted Producer's Accuracy User's Accuracy
## 1 AA 0.9839357 0.9570312
## 2 DB 0.9509202 0.9810127
## 3 LD 0.9687500 0.9612403
## 4 PA 0.9487179 0.8839590
## 5 PM 0.3170732 0.8125000
## 6 PS 0.8863636 0.8863636
## 7 WD 0.9863014 0.9729730
##
## $Metrics
## metric value
## 1 Overall Accuracy 0.933746
## 2 Balanced Accuracy 0.925724
## 3 Cohen's Kappa 0.919582
## 4 var Kappa 0.0000799014
## 5 Z-Value 102.875835
The default value for ntree is 500. This procedure checks, whether a different value results in significantly lower OOB error.
# as calculation of error rates is time consuming,
# during test runs read the results from a RDS file
if (sw_find_ntree == FALSE) {
df_error_rates <-
readRDS(paste0(path_output_data, "df_error_rates.rds"))
} else {
# identify the ntree (number of decision trees)
# that minimises the OOB estimate of error rate
v_ntree_options <- c(250,500,750,1000,1500)
# number of tests runs
ntests <- 10
# create empty dataframe to take up the results of the for-loop runs
df_error_rates <- as.data.frame(matrix(nrow = ntests,
ncol=length(v_ntree_options)))
for (i in 1:length(v_ntree_options)){
names(df_error_rates)[i] <-
paste(as.character(v_ntree_options[i]),"trees",sep=" ")
set.seed(seed) # render results reproducible
for (j in 1:ntests){
rfmodel <- randomForest(Class ~ .-ID -Feature_ID,
data = df_treetop_buffer_features,
ntree = v_ntree_options[i],
importance = TRUE)
# save error rates for each run
df_error_rates[j,i] <- rfmodel$err.rate[v_ntree_options[i]]
}
}
if (sw_save_results_as_rds== TRUE) {
saveRDS(df_error_rates,
file = paste0(path_output_data, "df_error_rates.rds"))
} # if
} # else
# plot OOB (out of box) error rates for different ntree values
boxplot(df_error_rates, notch=F,names=names(df_error_rates),
ylab = "OOB error rate",
ylim = c(0.05,0.07),
main = "OOB (out of box) error rates for different ntree values"
)
grid() # adds a rectangular grid to the plot
The default value for mtry is sqrt(number of features). This procedure tries to identify a value that provides a higher accuracy.
# as calculation of mtry is time consuming,
# during test runs read the results from a RDS file
if (sw_find_mtry == FALSE) {
v_accuracy_mtry <-
readRDS(paste0(path_output_data, "accuracy_mtry.rds"))
} else {
v_accuracy_mtry = c() # vector to be filled by loop
for (i in 1:50) { # values for mtry
set.seed(seed)
rf <- randomForest(Class ~ .-ID -Feature_ID,
data = df_treetop_buffer_features,
ntree = ntree,
mtry = i,
importance = TRUE)
v_accuracy_mtry[i] <-
mean(rf$predicted == df_treetop_buffer_features$Class)
} # for
if (sw_save_results_as_rds== TRUE) {
saveRDS(v_accuracy_mtry,
file = paste0(path_output_data, "accuracy_mtry.rds"))
} # if
} # else
mtry <- which.max(v_accuracy_mtry)
print(paste0("The max value for mtry is ",
round(max(v_accuracy_mtry),4),
" for mtry = ", mtry))
## [1] "The max value for mtry is 0.9461 for mtry = 31"
set.seed(seed)
RF_ref_model <- randomForest(Class ~ . ,
data = df_treetop_buffer_features,
mtry = mtry,
ntree = ntree,
importance = TRUE,
proximity = TRUE,
oob.prox = TRUE )
eval_RF_ref_model <- RFevaluate(model=RF_ref_model,
ref=df_treetop_buffer_features$Class)
eval_RF_ref_model
## $ConfusionMatrix
## AA DB LD PA PM PS WD
## AA 246 0 0 3 7 1 0
## DB 0 155 3 0 0 0 0
## LD 0 4 124 0 0 0 1
## PA 1 0 0 259 11 6 1
## PM 0 0 0 2 19 2 0
## PS 2 2 0 8 4 123 0
## WD 0 2 1 1 0 0 144
##
## $Accuracies
## Predicted Producer's Accuracy User's Accuracy
## 1 AA 0.9879518 0.9571984
## 2 DB 0.9509202 0.9810127
## 3 LD 0.9687500 0.9612403
## 4 PA 0.9487179 0.9316547
## 5 PM 0.4634146 0.8260870
## 6 PS 0.9318182 0.8848921
## 7 WD 0.9863014 0.9729730
##
## $Metrics
## metric value
## 1 Overall Accuracy 0.94523
## 2 Balanced Accuracy 0.940821
## 3 Cohen's Kappa 0.933687
## 4 var Kappa 0.00006657626
## 5 Z-Value 114.430433
Plot error rate versus number of decision trees
# set colors for the genus_classes (OOB-error = 1)
pal_genus_classes <- c("yellow","red","green","blue", "grey",
"violet", "orange", "black")
layout(matrix(c(1,2), nrow = 1), width=c(4,1))
par(mar=c(5,4,4,0)) # no right padding
plot(RF_ref_model, log = "y", col = pal_genus_classes, lty = 1,
main= "Reference model: error rate versus number of decision trees")
grid()
par(mar=c(5,0,4,2)) # no left padding
# add legend
plot(c(0,1),type = "n", axes = F, xlab = "", ylab = "")
legend("top", colnames(RF_ref_model$err.rate),
col = pal_genus_classes, lty = 1)
RFE automatically identifies the minimum set of features that meet the set quality standards.
# rfFuncs is a set of functions used by rfControl
# some of these functions will be adjusted in the following
RFE_adjusted <- rfFuncs
# function "fit" requires adjustment otherwise rerank=TRUE will not work
RFE_adjusted$fit <- function (x, y, first, last, ...) {
loadNamespace("randomForest")
randomForest::randomForest(x, y, ...)
}
# adjustment of function "selectSize"
# - determins which set of features is selected eventually
# - use pickSizeTolerance rather than pickSizeBest
# as pickSizeBest will always select the full set of features as "best",
# whereas pickSizeTolerance will choose the minimum feature set that meets
# the performance within the set tolerance
# (tol=1 meaning 1% from "best", i.e. 99%)
RFE_adjusted$selectSize <- function (x, metric, tol = 1, maximize)
{ # code origins from print(pickSizeTolerance)
if (!maximize) {
best <- min(x[, metric])
perf <- (x[, metric] - best)/best * 100
flag <- perf <= tol
}
else {
best <- max(x[, metric])
perf <- (best - x[, metric])/best * 100
flag <- perf <= tol
}
min(x[flag, "Variables"])
}
# function for model fitting within rfe
ctrl_rerank <- rfeControl(functions = RFE_adjusted,
method = "cv", # use cross-validation
number = 5,
rerank = TRUE # recalc importance after each subset
)
# RFE (RandomForest is being executed within RFE)
set.seed(seed)
this_RFE <- rfe(x = select(df_treetop_buffer_features, -ID, -Feature_ID, -Class),
y = df_treetop_buffer_features$Class,
sizes = c(5:25),
rfeControl = ctrl_rerank)
ggplot(data = this_RFE, mapping = NULL,
metric = this_RFE$metric[1], output = "layered",
environment = NULL)
## [1] "Ratio_2015_NIR.Mean" "NDVI_2015.Mean"
## [3] "Ratio_2015_Green.Mean" "Deri_1_NIR_2015.Mean"
## [5] "Ratio_2015_Red.Mean" "NDVI_2014.Min"
## [7] "PCA_1_2015.Mean" "NDVI_2014.Mean"
## [9] "Ratio_2015_Blue.Mean" "NDVI_Diff_2014_2015.Mean"
## [11] "Ratio_Targets_top150cm_to_all"
# subset selected features from treetop buffer feature data frame
df_features_selected <-
df_treetop_buffer_features[, names(df_treetop_buffer_features) %in% RFE_predictors]
# sort columns to reflect the order of vector name in the order of plots
df_features_selected <- df_features_selected[ , RFE_predictors]
# add columns Class to the dataframe
df_features_selected$Class <- df_treetop_buffer_features$Class
col_class <- c("AA"="red",
"DB"="green",
"LD"="blue",
"PA"="grey",
"PM"="violet",
"PS"="orange",
"WD"="black")
violin_plots <- lapply(1:length(RFE_predictors), function(i) {
ggplot(df_features_selected) +
scale_fill_manual( values=col_class) + # defines the color display of the points
theme_classic() +
theme(legend.position = "none",
panel.grid.major = element_line(color= "black", size = 0.1),
panel.grid.minor = element_line(color = "black", size = 0.05),
panel.border = element_rect(color = "black", size = 1, fill = NA),
plot.title = element_text(hjust = 0.5, size = 11),
axis.title = element_text(size = 9)) +
geom_violin(aes(x=Class, y=df_features_selected[,i], fill = Class)) +
# exclude outlier by limiting values to range 0.1-99.9%
ylim(quantile(df_features_selected[,i], .001),
quantile(df_features_selected[,i], .999))+
labs(title= RFE_predictors[i], y= '', x='')
})
grid.arrange(grobs = violin_plots, ncol = 3)
RF_RFE <- this_RFE$fit
eval_RF_RFE <- RFevaluate(model=RF_RFE,
ref=df_treetop_buffer_features$Class)
eval_RF_RFE
## $ConfusionMatrix
## AA DB LD PA PM PS WD
## AA 243 0 0 3 8 2 0
## DB 0 154 4 0 0 0 1
## LD 0 5 122 0 0 0 1
## PA 3 0 0 259 12 13 3
## PM 1 0 0 3 16 1 0
## PS 2 2 0 7 5 116 0
## WD 0 2 2 1 0 0 141
##
## $Accuracies
## Predicted Producer's Accuracy User's Accuracy
## 1 AA 0.9759036 0.9492188
## 2 DB 0.9447853 0.9685535
## 3 LD 0.9531250 0.9531250
## 4 PA 0.9487179 0.8931034
## 5 PM 0.3902439 0.7619048
## 6 PS 0.8787879 0.8787879
## 7 WD 0.9657534 0.9657534
##
## $Metrics
## metric value
## 1 Overall Accuracy 0.928445
## 2 Balanced Accuracy 0.926408
## 3 Cohen's Kappa 0.91322
## 4 var Kappa 0.00008562746
## 5 Z-Value 98.689075
In order to find a feature selection method that gives improved results and/or requires less features, the Variable Importance Plot is analysed. The Variable Importance Plot ranks features according to their importance for the model expressed by MeanDecreaseAccuracy and MeanDecreaseGini.
varImpPlot(RF_ref_model, sort =TRUE, n.var = 20, lcolor = "grey",
cex = 0.6, main = "Variable Importance Plot for Reference Model",
cex.main = 0.9)
AS RFE has identified 11 features, the top ranking 11 features in a combination of MeanDecreaseAccuracy and MeanDecreaseGini are selected.
# save ranks of importance in dataframe, saving row names as field "feature"
# save rank position for MeanDecreaseAccuracy and MeanDecreaseGini as new fields
df_ref_model_feature_importance <-
as.data.frame(round(importance(RF_ref_model),2)) %>%
tibble::rownames_to_column("feature") %>%
arrange(desc(MeanDecreaseGini)) %>%
tibble::rownames_to_column("Gini_Ranking") %>%
mutate(Gini_Ranking = as.numeric(Gini_Ranking)) %>%
arrange(desc(MeanDecreaseAccuracy)) %>%
tibble::rownames_to_column("Acc_Ranking") %>%
mutate(Acc_Ranking = as.numeric(Acc_Ranking))
str(df_ref_model_feature_importance)
## 'data.frame': 128 obs. of 12 variables:
## $ Acc_Ranking : num 1 2 3 4 5 6 7 8 9 10 ...
## $ Gini_Ranking : num 5 2 1 7 6 17 3 8 22 4 ...
## $ feature : chr "Deri_1_NIR_2015.Mean" "NDVI_2015.Mean" "Ratio_2015_NIR.Mean" "NDVI_2014.Min" ...
## $ AA : num 4.33 16.1 17.71 2.57 5.46 ...
## $ DB : num 4.85 15.09 15 12.56 14.77 ...
## $ LD : num 2.5 16.9 15.7 14 17.5 ...
## $ PA : num 42.87 15.91 17.31 3.7 5.69 ...
## $ PM : num 8.93 5.42 5.95 1.46 0.36 3.53 -0.7 2.11 5.21 3.69 ...
## $ PS : num 26.37 19.76 18.47 11.28 9.32 ...
## $ WD : num 4.59 15.92 16.75 17.99 14.49 ...
## $ MeanDecreaseAccuracy: num 42.9 22.2 22.2 20.2 19.6 ...
## $ MeanDecreaseGini : num 41.2 89.6 89.8 40.3 41.1 ...
# SELECT features that rank top x in MeanDecreaseAccuracy
# OR top y in MeanDecreaseGini
df_ref_model_feature_VIP <- filter(df_ref_model_feature_importance,
Acc_Ranking <= 7 | Gini_Ranking <= 10) %>%
select(.,-c("Acc_Ranking","Gini_Ranking"))
# select the identified set of features from the original features
df_treetop_buffer_features_VIP <-
select(df_treetop_buffer_features,
all_of(df_ref_model_feature_VIP$feature)) %>%
mutate(Class = df_treetop_buffer_features$Class)
if (sw_find_field_order == FALSE) {
# change order of the fields in the feature table
df_treetop_buffer_features_VIP <-
df_treetop_buffer_features_VIP[,
readRDS(paste0(path_output_data,"v_features_field_order_VIP.rds"))]
} else {
# values OOB error estimate change for each field order in the data table
# multiple runs with random order of fields are carried out
# to identify the order that produces the the minimum error value
num_runs <- 25 # number of loop runs
v_oob <- c() # save oob of each loop run
m_field_order <- matrix(0, # save field names of each loop run
ncol = ncol(df_treetop_buffer_features_VIP),
nrow = num_runs)
for (k in 1:num_runs) {
# randomize the field order of the feature table, while saving the order
m_field_order[k,] <- sample(names(df_treetop_buffer_features_VIP))
# ... and apply this order to the feature table
df_treetop_buffer_features_VIP <-
df_treetop_buffer_features_VIP[,m_field_order[k,]]
set.seed(seed)
RF_VIP <- randomForest(Class ~ . ,
data = df_treetop_buffer_features_VIP,
mtry=3, ntrees=500,
importance=TRUE,
proximity = TRUE,
oob.prox = TRUE
)
v_oob[k] <- RF_VIP$err.rate[500,1] # save the OOB error value
} # for
boxplot(v_oob, notch=F, names=names(v_oob),
ylab = "OOB error rate",
ylim = c(0.05,0.07),
main = "OOB error rate variability caused by sequence of fields"
)
if (sw_save_results_as_rds== TRUE) {
# save order of the fields in the feature table to the minimum OOB error
saveRDS(m_field_order[which.min(v_oob),],
file = paste0(path_output_data,"v_features_field_order_VIP.rds"))
}
# ... and apply this order to the feature table
df_treetop_buffer_features_VIP <-
df_treetop_buffer_features_VIP[,m_field_order[which.min(v_oob),]]
} # else
# run RandomForest with the min OOB feature order
set.seed(seed)
RF_VIP <- randomForest(Class ~ . ,
data = df_treetop_buffer_features_VIP,
mtry=3, ntrees=500,
importance=TRUE,
proximity = TRUE,
oob.prox = TRUE
)
eval_RF_VIP <- RFevaluate(model=RF_VIP,
ref=df_treetop_buffer_features$Class)
eval_RF_VIP
## $ConfusionMatrix
## AA DB LD PA PM PS WD
## AA 238 0 0 3 12 1 0
## DB 0 155 3 0 0 0 1
## LD 0 5 123 0 0 0 1
## PA 4 1 0 256 10 28 3
## PM 7 0 0 1 11 2 0
## PS 0 0 0 12 8 101 0
## WD 0 2 2 1 0 0 141
##
## $Accuracies
## Predicted Producer's Accuracy User's Accuracy
## 1 AA 0.9558233 0.9370079
## 2 DB 0.9509202 0.9748428
## 3 LD 0.9609375 0.9534884
## 4 PA 0.9377289 0.8476821
## 5 PM 0.2682927 0.5238095
## 6 PS 0.7651515 0.8347107
## 7 WD 0.9657534 0.9657534
##
## $Metrics
## metric value
## 1 Overall Accuracy 0.905477
## 2 Balanced Accuracy 0.906328
## 3 Cohen's Kappa 0.885207
## 4 var Kappa 0.000110333
## 5 Z-Value 84.273752
Based on the quality metrics (Balanced Accuracy, Kappa, Z-value) the 11 features selected by the Recursive Feature Elimination (RF_RFE) perform clearly better than 11 features combined of top ranking features for MeanDecreaseAccuracy and MeanDecreaseGini (RF_VIP). In comparison to the reference model RF_RFE shows a much lower Kappa difference (Kappa1 - Kappa2) / sqrt(Var.Kappa1 + Var.Kappa2) RFE is therefore applied as the model to the entire National Park (Script 2).
# kappa difference reference model to RFE model
diff_kappa_ref_model_RFE <- (as.numeric(eval_RF_ref_model$Metrics[3,2]) -
as.numeric(eval_RF_RFE$Metrics[3,2])) /
sqrt(as.numeric(eval_RF_ref_model$Metrics[4,2]) +
as.numeric(eval_RF_RFE$Metrics[4,2]))
print(paste0("The kappa difference between reference model (129 features)",
" and RF_RFE (", length(predictors(this_RFE)),
" features) is: ", round(diff_kappa_ref_model_RFE,3)))
## [1] "The kappa difference between reference model (129 features) and RF_RFE (11 features) is: 1.659"
# kappa difference reference model to VIP model
diff_kappa_ref_model_VIP <- (as.numeric(eval_RF_ref_model$Metrics[3,2]) -
as.numeric(eval_RF_VIP$Metrics[3,2])) /
sqrt(as.numeric(eval_RF_ref_model$Metrics[4,2]) +
as.numeric(eval_RF_VIP$Metrics[4,2]))
print(paste0("The kappa difference between reference model (129 features)",
" and RF_VIP (",ncol(df_treetop_buffer_features_VIP)-1,
" features) is: ", round(diff_kappa_ref_model_VIP,3)))
## [1] "The kappa difference between reference model (129 features) and RF_VIP (11 features) is: 3.645"