kleinsteuber
/
thesis-camera-trap-anomaly-detection


			
				
					
						
						
							123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324
							% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Author:   Felix Kleinsteuber
% Title:    Anomaly Detection in Camera Trap Images
% File:     chap04/chap04.tex
% Part:     experimental results, evaluation
% Description:
%         summary of the content in this chapter
% Version:  16.05.2022
% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\chapter{Experimental results and evaluation}
\label{chap:evaluation}
% chapter intro, which applications do you analyze, perhaps summarize your results beforehand

\begin{table}
    \centering
    \begin{tabular}[c]{|l|l|l|l|}
        \hline
        \textbf{Session} & Beaver\_01 & Marten\_01 & GFox\_03 \aster \\
        \hline
        \textbf{Images Lapse / Motion} & 1734 / 695 & 2462 / 3105 & 1210 / 2738 \\
        \hline
        \textbf{Identical Lapse Duplicates} & 0 & 621 & N/A \\
        \hline
        \textbf{Inconsistent Lapse Duplicates} & 0 & 108 & N/A \\
        \hline
        \textbf{Motion \% Anomalous} & 89 \% & 24 \% & 6 \% \\
        \hline
    \end{tabular}
    \caption[Sessions used for experiments]{Sessions used for experiments. The sessions differ in the number of images and the frequency of anomalous images. \aster Note that GFox\_03 is a generated set with no real lapse data.}
    \label{tab:sessions}
\end{table}

All four proposed approaches were implemented in Python using NumPy \cite{Harris20:NumPy}, OpenCV \cite{Bradski00:OpenCV} and scikit-learn \cite{Pedregosa11:scikit-learn}. The code is organized in multiple Jupyter notebooks \cite{Kluyver16:Jupyter}. Graphs were generated using Matplotlib \cite{Hunter07:Matplotlib}. The autoencoder approach was implemented using PyTorch \cite{Paszke19:PyTorch}. To generate meaningful scores, three sessions with different characteristics (see \autoref{tab:sessions}) were used for evaluation. Beaver\_01 and Marten\_01 were annotated fully using the tool described in \autoref{sec:labeling}. 

\paragraph{Beaver\_01} can be considered the easiest of datasets since there is only a single camera position and 89 \% of images contain interesting content. With 695 motion images, it is relatively small. It is used as a baseline set.

\paragraph{Marten\_01} only contains 24 \% anomalous images. There are three different camera positions. Moreover, there are 108 inconsistent duplicates in the lapse set. This means that for 108 timestamps, there exist two lapse images that show different camera positions. For 621 additional timestamps, there exist two identical lapse images. It is not apparent how this circumstance came about and there is no obvious way to bring the lapse set back to a consistent state. The inconsistency only affects approach 1, as approach 2 does not use the lapse set and approaches 3 and 4 do not take notice of the timestamps of lapse images. Approach 1 is implemented such that it selects the first lapse image it finds. For inconsistent duplicates, this image is not guaranteed to show the right background.

\paragraph{Fox\_03} represents a special case: The provided lapse images were not taken every hour but every day, making the lapse set ineligible because of the insufficient number of images and temporal proximity to the motion set. However, labels were provided for this session. Therefore, Fox\_03's lapse set was discarded and the new set GFox\_03 was generated from Fox\_03's motion set using the following procedure:

\begin{enumerate}
    \item Find a new set $S$ of consecutively taken motion images in Fox\_03.
    \item If all images in $S$ are labeled as normal, move them to lapse in GFox\_03.
    \item Repeat step 1 until all consecutive image sets were iterated.
    \item Move the remaining images to the motion set in GFox\_03.
\end{enumerate}

A generated session is not equivalent to a real session since the data shift between generated lapse and motion set is expected to be much smaller. Nevertheless, the diversity of generated lapse images is usually smaller than of real lapse images. For those reasons, the results of approaches 3 and 4 on this session cannot be fairly compared to the results of other sessions. Approach 1 can still not be evaluated on GFox\_03 since there are no lapse images in close temporal proximity in most cases. Approach 2 is not affected by the lapse set and can be fairly compared across all sessions. The images in Fox\_03 only show a single camera setting as intended with slight camera movements and occasional blurred leaves at the edge of the image.

% -------------------------------------------------------------------

\section{Hyperparameter search}
\label{sec:experiments}

For each approach, several hyperparameters need to be selected. In the following section, different configurations of hyperparameters are evaluated on the Beaver\_01 session. Beaver\_01 was chosen because it could be quickly annotated. The insights of these first experiments are then used in the next section to evaluate and compare the approaches on multiple sessions.

\subsection{Approach 1 - Lapse Frame Differencing}
There are three options to consider in this approach:

\paragraph{Difference function} Experiments show a slight increase in performance when using the squared pixel-wise difference over the absolute difference. This increase is only significant when combined with Gaussian filtering. A possible explanation is that the square function gives higher weight to significant pixel differences (as often caused by visual anomalies) while almost ignoring small differences (as often caused by lighting changes).

\paragraph{Difference image metric} As a metric for the similarity of lapse and motion image, the mean and variance of the difference image are considered. Here, experiments show that the variance is the better choice. This can be explained by the spatial limitation of true visual anomalies: An animal would introduce high pixel differences in specific image parts, therefore increasing the variance more than the mean value. On the other hand, a simple lighting change would alter the mean value but not the variance.

\paragraph{Gaussian filtering} Experiments confirm the hypothesis that filtering the image using a Gaussian filter beforehand greatly improves the classifier's performance. Introducing such a filter with $\sigma=2$ already increases the AUC score by $0.1362$ on average, by $0.1485$ for $\sigma=4$, and by $0.1531$ for $\sigma=6$0. A $\sigma$-value close to $6$ therefore seems to be a good choice. However, choosing a too large value for $\sigma$ can make smaller anomalies such as birds be completely blurred, making accurate classifications in such cases virtually impossible.

\begin{table}[h]
    \centering
    \begin{tabular}[c]{|r|l|l|r|r|r|r|}
        \hline
        $\sigma$ & Diff & Metric & AUC & $\TNR_{\TPR \geq 0.9}$ & $\TNR_{\TPR \geq 0.95}$ & $\TNR_{\TPR \geq 0.99}$ \\
        \hline
        $0$ & abs & mean & $0.7308$ & $0.4189$ & $0.3514$ & $0.1622$ \\
        \hline
        $0$ & abs & var & $0.7414$ & $0.4865$ & $0.4189$ & $0.2432$ \\
        \hline
        $0$ & sq & mean & $0.7336$ & $0.4189$ & $0.4054$ & $0.2162$ \\
        \hline
        $0$ & sq & var & $0.7296$ & $0.4189$ & $0.4189$ & $0.2568$ \\
        \hline
        \hline
        $2$ & abs & mean & $0.8230$ & $0.6486$ & $0.4459$ & $0.2973$ \\
        \hline
        $2$ & abs & var & $0.8941$ & $0.6486$ & $0.5946$ & $0.5000$ \\
        \hline
        $2$ & sq & mean & $0.8645$ & $0.6486$ & $0.5811$ & $0.4324$ \\
        \hline
        $2$ & sq & var & $0.8986$ & $0.7162$ & $0.6081$ & $0.5270$ \\
        \hline
        \hline
        $4$ & abs & mean & $0.8294$ & $0.6351$ & $0.4459$ & $0.2973$ \\
        \hline
        $4$ & abs & var & $0.9068$ & $0.6622$ & $0.5811$ & $0.5270$ \\
        \hline
        $4$ & sq & mean & $0.8777$ & $0.6486$ & $0.6081$ & $0.4324$ \\
        \hline
        $4$ & sq & var & $0.9156$ & $0.7973$ & $0.6486$ & $0.5676$ \\
        \hline
        \hline
        $6$ & abs & mean & $0.8337$ & $0.5946$ & $0.5270$ & $0.2838$ \\
        \hline
        $6$ & abs & var & $0.9109$ & $0.6622$ & $0.6351$ & $0.5270$ \\
        \hline
        $6$ & sq & mean & $0.8816$ & $0.6622$ & $0.5811$ & $0.4324$ \\
        \hline
        $\mathbf{6}$ & \textbf{sq} & \textbf{var} & $\mathbf{0.9214}$ & $\mathbf{0.7973}$ & $\mathbf{0.6351}$ & $\mathbf{0.5811}$ \\
        \hline
    \end{tabular}
    \caption[Evaluation of approach 1 with different hyperparameters on Beaver\_01]{Evaluation of approach 1 with different hyperparameters on Beaver\_01.}
    \label{tab:eval_approach1_beaver01}
\end{table}

The optimal configuration of approach 1 is highlighted in \autoref{tab:eval_approach1_beaver01}. It applies Gaussian filtering with $\sigma = 6$ to both images and then thresholds the variance of the squared pixel difference image to find anomalies, reaching a remarkable AUC score of $0.9214$ on the Beaver\_01 set. It also beats most of the other configurations regarding elimination rates: Even for a TPR of 99 \%, 58 \% of all empty images can be eliminated.

\subsection{Approach 2 - Median Frame Differencing}
In contrast to approach 1, the temporal median image is used as a background representation. Besides that, all following operations are the same. Indeed, experiments confirm that the same choices regarding difference function, metric and Gaussian filtering should be made. In this case, $\sigma = 4$ and $\sigma = 6$ perform almost equivalently.

\begin{table}[h]
    \centering
    \begin{tabular}[c]{|r|l|l|r|r|r|r|}
        \hline
        $\sigma$ & Diff & Metric & AUC & $\TNR_{\TPR \geq 0.9}$ & $\TNR_{\TPR \geq 0.95}$ & $\TNR_{\TPR \geq 0.99}$ \\
        \hline
        $0$ & sq & mean & $0.7794$ & $0.6351$ & $0.4730$ & $0.1757$ \\
        \hline
        $0$ & sq & var & $0.7897$ & $0.6622$ & $0.5946$ & $0.2703$ \\
        \hline
        \hline
        $2$ & sq & mean & $0.8475$ & $0.6622$ & $0.5811$ & $0.3378$ \\
        \hline
        $2$ & sq & var & $0.8735$ & $0.7973$ & $0.7162$ & $0.4865$ \\
        \hline
        \hline
        $4$ & sq & mean & $0.8509$ & $0.6081$ & $0.5270$ & $0.2568$ \\
        \hline
        $\mathbf{4}$ & \textbf{sq} & \textbf{var} & $\mathbf{.8776}$ & $\mathbf{.7838}$ & $\mathbf{.7027}$ & $\mathbf{.4459}$ \\
        \hline
        \hline
        $6$ & sq & mean & $0.8509$ & $0.6081$ & $0.5270$ & $0.2297$ \\
        \hline
        $6$ & sq & var & $0.8766$ & $0.7973$ & $0.6757$ & $0.3919$ \\
        \hline
    \end{tabular}
    \caption[Evaluation of approach 2 with different hyperparameters on Beaver\_01]{Evaluation of approach 2 with different hyperparameters on Beaver\_01.}
    \label{tab:eval_approach2_beaver01}
\end{table}

The optimal configuration of approach 2 is highlighted in \autoref{tab:eval_approach2_beaver01}. It applies Gaussian filtering with $\sigma = 4$ to the motion and median image and then thresholds the variance of the squared pixel difference image to find anomalies, reaching an AUC score of $0.8776$ on the Beaver\_01 set. As expected, approach 2 is less accurate than approach $1$ since it cannot benefit from the added information value of the lapse images. However, this might not be the case in general since the approaches have very different preconditions: Approach 1 requires lapse images of high temporal proximity and similar lighting whereas approach 2 requires the majority of pixel values in a consecutive motion image set for every pixel to show normal background.

\begin{figure}[htbp]
    \centering
    \begin{subfigure}[b]{.48\textwidth}
        \includegraphics[width=\textwidth]{images/results/approach1_Beaver_01_sqvar_sigma4.pdf}
        \caption{Approach 1.}
        \label{fig:approach1_roc}
    \end{subfigure}
    \begin{subfigure}[b]{.48\textwidth}
        \includegraphics[width=\textwidth]{images/results/approach2_Beaver_01_sqvar_sigma4.pdf}
        \caption{Approach 2.}
        \label{fig:approach2_roc}
    \end{subfigure}
    \caption[ROC curve of approaches 1 and 2 with identical configuration on Beaver\_01]{ROC curve of approaches 1 and 2 with identical configuration on Beaver\_01.}
    \label{fig:eval_approach12_beaver01_roc}
\end{figure}

\autoref{fig:eval_approach12_beaver01_roc} provides insight into the quality of approach 1 and 2 classifiers by comparing their ROC curves. In the context of empty image elimination, we expect the TPR to be $0.9$ and higher to keep most of the interesting images. When looking only at the top part of the ROC graph where this is true, the curves look roughly similar.

\subsection{Approach 3 - Bag of Visual Words}
The local feature approach allows for various options regarding feature generation and clustering. The following options were considered:

\paragraph{Clustering algorithm} The most common clustering method is $k$-Means, which is very time-consuming for large datasets. Experiments show that choosing the prototype vectors randomly achieves similar accuracies to k-Means and is much faster. The mean AUC score for random prototypes is on average $0.0131$ higher than for k-Means clustering. Results acquired using the k-Means algorithm are listed in \autoref{tab:eval_approach3_beaver01}, results of random clustering are illustrated in a boxplot in \autoref{fig:eval_approach3_random_boxplot}. The colored lines mark the mean values over 10 tests, the box extends from the first to the third quartile (interquartile range). The whiskers extend to the last datum within $1.5$ times the interquartile range from the box; data beyond the whiskers are considered outliers and marked as points.

\begin{table}[h]
    \centering
    \begin{tabular}[c]{|r|r|l|l|r|r|r|r|}
        \hline
        $k$ & $s$ & Clustering & M & AUC & $\TNR_{\TPR \geq 0.9}$ & $\TNR_{\TPR \geq 0.95}$ & $\TNR_{\TPR \geq 0.99}$ \\
        \hline
        1024 & 30 & kmeans & N & 0.7698 & 0.3929 & 0.3800 & 0.0757 \\
        \hline
        2048 & 30 & kmeans & N & 0.7741 & 0.4976 & 0.3382 & 0.0564 \\
        \hline
        4096 & 30 & kmeans & N & 0.7837 & 0.5797 & 0.2866 & 0.0451 \\
        \hline
        2048 & 40 & kmeans & N & 0.7611 & 0.3317 & 0.1610 & 0.1320 \\
        \hline
        \hline
        1024 & 30 & kmeans & Y & 0.7056 & 0.2432 & 0.2222 & 0.0821 \\
        \hline
        2048 & 30 & kmeans & Y & 0.7390 & 0.3172 & 0.3092 & 0.0612 \\
        \hline
        4096 & 30 & kmeans & Y & 0.7542 & 0.3768 & 0.2963 & 0.0515 \\
        \hline
    \end{tabular}
    \caption[Evaluation of approach 3 with different hyperparameters on Beaver\_01 using k-Means clustering]{Evaluation of approach 3 with different hyperparameters on Beaver\_01 using k-Means clustering. $k$ is the vocabulary size, $s$ is the keypoint and step size. $M$ specifies whether the motion features were included in clustering (yes/no).}
    \label{tab:eval_approach3_beaver01}
\end{table}

\paragraph{Including motion features} Descriptors of (unlabeled) motion images can be included in the unsupervised clustering process. Even though this can enhance the feature richness, experiments show that the AUC score as well as the elimination rates decrease while the computation time increases significantly when the number of motion images is high. In conclusion, only lapse descriptors should be used for clustering.

\begin{figure}[htbp]
    \centering
    \begin{subfigure}[b]{\textwidth}
        \includegraphics[width=\textwidth]{images/results/approach3_boxplot_random.pdf}
        \caption{AUC}
        \label{fig:eval_approach3_random_boxplot_auc}
    \end{subfigure}
    \begin{subfigure}[b]{\textwidth}
        \includegraphics[width=\textwidth]{images/results/approach3_boxplot_random_tnr95.pdf}
        \caption{$\TNR_{\TPR \geq 0.95}$}
        \label{fig:eval_approach3_random_boxplot_tnr95}
    \end{subfigure}
    \caption[Evaluation of approach 3 using random prototypes]{Evaluation of approach 3 using random prototypes. Ten different vocabularies with randomly selected prototype vectors were tested per configuration.}
    \label{fig:eval_approach3_random_boxplot}
\end{figure}

\paragraph{Vocabulary size} The optimal vocabulary size can vary greatly depending on session complexity and number of images as well as keypoint step size. Larger sessions often require more prototype vectors to accurately distinguish between images. A small step size produces more keypoints, therefore requiring more visual words. Experiments verify these considerations and show that a vocabulary size $k$ between $1000$ and $5000$ is appropriate.

For the sake of clarity, when using boxplots to visualize scores only $\TNR_{\TPR \geq 0.95}$ is given as the 'elimination rate'. Choosing a higher required $\TPR$ does not leave a margin for dataset errors and annotation mistakes while a lower $\TPR$ is not of practical relevance.

\paragraph{Keypoint size} Keypoint size and step size can be chosen independently, however, rudimentary first experiments show that choosing them identically provides a good compromise between image coverage and number of visual words. The keypoint size controls the resolution at which the model 'scans' the image - very small keypoints can lead to high noise sensitivity while too large keypoints wash out potentially important details. The step size $s$ influences the number of visual words and therefore the complexity of obtaining visual words and clustering - for $s=40$, there are 364 keypoints per image, 629 keypoints for $s=30$, and 1456 keypoints for $s=20$0. As these values show, $s$ should not be smaller than $20$ because of expensive visual words collection, but also not larger than $40$ to obtain enough keypoints per image. For the Beaver\_01 dataset, $s=20$ proves to be the most accurate configuration, particularly when looking at the elimination rates (see \autoref{fig:eval_approach3_random_boxplot_tnr95}). Note that when the step size is cut in half, the clustering complexity remains constant (when using random clustering) but the number of local feature descriptors to compute quadruples. Therefore, the training process for $s=20$ takes approximately twice as long as $s=30$ and four times as long as $s=40$0. Moreover, more local features require more memory for clustering.

\subsection{Approach 4 - Autoencoder}
For neural networks such as an autoencoder, there are several hyperparameters which affect the quality of the trained model. The learning rate $\alpha$ was found to be optimal around $10^{-4}$ where the model converges after around 200 epochs of training. For the larger learning rate of $10^{-3}$, the model converges more quickly after 100 epochs but performs slightly worse regarding both AUC ($-0.021$) and elimination rates ($-0.169$) when tested on the default configuration with 512 latent features. The optimizer used was Adam \cite{Kingma14:Adam} with a weight decay of $10^{-5}$. Each configuration was trained 10 times with different random initializations.

\paragraph{Early stopping} Especially in anomaly detection scenarios and when the risk of overfitting is high, it can be beneficial to stop training before the model converges. Experiments show that early stopping is not beneficial here as the model does not tend to overfit easily.

\paragraph{Anomaly metric} Anomalies can either be detected by measuring the reconstruction loss as the mean squared error between input and output image or by estimating the distribution of the latent features via Kernel Density Estimation (KDE). Experiments show that low log-likelihood of the latent features with respect to the KDE-estimated distribution is a much more reliable indicator than a high reconstruction loss, achieving $0.338$ higher AUC and $0.304$ higher elimination rates on average. All following graphs were thus created using KDE.

\paragraph{Bottleneck size} The size of the bottleneck layer controls how much the autoencoder needs to compress the input information. Too small bottleneck sizes cause too great a loss of information, while autoencoders with large bottlenecks usually do not eliminate enough redundancies. Experiments show that a bottleneck size of $16$ can already represent enough information to reach a mean AUC of $0.859$ (see \autoref{fig:eval_approach4_latentfeatures}). Optimal scores are achieved with around $512$ latent features (AUC $0.893$). As the diversity of scenes in Beaver\_01 is relatively low compared to other sessions, it is advisable to choose the larger value of $512$ latent features as a general rule of thumb. It is noticeable that configurations with fewer latent features mostly have slightly higher elimination rates. This indicates that it might be useful to impose a sparsity constraint.

\begin{figure}[htbp]
    \centering
    \begin{subfigure}[b]{.49\textwidth}
        \includegraphics[width=\textwidth]{images/results/approach4_boxplot_kde_latentfeatures_auc.pdf}
        \caption{AUC}
        \label{fig:eval_approach4_latentfeatures_auc}
    \end{subfigure}
    \begin{subfigure}[b]{.49\textwidth}
        \includegraphics[width=\textwidth]{images/results/approach4_boxplot_kde_latentfeatures_tnr95.pdf}
        \caption{$\TNR_{\TPR \geq 0.95}$}
        \label{fig:eval_approach4_latentfeatures_tnr95}
    \end{subfigure}
    \caption[Evaluation of approach 4 for different bottleneck sizes]{Evaluation of approach 4 for different bottleneck sizes.}
    \label{fig:eval_approach4_latentfeatures}
\end{figure}

\paragraph{Denoising autoencoder} Denoising autoencoders aim to make the latent representation more robust towards small input changes and noise. This prevents overfitting by adding noise as a dataset augmentation method. Experiments show that AUC and elimination rates are only weakly affected by the noise. Only for strong noise with $\sigma > 0.3$ the scores decrease. As shown in \autoref{fig:eval_approach4_extensions_tnr95}, the mean elimination rate increases by $7.3$ when adding Gaussian noise with low $\sigma = 0.1$0.

\paragraph{Sparse autoencoder} Sparse autoencoders force the latent feature vector to be sparse. In this implementation, this is done by imposing a L1 penalty on the bottleneck activations. Sparsity is a form of regularization that further decreases the dimensionality of the latent space by favouring the elimination of more redundancies, thereby preventing overfitting. To a certain degree, it can have similar effects as decreasing the bottleneck size, but is more adapted to the dataset by using a continuous penalty rather than discrete bottleneck sizes. In experiments, the sparsity constraint did not achieve any significant improvements compared to the base model (see \autoref{fig:eval_approach4_extensions}). As expected, choosing $\lambda$ too high leads to dropping scores.

\begin{figure}[tb]
    \centering
    \begin{subfigure}[b]{.9\textwidth}
        \includegraphics[width=\textwidth]{images/results/approach4_boxplot_kde_denoising_and_sparse_auc.pdf}
        \caption{AUC}
        \label{fig:eval_approach4_extensions_auc}
    \end{subfigure}
    \begin{subfigure}[b]{.9\textwidth}
        \includegraphics[width=\textwidth]{images/results/approach4_boxplot_kde_denoising_and_sparse_tnr95.pdf}
        \caption{$\TNR_{\TPR \geq 0.95}$}
        \label{fig:eval_approach4_extensions_tnr95}
    \end{subfigure}
    \caption[Evaluation of approach 4 for denoising and sparse extensions]{Evaluation of approach 4 for denoising and sparse extensions, respectively. $\sigma$ is the standard deviation of the Gaussian noise for the denoising autoencoder; $\lambda$ is the multiplier for the sparsity constraint of the sparse autoencoder.}
    \label{fig:eval_approach4_extensions}
\end{figure}

\subsection{Summary}
Although the choice is not always clear, configurations can be selected for every approach which are optimal regarding the Beaver\_01 session. These configurations are listed in \autoref{tab:eval_summary1} and will be used for the second experiment series to compare the performance of different approaches on other sessions.

\begin{table}
    \centering
    \begin{tabular}[c]{|l|l|}
        \hline
        \textbf{Approach} & \textbf{Optimal configuration} \\
        \hline
        1 - Lapse Frame Differencing & Variance of squared difference image, \\
        & Gaussian filtering $\sigma=6$ \\
        \hline
        2 - Median Frame Differencing & Variance of squared difference image, \\
        & Gaussian filtering $\sigma=4$ \\
        \hline
        3 - Bag of Visual Words & 4096 clusters, densely sampled SIFT features \\
        & with keypoint size = step size = 20, random \\
        & prototypes, motion features not included in \\
        & training \\
        \hline
        4 - Autoencoder & Bottleneck size 512, Gaussian noise on \\
        & input $\sigma = 0.1$, Kernel Density Estimation \\
        \hline
    \end{tabular}
    \caption[Best-performing configurations for the proposed approaches]{Best-performing configurations for the proposed approaches.}
    \label{tab:eval_summary1}
\end{table}

% -------------------------------------------------------------------

\section{Benchmarking the approaches}
After picking a good configuration for every approach, the methods were additionally tested on the two sessions Marten\_01 and GFox\_03. Both sessions have particular characteristics and irregularities that were described at the beginning of \autoref{sec:experiments}. The results are listed in \autoref{tab:eval_summary2}.

\begin{table}
    \centering
    \begin{tabular}[c]{|ll|r|r|r|r|}
        \hline
        \textbf{Dataset}     & \textbf{Metric} & 1 LapseFD & 2 MedianFD & 3 BoVW & 4 AE \\
        \hline \hline
        Beaver\_01  & AUC                     & 0.9214 & 0.8776 & 0.8127 & 0.8928 \\
                    & $\TNR_{\TPR \geq 0.95}$ & 0.6351 & 0.7027 & 0.5322 & 0.5743 \\
        \hline
        Marten\_01  & AUC                     & \aster 0.8012 & 0.8740 & 0.5913 & 0.7189 \\
                    & $\TNR_{\TPR \geq 0.95}$ & \aster 0.2474 & 0.4024 & 0.1906 & 0.0556 \\
        \hline
        GFox\_03    & AUC                     & \aster N/A  & 0.9812 & \aster 0.9730 & \aster 0.9739 \\
                    & $\TNR_{\TPR \geq 0.95}$ & \aster N/A  & 0.9510 & \aster 0.8150 & \aster 0.8782 \\
        \hline
    \end{tabular}
    \caption[Scores of all four approaches on all annotated datasets]{Scores of all four approaches on all annotated datasets. Results marked with \aster cannot be fairly compared due to irregularities in the dataset; approach 1 is not applicable for GFox\_03 since there is no real lapse set (see beginning of \autoref{sec:experiments}).}
    \label{tab:eval_summary2}
\end{table}

Overall, the classic approaches 1 and 2 achieve the best elimination rates. In contrast to expectations, approach 2 consistently beats approach 1 regarding elimination rates. For the generated GFox\_03 set, approach 2 achieves higher AUC and elimination rates than approaches 3 and 4 even though approach 2 is completely unsupervised, whereas approaches 3 and 4 were provided with some annotated normal motion images in the form of the generated lapse set.

Noteworthy is the comparably low elimination rate of the autoencoder for Marten\_01. A possible explanation is that the general reconstruction accuracy drops due to the high diversity of backgrounds, some of which are strongly underrepresented in the lapse set. Even though Marten\_01 represents an extreme case of an irregular dataset, this shows that the autoencoder becomes unreliable more quickly when the background often changes.

Of course, it should be noted that the selected configurations might not be ideal for every camera session. However, as the practical meaning of this work is to find a generally applicable method, it makes sense to choose the same configuration for all sessions. GFox\_03 is the most realistic available session with a low frequency of animals and only a single camera position. The high elimination rates on this session, even though they are biased because the lapse images were generated, therefore demonstrate the capacity and usefulness of the approaches in a practical setting.