kleinsteuber
/
thesis-camera-trap-anomaly-detection


			
				
					
						
						
							123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320321322323324325326327328329330331332333334335336337338339340341342343344345346347348349350351352353354355356357358359360361362363364365366367368369370371372373374375376377378379380381382383384385386387388389390391392393394395396397398399400401
							% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Author:   Felix Kleinsteuber
% Title:    Anomaly Detection in Camera Trap Images
% File:     chap02/chap02.tex
% Part:     theoretical background
% Description:
%         summary of the content in this chapter
% Version:  16.05.2022
% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\chapter{Theoretical background}
\label{chap:background}

In the following chapter, the most important theoretical concepts employed in this work such as machine learning, classification, anomaly detection, image comparison methods, local features, and deep learning methods are presented. In addition, the statistical evaluation of binary threshold classifiers is detailed.

% -------------------------------------------------------------------

\section{Basics of machine learning}
\label{sec:theory_ml}
First, some basic terms and notation will be defined. This section is based on \cite{Goodfellow16:DeepLearning}.

\textbf{Machine learning (ML)} algorithms learn from data without being explicitly instructed, by finding patterns and drawing inferences in data. They allow to solve tasks that are too difficult to solve with fixed algorithms by analyzing \textbf{examples}. An example $\bm{x} \in \R^n$ is a collection of $n$ \textbf{features} encoded as real numbers. To be able to find patterns in the data, the algorithm requires a large set of training examples (\textbf{training set}). We denote the number of examples as $m$ and the $i$-th training example as $x^{(i)} \in \R^n, 1 \leq i \leq m$.

Most machine learning algorithms are optimization problems minimizing some kind of \textbf{loss} or \textbf{error function}. A loss function maps the desired and actual outputs of a ML model to a real value, the loss. Intuitively, when the actual output matches the desired one, the loss value should be minimal.

\subsection{Classification}

In classification tasks, the model is asked to predict which of $k$ categories some input belongs to. In other words, the algorithm learns a function $f : \R^n \to \{ 1, \dots, k \}$. When the output is $y = f(\bm{x})$, the model assigns the input to the $y$-th category.

To be able to evaluate the abilities of a machine learning algorithm, a performance measure is required. For classification tasks, we often use the \textbf{accuracy}, which is defined as the proportion of examples for which the model produces the correct output.

When fitting a ML algorithm to a problem, we want it to generalize well, i.e., to perform well on data it has not seen before. Therefore, to assess its performance in the real world, we evaluate performance measures such as the accuracy on a \textbf{test set} that is distinct from the training set.

\subsection{Supervised and unsupervised tasks}

ML algorithms can be roughly categorized as \emph{supervised} or \emph{unsupervised}:

\begin{itemize}
  \item In \textbf{supervised learning} problems, the training data is labeled, i.e., the task is to learn the function $f$, given training examples $\bm{x}^{(i)}$ and their respective outputs $y^{(i)}$.

  \item In \textbf{unsupervised learning} problems, the training examples are unlabeled. The goal is to infer labels from the natural structure of the training set. An example are \emph{clustering} algorithms, which group similar examples regarding some distance measure.
\end{itemize}

\subsection{Support vector machines}
In their simplest form, \textbf{support vector machines (SVMs)} tackle binary classification tasks by finding a linear hyperplane that separates both classes with maximal \emph{margin}. The margin is defined as the distance between the separating hyperplane and the closest data points, which are called \emph{support vectors}. To find this hyperplane, a constrained nonlinear optimization problem is solved \cite{Kecman05:SVMs}.

\paragraph{Soft-margin SVMs} When the given classes cannot be linearly separated, such a hyperplane does not exist. However, a linear separating hyperplane might still be a good, generalized solution. Soft-margin SVMs allow all data points within a \emph{soft margin} of the separating hyperplane to be misclassified. 

\paragraph{Nonlinear SVMs} When the given classes cannot be linearly separated well, the data points are first transformed into a higher dimensional feature space using a nonlinear mapping. By choosing a good mapping, the data points can become linearly separable in the feature space. The transformation can be done very efficiently by substituting the mapping for a kernel function $K(x, x')$ directly in the input space (\textbf{kernel trick}). A popular kernel function, which is used in all SVMs in this work, is the \textbf{radial basis function (RBF) kernel} defined as
\begin{equation}
  K(x, x') = \exp(-\gamma \| x - x' \|^2 )
\end{equation}
with a free parameter $\gamma$ \cite{Chang10:RBF}. The feature space of this kernel has an infinite number of dimensions. Thus, using the kernel trick, we can construct an SVM that operates in an infinite dimensional space (which would not be possible if we had to explicitly compute the mappings) \cite{Kecman05:SVMs}.

By training multiple SVMs, it is possible to perform multi-class classification. Let $k$ be the number of classes. In a \emph{one-vs-all} scenario, $k$ SVMs are trained to distinguish each class from the rest. In a \emph{one-vs-one} scenario, $\frac{k (k-1)}{2}$ SVMs are trained to separate between all possible combinations of two classes. Though \emph{one-vs-one} requires significantly more SVMs to be trained, both options are equally suitable. Thus, the choice of method depends on personal preference and the properties of the dataset \cite{Gidudu:SVMsMultiClass}.

\section{Anomaly detection}
\label{sec:theory_anomalydetection}
In general, an anomaly is a deviation from a rule or from what is regarded regular or normal. According to \cite{Jiang22:VisualSensoryADSurvey}, a distinction can be made between semantic anomalies and sensory anomalies:

\paragraph{Semantic anomalies} refer to anomalies at the label level, i.e., the semantic meaning of the sample has changed. This kind of anomaly always refers to the sample as a whole rather than specific parts. For example, for a model only trained on cat images, the image of a dog would be a semantic anomaly. Such anomalies can be detected using supervised approaches such as \emph{one-class classification (OCC)} (see \autoref{sec:occ}) or \emph{out-of-distribution detection (ODD)} (see \autoref{sec:odd}).

\paragraph{Sensory anomalies} are anomalies at a specific part of the sample and can occur at three different levels: At \emph{object level}, a sensory anomaly is some defect in the otherwise normal object, e.g., a tumor in an image of an organ. At \emph{scene level}, novelous objects occur in a otherwise normal scene, e.g., an unknown object on the road in an autonomous driving scenario. At \emph{video level}, abnormal events or actions occur in a video. This is often the case in video surveillance settings. Compared to semantic anomalies, sensory defects occur more regularly and naturally in most research fields. For this paper, we consider anomalies at \emph{scene level} where animals and humans are novelous objects.

\subsection{Outlier vs novelty detection}
Anomaly detection is a term that is often used ambiguously. Depending on the available training data, it can refer to either outlier or novelty detection.

\paragraph{Outlier detection}
In this scenario, the training data contains both normal and anomalous data, often due to mechanical faults, human error, instrument error, or fraudulent behavior \cite{Hodge04:OutlierDetectionSurvey}. The model seeks to find the distribution of the normal data, isolating the anomalous data points. Training is often done in an unsupervised manner.

\paragraph{Novelty detection}
Here, the model receives only normal data as input. It must then detect outliers at prediction time. As the model is only fed annotated normal data, this concept is a form of supervised learning and equivalent to \emph{one-class classification} where normal data makes up the positive class. Novelty detection is often employed when the amount of available anomalous data is insufficient \cite{Pimentel14:NoveltyDetection}.

\subsection{One-Class Classification}
\label{sec:occ}
\textbf{One-class classification (OCC)} is a special case of supervised classification, where only data from a single positive class is observed during training \cite{Perera21:OCCSurvey}. The goal of the classifier is to recognize positive samples during inference while being able to separate them from negative samples from other semantic classes.

\begin{figure}[tbhp]
  \centering
  \includegraphics[width=.8\textwidth]{images/occ.pdf}
  \caption[Comparing multi-class to one-class methods]{Comparing multi-class to one-class methods \cite{Perera21:OCCSurvey}. In both multi-class settings, training data from different classes is available. In OCC, training data from only a single class is given.}
  \label{fig:occ}
\end{figure}

\autoref{fig:occ} illustrates the differences between multi-class and one-class settings. In \emph{multi-class classification}, training data from multiple classes is required. The classes are separated by multiple decision boundaries. In a multi-class or \emph{one-vs-rest detection} setting, a single decision boundary is learned using training data from multiple classes, separating one normal class from the rest. In \emph{one-class classification}, training data from only the positive class is observed and a single decision boundary is learned to separate it from the unseen negative class.

Consequently, OCC is a harder problem than the multi-class tasks since only training data from a single positive class is available instead of training data from all classes.

A popular one-class classifier is the one-class support vector machine (OCSVM) \cite{Schoelkopf99:OneClassSVM}, which is a modification of the support vector machine. In binary SVMs, the optimization goal is to find a hyper-plane separating both classes with the maximum margin. Similarly, the one-class SVM tries to separate the positive class from the origin of the feature space with the largest possible margin.

\subsection{Density estimation}
\label{sec:odd}
\textbf{Density estimation (DE)} methods attempt to model the distribution of normal training data. Unseen test data can then be separated by thresholding the likelihood of test data points under this distribution. Normal data is assumed to have a high likelihood since it was generated from the same underlying distribution as the training data whereas anomalous data is expected to have a lower likelihood \cite{Yang21:OODSurvey}. The term \textbf{out-of-distribution detection} refers to the classification of unseen test samples as part of vs. out of the distribution.

A distinction can be made between parametric and non-parametric DE methods. The former fit a parametric distribution such as a (multi-variate) Gaussian to the training data by maximizing the overall likelihood. The latter make no assumptions about the underlying distribution and estimate the likelihood directly from the data points. The simplest example of non-parametric DE is a discrete histogram. A more sophisticated and often more accurate method is \textbf{Kernel Density Estimation (KDE)} \cite{Rosenblatt56:KDE1,Parzen62:KDE2}, which, in contrast to the histogram, provides a smooth likelihood curve. It requires a \textbf{kernel function} such as the Gaussian kernel defined by the density function of a standard normal distribution. Given $m$ samples $\bm{x}^{(i)}$, the estimated distribution is defined by
\begin{equation}
  \hat{f}(\bm{x}) = \frac{1}{mh} \sum_{i=1}^m K\left( \frac{\bm{x} - \bm{x}^{(i)}}{h} \right)
\end{equation}
where $K(x)$ is the kernel function, and $h$ is the \textbf{bandwidth} \cite{Silverman86:DensityEstimation}. The kernel estimator therefore estimates the distribution as a sum of identical kernel curves ('bumps'), each centered around one data point (see \autoref{fig:kde_example}).

\begin{figure}[tb]
  \centering
  \includegraphics[width=.9\textwidth]{images/kde.pdf}
  \caption[One-dimensional kernel estimate example with Gaussian kernels]{One-dimensional kernel estimate example from 7 samples with Gaussian kernels, $h = 0.4$.}
  \label{fig:kde_example}
\end{figure}

The bandwidth $h$, also intuitively called smoothing parameter, controls the smoothness of the estimated distribution by dampening the kernel function for high values of $h$. Provided that $K$ is a non-negative probability density function, $\hat{f}$ is also a non-negative probability density function.

Classic density estimation methods are well suited for low-dimensional data but are often computationally expensive for high dimensional data such as images \cite{Yang21:OODSurvey}. This problem can be solved by performing dimensionality reduction, e.g., by finding local features (see \autoref{sec:theory_localfeatures}) or using deep models such as autoencoders (see \autoref{sec:theory_autoencoders}).

\section{Comparing images}
\label{sec:theory_comparingimages}
To detect sensory anomalies such as animals, many approaches rely on comparing the anomalous image to a reference background image. Suppose we have a reference background image for every unclassified observed image.

\subsection{Frame Differencing}
\label{sec:theory_comparingimages_fd}

The simplest approach to detect an anomalous object in an image $\tilde{I}$ is to calculate the pixel-wise absolute difference of $\tilde{I}$ and a normal reference image $I$:
\begin{equation}
  \Delta = \left| \tilde{I} - I \right|
\end{equation}
If there is an anomalous object present in $\tilde{I}$, the corresponding pixels in the \textbf{difference image} $\Delta$ would have high values. This can be measured using the mean or variance of the pixels of $\Delta$. The higher the mean and variance, the higher the likelihood that an anomalous object is present.

This approach can only work reliably if the reference image is very close in time to $\tilde{I}$ and indeed only shows the empty scene. Furthermore, since two images are compared pixel by pixel, it is susceptible to noise, which can be partly eliminated using a Gaussian filter beforehand and a threshold on the difference image $\Delta$.

\subsection{Background estimation using temporal median filtering}
\label{sec:theory_comparingimages_be}
When no highly similar reference image is available, a normal background image can be extracted from multiple (possibly anomalous) images. Given a series of images that are close together in time, we can assume that the background has not changed. In the context of camera trap images, we also assume that an anomalous object (i.e., an animal) would have moved over time. Therefore, every pixel has seen the background at some time and, in most cases, at the majority of times. Under these assumptions, we can calculate the median for every pixel over all images in the time series to obtain the \textbf{median image} containing just the background. Anomalous objects can now be detected using frame differencing against the median image.

This approach is more robust towards noise, focus and exposure changes. However, the assumption that the anomalous object moves over time is often not fulfilled, resulting in artifacts in the background reconstruction. Still, the background can change even in this short timeframe due to wind and lighting conditions which also leads to a poor reconstruction.

\section{Local features for anomaly detection}
\label{sec:theory_localfeatures}

\subsection{SIFT}
\textbf{SIFT (scale-invariant feature transform)} \cite{Lowe04:SIFT} provides an algorithm to calculate local image features invariant to scale, rotation, noise, and lighting. The features are designed to be highly distinctive to be matched against a large database of features from many images. Such features may be used for keypoint matching but also for image classification.

The first three steps of the SIFT algorithm aim to find meaningful and descriptive keypoints in the image. Keypoints are defined by their center position, scale, and orientation. The fourth and last step then calculates the keypoint descriptor by processing the local image gradients at the selected scale and region. A single SIFT descriptor is a real vector of length $128$.

An alternative strategy often employed in image classification is \textbf{Dense SIFT} \cite{Bosch06:DSIFT1,Bosch07:DSIFT2,Tuytelaars10:DenseInterestPoints} where the first three steps are skipped. Instead, keypoints are sampled in a dense image grid with a fixed size and orientation. This creates significantly more keypoints, therefore introducing redundancy in the set of descriptors while also increasing its descriptiveness \cite{Chavez12:DSIFT}.

\subsection{Bag of Visual Words}
To perform image classification from image features, a \textbf{codebook} or \textbf{vocabulary} of a fixed size $k$ is constructed from some or all of the extracted local features. This is done by clustering the set of local features into $k$ clusters called \textbf{visual words}, usually using $k$-means clustering or by randomly choosing $k$ cluster centers. Each local feature can now be assigned to one of the $k$ visual words by choosing the closest cluster center. The choice of distance measure depends on the local feature descriptors. In the case of SIFT, the Euclidian distance is used. The codebook reduces the numerous local features to a tractable number of distinct possible features, which are observed across different images \cite{Chavez12:DSIFT}.

For classification, each image is represented by its histogram of visual words (i.e., a $k$-dimensional vector with occurence counts of all $k$ words). Note that this reduction eliminates all information about the position of the local features and geometric relations between them. Different extensions exist where some spatial information is kept.

Next, these histograms can be used as feature vectors for classification. In the case of anomaly detection, the normal images are used for both the generation of the codebook and the calculation of normal feature vectors. These normal features can then be used to fit a one-class classifier such as a one-class SVM.

\section{Deep learning}
\label{sec:theory_deep_learning}
In deep learning, machine learning models with multiple layers are employed that learn representations of data at multiple levels of abstractions \cite{LeCun15:DeepLearning}. The fundamental deep model is the \emph{feedforward neural network}.

\subsection{Neural networks}
\label{sec:theory_neural_networks}
\textbf{Neural networks} are parametric models that define a highly non-linear mapping $\bm{y} = f(\bm{x} \mid \bm{\theta})$ and learn the parameters $\bm{\theta}$ that result in the best approximation such that the loss between actual and desired outputs is minimized. \textbf{Feedforward neural networks} are composed of $l$ \textbf{layers} that model functions $f^{(1)}, f^{(2)}, \dots, f^{(l)}$. The network is then evaluated by calculating $f(\bm{x}) = f^{(l)}(\dots f^{(2)} ( f^{(1)} (\bm{x}) ) )$, i.e., information flows in one direction and there are no feedback connections. $l$ is called the \textbf{depth of the model}. The final layer $f^{(l)}$ is called \textbf{output layer}; all other layers are \textbf{hidden layers} \cite{Goodfellow16:DeepLearning}.

\paragraph{Dense layers} A dense or fully connected layer is the default layer type and consists of several neurons, which are considered the smallest computation units in a neural network. The output of a neuron is called \textbf{activation}. The $i$-th neuron in layer $k$ computes its activation $f^{(k)}_i$ using the activations of the previous layer $f^{(k-1)}$, the \textbf{weights} vector $\bm{W}^{(k)}_{i}$, the \textbf{bias} $\bm{b}^{(k)}_i \in \R$, and the \textbf{activation function} $g^{(k)}$:

\begin{equation}
  f^{(k)}_i = g^{(k)} \left( {\bm{W}^{(k)}_{i}}^\top f^{(k-1)} + \bm{b}^{(k)}_i \right)
\end{equation}

The vector of the activations $f^{(k)}_i$ of all neurons in layer $k$ is the output $f^{(k)}$ of layer $k$, which is then fed as input to the next layer, and so on. $f^{(0)}$ is defined as the input data $\bm{x}$. The activations of the output layer are the predicted output $\bm{\hat{y}}$. The activation function $g: \R \to \R$ is required to introduce a nonlinearity into the term. An often employed activation function is \textbf{ReLU} (\emph{Rectified Linear Unit}, see \autoref{fig:activation_functions_relu}) \cite{Nair10:ReLU} which remains very close to linear:

\begin{equation}
  \text{ReLU}(x) = \max \{ 0, x \}
\end{equation}

The codomain of ReLU is $[0, \infty)$. However, especially for the output layer, it can be desirable that the activation function has a finite value range. The hyperbolic tangent (see \autoref{fig:activation_functions_tanh}) is also a suitable activation function and has a finite codomain of $(-1, 1)$. It is defined as follows:

\begin{equation}
  \tanh(x) = 1 - \frac{2}{e^{2x} + 1}
\end{equation}

\begin{figure}[tb]
  \centering
  \begin{subfigure}[b]{.47\textwidth}
    \includegraphics[width=\textwidth]{images/relu.pdf}
    \caption{ReLU.}
    \label{fig:activation_functions_relu}
  \end{subfigure}
  \begin{subfigure}[b]{.5\textwidth}
    \includegraphics[width=\textwidth]{images/tanh.pdf}
    \caption{$\tanh$.}
    \label{fig:activation_functions_tanh}
  \end{subfigure}
  \caption[The ReLU and hyperbolic tangent activation functions]{The ReLU and hyperbolic tangent activation functions.}
  \label{fig:activation_functions}
\end{figure}

\paragraph{Gradient descent} To train a neural network, we define a loss function $L(\bm{\hat{y}}, \bm{y})$ on the output layer. The parameters $\bm{\theta}$ (consisting of the weights and biases of all neurons) are then iteratively adjusted such that the loss function is minimized. For every iteration, the gradient of the loss function $L$ with respect to the parameters $\bm{\theta}$ is averaged over all training examples \cite{Goodfellow16:DeepLearning}:
\begin{equation}
  \bm{g} = \frac{1}{m} \nabla_{\bm{\theta}} \sum_{i=1}^m L(f(\bm{x}^{(i)} \mid \bm{\theta}), \bm{y}^{(i)})
\end{equation}
Next, a step is performed in the direction of the negative gradient:
\begin{equation}
  \bm{\theta} \leftarrow \bm{\theta} - \alpha \bm{g}
\end{equation}
The variable $\alpha > 0$ is the \textbf{learning rate}, which controls the step size in the direction of the gradient. The gradient with respect to weights and biases of hidden layers can be computed by repeatedly applying the chain rule in an algorithm called \textbf{backpropagation}. Since the training set is often large, estimates of the gradient can be computed over smaller subsets of the training data (\emph{mini-batches}). This algorithm is called \textbf{stochastic gradient descent}.

Note that gradient descent converges towards local minima and the closest local minimum depends on the initialization of the parameters. Different weight and bias initializations can therefore yield different results.

\subsection{Convolutional neural networks}
\label{sec:theory_cnns}
\textbf{Convolutional neural networks (CNNs)} \cite{LeCun89:CNN} are specialized neural networks for grid-like data such as time series and images employing the \textbf{convolution} operation. For images, we consider a two-dimensional discrete domain. Then, the convolution operation works as follows: Each input pixel $I(x, y)$ and its surrounding pixels are weighted using the entries of the \textbf{kernel} $K(m, n)$ which is centered around $(x, y)$. These weighted values are summed to produce the output $S(x, y)$, also called \textbf{feature map} \cite{Goodfellow16:DeepLearning}:

\begin{equation}
  S(x, y) = (K \ast I)(x, y) = \sum_{m,n} I(x - m, y - n) K(m, n)
\end{equation}

\paragraph{Convolutional layers} employ the convolution operation instead of the matrix multiplication in dense layers. Therefore, instead of the weights and biases, the kernel values are learnable parameters. A neural network with at least one convolutional layer is a CNN.

CNNs have computational advantages: In a traditional neural network with dense layers, every output unit interacts with every input unit. Since the kernel is smaller than the input, this is not the case for convolutional layers. Therefore, CNNs require fewer parameters and fewer operations to be computed \cite{Goodfellow16:DeepLearning} and are much more efficient when processing images.

Moreover, by applying filters, CNNs make use of spatial information between pixels which would be lost in dense models. By repeatedly applying filter operations, deeper convolutional layers learn to detect increasingly abstract patterns in images. While the first layers often detect simple shapes like lines and edges, deeper layers detect complex shapes like letters or faces.

To reduce the dimensions of convolutional layer activations, there are two options: The \textbf{stride} can be set higher than one which would skip some values of $S(x, y)$, thereby reducing the output size. Alternatively, \textbf{pooling layers} can be employed which reduce several pixels to a single value, e.g., by taking their maximum or average value.

\subsection{Autoencoders}
\label{sec:theory_autoencoders}

Autoencoders \cite{Rumelhart86:Autoencoders} are neural networks consisting of two (often symmetrical) parts: an encoder and a decoder (see \autoref{fig:autoencoder}). The encoder maps the input to a compressed and meaningful representation. The decoder takes this \emph{latent representation} and reconstructs the original input \cite{Bank20:Autoencoders}. Autoencoders have a \emph{bottleneck architecture} where the smallest layer with the latent representation is the bottleneck. As the optimization goal is to minimize the reconstruction error over the training set, the underlying network has to find a good compressed representation for all training samples. Therefore, mapping the input samples to their compressed representations is a form of dimensionality reduction. In fact, autoencoders can be interpreted as a nonlinear generalization of principal component analysis (PCA) \cite{Hinton06:Autoencoders}.

\begin{figure}[tb]
  \centering
  \includegraphics[width=.8\textwidth]{images/autoencoder.PNG}
  \caption[Architecture of an autoencoder]{Architecture of an autoencoder \cite{Bank20:Autoencoders}. The input image is encoded to the compressed latent representation and then decoded.}
  \label{fig:autoencoder}
\end{figure}

\textbf{Convolutional autoencoders} use convolutional encoder and decoder models and are often employed to represent images. The reconstruction loss is usually set to be the \textbf{mean squared error (MSE)} between the original and reconstructed image.

Autoencoders can easily overfit, thus not providing a meaningful latent representation. To avoid this, various regularization techniques can be applied. The simplest is to choose an appropriate bottleneck size depending on the complexities of the dataset, encoder, and decoder.

\subsubsection{Sparse autoencoders}
\label{sec:sparse_ae}
A different way of regularization is to enforce sparsity on the latent representation, i.e., forcing neurons to be inactive most of the time. This can be implemented using a penalty term on the bottleneck activations such as the $L_1$-norm \cite{Bank20:Autoencoders}. Another sparsity constraint based on the Kullback-Leibler (KL) divergence models each bottleneck neuron as a Bernoulli random variable with mean
\begin{equation}
  \hat{\rho_j} = \frac{1}{m} \sum_{i=1}^m \left[ a_j (x^{(i)}) \right]
\end{equation}
for the $j$-th bottleneck neuron, where $a_j (x^{(i)})$ is the activation of the $j$-th bottleneck neuron for the $i$-th training sample. In other words, $\hat{\rho_j}$ is set to the average activation of the $j$-th neuron. The penalty term is then set to
\begin{equation}
  \sum_{j=1}^b \text{KL}(\rho || \hat{\rho_j}),
\end{equation}
where $b$ is the number of bottleneck neurons and
\begin{equation}
  \text{KL}(\rho || \hat{\rho_j}) = \rho \log \frac{\rho}{\hat{\rho_j}} + (1 - \rho) \log \frac{1 - \rho}{\hat{\rho_j}}
\end{equation}
is the KL divergence between two Bernoulli random variables with mean $\rho$ and $\hat{\rho_j}$, respectively \cite{Ng11:SparseAutoencoder}. $\rho$ is a hyperparameter expressing the ideal mean of bottleneck activations and should be set close to $0$. Note that the KL term is computed over all training samples, making it a possibly more accurate but less efficient penalty term if the training set is too large to fit in memory.

\subsubsection{Denoising autoencoders}
\label{sec:denoising_ae}
In denoising autoencoders, the input is disrupted by noise, usually Gaussian noise or erasures. The autoencoder is expected to reconstruct the original version of the input image. Thus, the reconstruction loss is computed between the undisrupted input and output image. Effectively, this extension prevents overfitting while additionally making the model more robust to noise and even usable for error correction \cite{Bank20:Autoencoders}.

\section{Evaluating binary classifiers}
In a binary classification setting, there are only two possible prediction classes: negative and positive. To describe the predicted class and confidence value, only a single real number $x$ is needed as output: For positive $x$, the positive class is predicted, and for negative $x$ the negative class. The larger the absolute value of $x$, the higher the confidence. The function mapping input data to $x$ is called \emph{decision function}.

\subsection{Confusion matrix}
Suppose we want to evaluate a classifier on an annotated test set (i.e., the true classes are known for all samples). Assuming the classifier is not ideal, the predicted classes differ from the true classes. In fact, there are four possible outcomes for every prediction:

\begin{enumerate}[label=\alph*)]
  \item \textbf{True positive (TP):} The sample is labeled as positive and was correctly classified as positive.
  \item \textbf{False negative (FN):} The sample is labeled as positive but was falsely classified as negative.
  \item \textbf{True negative (TN):} The sample is labeled as negative and was correctly classified as negative.
  \item \textbf{False positive (FP):} The sample is labeled as negative but was falsely classified as positive.
\end{enumerate}

To keep a record of wrong and successful predictions, the \emph{confusion matrix} is introduced (see \autoref{fig:confusion_matrix}) containing the absolute frequencies of the above prediction cases. Consequently, correct classifier decisions lie on the main diagonal of the table while the other elements represent the misclassifications \cite{Majnik13:ROCAnalysis}.

\begin{figure}[tbhp]
  \centering
  \begin{tikzpicture}
    % table
    \draw[fill=red!40] (0, 0) rectangle (2, 2) node[pos=.5] {FN};
    \draw[fill=green!60] (2, 0) rectangle (4, 2) node[pos=.5] {TN};
    \draw[fill=green!60] (0, 2) rectangle (2, 4) node[pos=.5] {TP};
    \draw[fill=red!40] (2, 2) rectangle (4, 4) node[pos=.5] {FP};
    % positive/negative labels on the left
    \path (-1, 0) rectangle (0, 2) node[pos=.5] {N'};
    \path (-1, 2) rectangle (0, 4) node[pos=.5] {P'};
    % positive/negative labels on the top
    \path (0, 4) rectangle (2, 5) node[pos=.5] {P};
    \path (2, 4) rectangle (4, 5) node[pos=.5] {N};
    % outer labels
    \path (0, 5) rectangle (4, 6) node[pos=.5] {True class};
    \path (-2, 0) rectangle (-1, 4) node[pos=.5,rotate=90] {Predicted class};
  \end{tikzpicture}
  \caption[Confusion matrix]{Confusion matrix. The green cells indicate successful predictions, the red cells are prediction failures.}
  \label{fig:confusion_matrix}
\end{figure}

Out of these absolute values, relative ones can be derived. Let TP, FN, TN, FP denote absolute frequencies from the confusion matrix, $\text{P} = \text{TP} + \text{FN}$ be the number of positively labeled samples and $\text{N} = \text{FP} + \text{TN}$ the number of negatively labeled samples. Then, the following rates can be defined to describe the classifier's performance \cite{Majnik13:ROCAnalysis}:

\begin{enumerate}[label=\alph*)]
  \item \textbf{True positive rate (TPR), sensitivity:} $\text{TPR} = \frac{\text{TP}}{\text{P}} = 1 - \text{FNR}$.
  \item \textbf{True negative rate (TNR), specificity:} $\text{TNR} = \frac{\text{TN}}{\text{N}} = 1 - \text{FPR}$.
  \item \textbf{False positive rate (FPR):} $\text{FPR} = \frac{\text{FP}}{\text{N}} = 1 - \text{TNR}$.
  \item \textbf{False negative rate (FNR):} $\text{FNR} = \frac{\text{FN}}{\text{P}} = 1 - \text{TPR}$.
\end{enumerate}

An ideal classifier would have a sensitivity and specificity of $1$. Real-world classifiers, however, often require making trade-offs between high sensitivity and high specificity by choosing a suitable threshold for its decision function.

\subsection{ROC curves}
\label{sec:roc_curves}

\begin{figure}[tb]
  \centering
  \begin{tikzpicture}
    % plot + labels
    \draw[opacity=0.5] (0, 0) rectangle (10, 10);
    \path (0, -2) rectangle (10, 0) node[pos=.5] {FPR};
    \path (-2, 10) rectangle (0, 0) node[pos=.5,rotate=90] {TPR};
    % tick bottom left
    \draw[opacity=0.5] (0,0) -- (-0.2,-0.2);
    \node at (-0.4,-0.4) {0};
    % tick bottom right
    \draw[opacity=0.5] (10,0) -- (10,-0.2);
    \node at (10,-0.4) {1};
    % tick top left
    \draw[opacity=0.5] (0,10) -- (-0.2,10);
    \node at (-0.4,10) {1};

    % diagonal
    \draw[densely dotted,opacity=0.5] (0,0) -- (10,10);

    % always negative
    \draw[fill=blue!80!black] (0,0) circle (0.1) node[above right,text=blue!80!black] {always negative};

    % always positive
    \draw[fill=blue!80!black] (10,10) circle (0.1) node[below left,text=blue!80!black] {always positive};

    % always correct
    \draw[fill=green!60!black] (0,10) circle (0.1) node[below right,text=green!60!black] {always correct};

    % always incorrect
    \draw[fill=red!70!black] (10,0) circle (0.1) node[above left,text=red!70!black] {always incorrect};

    % real classifiers
    \draw[densely dotted,fill=violet,opacity=0.8] (7,1.4) -- (3,8.6);
    \draw[fill=violet] (7,1.4) circle (0.1) node[above right,text=violet] {$A$};
    \draw[fill=violet] (3,8.6) circle (0.1) node[above left,text=violet] {$\overline{A}$};

    % random guesses
    \draw[fill=orange!80!black] (5,5) circle (0.1) node[right,text=orange!80!black] {random guessing $p=0.5$};
    \draw[fill=orange!80!black] (2,2) circle (0.1) node[below right,text=orange!80!black] {random guessing $p=0.2$};
  \end{tikzpicture}
  \caption[Characteristic points on a ROC graph]{Characteristic points on a ROC graph. Classifier $A$ performs significantly worse than random guessing, suggesting that its decision function should be inverted. This yields the much better classifier $\overline{A}$.}
  \label{fig:roc_curve}
\end{figure}

To compare binary classifiers, a fair benchmark independent of the chosen threshold is required. The ROC curve (receiver operating characteristics) for a binary classifier aggregates the FPR on the x-axis against the TPR on the y-axis for all possible thresholds. In other words, each point on the ROC graph represents a single threshold for the decision function. There are three characteristic points, illustrated in \autoref{fig:roc_curve}:

\begin{itemize}
  \item The classifier at $(0, 0)$ always predicts the negative class, therefore producing neither true positives nor false positives.
  \item The classifier at $(1, 1)$ always predicts the positive class, therefore producing only true and false positives.
  \item The classifier at $(0, 1)$ is the ideal classifier that always predicts the correct class.
\end{itemize}
A classifier at $(1, 0)$ would always predict the opposite of the true class. Furthermore, a classifier that guesses randomly would land close to the ascending diagonal line. Therefore, all useful classifiers must be above the diagonal. However, even classifiers significantly below the diagonal must have useful information to be able to perform significantly worse than random guessing. Such cases indicate that the decision function might have to be inverted, effectively mirroring it against the midpoint $(0.5, 0.5)$ on the ROC graph \cite{Flach05:ROCCurves}. An example of this is shown in \autoref{fig:roc_curve} with the classifier $A$.

To compare different classifiers, one ROC curve per classifier is computed over all possible thresholds. The quality of each classifier can then be measured by the Area Under Curve (AUC) score, i.e., the area under the ROC curve. Consequently, a good classifier would have an AUC score close to $1$ whereas a poor classifier would have an AUC score of around $0.5$ (see \autoref{fig:roc_examples}). Again, an AUC much lower than $0.5$ suggests that the decision function should be inverted whereas an AUC of $1$ shows that there exists a threshold for the decision function with $100 \%$ accuracy.

\begin{figure}[tb]
  \centering
  \begin{subfigure}[b]{0.32\textwidth}
      \centering
      \includegraphics[width=\textwidth]{images/roc_random.pdf}
      \caption{}
      \label{fig:roc_random}
  \end{subfigure}
  \begin{subfigure}[b]{0.32\textwidth}
      \centering
      \includegraphics[width=\textwidth]{images/roc_goodclf.pdf}
      \caption{}
      \label{fig:roc_goodclf}
  \end{subfigure}
  \begin{subfigure}[b]{0.32\textwidth}
      \centering
      \includegraphics[width=\textwidth]{images/roc_goodclf_inv.pdf}
      \caption{}
      \label{fig:roc_goodclf_inv}
  \end{subfigure}
  \caption[Different ROC curves and AUC scores]{Different ROC curves and AUC scores. (a) Random guessing yields AUC $\approx 0.5$. (b) A good classifier $C$ has an AUC score close to 1. (c) If we invert the decision function of $C$, the AUC score changes to $1 - $ AUC.}
  \label{fig:roc_examples}
\end{figure}

% -------------------------------------------------------------------

\section{Already existing solutions}
\label{sec:existingSolutions}
Several animal species detection solutions \cite{Norouzzadeh18:Solution1,Willi19:Solution2} used a separate CNN for the task of detecting the presence of an animal in an image. To train such CNNs, a large and diverse annotated dataset is required: \cite{Norouzzadeh18:Solution1} used a balanced dataset of 1.4 million training and 105,000 test images. The images were selected from a total of 3.2 million images generated by 225 continuously running camera traps in the Serengeti National Park, Tanzania, since 2011. \cite{Willi19:Solution2} used the same dataset as well as three additional ones from Africa and North America, each with around 500,000 images.

\cite{Norouzzadeh18:Solution1} found that splitting the tasks of animal detection and species classification outperformed a combined solution. They compared different CNN architectures and an ensemble of multiple CNNs in terms of accuracy. While the ensemble often slightly enhanced the accuracy compared to the best single model, it also increased training and inference time.

\cite{Willi19:Solution2} argued that a combined model would have a strong bias towards the majority class of empty images. This effect is very significant since empty images make up more than 50 \% in most camera trap datasets. Additionally, an increase in accuracy can be expected by learning more specific target tasks. \cite{Willi19:Solution2} trained a CNN on the large Snapshot Serengeti dataset and then employed transfer learning to fine-tune the model to the smaller datasets to avoid overfitting. Their experiments showed that this increases the overall accuracy on all datasets.

\cite{Yang21:SolutionEnsemble} have described another method for removing empty camera trap images using ensemble learning. They employed an ensemble of three CNNs with different architectures. In consideration of the class imbalance, they constructed a balanced and an unbalanced training set and trained each model architecture on each training set, therefore creating a pool of six models. Using a conservative voting mechanism, they removed between 50.78 \% and 77.51 \% of the empty images with omission errors of 0.70 \% to 2.54 \%. However, they also used a large Chinese dataset with 268,484 motion images (of which 77.86 \% were empty) and training six deep CNN models presumably requires a significant amount of time (which is not specified in the paper). In this work, such deep models are not applicable since there is much less training data (1,000 - 5,000 images per camera station) and annotated test data (500 - 3,000 images per camera station) available.

% -------------------------------------------------------------------

% insert further sections if necessary