kleinsteuber
/
thesis-camera-trap-anomaly-detection


			
				
					
						
						
							123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201
							% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Author:   Felix Kleinsteuber
% Title:    Anomaly Detection in Camera Trap Images
% File:     chap03/chap03.tex
% Part:     methods
% Description:
%         summary of the content in this chapter
% Version:  16.05.2022
% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\chapter{Methods}
\label{chap:contribution}
In the following, the dataset is analyzed regarding its structure, inconsistencies, and preprocessing. After that, four approaches are proposed based on the theoretical considerations of the previous chapter.

\section{Dataset}
We will call a set of images generated from a single camera at a single position in the Bavarian Forest a \emph{session}. The images in a session were taken in a timeframe of between one and three months. To attract animals, the researchers laid out an animal cadaver in front of the camera which can be seen in the images.

The dataset is organized in 32 sessions, each of which has its own folder identified by the cadaver type and a session number between one and five. Overall, there are 10 different cadaver types. Each session folder contains three subfolders, \emph{Lapse}, \emph{Motion}, and \emph{Full}. The \emph{Lapse} folder contains images taken at regular time intervals (usually one hour), regardless of whether motion was detected in front of the camera. In contrast, the \emph{Motion} folder contains only images that were captured when the motion sensor was triggered. When a movement is registered, the camera takes five images at an interval of one second. This process is repeated as long as the movement continues. Therefore, \emph{Motion} images are organized in sets of consecutively taken images with at least five images, referred to as \emph{capture sets}. The \emph{Full} folder is a subset of the \emph{Motion} folder and contains images, pre-selected by humans, that actually contain a moving object. The \emph{Full} files can be used to aid annotation but are not further referenced as they are not part of the classification process.

A total of 203,887 images are available, of which 82 \% are \emph{Motion}, 15 \% are \emph{Lapse}, and 2 \% are \emph{Full} images.

\subsection{Challenges}

The ratio between the number of files in the three folders varies significantly. For instance, the \emph{Roedeer\_01} session contains 1380 \emph{Lapse} samples and 38,820 \emph{Motion} samples, of which only 18 have been pre-selected. In contrast, the \emph{Beaver\_01} session contains 1734 \emph{Lapse} samples but only 695 \emph{Motion} samples, of which 200 have been pre-selected. This shows a great variance in the relative frequency of anomalous images.

The analysis of the distribution of EXIF image dates for each session shows that nine of the 32 sessions contain duplicates (i.e., two or more images taken at the same time), of which six contain inconsistent duplicates (i.e., the duplicate images show different content). While consistent duplicates can be eliminated easily, the inconsistent duplicates indicate an error in the dataset that can not be fixed easily. Moreover, in a few cases, the number of images in a capture set is not a multiple of five. It cannot be traced back how this error came about.

\subsection{Preprocessing}
The original image size of around 8 megapixels is too large for efficient execution of the proposed methods. In addition, the raw input images have black status bars at the top and bottom, which do not contain any useful information for classification. Therefore, in a preprocessing step, the black strips are cut off and the image is rescaled to 30 \% of its original size for approaches 1-3 and to 256 x 256 for approach 4, respectively. This process is demonstrated in \autoref{fig:preprocessing} for a sample image.

\begin{figure}[htbp]
    \centering
    \begin{subfigure}[b]{.8\textwidth}
        \centering
        \includegraphics[width=\textwidth]{images/sample_0_5.jpg}
        \caption{}
        \label{fig:sample_image_raw}
    \end{subfigure}
    \begin{subfigure}[b]{.52\textwidth}
        \centering
        \includegraphics[width=\textwidth]{images/sample_preprocessed.jpg}
        \caption{}
        \label{fig:sample_image_pre}
    \end{subfigure}
    \begin{subfigure}[b]{.38\textwidth}
        \centering
        \includegraphics[width=\textwidth]{images/sample_preprocessed256.jpg}
        \caption{}
        \label{fig:sample_image_pre256}
    \end{subfigure}
    \caption[Cropping and resizing input images]{Cropping and resizing input images. (a) Raw input image (size 3776 x 2124). (b) Cropped and rescaled to 30 \% (size 1133 x 613) for approaches 1-3. (c) Cropped and resized to 256 x 256 for approach 4.}
    \label{fig:preprocessing}
\end{figure}

\subsection{Labeling}
\label{sec:labeling}
To annotate sessions for testing, a script shows the annotator one image at a time. The annotator can then classify the image as normal or anomalous using the '1' and '2' keys. The generated labels can be quickly exported and saved for automated testing. Experiments show that for the tested sessions, between 60 and 100 images can be annotated this way in a minute, depending on the frequency of images with contest of interest.

\section{Approaches}
A total of four approaches are evaluated on the available sessions. As the sessions were created independently from each other, each approach is just evaluated on a single session at a time with only the lapse and motion images of this session as input information.

\subsection{Lapse frame differencing}
Using frame differencing (see \autoref{sec:theory_comparingimages_fd}), the motion image can be compared to the closest available lapse image in time. For this approach to work, lapse images have to be taken at least every hour using the same camera setup as for motion images such that the closest lapse image closely resembles its background. Mean and variance of the difference image (see \autoref{fig:approach1_example}) are determined and independently thresholded to distinguish between animal images with a high mean and variance, and empty images with a low mean and variance.

\begin{figure}[htbp]
    \centering
    \begin{subfigure}[b]{0.48\textwidth}
        \centering
        \includegraphics[width=\textwidth]{images/approach1a_motion.pdf}
        \label{fig:approach1_motion}
    \end{subfigure}
    \begin{subfigure}[b]{0.48\textwidth}
        \centering
        \includegraphics[width=\textwidth]{images/approach1a_lapse.pdf}
        \label{fig:approach1_lapse}
    \end{subfigure}
    \begin{subfigure}[b]{0.48\textwidth}
        \centering
        \includegraphics[width=\textwidth]{images/approach1a_sqdiff.pdf}
        \label{fig:approach1_sqdiff}
    \end{subfigure}
    \caption[Demonstration of lapse frame differencing on an anomalous motion image]{Demonstration of lapse frame differencing on an anomalous motion image. (a) Motion image taken at 00:22:46. (b) Closest lapse image taken at 00:00:00. (c) Squared pixel-wise difference ($\mu = 0.060, \sigma^2 = 0.033$).}
    \label{fig:approach1_example}
\end{figure}

The algorithm is extended by applying Gaussian filtering on both input images with standard deviations of $\sigma \in \{ 2, 4, 6 \}$ (see \autoref{fig:approach1_example2}). Thus, the approach becomes more robust towards noise and small object movements (leaves, flies, dust particles, etc.). Note that this approach can only be evaluated on sessions with lapse images captured every hour.

\begin{figure}[htbp]
    \centering
    \begin{subfigure}[b]{0.48\textwidth}
        \centering
        \includegraphics[width=\textwidth]{images/approach1a_ex2_motion.pdf}
        \label{fig:approach1_ex2_motion}
    \end{subfigure}
    \begin{subfigure}[b]{0.48\textwidth}
        \centering
        \includegraphics[width=\textwidth]{images/approach1a_ex2_lapse.pdf}
        \label{fig:approach1_ex2_lapse}
    \end{subfigure}
    \begin{subfigure}[b]{0.48\textwidth}
        \centering
        \includegraphics[width=\textwidth]{images/approach1a_ex2_sqdiff.pdf}
        \label{fig:approach1_ex2_sqdiff}
    \end{subfigure}
    \begin{subfigure}[b]{0.48\textwidth}
        \centering
        \includegraphics[width=\textwidth]{images/approach1a_ex2_sigma4_sqdiff.pdf}
        \label{fig:approach1_ex2_sqdiff_sigma4}
    \end{subfigure}
    \caption[Extending lapse frame differencing using Gaussian filtering]{Extending lapse frame differencing using Gaussian filtering ($\sigma = 4$) beforehand. (a) Motion image. (b) Lapse image. (c) Squared pixel-wise difference ($\mu = 1.258, \sigma^2 = 4.113$). (d) Squared pixel-wise difference of Gaussian filtered images ($\mu = 0.160, \sigma^2 = 0.330$).}
    \label{fig:approach1_example2}
\end{figure}

\subsection{Median frame differencing}
If the above condition is not met, i.e. no or too few lapse images are available, a substitute for lapse images can be found. Since the motion set is organized in subsets of at least five consecutively taken images (\emph{capture sets}), background estimation using the temporal median image (see \autoref{sec:theory_comparingimages_be}) can be applied to estimate a background image directly from the motion images of a single capture set. In the best case, such an image can be a better background estimation than the closest lapse image since there is no time difference between the images (see \autoref{fig:approach2_good_median}). However, this method often fails, specifically when the foreground object remains in the same place in the image for an extended period of time. In the case of animals, this is not unusual behavior (see \autoref{fig:approach2_bad_median}). The main advantage of this approach are its low requirements, since only motion images are used in the algorithm. As before, the accuracy should improve by filtering both the median image and the lapse image using a Gaussian filter with $\sigma \in \{ 2, 4, 6 \}$.

\begin{figure}[tb]
    \centering
    \begin{subfigure}[b]{0.6\textwidth}
        \centering
        \includegraphics[width=\textwidth]{images/approach2_good_example_imgs.png}
        \caption{}
        \label{fig:approach2_good_imgs}
    \end{subfigure}
    \begin{subfigure}[b]{0.38\textwidth}
        \centering
        \includegraphics[width=\textwidth]{images/approach2_good_example_median.png}
        \caption{}
        \label{fig:approach2_good_median}
    \end{subfigure}
    \begin{subfigure}[b]{0.6\textwidth}
        \centering
        \includegraphics[width=\textwidth]{images/approach2_bad_example_imgs.png}
        \caption{}
        \label{fig:approach2_bad_imgs}
    \end{subfigure}
    \begin{subfigure}[b]{0.38\textwidth}
        \centering
        \includegraphics[width=\textwidth]{images/approach2_bad_example_median.png}
        \caption{}
        \label{fig:approach2_bad_median}
    \end{subfigure}
    \caption[Demonstration of background estimation using temporal median filtering]{Demonstration of background estimation using temporal median filtering. (a) Convenient set of motion images; the majority of values show background for all pixels. (b) Resulting good background estimation. (c) Inconvenient set of motion images; the deer stays in the center of the image. (d) Resulting poor background estimation.}
    \label{fig:approach2_example}
\end{figure}

A possible fault of both approaches 1 and 2 is the high sensitivity to movement of the camera or background objects. As soon as the camera, trees, or leaves move around by even a few pixels, high difference values occur which distort the statistical properties of the difference image, thus making it harder to distinguish between normal and anomalous images. Approach 3 tries to eliminate this fault by relying on local image features. 

\subsection{Bag of Visual Words}
In the training process, a visual vocabulary is generated in the following way: First, SIFT descriptors are calculated on densely sampled keypoints for all lapse images (see \autoref{fig:approach3}). Then, the features are clustered into $k$ groups using the k-Means algorithm to create a visual vocabulary $V$. The best combination of hyperparameters including the keypoint step size $s$, keypoint size, and vocabulary size $k$ is determined experimentally. The training features are derived by computing the Bag of Words histogram for every lapse image with respect to the vocabulary $V$. They can then be used to fit a one-class classifier (here: one-class SVM).

\begin{figure}[htbp]
    \centering
    \includegraphics[width=.6\textwidth]{images/approach3_keypoints.pdf}
    \caption[Densely sampled keypoints.]{Densely sampled keypoints. Keypoint size 30 pixels, no spacing ($s = 30$).}
    \label{fig:approach3}
\end{figure}

For evaluation on the motion images, SIFT descriptors are computed on the same dense keypoint grid that was used in training. The descriptors are used to derive the bag of words histogram to which a score can be assigned using the trained one-class classifier. This score can then be thresholded to distinguish between normal and anomalous images. Picking random prototypes is tried as an alternative to k-Means clustering.

\subsection{Autoencoder}
An autoencoder neural network is trained on normal lapse images and then evaluated on motion images as described in \autoref{sec:theory_autoencoders}. Two metrics are considered separately to distinguish between normal and anomalous images: The reconstruction loss measures the mean squared error between the input and reconstructed image. The other metric is the log likelihood of the input image under the distribution of the bottleneck activations, estimated using Kernel Density Estimation (KDE) with a Gaussian kernel. If the reconstruction loss is \emph{above} a certain threshold or the log likelihood is \emph{below} a threshold, respectively, the image is considered anomalous. Again, the best combination of hyperparameters including learning rate, batch size and dropout rate is determined experimentally. \autoref{fig:approach4_reconstructions} demonstrates that anomalous images yield a high reconstruction error while normal images are reconstructed much more accurately.

\begin{figure}[htbp]
    \centering
    \begin{subfigure}[t]{.48\textwidth}
        \includegraphics[width=\textwidth]{images/approach4_normal_reconstruction.png}
        \caption{}
    \end{subfigure}
    \begin{subfigure}[t]{.48\textwidth}
        \includegraphics[width=\textwidth]{images/approach4_anomalous_reconstruction.png}
        \caption{}
    \end{subfigure}
    \caption[Autoencoder inputs and reconstructions]{Autoencoder inputs (left) and reconstructions (right). (a) A normal input image is reconstructed well. (b) An anomalous image is reconstructed poorly.}
    \label{fig:approach4_reconstructions}
\end{figure}
\begin{figure}[hbtp]
    \centering
    \includegraphics[width=.36\textwidth,angle=90]{images/approach4_architecture.pdf}
    \caption[Architecture of the convolutional autoencoder]{Architecture of the convolutional autoencoder. All convolutional layers use dropout with $p=0.05$ and the ReLU activation function. The very last layer uses the $\tanh$ activation function to provide a value range of $(-1, 1)$ identical to the input. The output dimensions of each layer are given in brackets.}
    \label{fig:approach4_architecture}
\end{figure}

\subsubsection{Architecture}
To keep the number of parameters small, a fully convolutional architecture with no dense layers is employed (see \autoref{fig:approach4_architecture}). The encoder and decoder part are mirror-symmetric and consist of seven convolutional layers each. Input and output images are downscaled to $(3, 256, 256)$ with the color channel in the first dimension. The image size should not be smaller than 256x256 for even small anomalous objects to be visible. A larger image size is possible but requires a change of architecture. The value range of both input and output image is $(-1, 1)$.

The bottleneck layer has shape $(n, 4, 4)$. The variable $n$ therefore controls the size of the latent representation in multiples of 16. The optimal value for $n$ depends on the characteristics of the session. Depending on the bottleneck size, the number of trainable parameters varies. For $n = 32$ (512 latent features), there are a total of $1{,}076{,}771$ parameters of which $388{,}192$ belong to the encoder and $688{,}579$ belong to the decoder.

It was considered to use an existing network architecture. However, most available architectures are simply too large for such small training sets. The architecture follows two basic design principles:
\begin{enumerate}
    \item In the encoder, the image size gradually decreases while the number of channel gradually increases (except for the bottleneck layer).
    \item The decoder mirrors the encoder.
\end{enumerate}

\subsubsection{Training and evaluation}
The model parameters are optimized using the Adam optimizer \cite{Kingma14:Adam} in PyTorch \cite{Paszke19:PyTorch} and trained for 200 epochs. After the training is completed, the reconstruction losses and log likelihoods are calculated for all lapse and motion images. Both of these metrics are then thresholded as before.

\subsubsection{Extensions}
In a denoising autoencoder (see \autoref{sec:denoising_ae}), Gaussian noise with different standard deviations $\sigma$ is added to the input images during training. In a sparse autoencoder (see \autoref{sec:sparse_ae}), an L1 penalty on the bottleneck activations is added to the loss function with different sparsity multipliers $\lambda$. A KL divergence-based penalty was not examined for time reasons. The experiments are repeated for all configurations of the two extensions, and for a combination of both.

\section{Evaluation}
All approaches yield a function that maps every motion image to a real number. This function then has to be thresholded. As the reliability of the algorithm depends on the chosen threshold, ROC curves (see \autoref{sec:roc_curves}) mapping the false positive rate (FPR) to the true positive rate (TPR) are generated for every approach and every session. The AUC score is used as the main comparison metric. However, the AUC score often does not describe the suitability of the approach to the problem of empty image elimination well. Therefore, elimination rates are introduced as an additional metric: The elimination rate $\TNR_{\TPR \geq x}$ describes the highest possible true negative rate (TNR) for a true positive rate of at least $x$. Here, we choose $x \in \{ 0.9, 0.95, 0.99 \}$. Descriptively speaking, if we want to keep at least 90 \% (95 \%, 99 \%) of interesting images, what percentage of empty images can we eliminate? This metric is more relevant to our problem than the AUC score, since we prioritize keeping a large number of interesting images.