kleinsteuber
/
thesis-camera-trap-anomaly-detection


			
							12345678910111213141516171819202122232425262728293031323334353637383940414243444546474849505152535455565758596061626364
							% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Author:   Felix Kleinsteuber
% Title:    Anomaly Detection in Camera Trap Images
% File:     chap01-introduction/chap01-introduction.tex
% Part:     introduction
% Description:
%         summary of the content in this chapter
% Version:  16.05.2022
% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%

\chapter{Introduction} % \chapter{Einf\"{u}hrung}
\label{chap:introduction}

Over the past three decades, there has been a surge of interest in the study of Earth's biodiversity loss. Major international research projects and hundreds of experiments proved that biodiversity plays a crucial role in the efficiency and stability of ecosystems \cite{Cardinale12:BiodiversityLoss}. As it is a significant indicator of environmental health and the functioning of ecosystems \cite{Bianchi22:BiodiversityMonitoring}, researchers constantly monitor wildlife as a quantifier of biodiversity. To aid these efforts, national parks have installed camera traps to keep track of different species' appearances. The resulting images are often manually analyzed by professionals. However, camera traps produce vast amounts of image data, making this process slow, cumbersome, and inefficient. Yet, to obtain meaningful statistics, lots of images need to be analyzed. Several approaches have been proposed over the past few years to aid this image analysis using computer vision.

Analyzing a camera trap image can be divided into two smaller problems: First, determining if the image contains any animals at all. Second, identifying the species of animals. While the second problem is hard even for trained professionals, especially when the image quality is poor or the animal is partially occluded, solving the first task does not require any professional knowledge.

Still, eliminating empty images with sufficiently high accuracy can increase the efficiency of the whole process enormously. This is due to the fact that empty images often make up the majority of all images. False triggering is caused by a multitude of events, such as moving plants like grass or tree branches, camera movements, rapid lighting changes, or weather conditions. Especially in a forest environment, such events occur frequently and can influence the image content in very different manners. For this reason, it is not an easy task to find features that are distinctive for false triggers.

% -------------------------------------------------------------------

\section{Purpose of this work} % \section{Aufgabenstellung}
\label{sec:task}

The purpose of this work is to distinguish camera trap images with content of interest (i.e., animals or humans) from falsely triggered images without content of interest. The latter are to be eliminated for further processing. To achieve this, different methods of anomaly detection are employed and compared in terms of accuracy and efficiency. It is neither desired to distinguish between images with humans vs animals nor to filter out animal images that do not qualify for further analysis (because the image quality is poor, the animal is too small or partially occluded).

In the context of anomaly detection, we treat images only depicting background as \emph{normal}, whereas images with a foreground object are considered \emph{anomalous}. 

Different approaches using traditional computer vision methods as well as deep learning are proposed, implemented in Python, and compared on image data from three camera trap stations, provided by the Bavarian Forest National Park in Germany. The data from each station contains images triggered by movement and, additionally, images taken every hour, which can be used as a baseline for normal images (neglecting the unlikely cases when an animal is in the image at this particular time).

A particular focus lies on keeping the number of false eliminations low. After all, it is preferable to keep some empty images compared to eliminating images with animals, thus possibly distorting the wildlife statistics. Additionally, because only a few annotations were provided at the time of writing, a small annotation tool was included in the implementation.

% -------------------------------------------------------------------

\section{Related work} % \section{Literatur\"{u}berblick}
\label{sec:relatedWork}

\subsection{Frame Differencing}
\label{subsec:related_framedifferencing}
Frame Differencing is a method commonly used in video processing and refers to the simple procedure of calculating the difference image between two video frames. This difference image can then be used to detect moving objects in video surveillance settings \cite{Collins00:VideoSurveillance} or for video conference systems where the camera automatically follows the presenter \cite{Gupta07:FrameDifferencing}. This basic approach has two shortcomings: It often fails due to lighting changes and for very slow-moving objects, the difference image will be nearly zero. Moreover, it is affected by moving background objects, such as rain, snow, moving leaves or grass, and changing shadows. To filter such distractions and hence increase the reliability of frame differencing, thresholding and morphological erosion operations can be performed on the difference image \cite{Gupta07:FrameDifferencing}.

\subsection{Deep learning methods for anomaly detection}
\label{subsec:related_deepad}
Deep learning is a form of representation learning employing machine learning models with multiple layers that learn representations of data at multiple levels of abstractions \cite{LeCun15:DeepLearning}. Deep convolutional nets are the state-of-the-art models for image processing and can be used for classification, object detection, and segmentation.

In object segmentation tasks, it is assumed that all classes found during test time have already been observed during training. Novel objects, however, appear only at test time and by definition cannot be mapped to one of the existing normal labels \cite{Jiang22:VisualSensoryADSurvey}. To find novel objects, the prediction confidence of the segmentation model is often leveraged by marking regions with high uncertainty as anomalous. However, \cite{Lis19:ADImageResynthesis} argue that low prediction confidence is not a reliable indicator of anomalies since it yields a lot of false positives. Instead, they propose an approach where a generative model is used to resynthesize the original image from the segmentation map. Regions with a high reconstruction loss are then considered anomalous. The rationale behind this approach is that segmentation models produce spurious label outputs in anomalous regions that translate to poor reconstructions after resynthesis. As \cite{DiBiase21:PixelwiseAD} have shown, the concepts of uncertainty and image resynthesis contain complementary information and can be combined into an even more robust approach for pixel-wise anomaly detection.

Often, the location of a visual anomaly is not important. In such cases, models only need to choose between two classes: normal and anomalous. Autoencoders, as will be explained in \autoref{sec:theory_autoencoders}, learn to find a good compressed representation for known (normal) samples. \cite{Japkowicz99:FirstAE} first applied an autoencoder for anomaly detection by measuring the reconstruction loss. The rationale here is that redundant information in normal data differs from the redundant information in anomalous data, i.e., a model that compresses normal data well by eliminating redundant information compresses anomalous data poorly. For this reason, the reconstruction error is considered a good measure of anomalies.

Another approach to deep anomaly detection combines deep classifiers with one-class classification methods. \cite{Perera19:DeepOCC} proposed a method based on transfer learning where descriptive features are extracted from a pretrained Convolutional Neural Network (CNN) and classified by thresholding the distance to the $k$ nearest neighbours. \cite{Oza19:OCCNN} first presented an end-to-end CNN for one-class classification, which also allows for transfer learning.

% -------------------------------------------------------------------

% You can insert additional sections like notations or something else

% -------------------------------------------------------------------

\section{Overview} % \section{Aufbau der Arbeit}
\label{sec:overview}

This work is divided into six chapters: In the following chapter, the theory of anomaly detection, image comparison and relevant computer vision frameworks is detailed. Some existing solutions to similar problems are compiled and compared to the proposed approaches. In chapter 3, the dataset is analyzed and four different approaches for empty image detection are proposed. These approaches are then evaluated and compared in chapter 4. Chapters 5 and 6 conclude this thesis with a summary of the key takeaways and ideas for future work.


% -------------------------------------------------------------------