kleinsteuber
/
thesis-camera-trap-anomaly-detection


			
				
					
						
						
							123456789101112131415161718192021222324252627282930313233
							% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Author:   Felix Kleinsteuber
% Title:    Anomaly Detection in Camera Trap Images
% File:     conclusions/conclusions.tex
% Part:     conclusions
% Description:
%         summary of the content in this chapter
% Version:  16.05.2022
% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\chapter{Conclusions} % \chap{Zusammenfassung}
\label{chap:conclusions}

% -------------------------------------------------------------------

In this thesis, four different approaches for eliminating empty camera trap images were examined. A particular challenge in the setting of camera trap images is keeping the false negative rate low, i.e. not falsely eliminating images with animals. Even a classifier with a good AUC score can be useless in this context if it does not perform well under this precondition.

The lapse frame differencing approach showed that even simple pixel-wise comparison methods can distinguish between normal and anomalous images. Although only some images can be safely eliminated without driving up the false negative rate, the accuracy achieved with this is high enough to noticeably accelerate the image analysis process. The success of this method shows that it is advisable for camera trap operators to keep generating lapse images, i.e. additionally trigger the camera once an hour, in order to deploy this method. Lapse frame differencing is a simple, easy-to-understand approach and, moreover, the fastest, as it does not require any training process. Furthermore, it can quickly adapt to changing weather and lighting conditions or different camera settings since it only refers to the closest lapse image.

However, especially when dealing with older existing datasets, lapse images with close temporal proximity often do not exist. To address this issue, the Median Frame Differencing approach was proposed and compared to the previous one. As expected, in direct comparison, the results are slightly below the previous ones since the median image fails to accurately resemble the background of all images. This is the case when the foreground object is not moving enough or the lighting conditions change, often because of the adjustment of the aperture due to darkening caused by the animal. Still, this approach is just as fast, requires no training process, adapts quickly, and can be utilized to eliminate a significant portion of empty images even when there are no lapse images available. Even outside of this use case, it outperformed lapse frame differencing regarding elimination rates for both eligible datasets.

When it comes to comparing image contents, local features have successfully been used for classification tasks for many years. In a Bag of Visual Words approach, local features are densely sampled from lapse images, clustered into a vocabulary, and then used for fitting a one-class support vector machine. In contrast to the first two approaches, a training process is required, controlled by several hyperparameters.

The most successful configuration achieves lower AUC scores than frame differencing but similar elimination rates for low FNR. Yet, it has conceptual advantages: First, objects with forest-like brightness but different textures are hard to detect by frame differencing, whereas the local features are affected by the texture change. Second, large anomalous objects cause large image differences and therefore high anomaly scores in frame differencing. This is not necessarily the case for local feature-based approaches: Here, anomalies are manifested in the absence of normal features or the presence of normal ones.

The biggest disadvantage of using more complex approaches is training speed: Both accuracy and computational effort of the local feature approach increase when using more keypoints. However, experiments showed that choosing random prototypes for the vocabulary performs equivalently while being much faster with a constant running time.

Lastly, the autoencoder approach slightly outperforms the local feature approach on two of the datasets. A possible reason for the performance gain is the usage of color information. On Marten\_01, however, the multiple camera position changes cause the model's elimination rate to decline. Approach 4, therefore, appears to be less reliable in experiments. Different extensions such as Denoising Autoencoder and Sparse Autoencoder have little effect on the scores. Presumably, the architecture of the autoencoder model can be improved.

In summary, the experiments show that it is often possible to eliminate the majority of empty images from camera trap images with relatively simple methods. The proposed methods work in a weakly supervised or in the case of approach 2, even in a completely unsupervised manner, and can thus minimize human annotation effort on the test set.

% -------------------------------------------------------------------

% insert further sections if necessary