kleinsteuber
/
thesis-camera-trap-anomaly-detection


			
				
					
						
						
							1234567891011121314151617181920212223242526272829303132333435
							% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Author:   Felix Kleinsteuber
% Title:    Anomaly Detection in Camera Trap Images
% File:     furtherWork/furtherWork.tex
% Part:     ideas for further work
% Description:
%         summary of the content in this chapter
% Version:  16.05.2022
% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\chapter{Further work} % \chap{Ausblick}
\label{chap:furtherWork}

% \begin{itemize}
%   \item explain ideas for further work
%   \item what is not yet solved?
%   \item possible extensions/improvements
%   \item you can use different sections (and even subsections) for different ideas, % problems
%   \item a single text, e.g. with paragraphs, is also sufficient
% \end{itemize}

% -------------------------------------------------------------------

There are several improvements and extensions for single approaches that were considered but not implemented in this work. In the following paragraphs, some possible improvements are discussed.

\paragraph{Lapse Frame Differencing} For simplicity purposes, the images were converted to grayscale before taking the difference image. Additional RGB color information could improve the accuracy for daytime images. The approach could also benefit from switching to a different color space such as HSV (hue, saturation, value) or L*a*b* (Lightness $L^*$, color plane $(a*, b*)$). E.g., \cite{Haensch14:ColorSpacesForGraphCut} show that graph-cut segmentations based on L*a*b* are of higher quality than any other major color space. In contrast, segmentations in RGB are mostly worse than in any other space. It is conceivable that these findings can be transferred to the frame differencing approach. 

\paragraph{Median Frame Differencing} The main problem with taking the median image is that it often fails to resemble the background accurately. Future work could examine the question of whether it is possible to detect such cases. Additionally, in scenarios where lapse images are available, Lapse and Median Frame Differencing could be combined to achieve higher reliability.

\paragraph{Bag of Visual Words} In the current implementation, standard SIFT descriptors \cite{Lowe04:SIFT} are used. For daylight images, it is presumably beneficial to also employ color information. Different colored extensions to SIFT exist such as CSIFT \cite{Abdel-Hakim06:CSIFT}. Alternatively, a single channel from another color space could be used as described in the Lapse Frame Differencing paragraph.

\paragraph{Autoencoder} There are a lot of additions to explore. The KL-divergence-based sparsity constraint (as explained in \autoref{sec:sparse_ae}) has not been implemented. It is possible that this extension could slightly improve the autoencoder quality. Moreover, it could be beneficial to make adjustments to the model architecture. Currently, a fully convolutional model with few parameters is used for simplicity and efficiency purposes. An autoencoder with dense layers can make global connections that a fully convolutional one cannot but requires the training of significantly more parameters. It is also possible that a model with a larger input image size (such as 512x512) would be better at detecting smaller anomalies: Many of the images where the autoencoder fails contain small birds or just the eyes of an animal. Another modification of autoencoders is the Variational Autoencoder (VAE) \cite{Kingma13:VAE}, which has also successfully been applied in anomaly detection \cite{Xu18:VAEforAD,Krajsic21:VAEforAD}. There are other deep learning models successfully applied in the weakly supervised anomaly detection field such as Generative Adversarial Networks (GANs) \cite{Goodfellow14:GANs}, which are much harder to train and often require more data.

\paragraph{Evaluation} A particular challenge in creating this work was ordering, preprocessing, and annotating the dataset. Unfortunately, at the time of writing, there were no other annotated sessions available. In future work, annotation and evaluation of more sessions could lead to clearer results. It could also help answer the question of whether the different approaches fail on different image contents. If this is the case, combining approaches in a majority voting scenario could further improve reliability. 

Since all proposed methods work as threshold classifiers, a suitable threshold must be found to be able to apply them in practice. Further research could bring the approaches to practical application and find heuristics for choosing threshold values.