Felix Kleinsteuber 2 жил өмнө
parent
commit
73ffc72968

+ 220 - 0
.gitignore

@@ -0,0 +1,220 @@
+## Core latex/pdflatex auxiliary files:
+*.aux
+*.lof
+*.log
+*.lot
+*.fls
+*.out
+*.toc
+*.fmt
+*.fot
+*.cb
+*.cb2
+
+## Intermediate documents:
+*.dvi
+*-converted-to.*
+# these rules might exclude image files for figures etc.
+# *.ps
+# *.eps
+# *.pdf
+
+## Generated if empty string is given at "Please type another file name for output:"
+.pdf
+
+## Bibliography auxiliary files (bibtex/biblatex/biber):
+*.bbl
+*.bcf
+*.blg
+*-blx.aux
+*-blx.bib
+*.run.xml
+
+## Build tool auxiliary files:
+*.fdb_latexmk
+*.synctex
+*.synctex(busy)
+*.synctex.gz
+*.synctex.gz(busy)
+*.pdfsync
+
+## Auxiliary and intermediate files from other packages:
+# algorithms
+*.alg
+*.loa
+
+# achemso
+acs-*.bib
+
+# amsthm
+*.thm
+
+# beamer
+*.nav
+*.pre
+*.snm
+*.vrb
+
+# changes
+*.soc
+
+# cprotect
+*.cpt
+
+# elsarticle (documentclass of Elsevier journals)
+*.spl
+
+# endnotes
+*.ent
+
+# fixme
+*.lox
+
+# feynmf/feynmp
+*.mf
+*.mp
+*.t[1-9]
+*.t[1-9][0-9]
+*.tfm
+
+#(r)(e)ledmac/(r)(e)ledpar
+*.end
+*.?end
+*.[1-9]
+*.[1-9][0-9]
+*.[1-9][0-9][0-9]
+*.[1-9]R
+*.[1-9][0-9]R
+*.[1-9][0-9][0-9]R
+*.eledsec[1-9]
+*.eledsec[1-9]R
+*.eledsec[1-9][0-9]
+*.eledsec[1-9][0-9]R
+*.eledsec[1-9][0-9][0-9]
+*.eledsec[1-9][0-9][0-9]R
+
+# glossaries
+*.acn
+*.acr
+*.glg
+*.glo
+*.gls
+*.glsdefs
+
+# gnuplottex
+*-gnuplottex-*
+
+# gregoriotex
+*.gaux
+*.gtex
+
+# hyperref
+*.brf
+
+# knitr
+*-concordance.tex
+# TODO Comment the next line if you want to keep your tikz graphics files
+*.tikz
+*-tikzDictionary
+
+# listings
+*.lol
+
+# makeidx
+*.idx
+*.ilg
+*.ind
+*.ist
+
+# minitoc
+*.maf
+*.mlf
+*.mlt
+*.mtc[0-9]*
+*.slf[0-9]*
+*.slt[0-9]*
+*.stc[0-9]*
+
+# minted
+_minted*
+*.pyg
+
+# morewrites
+*.mw
+
+# nomencl
+*.nlo
+
+# pax
+*.pax
+
+# pdfpcnotes
+*.pdfpc
+
+# sagetex
+*.sagetex.sage
+*.sagetex.py
+*.sagetex.scmd
+
+# scrwfile
+*.wrt
+
+# sympy
+*.sout
+*.sympy
+sympy-plots-for-*.tex/
+
+# pdfcomment
+*.upa
+*.upb
+
+# pythontex
+*.pytxcode
+pythontex-files-*/
+
+# thmtools
+*.loe
+
+# TikZ & PGF
+*.dpth
+*.md5
+*.auxlock
+
+# todonotes
+*.tdo
+
+# easy-todo
+*.lod
+
+# xindy
+*.xdy
+
+# xypic precompiled matrices
+*.xyc
+
+# endfloat
+*.ttt
+*.fff
+
+# Latexian
+TSWLatexianTemp*
+
+## Editors:
+# WinEdt
+*.bak
+*.sav
+
+# Texpad
+.texpadtmp
+
+# Kile
+*.backup
+
+# KBibTeX
+*~[0-9]*
+
+# auto folder when using emacs and auctex
+/auto/*
+
+# expex forward references with \gathertags
+*-tags.tex

+ 3 - 0
.vscode/settings.json

@@ -0,0 +1,3 @@
+{
+    "editor.wordWrap": "on"
+}

+ 44 - 0
Makefile

@@ -0,0 +1,44 @@
+-include Makefile.cfg
+
+ifeq "$(COMPRESSION)" "0"
+DVIPDF_ARG+=-dAutoFilterColorImages=false -dColorImageFilter=/FlateEncode
+endif
+
+ifeq "$(LETTER)" "1"
+DVIPS_ARG+=-t letter
+endif
+
+#LATEX=latex
+LATEX=pdflatex
+
+.PRECIOUS:%.aux %.bbl
+
+%.dvi:%.tex
+
+%.aux:%.tex
+	$(LATEX) $<
+
+%.bbl:%.tex %.bib %.aux
+	bibtex $*
+	$(LATEX) $<
+	bibtex $*
+
+%.bbl:%.tex %.aux
+	@echo WARNING: no $*.bib found... assuming you are not using BibTeX
+	touch $@
+
+%.dvi:%.tex %.bbl
+	$(LATEX) $<
+
+%.ps:%.dvi
+	dvips -j0 -P generic $(DVIPS_ARG) $< -o $@
+
+#this is the old version using dvipdf, which can not handle letter papersize
+#%.pdf:%.dvi
+#	dvipdf $(DVIPDF_ARG) $<
+
+#this is the new version, going manually via dvips and ps2pdf
+#this is exactly the same as the dvipdf script does
+%.pdf:%.ps
+	ps2pdf14 $(DVIPDF_ARG) $< $@
+

+ 9 - 0
abstract.tex

@@ -0,0 +1,9 @@
+\begin{center}{\sectfont\LARGE {\"U}berblick}\end{center}
+
+In der Biodiversit{\"a}tsforschung werden Kamerafallen zur {\"U}berwachung von Tierpopulationen eingesetzt. Diese Vorrichtungen produzieren riesige Datenmengen, wodurch es f{\"u}r Spezialisten sehr m{\"u}hsam ist, die abgelichteten Tierarten zu klassifizieren. Au{\ss}erdem f{\"u}hren Beleuchtungs{\"a}nderungen, Wind und wechselnde Wetterbedingungen zu einer gro{\ss}en Anzahl leerer Bilder ohne Tiere. In dieser Arbeit werden mehrere Ans{\"a}tze aus der Anomalieerkennung untersucht, die teilweise auf traditionellen Bildverarbeitungsmethoden (Differenzbild, Bag of Visual Words) und teilweise auf Methoden des tiefen Lernens (Autoencoder) basieren, um leere Bilder zu eliminieren. So wird die Artenklassifizierung vereinfacht und dadurch der Annotationsaufwand f{\"u}r die Spezialisten minimiert. Verschiedene Konfigurationen dieser Ans{\"a}tze werden erforscht und bewertet. Anschlie{\ss}end werden die Ans{\"a}tze f{\"u}r Bilddaten aus unterschiedlichen Kamerafallen verglichen, die im Bayerischen Wald aufgestellt wurden. Experimente zeigen, dass alle Ans{\"a}tze f{\"u}r die meisten Kamerafallen eine signifikante Anzahl leerer Bilder eliminieren k{\"o}nnen. Allerdings erweisen sich die traditionellen Methoden, die auf Differenzbildern basieren, als die zuverl{\"a}ssigsten. Abschlie{\ss}end werden f{\"u}r jeden Ansatz m{\"o}gliche Erweiterungen zur Verbesserung der Genauigkeit diskutiert.
+
+\vspace{\fill}
+
+\begin{center}{\sectfont\LARGE Abstract}\end{center}
+
+In biodiversity research, camera traps are used to monitor wildlife populations. Such devices produce vast amounts of data, making the process of species recognition cumbersome. Moreover, lighting changes, wind, and changing weather conditions result in a high proportion of images without animals. In this work, several approaches based on anomaly detection are proposed using traditional computer vision methods (frame differencing, Bag of Visual Words) as well as deep learning (autoencoders) to eliminate such empty images, hence aiding the subsequent species classification step and minimizing the human annotation workload. Different configurations of these approaches are explored and evaluated. Next, the approaches are compared on image data from multiple camera traps installed in the Bavarian Forest National Park in Germany. Experiments show that all proposed approaches are able to eliminate a significant number of empty images for most of the utilized image sets. However, the traditional methods based on frame differencing prove to be the most reliable. Finally, possible extensions to improve accuracy are discussed for every approach.

+ 59 - 0
chapters/appendix/appendix.tex

@@ -0,0 +1,59 @@
+% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+% Author:   Felix Kleinsteuber
+% Title:    Anomaly Detection in Camera Trap Images
+% File:     appendix/appendix.tex
+% Part:     appendix
+% Description:
+%         summary of the content in this chapter
+% Version:  16.05.2022
+% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+
+\chapter{Mathematical Details}
+\label{app:mathDetails}
+
+chapter intro\newline
+
+% -------------------------------------------------------------------
+
+\section{Further Explanations}
+\label{app:explanations}
+
+Explain some complicated issues in this section.\newline
+
+% -------------------------------------------------------------------
+
+\section{Fancy Proof}
+\label{app:proof}
+
+Proof a statement here.\newline
+
+%---------------------------------------------------------------------------
+% -------------------------------------------------------------------
+
+\chapter{Additional Results}
+\label{app:results}
+
+chapter intro\newline
+
+% -------------------------------------------------------------------
+
+\section{Nice Plots of Experiment 1}
+\label{app:exp1}
+
+include additional figures
+
+% -------------------------------------------------------------------
+
+\section{Nice Plots of Experiment 2}
+\label{app:exp2}
+
+include additional figures
+
+% -------------------------------------------------------------------
+
+\section{Nice Plots of Experiment 3}
+\label{app:exp3}
+
+include additional figures
+
+%---------------------------------------------------------------------------

+ 64 - 0
chapters/chap01-introduction/chap01-introduction.tex

@@ -0,0 +1,64 @@
+% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+% Author:   Felix Kleinsteuber
+% Title:    Anomaly Detection in Camera Trap Images
+% File:     chap01-introduction/chap01-introduction.tex
+% Part:     introduction
+% Description:
+%         summary of the content in this chapter
+% Version:  16.05.2022
+% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+
+\chapter{Introduction} % \chapter{Einf\"{u}hrung}
+\label{chap:introduction}
+
+Over the past three decades, there has been a surge of interest in the study of Earth's biodiversity loss. Major international research projects and hundreds of experiments proved that biodiversity plays a crucial role in the efficiency and stability of ecosystems \cite{Cardinale12:BiodiversityLoss}. As it is a significant indicator of environmental health and the functioning of ecosystems \cite{Bianchi22:BiodiversityMonitoring}, researchers constantly monitor wildlife as a quantifier of biodiversity. To aid these efforts, national parks have installed camera traps to keep track of different species' appearances. The resulting images are often manually analyzed by professionals. However, camera traps produce vast amounts of image data, making this process slow, cumbersome, and inefficient. Yet, to obtain meaningful statistics, lots of images need to be analyzed. Several approaches have been proposed over the past few years to aid this image analysis using computer vision.
+
+Analyzing a camera trap image can be divided into two smaller problems: First, determining if the image contains any animals at all. Second, identifying the species of animals. While the second problem is hard even for trained professionals, especially when the image quality is poor or the animal is partially occluded, solving the first task does not require any professional knowledge.
+
+Still, eliminating empty images with sufficiently high accuracy can increase the efficiency of the whole process enormously. This is due to the fact that empty images often make up the majority of all images. False triggering is caused by a multitude of events, such as moving plants like grass or tree branches, camera movements, rapid lighting changes, or weather conditions. Especially in a forest environment, such events occur frequently and can influence the image content in very different manners. For this reason, it is not an easy task to find features that are distinctive for false triggers.
+
+% -------------------------------------------------------------------
+
+\section{Purpose of this work} % \section{Aufgabenstellung}
+\label{sec:task}
+
+The purpose of this work is to distinguish camera trap images with content of interest (i.e., animals or humans) from falsely triggered images without content of interest. The latter are to be eliminated for further processing. To achieve this, different methods of anomaly detection are employed and compared in terms of accuracy and efficiency. It is neither desired to distinguish between images with humans vs animals nor to filter out animal images that do not qualify for further analysis (because the image quality is poor, the animal is too small or partially occluded).
+
+In the context of anomaly detection, we treat images only depicting background as \emph{normal}, whereas images with a foreground object are considered \emph{anomalous}. 
+
+Different approaches using traditional computer vision methods as well as deep learning are proposed, implemented in Python, and compared on image data from three camera trap stations, provided by the Bavarian Forest National Park in Germany. The data from each station contains images triggered by movement and, additionally, images taken every hour, which can be used as a baseline for normal images (neglecting the unlikely cases when an animal is in the image at this particular time).
+
+A particular focus lies on keeping the number of false eliminations low. After all, it is preferable to keep some empty images compared to eliminating images with animals, thus possibly distorting the wildlife statistics. Additionally, because only a few annotations were provided at the time of writing, a small annotation tool was included in the implementation.
+
+% -------------------------------------------------------------------
+
+\section{Related work} % \section{Literatur\"{u}berblick}
+\label{sec:relatedWork}
+
+\subsection{Frame Differencing}
+\label{subsec:related_framedifferencing}
+Frame Differencing is a method commonly used in video processing and refers to the simple procedure of calculating the difference image between two video frames. This difference image can then be used to detect moving objects in video surveillance settings \cite{Collins00:VideoSurveillance} or for video conference systems where the camera automatically follows the presenter \cite{Gupta07:FrameDifferencing}. This basic approach has two shortcomings: It often fails due to lighting changes and for very slow-moving objects, the difference image will be nearly zero. Moreover, it is affected by moving background objects, such as rain, snow, moving leaves or grass, and changing shadows. To filter such distractions and hence increase the reliability of frame differencing, thresholding and morphological erosion operations can be performed on the difference image \cite{Gupta07:FrameDifferencing}.
+
+\subsection{Deep learning methods for anomaly detection}
+\label{subsec:related_deepad}
+Deep learning is a form of representation learning employing machine learning models with multiple layers that learn representations of data at multiple levels of abstractions \cite{LeCun15:DeepLearning}. Deep convolutional nets are the state-of-the-art models for image processing and can be used for classification, object detection, and segmentation.
+
+In object segmentation tasks, it is assumed that all classes found during test time have already been observed during training. Novel objects, however, appear only at test time and by definition cannot be mapped to one of the existing normal labels \cite{Jiang22:VisualSensoryADSurvey}. To find novel objects, the prediction confidence of the segmentation model is often leveraged by marking regions with high uncertainty as anomalous. However, \cite{Lis19:ADImageResynthesis} argue that low prediction confidence is not a reliable indicator of anomalies since it yields a lot of false positives. Instead, they propose an approach where a generative model is used to resynthesize the original image from the segmentation map. Regions with a high reconstruction loss are then considered anomalous. The rationale behind this approach is that segmentation models produce spurious label outputs in anomalous regions that translate to poor reconstructions after resynthesis. As \cite{DiBiase21:PixelwiseAD} have shown, the concepts of uncertainty and image resynthesis contain complementary information and can be combined into an even more robust approach for pixel-wise anomaly detection.
+
+Often, the location of a visual anomaly is not important. In such cases, models only need to choose between two classes: normal and anomalous. Autoencoders, as will be explained in \autoref{sec:theory_autoencoders}, learn to find a good compressed representation for known (normal) samples. \cite{Japkowicz99:FirstAE} first applied an autoencoder for anomaly detection by measuring the reconstruction loss. The rationale here is that redundant information in normal data differs from the redundant information in anomalous data, i.e., a model that compresses normal data well by eliminating redundant information compresses anomalous data poorly. For this reason, the reconstruction error is considered a good measure of anomalies.
+
+Another approach to deep anomaly detection combines deep classifiers with one-class classification methods. \cite{Perera19:DeepOCC} proposed a method based on transfer learning where descriptive features are extracted from a pretrained Convolutional Neural Network (CNN) and classified by thresholding the distance to the $k$ nearest neighbours. \cite{Oza19:OCCNN} first presented an end-to-end CNN for one-class classification, which also allows for transfer learning.
+
+% -------------------------------------------------------------------
+
+% You can insert additional sections like notations or something else
+
+% -------------------------------------------------------------------
+
+\section{Overview} % \section{Aufbau der Arbeit}
+\label{sec:overview}
+
+This work is divided into six chapters: In the following chapter, the theory of anomaly detection, image comparison and relevant computer vision frameworks is detailed. Some existing solutions to similar problems are compiled and compared to the proposed approaches. In chapter 3, the dataset is analyzed and four different approaches for empty image detection are proposed. These approaches are then evaluated and compared in chapter 4. Chapters 5 and 6 conclude this thesis with a summary of the key takeaways and ideas for future work.
+
+
+% -------------------------------------------------------------------

+ 401 - 0
chapters/chap02/chap02.tex

@@ -0,0 +1,401 @@
+% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+% Author:   Felix Kleinsteuber
+% Title:    Anomaly Detection in Camera Trap Images
+% File:     chap02/chap02.tex
+% Part:     theoretical background
+% Description:
+%         summary of the content in this chapter
+% Version:  16.05.2022
+% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+\chapter{Theoretical background}
+\label{chap:background}
+
+In the following chapter, the most important theoretical concepts employed in this work such as machine learning, classification, anomaly detection, image comparison methods, local features, and deep learning methods are presented. In addition, the statistical evaluation of binary threshold classifiers is detailed.
+
+% -------------------------------------------------------------------
+
+\section{Basics of machine learning}
+\label{sec:theory_ml}
+First, some basic terms and notation will be defined. This section is based on \cite{Goodfellow16:DeepLearning}.
+
+\textbf{Machine learning (ML)} algorithms learn from data without being explicitly instructed, by finding patterns and drawing inferences in data. They allow to solve tasks that are too difficult to solve with fixed algorithms by analyzing \textbf{examples}. An example $\bm{x} \in \R^n$ is a collection of $n$ \textbf{features} encoded as real numbers. To be able to find patterns in the data, the algorithm requires a large set of training examples (\textbf{training set}). We denote the number of examples as $m$ and the $i$-th training example as $x^{(i)} \in \R^n, 1 \leq i \leq m$.
+
+Most machine learning algorithms are optimization problems minimizing some kind of \textbf{loss} or \textbf{error function}. A loss function maps the desired and actual outputs of a ML model to a real value, the loss. Intuitively, when the actual output matches the desired one, the loss value should be minimal.
+
+\subsection{Classification}
+
+In classification tasks, the model is asked to predict which of $k$ categories some input belongs to. In other words, the algorithm learns a function $f : \R^n \to \{ 1, \dots, k \}$. When the output is $y = f(\bm{x})$, the model assigns the input to the $y$-th category.
+
+To be able to evaluate the abilities of a machine learning algorithm, a performance measure is required. For classification tasks, we often use the \textbf{accuracy}, which is defined as the proportion of examples for which the model produces the correct output.
+
+When fitting a ML algorithm to a problem, we want it to generalize well, i.e., to perform well on data it has not seen before. Therefore, to assess its performance in the real world, we evaluate performance measures such as the accuracy on a \textbf{test set} that is distinct from the training set.
+
+\subsection{Supervised and unsupervised tasks}
+
+ML algorithms can be roughly categorized as \emph{supervised} or \emph{unsupervised}:
+
+\begin{itemize}
+  \item In \textbf{supervised learning} problems, the training data is labeled, i.e., the task is to learn the function $f$, given training examples $\bm{x}^{(i)}$ and their respective outputs $y^{(i)}$.
+
+  \item In \textbf{unsupervised learning} problems, the training examples are unlabeled. The goal is to infer labels from the natural structure of the training set. An example are \emph{clustering} algorithms, which group similar examples regarding some distance measure.
+\end{itemize}
+
+\subsection{Support vector machines}
+In their simplest form, \textbf{support vector machines (SVMs)} tackle binary classification tasks by finding a linear hyperplane that separates both classes with maximal \emph{margin}. The margin is defined as the distance between the separating hyperplane and the closest data points, which are called \emph{support vectors}. To find this hyperplane, a constrained nonlinear optimization problem is solved \cite{Kecman05:SVMs}.
+
+\paragraph{Soft-margin SVMs} When the given classes cannot be linearly separated, such a hyperplane does not exist. However, a linear separating hyperplane might still be a good, generalized solution. Soft-margin SVMs allow all data points within a \emph{soft margin} of the separating hyperplane to be misclassified. 
+
+\paragraph{Nonlinear SVMs} When the given classes cannot be linearly separated well, the data points are first transformed into a higher dimensional feature space using a nonlinear mapping. By choosing a good mapping, the data points can become linearly separable in the feature space. The transformation can be done very efficiently by substituting the mapping for a kernel function $K(x, x')$ directly in the input space (\textbf{kernel trick}). A popular kernel function, which is used in all SVMs in this work, is the \textbf{radial basis function (RBF) kernel} defined as
+\begin{equation}
+  K(x, x') = \exp(-\gamma \| x - x' \|^2 )
+\end{equation}
+with a free parameter $\gamma$ \cite{Chang10:RBF}. The feature space of this kernel has an infinite number of dimensions. Thus, using the kernel trick, we can construct an SVM that operates in an infinite dimensional space (which would not be possible if we had to explicitly compute the mappings) \cite{Kecman05:SVMs}.
+
+By training multiple SVMs, it is possible to perform multi-class classification. Let $k$ be the number of classes. In a \emph{one-vs-all} scenario, $k$ SVMs are trained to distinguish each class from the rest. In a \emph{one-vs-one} scenario, $\frac{k (k-1)}{2}$ SVMs are trained to separate between all possible combinations of two classes. Though \emph{one-vs-one} requires significantly more SVMs to be trained, both options are equally suitable. Thus, the choice of method depends on personal preference and the properties of the dataset \cite{Gidudu:SVMsMultiClass}.
+
+\section{Anomaly detection}
+\label{sec:theory_anomalydetection}
+In general, an anomaly is a deviation from a rule or from what is regarded regular or normal. According to \cite{Jiang22:VisualSensoryADSurvey}, a distinction can be made between semantic anomalies and sensory anomalies:
+
+\paragraph{Semantic anomalies} refer to anomalies at the label level, i.e., the semantic meaning of the sample has changed. This kind of anomaly always refers to the sample as a whole rather than specific parts. For example, for a model only trained on cat images, the image of a dog would be a semantic anomaly. Such anomalies can be detected using supervised approaches such as \emph{one-class classification (OCC)} (see \autoref{sec:occ}) or \emph{out-of-distribution detection (ODD)} (see \autoref{sec:odd}).
+
+\paragraph{Sensory anomalies} are anomalies at a specific part of the sample and can occur at three different levels: At \emph{object level}, a sensory anomaly is some defect in the otherwise normal object, e.g., a tumor in an image of an organ. At \emph{scene level}, novelous objects occur in a otherwise normal scene, e.g., an unknown object on the road in an autonomous driving scenario. At \emph{video level}, abnormal events or actions occur in a video. This is often the case in video surveillance settings. Compared to semantic anomalies, sensory defects occur more regularly and naturally in most research fields. For this paper, we consider anomalies at \emph{scene level} where animals and humans are novelous objects.
+
+\subsection{Outlier vs novelty detection}
+Anomaly detection is a term that is often used ambiguously. Depending on the available training data, it can refer to either outlier or novelty detection.
+
+\paragraph{Outlier detection}
+In this scenario, the training data contains both normal and anomalous data, often due to mechanical faults, human error, instrument error, or fraudulent behavior \cite{Hodge04:OutlierDetectionSurvey}. The model seeks to find the distribution of the normal data, isolating the anomalous data points. Training is often done in an unsupervised manner.
+
+\paragraph{Novelty detection}
+Here, the model receives only normal data as input. It must then detect outliers at prediction time. As the model is only fed annotated normal data, this concept is a form of supervised learning and equivalent to \emph{one-class classification} where normal data makes up the positive class. Novelty detection is often employed when the amount of available anomalous data is insufficient \cite{Pimentel14:NoveltyDetection}.
+
+\subsection{One-Class Classification}
+\label{sec:occ}
+\textbf{One-class classification (OCC)} is a special case of supervised classification, where only data from a single positive class is observed during training \cite{Perera21:OCCSurvey}. The goal of the classifier is to recognize positive samples during inference while being able to separate them from negative samples from other semantic classes.
+
+\begin{figure}[tbhp]
+  \centering
+  \includegraphics[width=.8\textwidth]{images/occ.pdf}
+  \caption[Comparing multi-class to one-class methods]{Comparing multi-class to one-class methods \cite{Perera21:OCCSurvey}. In both multi-class settings, training data from different classes is available. In OCC, training data from only a single class is given.}
+  \label{fig:occ}
+\end{figure}
+
+\autoref{fig:occ} illustrates the differences between multi-class and one-class settings. In \emph{multi-class classification}, training data from multiple classes is required. The classes are separated by multiple decision boundaries. In a multi-class or \emph{one-vs-rest detection} setting, a single decision boundary is learned using training data from multiple classes, separating one normal class from the rest. In \emph{one-class classification}, training data from only the positive class is observed and a single decision boundary is learned to separate it from the unseen negative class.
+
+Consequently, OCC is a harder problem than the multi-class tasks since only training data from a single positive class is available instead of training data from all classes.
+
+A popular one-class classifier is the one-class support vector machine (OCSVM) \cite{Schoelkopf99:OneClassSVM}, which is a modification of the support vector machine. In binary SVMs, the optimization goal is to find a hyper-plane separating both classes with the maximum margin. Similarly, the one-class SVM tries to separate the positive class from the origin of the feature space with the largest possible margin.
+
+\subsection{Density estimation}
+\label{sec:odd}
+\textbf{Density estimation (DE)} methods attempt to model the distribution of normal training data. Unseen test data can then be separated by thresholding the likelihood of test data points under this distribution. Normal data is assumed to have a high likelihood since it was generated from the same underlying distribution as the training data whereas anomalous data is expected to have a lower likelihood \cite{Yang21:OODSurvey}. The term \textbf{out-of-distribution detection} refers to the classification of unseen test samples as part of vs. out of the distribution.
+
+A distinction can be made between parametric and non-parametric DE methods. The former fit a parametric distribution such as a (multi-variate) Gaussian to the training data by maximizing the overall likelihood. The latter make no assumptions about the underlying distribution and estimate the likelihood directly from the data points. The simplest example of non-parametric DE is a discrete histogram. A more sophisticated and often more accurate method is \textbf{Kernel Density Estimation (KDE)} \cite{Rosenblatt56:KDE1,Parzen62:KDE2}, which, in contrast to the histogram, provides a smooth likelihood curve. It requires a \textbf{kernel function} such as the Gaussian kernel defined by the density function of a standard normal distribution. Given $m$ samples $\bm{x}^{(i)}$, the estimated distribution is defined by
+\begin{equation}
+  \hat{f}(\bm{x}) = \frac{1}{mh} \sum_{i=1}^m K\left( \frac{\bm{x} - \bm{x}^{(i)}}{h} \right)
+\end{equation}
+where $K(x)$ is the kernel function, and $h$ is the \textbf{bandwidth} \cite{Silverman86:DensityEstimation}. The kernel estimator therefore estimates the distribution as a sum of identical kernel curves ('bumps'), each centered around one data point (see \autoref{fig:kde_example}).
+
+\begin{figure}[tb]
+  \centering
+  \includegraphics[width=.9\textwidth]{images/kde.pdf}
+  \caption[One-dimensional kernel estimate example with Gaussian kernels]{One-dimensional kernel estimate example from 7 samples with Gaussian kernels, $h = 0.4$.}
+  \label{fig:kde_example}
+\end{figure}
+
+The bandwidth $h$, also intuitively called smoothing parameter, controls the smoothness of the estimated distribution by dampening the kernel function for high values of $h$. Provided that $K$ is a non-negative probability density function, $\hat{f}$ is also a non-negative probability density function.
+
+Classic density estimation methods are well suited for low-dimensional data but are often computationally expensive for high dimensional data such as images \cite{Yang21:OODSurvey}. This problem can be solved by performing dimensionality reduction, e.g., by finding local features (see \autoref{sec:theory_localfeatures}) or using deep models such as autoencoders (see \autoref{sec:theory_autoencoders}).
+
+\section{Comparing images}
+\label{sec:theory_comparingimages}
+To detect sensory anomalies such as animals, many approaches rely on comparing the anomalous image to a reference background image. Suppose we have a reference background image for every unclassified observed image.
+
+\subsection{Frame Differencing}
+\label{sec:theory_comparingimages_fd}
+
+The simplest approach to detect an anomalous object in an image $\tilde{I}$ is to calculate the pixel-wise absolute difference of $\tilde{I}$ and a normal reference image $I$:
+\begin{equation}
+  \Delta = \left| \tilde{I} - I \right|
+\end{equation}
+If there is an anomalous object present in $\tilde{I}$, the corresponding pixels in the \textbf{difference image} $\Delta$ would have high values. This can be measured using the mean or variance of the pixels of $\Delta$. The higher the mean and variance, the higher the likelihood that an anomalous object is present.
+
+This approach can only work reliably if the reference image is very close in time to $\tilde{I}$ and indeed only shows the empty scene. Furthermore, since two images are compared pixel by pixel, it is susceptible to noise, which can be partly eliminated using a Gaussian filter beforehand and a threshold on the difference image $\Delta$.
+
+\subsection{Background estimation using temporal median filtering}
+\label{sec:theory_comparingimages_be}
+When no highly similar reference image is available, a normal background image can be extracted from multiple (possibly anomalous) images. Given a series of images that are close together in time, we can assume that the background has not changed. In the context of camera trap images, we also assume that an anomalous object (i.e., an animal) would have moved over time. Therefore, every pixel has seen the background at some time and, in most cases, at the majority of times. Under these assumptions, we can calculate the median for every pixel over all images in the time series to obtain the \textbf{median image} containing just the background. Anomalous objects can now be detected using frame differencing against the median image.
+
+This approach is more robust towards noise, focus and exposure changes. However, the assumption that the anomalous object moves over time is often not fulfilled, resulting in artifacts in the background reconstruction. Still, the background can change even in this short timeframe due to wind and lighting conditions which also leads to a poor reconstruction.
+
+\section{Local features for anomaly detection}
+\label{sec:theory_localfeatures}
+
+\subsection{SIFT}
+\textbf{SIFT (scale-invariant feature transform)} \cite{Lowe04:SIFT} provides an algorithm to calculate local image features invariant to scale, rotation, noise, and lighting. The features are designed to be highly distinctive to be matched against a large database of features from many images. Such features may be used for keypoint matching but also for image classification.
+
+The first three steps of the SIFT algorithm aim to find meaningful and descriptive keypoints in the image. Keypoints are defined by their center position, scale, and orientation. The fourth and last step then calculates the keypoint descriptor by processing the local image gradients at the selected scale and region. A single SIFT descriptor is a real vector of length $128$.
+
+An alternative strategy often employed in image classification is \textbf{Dense SIFT} \cite{Bosch06:DSIFT1,Bosch07:DSIFT2,Tuytelaars10:DenseInterestPoints} where the first three steps are skipped. Instead, keypoints are sampled in a dense image grid with a fixed size and orientation. This creates significantly more keypoints, therefore introducing redundancy in the set of descriptors while also increasing its descriptiveness \cite{Chavez12:DSIFT}.
+
+\subsection{Bag of Visual Words}
+To perform image classification from image features, a \textbf{codebook} or \textbf{vocabulary} of a fixed size $k$ is constructed from some or all of the extracted local features. This is done by clustering the set of local features into $k$ clusters called \textbf{visual words}, usually using $k$-means clustering or by randomly choosing $k$ cluster centers. Each local feature can now be assigned to one of the $k$ visual words by choosing the closest cluster center. The choice of distance measure depends on the local feature descriptors. In the case of SIFT, the Euclidian distance is used. The codebook reduces the numerous local features to a tractable number of distinct possible features, which are observed across different images \cite{Chavez12:DSIFT}.
+
+For classification, each image is represented by its histogram of visual words (i.e., a $k$-dimensional vector with occurence counts of all $k$ words). Note that this reduction eliminates all information about the position of the local features and geometric relations between them. Different extensions exist where some spatial information is kept.
+
+Next, these histograms can be used as feature vectors for classification. In the case of anomaly detection, the normal images are used for both the generation of the codebook and the calculation of normal feature vectors. These normal features can then be used to fit a one-class classifier such as a one-class SVM.
+
+\section{Deep learning}
+\label{sec:theory_deep_learning}
+In deep learning, machine learning models with multiple layers are employed that learn representations of data at multiple levels of abstractions \cite{LeCun15:DeepLearning}. The fundamental deep model is the \emph{feedforward neural network}.
+
+\subsection{Neural networks}
+\label{sec:theory_neural_networks}
+\textbf{Neural networks} are parametric models that define a highly non-linear mapping $\bm{y} = f(\bm{x} \mid \bm{\theta})$ and learn the parameters $\bm{\theta}$ that result in the best approximation such that the loss between actual and desired outputs is minimized. \textbf{Feedforward neural networks} are composed of $l$ \textbf{layers} that model functions $f^{(1)}, f^{(2)}, \dots, f^{(l)}$. The network is then evaluated by calculating $f(\bm{x}) = f^{(l)}(\dots f^{(2)} ( f^{(1)} (\bm{x}) ) )$, i.e., information flows in one direction and there are no feedback connections. $l$ is called the \textbf{depth of the model}. The final layer $f^{(l)}$ is called \textbf{output layer}; all other layers are \textbf{hidden layers} \cite{Goodfellow16:DeepLearning}.
+
+\paragraph{Dense layers} A dense or fully connected layer is the default layer type and consists of several neurons, which are considered the smallest computation units in a neural network. The output of a neuron is called \textbf{activation}. The $i$-th neuron in layer $k$ computes its activation $f^{(k)}_i$ using the activations of the previous layer $f^{(k-1)}$, the \textbf{weights} vector $\bm{W}^{(k)}_{i}$, the \textbf{bias} $\bm{b}^{(k)}_i \in \R$, and the \textbf{activation function} $g^{(k)}$:
+
+\begin{equation}
+  f^{(k)}_i = g^{(k)} \left( {\bm{W}^{(k)}_{i}}^\top f^{(k-1)} + \bm{b}^{(k)}_i \right)
+\end{equation}
+
+The vector of the activations $f^{(k)}_i$ of all neurons in layer $k$ is the output $f^{(k)}$ of layer $k$, which is then fed as input to the next layer, and so on. $f^{(0)}$ is defined as the input data $\bm{x}$. The activations of the output layer are the predicted output $\bm{\hat{y}}$. The activation function $g: \R \to \R$ is required to introduce a nonlinearity into the term. An often employed activation function is \textbf{ReLU} (\emph{Rectified Linear Unit}, see \autoref{fig:activation_functions_relu}) \cite{Nair10:ReLU} which remains very close to linear:
+
+\begin{equation}
+  \text{ReLU}(x) = \max \{ 0, x \}
+\end{equation}
+
+The codomain of ReLU is $[0, \infty)$. However, especially for the output layer, it can be desirable that the activation function has a finite value range. The hyperbolic tangent (see \autoref{fig:activation_functions_tanh}) is also a suitable activation function and has a finite codomain of $(-1, 1)$. It is defined as follows:
+
+\begin{equation}
+  \tanh(x) = 1 - \frac{2}{e^{2x} + 1}
+\end{equation}
+
+\begin{figure}[tb]
+  \centering
+  \begin{subfigure}[b]{.47\textwidth}
+    \includegraphics[width=\textwidth]{images/relu.pdf}
+    \caption{ReLU.}
+    \label{fig:activation_functions_relu}
+  \end{subfigure}
+  \begin{subfigure}[b]{.5\textwidth}
+    \includegraphics[width=\textwidth]{images/tanh.pdf}
+    \caption{$\tanh$.}
+    \label{fig:activation_functions_tanh}
+  \end{subfigure}
+  \caption[The ReLU and hyperbolic tangent activation functions]{The ReLU and hyperbolic tangent activation functions.}
+  \label{fig:activation_functions}
+\end{figure}
+
+\paragraph{Gradient descent} To train a neural network, we define a loss function $L(\bm{\hat{y}}, \bm{y})$ on the output layer. The parameters $\bm{\theta}$ (consisting of the weights and biases of all neurons) are then iteratively adjusted such that the loss function is minimized. For every iteration, the gradient of the loss function $L$ with respect to the parameters $\bm{\theta}$ is averaged over all training examples \cite{Goodfellow16:DeepLearning}:
+\begin{equation}
+  \bm{g} = \frac{1}{m} \nabla_{\bm{\theta}} \sum_{i=1}^m L(f(\bm{x}^{(i)} \mid \bm{\theta}), \bm{y}^{(i)})
+\end{equation}
+Next, a step is performed in the direction of the negative gradient:
+\begin{equation}
+  \bm{\theta} \leftarrow \bm{\theta} - \alpha \bm{g}
+\end{equation}
+The variable $\alpha > 0$ is the \textbf{learning rate}, which controls the step size in the direction of the gradient. The gradient with respect to weights and biases of hidden layers can be computed by repeatedly applying the chain rule in an algorithm called \textbf{backpropagation}. Since the training set is often large, estimates of the gradient can be computed over smaller subsets of the training data (\emph{mini-batches}). This algorithm is called \textbf{stochastic gradient descent}.
+
+Note that gradient descent converges towards local minima and the closest local minimum depends on the initialization of the parameters. Different weight and bias initializations can therefore yield different results.
+
+\subsection{Convolutional neural networks}
+\label{sec:theory_cnns}
+\textbf{Convolutional neural networks (CNNs)} \cite{LeCun89:CNN} are specialized neural networks for grid-like data such as time series and images employing the \textbf{convolution} operation. For images, we consider a two-dimensional discrete domain. Then, the convolution operation works as follows: Each input pixel $I(x, y)$ and its surrounding pixels are weighted using the entries of the \textbf{kernel} $K(m, n)$ which is centered around $(x, y)$. These weighted values are summed to produce the output $S(x, y)$, also called \textbf{feature map} \cite{Goodfellow16:DeepLearning}:
+
+\begin{equation}
+  S(x, y) = (K \ast I)(x, y) = \sum_{m,n} I(x - m, y - n) K(m, n)
+\end{equation}
+
+\paragraph{Convolutional layers} employ the convolution operation instead of the matrix multiplication in dense layers. Therefore, instead of the weights and biases, the kernel values are learnable parameters. A neural network with at least one convolutional layer is a CNN.
+
+CNNs have computational advantages: In a traditional neural network with dense layers, every output unit interacts with every input unit. Since the kernel is smaller than the input, this is not the case for convolutional layers. Therefore, CNNs require fewer parameters and fewer operations to be computed \cite{Goodfellow16:DeepLearning} and are much more efficient when processing images.
+
+Moreover, by applying filters, CNNs make use of spatial information between pixels which would be lost in dense models. By repeatedly applying filter operations, deeper convolutional layers learn to detect increasingly abstract patterns in images. While the first layers often detect simple shapes like lines and edges, deeper layers detect complex shapes like letters or faces.
+
+To reduce the dimensions of convolutional layer activations, there are two options: The \textbf{stride} can be set higher than one which would skip some values of $S(x, y)$, thereby reducing the output size. Alternatively, \textbf{pooling layers} can be employed which reduce several pixels to a single value, e.g., by taking their maximum or average value.
+
+\subsection{Autoencoders}
+\label{sec:theory_autoencoders}
+
+Autoencoders \cite{Rumelhart86:Autoencoders} are neural networks consisting of two (often symmetrical) parts: an encoder and a decoder (see \autoref{fig:autoencoder}). The encoder maps the input to a compressed and meaningful representation. The decoder takes this \emph{latent representation} and reconstructs the original input \cite{Bank20:Autoencoders}. Autoencoders have a \emph{bottleneck architecture} where the smallest layer with the latent representation is the bottleneck. As the optimization goal is to minimize the reconstruction error over the training set, the underlying network has to find a good compressed representation for all training samples. Therefore, mapping the input samples to their compressed representations is a form of dimensionality reduction. In fact, autoencoders can be interpreted as a nonlinear generalization of principal component analysis (PCA) \cite{Hinton06:Autoencoders}.
+
+\begin{figure}[tb]
+  \centering
+  \includegraphics[width=.8\textwidth]{images/autoencoder.PNG}
+  \caption[Architecture of an autoencoder]{Architecture of an autoencoder \cite{Bank20:Autoencoders}. The input image is encoded to the compressed latent representation and then decoded.}
+  \label{fig:autoencoder}
+\end{figure}
+
+\textbf{Convolutional autoencoders} use convolutional encoder and decoder models and are often employed to represent images. The reconstruction loss is usually set to be the \textbf{mean squared error (MSE)} between the original and reconstructed image.
+
+Autoencoders can easily overfit, thus not providing a meaningful latent representation. To avoid this, various regularization techniques can be applied. The simplest is to choose an appropriate bottleneck size depending on the complexities of the dataset, encoder, and decoder.
+
+\subsubsection{Sparse autoencoders}
+\label{sec:sparse_ae}
+A different way of regularization is to enforce sparsity on the latent representation, i.e., forcing neurons to be inactive most of the time. This can be implemented using a penalty term on the bottleneck activations such as the $L_1$-norm \cite{Bank20:Autoencoders}. Another sparsity constraint based on the Kullback-Leibler (KL) divergence models each bottleneck neuron as a Bernoulli random variable with mean
+\begin{equation}
+  \hat{\rho_j} = \frac{1}{m} \sum_{i=1}^m \left[ a_j (x^{(i)}) \right]
+\end{equation}
+for the $j$-th bottleneck neuron, where $a_j (x^{(i)})$ is the activation of the $j$-th bottleneck neuron for the $i$-th training sample. In other words, $\hat{\rho_j}$ is set to the average activation of the $j$-th neuron. The penalty term is then set to
+\begin{equation}
+  \sum_{j=1}^b \text{KL}(\rho || \hat{\rho_j}),
+\end{equation}
+where $b$ is the number of bottleneck neurons and
+\begin{equation}
+  \text{KL}(\rho || \hat{\rho_j}) = \rho \log \frac{\rho}{\hat{\rho_j}} + (1 - \rho) \log \frac{1 - \rho}{\hat{\rho_j}}
+\end{equation}
+is the KL divergence between two Bernoulli random variables with mean $\rho$ and $\hat{\rho_j}$, respectively \cite{Ng11:SparseAutoencoder}. $\rho$ is a hyperparameter expressing the ideal mean of bottleneck activations and should be set close to $0$. Note that the KL term is computed over all training samples, making it a possibly more accurate but less efficient penalty term if the training set is too large to fit in memory.
+
+\subsubsection{Denoising autoencoders}
+\label{sec:denoising_ae}
+In denoising autoencoders, the input is disrupted by noise, usually Gaussian noise or erasures. The autoencoder is expected to reconstruct the original version of the input image. Thus, the reconstruction loss is computed between the undisrupted input and output image. Effectively, this extension prevents overfitting while additionally making the model more robust to noise and even usable for error correction \cite{Bank20:Autoencoders}.
+
+\section{Evaluating binary classifiers}
+In a binary classification setting, there are only two possible prediction classes: negative and positive. To describe the predicted class and confidence value, only a single real number $x$ is needed as output: For positive $x$, the positive class is predicted, and for negative $x$ the negative class. The larger the absolute value of $x$, the higher the confidence. The function mapping input data to $x$ is called \emph{decision function}.
+
+\subsection{Confusion matrix}
+Suppose we want to evaluate a classifier on an annotated test set (i.e., the true classes are known for all samples). Assuming the classifier is not ideal, the predicted classes differ from the true classes. In fact, there are four possible outcomes for every prediction:
+
+\begin{enumerate}[label=\alph*)]
+  \item \textbf{True positive (TP):} The sample is labeled as positive and was correctly classified as positive.
+  \item \textbf{False negative (FN):} The sample is labeled as positive but was falsely classified as negative.
+  \item \textbf{True negative (TN):} The sample is labeled as negative and was correctly classified as negative.
+  \item \textbf{False positive (FP):} The sample is labeled as negative but was falsely classified as positive.
+\end{enumerate}
+
+To keep a record of wrong and successful predictions, the \emph{confusion matrix} is introduced (see \autoref{fig:confusion_matrix}) containing the absolute frequencies of the above prediction cases. Consequently, correct classifier decisions lie on the main diagonal of the table while the other elements represent the misclassifications \cite{Majnik13:ROCAnalysis}.
+
+\begin{figure}[tbhp]
+  \centering
+  \begin{tikzpicture}
+    % table
+    \draw[fill=red!40] (0, 0) rectangle (2, 2) node[pos=.5] {FN};
+    \draw[fill=green!60] (2, 0) rectangle (4, 2) node[pos=.5] {TN};
+    \draw[fill=green!60] (0, 2) rectangle (2, 4) node[pos=.5] {TP};
+    \draw[fill=red!40] (2, 2) rectangle (4, 4) node[pos=.5] {FP};
+    % positive/negative labels on the left
+    \path (-1, 0) rectangle (0, 2) node[pos=.5] {N'};
+    \path (-1, 2) rectangle (0, 4) node[pos=.5] {P'};
+    % positive/negative labels on the top
+    \path (0, 4) rectangle (2, 5) node[pos=.5] {P};
+    \path (2, 4) rectangle (4, 5) node[pos=.5] {N};
+    % outer labels
+    \path (0, 5) rectangle (4, 6) node[pos=.5] {True class};
+    \path (-2, 0) rectangle (-1, 4) node[pos=.5,rotate=90] {Predicted class};
+  \end{tikzpicture}
+  \caption[Confusion matrix]{Confusion matrix. The green cells indicate successful predictions, the red cells are prediction failures.}
+  \label{fig:confusion_matrix}
+\end{figure}
+
+Out of these absolute values, relative ones can be derived. Let TP, FN, TN, FP denote absolute frequencies from the confusion matrix, $\text{P} = \text{TP} + \text{FN}$ be the number of positively labeled samples and $\text{N} = \text{FP} + \text{TN}$ the number of negatively labeled samples. Then, the following rates can be defined to describe the classifier's performance \cite{Majnik13:ROCAnalysis}:
+
+\begin{enumerate}[label=\alph*)]
+  \item \textbf{True positive rate (TPR), sensitivity:} $\text{TPR} = \frac{\text{TP}}{\text{P}} = 1 - \text{FNR}$.
+  \item \textbf{True negative rate (TNR), specificity:} $\text{TNR} = \frac{\text{TN}}{\text{N}} = 1 - \text{FPR}$.
+  \item \textbf{False positive rate (FPR):} $\text{FPR} = \frac{\text{FP}}{\text{N}} = 1 - \text{TNR}$.
+  \item \textbf{False negative rate (FNR):} $\text{FNR} = \frac{\text{FN}}{\text{P}} = 1 - \text{TPR}$.
+\end{enumerate}
+
+An ideal classifier would have a sensitivity and specificity of $1$. Real-world classifiers, however, often require making trade-offs between high sensitivity and high specificity by choosing a suitable threshold for its decision function.
+
+\subsection{ROC curves}
+\label{sec:roc_curves}
+
+\begin{figure}[tb]
+  \centering
+  \begin{tikzpicture}
+    % plot + labels
+    \draw[opacity=0.5] (0, 0) rectangle (10, 10);
+    \path (0, -2) rectangle (10, 0) node[pos=.5] {FPR};
+    \path (-2, 10) rectangle (0, 0) node[pos=.5,rotate=90] {TPR};
+    % tick bottom left
+    \draw[opacity=0.5] (0,0) -- (-0.2,-0.2);
+    \node at (-0.4,-0.4) {0};
+    % tick bottom right
+    \draw[opacity=0.5] (10,0) -- (10,-0.2);
+    \node at (10,-0.4) {1};
+    % tick top left
+    \draw[opacity=0.5] (0,10) -- (-0.2,10);
+    \node at (-0.4,10) {1};
+
+    % diagonal
+    \draw[densely dotted,opacity=0.5] (0,0) -- (10,10);
+
+    % always negative
+    \draw[fill=blue!80!black] (0,0) circle (0.1) node[above right,text=blue!80!black] {always negative};
+
+    % always positive
+    \draw[fill=blue!80!black] (10,10) circle (0.1) node[below left,text=blue!80!black] {always positive};
+
+    % always correct
+    \draw[fill=green!60!black] (0,10) circle (0.1) node[below right,text=green!60!black] {always correct};
+
+    % always incorrect
+    \draw[fill=red!70!black] (10,0) circle (0.1) node[above left,text=red!70!black] {always incorrect};
+
+    % real classifiers
+    \draw[densely dotted,fill=violet,opacity=0.8] (7,1.4) -- (3,8.6);
+    \draw[fill=violet] (7,1.4) circle (0.1) node[above right,text=violet] {$A$};
+    \draw[fill=violet] (3,8.6) circle (0.1) node[above left,text=violet] {$\overline{A}$};
+
+    % random guesses
+    \draw[fill=orange!80!black] (5,5) circle (0.1) node[right,text=orange!80!black] {random guessing $p=0.5$};
+    \draw[fill=orange!80!black] (2,2) circle (0.1) node[below right,text=orange!80!black] {random guessing $p=0.2$};
+  \end{tikzpicture}
+  \caption[Characteristic points on a ROC graph]{Characteristic points on a ROC graph. Classifier $A$ performs significantly worse than random guessing, suggesting that its decision function should be inverted. This yields the much better classifier $\overline{A}$.}
+  \label{fig:roc_curve}
+\end{figure}
+
+To compare binary classifiers, a fair benchmark independent of the chosen threshold is required. The ROC curve (receiver operating characteristics) for a binary classifier aggregates the FPR on the x-axis against the TPR on the y-axis for all possible thresholds. In other words, each point on the ROC graph represents a single threshold for the decision function. There are three characteristic points, illustrated in \autoref{fig:roc_curve}:
+
+\begin{itemize}
+  \item The classifier at $(0, 0)$ always predicts the negative class, therefore producing neither true positives nor false positives.
+  \item The classifier at $(1, 1)$ always predicts the positive class, therefore producing only true and false positives.
+  \item The classifier at $(0, 1)$ is the ideal classifier that always predicts the correct class.
+\end{itemize}
+A classifier at $(1, 0)$ would always predict the opposite of the true class. Furthermore, a classifier that guesses randomly would land close to the ascending diagonal line. Therefore, all useful classifiers must be above the diagonal. However, even classifiers significantly below the diagonal must have useful information to be able to perform significantly worse than random guessing. Such cases indicate that the decision function might have to be inverted, effectively mirroring it against the midpoint $(0.5, 0.5)$ on the ROC graph \cite{Flach05:ROCCurves}. An example of this is shown in \autoref{fig:roc_curve} with the classifier $A$.
+
+To compare different classifiers, one ROC curve per classifier is computed over all possible thresholds. The quality of each classifier can then be measured by the Area Under Curve (AUC) score, i.e., the area under the ROC curve. Consequently, a good classifier would have an AUC score close to $1$ whereas a poor classifier would have an AUC score of around $0.5$ (see \autoref{fig:roc_examples}). Again, an AUC much lower than $0.5$ suggests that the decision function should be inverted whereas an AUC of $1$ shows that there exists a threshold for the decision function with $100 \%$ accuracy.
+
+\begin{figure}[tb]
+  \centering
+  \begin{subfigure}[b]{0.32\textwidth}
+      \centering
+      \includegraphics[width=\textwidth]{images/roc_random.pdf}
+      \caption{}
+      \label{fig:roc_random}
+  \end{subfigure}
+  \begin{subfigure}[b]{0.32\textwidth}
+      \centering
+      \includegraphics[width=\textwidth]{images/roc_goodclf.pdf}
+      \caption{}
+      \label{fig:roc_goodclf}
+  \end{subfigure}
+  \begin{subfigure}[b]{0.32\textwidth}
+      \centering
+      \includegraphics[width=\textwidth]{images/roc_goodclf_inv.pdf}
+      \caption{}
+      \label{fig:roc_goodclf_inv}
+  \end{subfigure}
+  \caption[Different ROC curves and AUC scores]{Different ROC curves and AUC scores. (a) Random guessing yields AUC $\approx 0.5$. (b) A good classifier $C$ has an AUC score close to 1. (c) If we invert the decision function of $C$, the AUC score changes to $1 - $ AUC.}
+  \label{fig:roc_examples}
+\end{figure}
+
+% -------------------------------------------------------------------
+
+\section{Already existing solutions}
+\label{sec:existingSolutions}
+Several animal species detection solutions \cite{Norouzzadeh18:Solution1,Willi19:Solution2} used a separate CNN for the task of detecting the presence of an animal in an image. To train such CNNs, a large and diverse annotated dataset is required: \cite{Norouzzadeh18:Solution1} used a balanced dataset of 1.4 million training and 105,000 test images. The images were selected from a total of 3.2 million images generated by 225 continuously running camera traps in the Serengeti National Park, Tanzania, since 2011. \cite{Willi19:Solution2} used the same dataset as well as three additional ones from Africa and North America, each with around 500,000 images.
+
+\cite{Norouzzadeh18:Solution1} found that splitting the tasks of animal detection and species classification outperformed a combined solution. They compared different CNN architectures and an ensemble of multiple CNNs in terms of accuracy. While the ensemble often slightly enhanced the accuracy compared to the best single model, it also increased training and inference time.
+
+\cite{Willi19:Solution2} argued that a combined model would have a strong bias towards the majority class of empty images. This effect is very significant since empty images make up more than 50 \% in most camera trap datasets. Additionally, an increase in accuracy can be expected by learning more specific target tasks. \cite{Willi19:Solution2} trained a CNN on the large Snapshot Serengeti dataset and then employed transfer learning to fine-tune the model to the smaller datasets to avoid overfitting. Their experiments showed that this increases the overall accuracy on all datasets.
+
+\cite{Yang21:SolutionEnsemble} have described another method for removing empty camera trap images using ensemble learning. They employed an ensemble of three CNNs with different architectures. In consideration of the class imbalance, they constructed a balanced and an unbalanced training set and trained each model architecture on each training set, therefore creating a pool of six models. Using a conservative voting mechanism, they removed between 50.78 \% and 77.51 \% of the empty images with omission errors of 0.70 \% to 2.54 \%. However, they also used a large Chinese dataset with 268,484 motion images (of which 77.86 \% were empty) and training six deep CNN models presumably requires a significant amount of time (which is not specified in the paper). In this work, such deep models are not applicable since there is much less training data (1,000 - 5,000 images per camera station) and annotated test data (500 - 3,000 images per camera station) available.
+
+% -------------------------------------------------------------------
+
+% insert further sections if necessary
+

+ 201 - 0
chapters/chap03/chap03.tex

@@ -0,0 +1,201 @@
+% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+% Author:   Felix Kleinsteuber
+% Title:    Anomaly Detection in Camera Trap Images
+% File:     chap03/chap03.tex
+% Part:     methods
+% Description:
+%         summary of the content in this chapter
+% Version:  16.05.2022
+% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+\chapter{Methods}
+\label{chap:contribution}
+In the following, the dataset is analyzed regarding its structure, inconsistencies, and preprocessing. After that, four approaches are proposed based on the theoretical considerations of the previous chapter.
+
+\section{Dataset}
+We will call a set of images generated from a single camera at a single position in the Bavarian Forest a \emph{session}. The images in a session were taken in a timeframe of between one and three months. To attract animals, the researchers laid out an animal cadaver in front of the camera which can be seen in the images.
+
+The dataset is organized in 32 sessions, each of which has its own folder identified by the cadaver type and a session number between one and five. Overall, there are 10 different cadaver types. Each session folder contains three subfolders, \emph{Lapse}, \emph{Motion}, and \emph{Full}. The \emph{Lapse} folder contains images taken at regular time intervals (usually one hour), regardless of whether motion was detected in front of the camera. In contrast, the \emph{Motion} folder contains only images that were captured when the motion sensor was triggered. When a movement is registered, the camera takes five images at an interval of one second. This process is repeated as long as the movement continues. Therefore, \emph{Motion} images are organized in sets of consecutively taken images with at least five images, referred to as \emph{capture sets}. The \emph{Full} folder is a subset of the \emph{Motion} folder and contains images, pre-selected by humans, that actually contain a moving object. The \emph{Full} files can be used to aid annotation but are not further referenced as they are not part of the classification process.
+
+A total of 203,887 images are available, of which 82 \% are \emph{Motion}, 15 \% are \emph{Lapse}, and 2 \% are \emph{Full} images.
+
+\subsection{Challenges}
+
+The ratio between the number of files in the three folders varies significantly. For instance, the \emph{Roedeer\_01} session contains 1380 \emph{Lapse} samples and 38,820 \emph{Motion} samples, of which only 18 have been pre-selected. In contrast, the \emph{Beaver\_01} session contains 1734 \emph{Lapse} samples but only 695 \emph{Motion} samples, of which 200 have been pre-selected. This shows a great variance in the relative frequency of anomalous images.
+
+The analysis of the distribution of EXIF image dates for each session shows that nine of the 32 sessions contain duplicates (i.e., two or more images taken at the same time), of which six contain inconsistent duplicates (i.e., the duplicate images show different content). While consistent duplicates can be eliminated easily, the inconsistent duplicates indicate an error in the dataset that can not be fixed easily. Moreover, in a few cases, the number of images in a capture set is not a multiple of five. It cannot be traced back how this error came about.
+
+\subsection{Preprocessing}
+The original image size of around 8 megapixels is too large for efficient execution of the proposed methods. In addition, the raw input images have black status bars at the top and bottom, which do not contain any useful information for classification. Therefore, in a preprocessing step, the black strips are cut off and the image is rescaled to 30 \% of its original size for approaches 1-3 and to 256 x 256 for approach 4, respectively. This process is demonstrated in \autoref{fig:preprocessing} for a sample image.
+
+\begin{figure}[htbp]
+    \centering
+    \begin{subfigure}[b]{.8\textwidth}
+        \centering
+        \includegraphics[width=\textwidth]{images/sample_0_5.jpg}
+        \caption{}
+        \label{fig:sample_image_raw}
+    \end{subfigure}
+    \begin{subfigure}[b]{.52\textwidth}
+        \centering
+        \includegraphics[width=\textwidth]{images/sample_preprocessed.jpg}
+        \caption{}
+        \label{fig:sample_image_pre}
+    \end{subfigure}
+    \begin{subfigure}[b]{.38\textwidth}
+        \centering
+        \includegraphics[width=\textwidth]{images/sample_preprocessed256.jpg}
+        \caption{}
+        \label{fig:sample_image_pre256}
+    \end{subfigure}
+    \caption[Cropping and resizing input images]{Cropping and resizing input images. (a) Raw input image (size 3776 x 2124). (b) Cropped and rescaled to 30 \% (size 1133 x 613) for approaches 1-3. (c) Cropped and resized to 256 x 256 for approach 4.}
+    \label{fig:preprocessing}
+\end{figure}
+
+\subsection{Labeling}
+\label{sec:labeling}
+To annotate sessions for testing, a script shows the annotator one image at a time. The annotator can then classify the image as normal or anomalous using the '1' and '2' keys. The generated labels can be quickly exported and saved for automated testing. Experiments show that for the tested sessions, between 60 and 100 images can be annotated this way in a minute, depending on the frequency of images with contest of interest.
+
+\section{Approaches}
+A total of four approaches are evaluated on the available sessions. As the sessions were created independently from each other, each approach is just evaluated on a single session at a time with only the lapse and motion images of this session as input information.
+
+\subsection{Lapse frame differencing}
+Using frame differencing (see \autoref{sec:theory_comparingimages_fd}), the motion image can be compared to the closest available lapse image in time. For this approach to work, lapse images have to be taken at least every hour using the same camera setup as for motion images such that the closest lapse image closely resembles its background. Mean and variance of the difference image (see \autoref{fig:approach1_example}) are determined and independently thresholded to distinguish between animal images with a high mean and variance, and empty images with a low mean and variance.
+
+\begin{figure}[htbp]
+    \centering
+    \begin{subfigure}[b]{0.48\textwidth}
+        \centering
+        \includegraphics[width=\textwidth]{images/approach1a_motion.pdf}
+        \label{fig:approach1_motion}
+    \end{subfigure}
+    \begin{subfigure}[b]{0.48\textwidth}
+        \centering
+        \includegraphics[width=\textwidth]{images/approach1a_lapse.pdf}
+        \label{fig:approach1_lapse}
+    \end{subfigure}
+    \begin{subfigure}[b]{0.48\textwidth}
+        \centering
+        \includegraphics[width=\textwidth]{images/approach1a_sqdiff.pdf}
+        \label{fig:approach1_sqdiff}
+    \end{subfigure}
+    \caption[Demonstration of lapse frame differencing on an anomalous motion image]{Demonstration of lapse frame differencing on an anomalous motion image. (a) Motion image taken at 00:22:46. (b) Closest lapse image taken at 00:00:00. (c) Squared pixel-wise difference ($\mu = 0.060, \sigma^2 = 0.033$).}
+    \label{fig:approach1_example}
+\end{figure}
+
+The algorithm is extended by applying Gaussian filtering on both input images with standard deviations of $\sigma \in \{ 2, 4, 6 \}$ (see \autoref{fig:approach1_example2}). Thus, the approach becomes more robust towards noise and small object movements (leaves, flies, dust particles, etc.). Note that this approach can only be evaluated on sessions with lapse images captured every hour.
+
+\begin{figure}[htbp]
+    \centering
+    \begin{subfigure}[b]{0.48\textwidth}
+        \centering
+        \includegraphics[width=\textwidth]{images/approach1a_ex2_motion.pdf}
+        \label{fig:approach1_ex2_motion}
+    \end{subfigure}
+    \begin{subfigure}[b]{0.48\textwidth}
+        \centering
+        \includegraphics[width=\textwidth]{images/approach1a_ex2_lapse.pdf}
+        \label{fig:approach1_ex2_lapse}
+    \end{subfigure}
+    \begin{subfigure}[b]{0.48\textwidth}
+        \centering
+        \includegraphics[width=\textwidth]{images/approach1a_ex2_sqdiff.pdf}
+        \label{fig:approach1_ex2_sqdiff}
+    \end{subfigure}
+    \begin{subfigure}[b]{0.48\textwidth}
+        \centering
+        \includegraphics[width=\textwidth]{images/approach1a_ex2_sigma4_sqdiff.pdf}
+        \label{fig:approach1_ex2_sqdiff_sigma4}
+    \end{subfigure}
+    \caption[Extending lapse frame differencing using Gaussian filtering]{Extending lapse frame differencing using Gaussian filtering ($\sigma = 4$) beforehand. (a) Motion image. (b) Lapse image. (c) Squared pixel-wise difference ($\mu = 1.258, \sigma^2 = 4.113$). (d) Squared pixel-wise difference of Gaussian filtered images ($\mu = 0.160, \sigma^2 = 0.330$).}
+    \label{fig:approach1_example2}
+\end{figure}
+
+\subsection{Median frame differencing}
+If the above condition is not met, i.e. no or too few lapse images are available, a substitute for lapse images can be found. Since the motion set is organized in subsets of at least five consecutively taken images (\emph{capture sets}), background estimation using the temporal median image (see \autoref{sec:theory_comparingimages_be}) can be applied to estimate a background image directly from the motion images of a single capture set. In the best case, such an image can be a better background estimation than the closest lapse image since there is no time difference between the images (see \autoref{fig:approach2_good_median}). However, this method often fails, specifically when the foreground object remains in the same place in the image for an extended period of time. In the case of animals, this is not unusual behavior (see \autoref{fig:approach2_bad_median}). The main advantage of this approach are its low requirements, since only motion images are used in the algorithm. As before, the accuracy should improve by filtering both the median image and the lapse image using a Gaussian filter with $\sigma \in \{ 2, 4, 6 \}$.
+
+\begin{figure}[tb]
+    \centering
+    \begin{subfigure}[b]{0.6\textwidth}
+        \centering
+        \includegraphics[width=\textwidth]{images/approach2_good_example_imgs.png}
+        \caption{}
+        \label{fig:approach2_good_imgs}
+    \end{subfigure}
+    \begin{subfigure}[b]{0.38\textwidth}
+        \centering
+        \includegraphics[width=\textwidth]{images/approach2_good_example_median.png}
+        \caption{}
+        \label{fig:approach2_good_median}
+    \end{subfigure}
+    \begin{subfigure}[b]{0.6\textwidth}
+        \centering
+        \includegraphics[width=\textwidth]{images/approach2_bad_example_imgs.png}
+        \caption{}
+        \label{fig:approach2_bad_imgs}
+    \end{subfigure}
+    \begin{subfigure}[b]{0.38\textwidth}
+        \centering
+        \includegraphics[width=\textwidth]{images/approach2_bad_example_median.png}
+        \caption{}
+        \label{fig:approach2_bad_median}
+    \end{subfigure}
+    \caption[Demonstration of background estimation using temporal median filtering]{Demonstration of background estimation using temporal median filtering. (a) Convenient set of motion images; the majority of values show background for all pixels. (b) Resulting good background estimation. (c) Inconvenient set of motion images; the deer stays in the center of the image. (d) Resulting poor background estimation.}
+    \label{fig:approach2_example}
+\end{figure}
+
+A possible fault of both approaches 1 and 2 is the high sensitivity to movement of the camera or background objects. As soon as the camera, trees, or leaves move around by even a few pixels, high difference values occur which distort the statistical properties of the difference image, thus making it harder to distinguish between normal and anomalous images. Approach 3 tries to eliminate this fault by relying on local image features. 
+
+\subsection{Bag of Visual Words}
+In the training process, a visual vocabulary is generated in the following way: First, SIFT descriptors are calculated on densely sampled keypoints for all lapse images (see \autoref{fig:approach3}). Then, the features are clustered into $k$ groups using the k-Means algorithm to create a visual vocabulary $V$. The best combination of hyperparameters including the keypoint step size $s$, keypoint size, and vocabulary size $k$ is determined experimentally. The training features are derived by computing the Bag of Words histogram for every lapse image with respect to the vocabulary $V$. They can then be used to fit a one-class classifier (here: one-class SVM).
+
+\begin{figure}[htbp]
+    \centering
+    \includegraphics[width=.6\textwidth]{images/approach3_keypoints.pdf}
+    \caption[Densely sampled keypoints.]{Densely sampled keypoints. Keypoint size 30 pixels, no spacing ($s = 30$).}
+    \label{fig:approach3}
+\end{figure}
+
+For evaluation on the motion images, SIFT descriptors are computed on the same dense keypoint grid that was used in training. The descriptors are used to derive the bag of words histogram to which a score can be assigned using the trained one-class classifier. This score can then be thresholded to distinguish between normal and anomalous images. Picking random prototypes is tried as an alternative to k-Means clustering.
+
+\subsection{Autoencoder}
+An autoencoder neural network is trained on normal lapse images and then evaluated on motion images as described in \autoref{sec:theory_autoencoders}. Two metrics are considered separately to distinguish between normal and anomalous images: The reconstruction loss measures the mean squared error between the input and reconstructed image. The other metric is the log likelihood of the input image under the distribution of the bottleneck activations, estimated using Kernel Density Estimation (KDE) with a Gaussian kernel. If the reconstruction loss is \emph{above} a certain threshold or the log likelihood is \emph{below} a threshold, respectively, the image is considered anomalous. Again, the best combination of hyperparameters including learning rate, batch size and dropout rate is determined experimentally. \autoref{fig:approach4_reconstructions} demonstrates that anomalous images yield a high reconstruction error while normal images are reconstructed much more accurately.
+
+\begin{figure}[htbp]
+    \centering
+    \begin{subfigure}[t]{.48\textwidth}
+        \includegraphics[width=\textwidth]{images/approach4_normal_reconstruction.png}
+        \caption{}
+    \end{subfigure}
+    \begin{subfigure}[t]{.48\textwidth}
+        \includegraphics[width=\textwidth]{images/approach4_anomalous_reconstruction.png}
+        \caption{}
+    \end{subfigure}
+    \caption[Autoencoder inputs and reconstructions]{Autoencoder inputs (left) and reconstructions (right). (a) A normal input image is reconstructed well. (b) An anomalous image is reconstructed poorly.}
+    \label{fig:approach4_reconstructions}
+\end{figure}
+\begin{figure}[hbtp]
+    \centering
+    \includegraphics[width=.36\textwidth,angle=90]{images/approach4_architecture.pdf}
+    \caption[Architecture of the convolutional autoencoder]{Architecture of the convolutional autoencoder. All convolutional layers use dropout with $p=0.05$ and the ReLU activation function. The very last layer uses the $\tanh$ activation function to provide a value range of $(-1, 1)$ identical to the input. The output dimensions of each layer are given in brackets.}
+    \label{fig:approach4_architecture}
+\end{figure}
+
+\subsubsection{Architecture}
+To keep the number of parameters small, a fully convolutional architecture with no dense layers is employed (see \autoref{fig:approach4_architecture}). The encoder and decoder part are mirror-symmetric and consist of seven convolutional layers each. Input and output images are downscaled to $(3, 256, 256)$ with the color channel in the first dimension. The image size should not be smaller than 256x256 for even small anomalous objects to be visible. A larger image size is possible but requires a change of architecture. The value range of both input and output image is $(-1, 1)$.
+
+The bottleneck layer has shape $(n, 4, 4)$. The variable $n$ therefore controls the size of the latent representation in multiples of 16. The optimal value for $n$ depends on the characteristics of the session. Depending on the bottleneck size, the number of trainable parameters varies. For $n = 32$ (512 latent features), there are a total of $1{,}076{,}771$ parameters of which $388{,}192$ belong to the encoder and $688{,}579$ belong to the decoder.
+
+It was considered to use an existing network architecture. However, most available architectures are simply too large for such small training sets. The architecture follows two basic design principles:
+\begin{enumerate}
+    \item In the encoder, the image size gradually decreases while the number of channel gradually increases (except for the bottleneck layer).
+    \item The decoder mirrors the encoder.
+\end{enumerate}
+
+\subsubsection{Training and evaluation}
+The model parameters are optimized using the Adam optimizer \cite{Kingma14:Adam} in PyTorch \cite{Paszke19:PyTorch} and trained for 200 epochs. After the training is completed, the reconstruction losses and log likelihoods are calculated for all lapse and motion images. Both of these metrics are then thresholded as before.
+
+\subsubsection{Extensions}
+In a denoising autoencoder (see \autoref{sec:denoising_ae}), Gaussian noise with different standard deviations $\sigma$ is added to the input images during training. In a sparse autoencoder (see \autoref{sec:sparse_ae}), an L1 penalty on the bottleneck activations is added to the loss function with different sparsity multipliers $\lambda$. A KL divergence-based penalty was not examined for time reasons. The experiments are repeated for all configurations of the two extensions, and for a combination of both.
+
+\section{Evaluation}
+All approaches yield a function that maps every motion image to a real number. This function then has to be thresholded. As the reliability of the algorithm depends on the chosen threshold, ROC curves (see \autoref{sec:roc_curves}) mapping the false positive rate (FPR) to the true positive rate (TPR) are generated for every approach and every session. The AUC score is used as the main comparison metric. However, the AUC score often does not describe the suitability of the approach to the problem of empty image elimination well. Therefore, elimination rates are introduced as an additional metric: The elimination rate $\TNR_{\TPR \geq x}$ describes the highest possible true negative rate (TNR) for a true positive rate of at least $x$. Here, we choose $x \in \{ 0.9, 0.95, 0.99 \}$. Descriptively speaking, if we want to keep at least 90 \% (95 \%, 99 \%) of interesting images, what percentage of empty images can we eliminate? This metric is more relevant to our problem than the AUC score, since we prioritize keeping a large number of interesting images.

+ 324 - 0
chapters/chap04/chap04.tex

@@ -0,0 +1,324 @@
+% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+% Author:   Felix Kleinsteuber
+% Title:    Anomaly Detection in Camera Trap Images
+% File:     chap04/chap04.tex
+% Part:     experimental results, evaluation
+% Description:
+%         summary of the content in this chapter
+% Version:  16.05.2022
+% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+\chapter{Experimental results and evaluation}
+\label{chap:evaluation}
+% chapter intro, which applications do you analyze, perhaps summarize your results beforehand
+
+\begin{table}
+    \centering
+    \begin{tabular}[c]{|l|l|l|l|}
+        \hline
+        \textbf{Session} & Beaver\_01 & Marten\_01 & GFox\_03 \aster \\
+        \hline
+        \textbf{Images Lapse / Motion} & 1734 / 695 & 2462 / 3105 & 1210 / 2738 \\
+        \hline
+        \textbf{Identical Lapse Duplicates} & 0 & 621 & N/A \\
+        \hline
+        \textbf{Inconsistent Lapse Duplicates} & 0 & 108 & N/A \\
+        \hline
+        \textbf{Motion \% Anomalous} & 89 \% & 24 \% & 6 \% \\
+        \hline
+    \end{tabular}
+    \caption[Sessions used for experiments]{Sessions used for experiments. The sessions differ in the number of images and the frequency of anomalous images. \aster Note that GFox\_03 is a generated set with no real lapse data.}
+    \label{tab:sessions}
+\end{table}
+
+All four proposed approaches were implemented in Python using NumPy \cite{Harris20:NumPy}, OpenCV \cite{Bradski00:OpenCV} and scikit-learn \cite{Pedregosa11:scikit-learn}. The code is organized in multiple Jupyter notebooks \cite{Kluyver16:Jupyter}. Graphs were generated using Matplotlib \cite{Hunter07:Matplotlib}. The autoencoder approach was implemented using PyTorch \cite{Paszke19:PyTorch}. To generate meaningful scores, three sessions with different characteristics (see \autoref{tab:sessions}) were used for evaluation. Beaver\_01 and Marten\_01 were annotated fully using the tool described in \autoref{sec:labeling}. 
+
+\paragraph{Beaver\_01} can be considered the easiest of datasets since there is only a single camera position and 89 \% of images contain interesting content. With 695 motion images, it is relatively small. It is used as a baseline set.
+
+\paragraph{Marten\_01} only contains 24 \% anomalous images. There are three different camera positions. Moreover, there are 108 inconsistent duplicates in the lapse set. This means that for 108 timestamps, there exist two lapse images that show different camera positions. For 621 additional timestamps, there exist two identical lapse images. It is not apparent how this circumstance came about and there is no obvious way to bring the lapse set back to a consistent state. The inconsistency only affects approach 1, as approach 2 does not use the lapse set and approaches 3 and 4 do not take notice of the timestamps of lapse images. Approach 1 is implemented such that it selects the first lapse image it finds. For inconsistent duplicates, this image is not guaranteed to show the right background.
+
+\paragraph{Fox\_03} represents a special case: The provided lapse images were not taken every hour but every day, making the lapse set ineligible because of the insufficient number of images and temporal proximity to the motion set. However, labels were provided for this session. Therefore, Fox\_03's lapse set was discarded and the new set GFox\_03 was generated from Fox\_03's motion set using the following procedure:
+
+\begin{enumerate}
+    \item Find a new set $S$ of consecutively taken motion images in Fox\_03.
+    \item If all images in $S$ are labeled as normal, move them to lapse in GFox\_03.
+    \item Repeat step 1 until all consecutive image sets were iterated.
+    \item Move the remaining images to the motion set in GFox\_03.
+\end{enumerate}
+
+A generated session is not equivalent to a real session since the data shift between generated lapse and motion set is expected to be much smaller. Nevertheless, the diversity of generated lapse images is usually smaller than of real lapse images. For those reasons, the results of approaches 3 and 4 on this session cannot be fairly compared to the results of other sessions. Approach 1 can still not be evaluated on GFox\_03 since there are no lapse images in close temporal proximity in most cases. Approach 2 is not affected by the lapse set and can be fairly compared across all sessions. The images in Fox\_03 only show a single camera setting as intended with slight camera movements and occasional blurred leaves at the edge of the image.
+
+% -------------------------------------------------------------------
+
+\section{Hyperparameter search}
+\label{sec:experiments}
+
+For each approach, several hyperparameters need to be selected. In the following section, different configurations of hyperparameters are evaluated on the Beaver\_01 session. Beaver\_01 was chosen because it could be quickly annotated. The insights of these first experiments are then used in the next section to evaluate and compare the approaches on multiple sessions.
+
+\subsection{Approach 1 - Lapse Frame Differencing}
+There are three options to consider in this approach:
+
+\paragraph{Difference function} Experiments show a slight increase in performance when using the squared pixel-wise difference over the absolute difference. This increase is only significant when combined with Gaussian filtering. A possible explanation is that the square function gives higher weight to significant pixel differences (as often caused by visual anomalies) while almost ignoring small differences (as often caused by lighting changes).
+
+\paragraph{Difference image metric} As a metric for the similarity of lapse and motion image, the mean and variance of the difference image are considered. Here, experiments show that the variance is the better choice. This can be explained by the spatial limitation of true visual anomalies: An animal would introduce high pixel differences in specific image parts, therefore increasing the variance more than the mean value. On the other hand, a simple lighting change would alter the mean value but not the variance.
+
+\paragraph{Gaussian filtering} Experiments confirm the hypothesis that filtering the image using a Gaussian filter beforehand greatly improves the classifier's performance. Introducing such a filter with $\sigma=2$ already increases the AUC score by $0.1362$ on average, by $0.1485$ for $\sigma=4$, and by $0.1531$ for $\sigma=6$0. A $\sigma$-value close to $6$ therefore seems to be a good choice. However, choosing a too large value for $\sigma$ can make smaller anomalies such as birds be completely blurred, making accurate classifications in such cases virtually impossible.
+
+\begin{table}[h]
+    \centering
+    \begin{tabular}[c]{|r|l|l|r|r|r|r|}
+        \hline
+        $\sigma$ & Diff & Metric & AUC & $\TNR_{\TPR \geq 0.9}$ & $\TNR_{\TPR \geq 0.95}$ & $\TNR_{\TPR \geq 0.99}$ \\
+        \hline
+        $0$ & abs & mean & $0.7308$ & $0.4189$ & $0.3514$ & $0.1622$ \\
+        \hline
+        $0$ & abs & var & $0.7414$ & $0.4865$ & $0.4189$ & $0.2432$ \\
+        \hline
+        $0$ & sq & mean & $0.7336$ & $0.4189$ & $0.4054$ & $0.2162$ \\
+        \hline
+        $0$ & sq & var & $0.7296$ & $0.4189$ & $0.4189$ & $0.2568$ \\
+        \hline
+        \hline
+        $2$ & abs & mean & $0.8230$ & $0.6486$ & $0.4459$ & $0.2973$ \\
+        \hline
+        $2$ & abs & var & $0.8941$ & $0.6486$ & $0.5946$ & $0.5000$ \\
+        \hline
+        $2$ & sq & mean & $0.8645$ & $0.6486$ & $0.5811$ & $0.4324$ \\
+        \hline
+        $2$ & sq & var & $0.8986$ & $0.7162$ & $0.6081$ & $0.5270$ \\
+        \hline
+        \hline
+        $4$ & abs & mean & $0.8294$ & $0.6351$ & $0.4459$ & $0.2973$ \\
+        \hline
+        $4$ & abs & var & $0.9068$ & $0.6622$ & $0.5811$ & $0.5270$ \\
+        \hline
+        $4$ & sq & mean & $0.8777$ & $0.6486$ & $0.6081$ & $0.4324$ \\
+        \hline
+        $4$ & sq & var & $0.9156$ & $0.7973$ & $0.6486$ & $0.5676$ \\
+        \hline
+        \hline
+        $6$ & abs & mean & $0.8337$ & $0.5946$ & $0.5270$ & $0.2838$ \\
+        \hline
+        $6$ & abs & var & $0.9109$ & $0.6622$ & $0.6351$ & $0.5270$ \\
+        \hline
+        $6$ & sq & mean & $0.8816$ & $0.6622$ & $0.5811$ & $0.4324$ \\
+        \hline
+        $\mathbf{6}$ & \textbf{sq} & \textbf{var} & $\mathbf{0.9214}$ & $\mathbf{0.7973}$ & $\mathbf{0.6351}$ & $\mathbf{0.5811}$ \\
+        \hline
+    \end{tabular}
+    \caption[Evaluation of approach 1 with different hyperparameters on Beaver\_01]{Evaluation of approach 1 with different hyperparameters on Beaver\_01.}
+    \label{tab:eval_approach1_beaver01}
+\end{table}
+
+The optimal configuration of approach 1 is highlighted in \autoref{tab:eval_approach1_beaver01}. It applies Gaussian filtering with $\sigma = 6$ to both images and then thresholds the variance of the squared pixel difference image to find anomalies, reaching a remarkable AUC score of $0.9214$ on the Beaver\_01 set. It also beats most of the other configurations regarding elimination rates: Even for a TPR of 99 \%, 58 \% of all empty images can be eliminated.
+
+\subsection{Approach 2 - Median Frame Differencing}
+In contrast to approach 1, the temporal median image is used as a background representation. Besides that, all following operations are the same. Indeed, experiments confirm that the same choices regarding difference function, metric and Gaussian filtering should be made. In this case, $\sigma = 4$ and $\sigma = 6$ perform almost equivalently.
+
+\begin{table}[h]
+    \centering
+    \begin{tabular}[c]{|r|l|l|r|r|r|r|}
+        \hline
+        $\sigma$ & Diff & Metric & AUC & $\TNR_{\TPR \geq 0.9}$ & $\TNR_{\TPR \geq 0.95}$ & $\TNR_{\TPR \geq 0.99}$ \\
+        \hline
+        $0$ & sq & mean & $0.7794$ & $0.6351$ & $0.4730$ & $0.1757$ \\
+        \hline
+        $0$ & sq & var & $0.7897$ & $0.6622$ & $0.5946$ & $0.2703$ \\
+        \hline
+        \hline
+        $2$ & sq & mean & $0.8475$ & $0.6622$ & $0.5811$ & $0.3378$ \\
+        \hline
+        $2$ & sq & var & $0.8735$ & $0.7973$ & $0.7162$ & $0.4865$ \\
+        \hline
+        \hline
+        $4$ & sq & mean & $0.8509$ & $0.6081$ & $0.5270$ & $0.2568$ \\
+        \hline
+        $\mathbf{4}$ & \textbf{sq} & \textbf{var} & $\mathbf{.8776}$ & $\mathbf{.7838}$ & $\mathbf{.7027}$ & $\mathbf{.4459}$ \\
+        \hline
+        \hline
+        $6$ & sq & mean & $0.8509$ & $0.6081$ & $0.5270$ & $0.2297$ \\
+        \hline
+        $6$ & sq & var & $0.8766$ & $0.7973$ & $0.6757$ & $0.3919$ \\
+        \hline
+    \end{tabular}
+    \caption[Evaluation of approach 2 with different hyperparameters on Beaver\_01]{Evaluation of approach 2 with different hyperparameters on Beaver\_01.}
+    \label{tab:eval_approach2_beaver01}
+\end{table}
+
+The optimal configuration of approach 2 is highlighted in \autoref{tab:eval_approach2_beaver01}. It applies Gaussian filtering with $\sigma = 4$ to the motion and median image and then thresholds the variance of the squared pixel difference image to find anomalies, reaching an AUC score of $0.8776$ on the Beaver\_01 set. As expected, approach 2 is less accurate than approach $1$ since it cannot benefit from the added information value of the lapse images. However, this might not be the case in general since the approaches have very different preconditions: Approach 1 requires lapse images of high temporal proximity and similar lighting whereas approach 2 requires the majority of pixel values in a consecutive motion image set for every pixel to show normal background.
+
+\begin{figure}[htbp]
+    \centering
+    \begin{subfigure}[b]{.48\textwidth}
+        \includegraphics[width=\textwidth]{images/results/approach1_Beaver_01_sqvar_sigma4.pdf}
+        \caption{Approach 1.}
+        \label{fig:approach1_roc}
+    \end{subfigure}
+    \begin{subfigure}[b]{.48\textwidth}
+        \includegraphics[width=\textwidth]{images/results/approach2_Beaver_01_sqvar_sigma4.pdf}
+        \caption{Approach 2.}
+        \label{fig:approach2_roc}
+    \end{subfigure}
+    \caption[ROC curve of approaches 1 and 2 with identical configuration on Beaver\_01]{ROC curve of approaches 1 and 2 with identical configuration on Beaver\_01.}
+    \label{fig:eval_approach12_beaver01_roc}
+\end{figure}
+
+\autoref{fig:eval_approach12_beaver01_roc} provides insight into the quality of approach 1 and 2 classifiers by comparing their ROC curves. In the context of empty image elimination, we expect the TPR to be $0.9$ and higher to keep most of the interesting images. When looking only at the top part of the ROC graph where this is true, the curves look roughly similar.
+
+\subsection{Approach 3 - Bag of Visual Words}
+The local feature approach allows for various options regarding feature generation and clustering. The following options were considered:
+
+\paragraph{Clustering algorithm} The most common clustering method is $k$-Means, which is very time-consuming for large datasets. Experiments show that choosing the prototype vectors randomly achieves similar accuracies to k-Means and is much faster. The mean AUC score for random prototypes is on average $0.0131$ higher than for k-Means clustering. Results acquired using the k-Means algorithm are listed in \autoref{tab:eval_approach3_beaver01}, results of random clustering are illustrated in a boxplot in \autoref{fig:eval_approach3_random_boxplot}. The colored lines mark the mean values over 10 tests, the box extends from the first to the third quartile (interquartile range). The whiskers extend to the last datum within $1.5$ times the interquartile range from the box; data beyond the whiskers are considered outliers and marked as points.
+
+\begin{table}[h]
+    \centering
+    \begin{tabular}[c]{|r|r|l|l|r|r|r|r|}
+        \hline
+        $k$ & $s$ & Clustering & M & AUC & $\TNR_{\TPR \geq 0.9}$ & $\TNR_{\TPR \geq 0.95}$ & $\TNR_{\TPR \geq 0.99}$ \\
+        \hline
+        1024 & 30 & kmeans & N & 0.7698 & 0.3929 & 0.3800 & 0.0757 \\
+        \hline
+        2048 & 30 & kmeans & N & 0.7741 & 0.4976 & 0.3382 & 0.0564 \\
+        \hline
+        4096 & 30 & kmeans & N & 0.7837 & 0.5797 & 0.2866 & 0.0451 \\
+        \hline
+        2048 & 40 & kmeans & N & 0.7611 & 0.3317 & 0.1610 & 0.1320 \\
+        \hline
+        \hline
+        1024 & 30 & kmeans & Y & 0.7056 & 0.2432 & 0.2222 & 0.0821 \\
+        \hline
+        2048 & 30 & kmeans & Y & 0.7390 & 0.3172 & 0.3092 & 0.0612 \\
+        \hline
+        4096 & 30 & kmeans & Y & 0.7542 & 0.3768 & 0.2963 & 0.0515 \\
+        \hline
+    \end{tabular}
+    \caption[Evaluation of approach 3 with different hyperparameters on Beaver\_01 using k-Means clustering]{Evaluation of approach 3 with different hyperparameters on Beaver\_01 using k-Means clustering. $k$ is the vocabulary size, $s$ is the keypoint and step size. $M$ specifies whether the motion features were included in clustering (yes/no).}
+    \label{tab:eval_approach3_beaver01}
+\end{table}
+
+\paragraph{Including motion features} Descriptors of (unlabeled) motion images can be included in the unsupervised clustering process. Even though this can enhance the feature richness, experiments show that the AUC score as well as the elimination rates decrease while the computation time increases significantly when the number of motion images is high. In conclusion, only lapse descriptors should be used for clustering.
+
+\begin{figure}[htbp]
+    \centering
+    \begin{subfigure}[b]{\textwidth}
+        \includegraphics[width=\textwidth]{images/results/approach3_boxplot_random.pdf}
+        \caption{AUC}
+        \label{fig:eval_approach3_random_boxplot_auc}
+    \end{subfigure}
+    \begin{subfigure}[b]{\textwidth}
+        \includegraphics[width=\textwidth]{images/results/approach3_boxplot_random_tnr95.pdf}
+        \caption{$\TNR_{\TPR \geq 0.95}$}
+        \label{fig:eval_approach3_random_boxplot_tnr95}
+    \end{subfigure}
+    \caption[Evaluation of approach 3 using random prototypes]{Evaluation of approach 3 using random prototypes. Ten different vocabularies with randomly selected prototype vectors were tested per configuration.}
+    \label{fig:eval_approach3_random_boxplot}
+\end{figure}
+
+\paragraph{Vocabulary size} The optimal vocabulary size can vary greatly depending on session complexity and number of images as well as keypoint step size. Larger sessions often require more prototype vectors to accurately distinguish between images. A small step size produces more keypoints, therefore requiring more visual words. Experiments verify these considerations and show that a vocabulary size $k$ between $1000$ and $5000$ is appropriate.
+
+For the sake of clarity, when using boxplots to visualize scores only $\TNR_{\TPR \geq 0.95}$ is given as the 'elimination rate'. Choosing a higher required $\TPR$ does not leave a margin for dataset errors and annotation mistakes while a lower $\TPR$ is not of practical relevance.
+
+\paragraph{Keypoint size} Keypoint size and step size can be chosen independently, however, rudimentary first experiments show that choosing them identically provides a good compromise between image coverage and number of visual words. The keypoint size controls the resolution at which the model 'scans' the image - very small keypoints can lead to high noise sensitivity while too large keypoints wash out potentially important details. The step size $s$ influences the number of visual words and therefore the complexity of obtaining visual words and clustering - for $s=40$, there are 364 keypoints per image, 629 keypoints for $s=30$, and 1456 keypoints for $s=20$0. As these values show, $s$ should not be smaller than $20$ because of expensive visual words collection, but also not larger than $40$ to obtain enough keypoints per image. For the Beaver\_01 dataset, $s=20$ proves to be the most accurate configuration, particularly when looking at the elimination rates (see \autoref{fig:eval_approach3_random_boxplot_tnr95}). Note that when the step size is cut in half, the clustering complexity remains constant (when using random clustering) but the number of local feature descriptors to compute quadruples. Therefore, the training process for $s=20$ takes approximately twice as long as $s=30$ and four times as long as $s=40$0. Moreover, more local features require more memory for clustering.
+
+\subsection{Approach 4 - Autoencoder}
+For neural networks such as an autoencoder, there are several hyperparameters which affect the quality of the trained model. The learning rate $\alpha$ was found to be optimal around $10^{-4}$ where the model converges after around 200 epochs of training. For the larger learning rate of $10^{-3}$, the model converges more quickly after 100 epochs but performs slightly worse regarding both AUC ($-0.021$) and elimination rates ($-0.169$) when tested on the default configuration with 512 latent features. The optimizer used was Adam \cite{Kingma14:Adam} with a weight decay of $10^{-5}$. Each configuration was trained 10 times with different random initializations.
+
+\paragraph{Early stopping} Especially in anomaly detection scenarios and when the risk of overfitting is high, it can be beneficial to stop training before the model converges. Experiments show that early stopping is not beneficial here as the model does not tend to overfit easily.
+
+\paragraph{Anomaly metric} Anomalies can either be detected by measuring the reconstruction loss as the mean squared error between input and output image or by estimating the distribution of the latent features via Kernel Density Estimation (KDE). Experiments show that low log-likelihood of the latent features with respect to the KDE-estimated distribution is a much more reliable indicator than a high reconstruction loss, achieving $0.338$ higher AUC and $0.304$ higher elimination rates on average. All following graphs were thus created using KDE.
+
+\paragraph{Bottleneck size} The size of the bottleneck layer controls how much the autoencoder needs to compress the input information. Too small bottleneck sizes cause too great a loss of information, while autoencoders with large bottlenecks usually do not eliminate enough redundancies. Experiments show that a bottleneck size of $16$ can already represent enough information to reach a mean AUC of $0.859$ (see \autoref{fig:eval_approach4_latentfeatures}). Optimal scores are achieved with around $512$ latent features (AUC $0.893$). As the diversity of scenes in Beaver\_01 is relatively low compared to other sessions, it is advisable to choose the larger value of $512$ latent features as a general rule of thumb. It is noticeable that configurations with fewer latent features mostly have slightly higher elimination rates. This indicates that it might be useful to impose a sparsity constraint.
+
+\begin{figure}[htbp]
+    \centering
+    \begin{subfigure}[b]{.49\textwidth}
+        \includegraphics[width=\textwidth]{images/results/approach4_boxplot_kde_latentfeatures_auc.pdf}
+        \caption{AUC}
+        \label{fig:eval_approach4_latentfeatures_auc}
+    \end{subfigure}
+    \begin{subfigure}[b]{.49\textwidth}
+        \includegraphics[width=\textwidth]{images/results/approach4_boxplot_kde_latentfeatures_tnr95.pdf}
+        \caption{$\TNR_{\TPR \geq 0.95}$}
+        \label{fig:eval_approach4_latentfeatures_tnr95}
+    \end{subfigure}
+    \caption[Evaluation of approach 4 for different bottleneck sizes]{Evaluation of approach 4 for different bottleneck sizes.}
+    \label{fig:eval_approach4_latentfeatures}
+\end{figure}
+
+\paragraph{Denoising autoencoder} Denoising autoencoders aim to make the latent representation more robust towards small input changes and noise. This prevents overfitting by adding noise as a dataset augmentation method. Experiments show that AUC and elimination rates are only weakly affected by the noise. Only for strong noise with $\sigma > 0.3$ the scores decrease. As shown in \autoref{fig:eval_approach4_extensions_tnr95}, the mean elimination rate increases by $7.3$ when adding Gaussian noise with low $\sigma = 0.1$0.
+
+\paragraph{Sparse autoencoder} Sparse autoencoders force the latent feature vector to be sparse. In this implementation, this is done by imposing a L1 penalty on the bottleneck activations. Sparsity is a form of regularization that further decreases the dimensionality of the latent space by favouring the elimination of more redundancies, thereby preventing overfitting. To a certain degree, it can have similar effects as decreasing the bottleneck size, but is more adapted to the dataset by using a continuous penalty rather than discrete bottleneck sizes. In experiments, the sparsity constraint did not achieve any significant improvements compared to the base model (see \autoref{fig:eval_approach4_extensions}). As expected, choosing $\lambda$ too high leads to dropping scores.
+
+\begin{figure}[tb]
+    \centering
+    \begin{subfigure}[b]{.9\textwidth}
+        \includegraphics[width=\textwidth]{images/results/approach4_boxplot_kde_denoising_and_sparse_auc.pdf}
+        \caption{AUC}
+        \label{fig:eval_approach4_extensions_auc}
+    \end{subfigure}
+    \begin{subfigure}[b]{.9\textwidth}
+        \includegraphics[width=\textwidth]{images/results/approach4_boxplot_kde_denoising_and_sparse_tnr95.pdf}
+        \caption{$\TNR_{\TPR \geq 0.95}$}
+        \label{fig:eval_approach4_extensions_tnr95}
+    \end{subfigure}
+    \caption[Evaluation of approach 4 for denoising and sparse extensions]{Evaluation of approach 4 for denoising and sparse extensions, respectively. $\sigma$ is the standard deviation of the Gaussian noise for the denoising autoencoder; $\lambda$ is the multiplier for the sparsity constraint of the sparse autoencoder.}
+    \label{fig:eval_approach4_extensions}
+\end{figure}
+
+\subsection{Summary}
+Although the choice is not always clear, configurations can be selected for every approach which are optimal regarding the Beaver\_01 session. These configurations are listed in \autoref{tab:eval_summary1} and will be used for the second experiment series to compare the performance of different approaches on other sessions.
+
+\begin{table}
+    \centering
+    \begin{tabular}[c]{|l|l|}
+        \hline
+        \textbf{Approach} & \textbf{Optimal configuration} \\
+        \hline
+        1 - Lapse Frame Differencing & Variance of squared difference image, \\
+        & Gaussian filtering $\sigma=6$ \\
+        \hline
+        2 - Median Frame Differencing & Variance of squared difference image, \\
+        & Gaussian filtering $\sigma=4$ \\
+        \hline
+        3 - Bag of Visual Words & 4096 clusters, densely sampled SIFT features \\
+        & with keypoint size = step size = 20, random \\
+        & prototypes, motion features not included in \\
+        & training \\
+        \hline
+        4 - Autoencoder & Bottleneck size 512, Gaussian noise on \\
+        & input $\sigma = 0.1$, Kernel Density Estimation \\
+        \hline
+    \end{tabular}
+    \caption[Best-performing configurations for the proposed approaches]{Best-performing configurations for the proposed approaches.}
+    \label{tab:eval_summary1}
+\end{table}
+
+% -------------------------------------------------------------------
+
+\section{Benchmarking the approaches}
+After picking a good configuration for every approach, the methods were additionally tested on the two sessions Marten\_01 and GFox\_03. Both sessions have particular characteristics and irregularities that were described at the beginning of \autoref{sec:experiments}. The results are listed in \autoref{tab:eval_summary2}.
+
+\begin{table}
+    \centering
+    \begin{tabular}[c]{|ll|r|r|r|r|}
+        \hline
+        \textbf{Dataset}     & \textbf{Metric} & 1 LapseFD & 2 MedianFD & 3 BoVW & 4 AE \\
+        \hline \hline
+        Beaver\_01  & AUC                     & 0.9214 & 0.8776 & 0.8127 & 0.8928 \\
+                    & $\TNR_{\TPR \geq 0.95}$ & 0.6351 & 0.7027 & 0.5322 & 0.5743 \\
+        \hline
+        Marten\_01  & AUC                     & \aster 0.8012 & 0.8740 & 0.5913 & 0.7189 \\
+                    & $\TNR_{\TPR \geq 0.95}$ & \aster 0.2474 & 0.4024 & 0.1906 & 0.0556 \\
+        \hline
+        GFox\_03    & AUC                     & \aster N/A  & 0.9812 & \aster 0.9730 & \aster 0.9739 \\
+                    & $\TNR_{\TPR \geq 0.95}$ & \aster N/A  & 0.9510 & \aster 0.8150 & \aster 0.8782 \\
+        \hline
+    \end{tabular}
+    \caption[Scores of all four approaches on all annotated datasets]{Scores of all four approaches on all annotated datasets. Results marked with \aster cannot be fairly compared due to irregularities in the dataset; approach 1 is not applicable for GFox\_03 since there is no real lapse set (see beginning of \autoref{sec:experiments}).}
+    \label{tab:eval_summary2}
+\end{table}
+
+Overall, the classic approaches 1 and 2 achieve the best elimination rates. In contrast to expectations, approach 2 consistently beats approach 1 regarding elimination rates. For the generated GFox\_03 set, approach 2 achieves higher AUC and elimination rates than approaches 3 and 4 even though approach 2 is completely unsupervised, whereas approaches 3 and 4 were provided with some annotated normal motion images in the form of the generated lapse set.
+
+Noteworthy is the comparably low elimination rate of the autoencoder for Marten\_01. A possible explanation is that the general reconstruction accuracy drops due to the high diversity of backgrounds, some of which are strongly underrepresented in the lapse set. Even though Marten\_01 represents an extreme case of an irregular dataset, this shows that the autoencoder becomes unreliable more quickly when the background often changes.
+
+Of course, it should be noted that the selected configurations might not be ideal for every camera session. However, as the practical meaning of this work is to find a generally applicable method, it makes sense to choose the same configuration for all sessions. GFox\_03 is the most realistic available session with a low frequency of animals and only a single camera position. The high elimination rates on this session, even though they are biased because the lapse images were generated, therefore demonstrate the capacity and usefulness of the approaches in a practical setting.

+ 33 - 0
chapters/conclusions/conclusions.tex

@@ -0,0 +1,33 @@
+% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+% Author:   Felix Kleinsteuber
+% Title:    Anomaly Detection in Camera Trap Images
+% File:     conclusions/conclusions.tex
+% Part:     conclusions
+% Description:
+%         summary of the content in this chapter
+% Version:  16.05.2022
+% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+\chapter{Conclusions} % \chap{Zusammenfassung}
+\label{chap:conclusions}
+
+% -------------------------------------------------------------------
+
+In this thesis, four different approaches for eliminating empty camera trap images were examined. A particular challenge in the setting of camera trap images is keeping the false negative rate low, i.e. not falsely eliminating images with animals. Even a classifier with a good AUC score can be useless in this context if it does not perform well under this precondition.
+
+The lapse frame differencing approach showed that even simple pixel-wise comparison methods can distinguish between normal and anomalous images. Although only some images can be safely eliminated without driving up the false negative rate, the accuracy achieved with this is high enough to noticeably accelerate the image analysis process. The success of this method shows that it is advisable for camera trap operators to keep generating lapse images, i.e. additionally trigger the camera once an hour, in order to deploy this method. Lapse frame differencing is a simple, easy-to-understand approach and, moreover, the fastest, as it does not require any training process. Furthermore, it can quickly adapt to changing weather and lighting conditions or different camera settings since it only refers to the closest lapse image.
+
+However, especially when dealing with older existing datasets, lapse images with close temporal proximity often do not exist. To address this issue, the Median Frame Differencing approach was proposed and compared to the previous one. As expected, in direct comparison, the results are slightly below the previous ones since the median image fails to accurately resemble the background of all images. This is the case when the foreground object is not moving enough or the lighting conditions change, often because of the adjustment of the aperture due to darkening caused by the animal. Still, this approach is just as fast, requires no training process, adapts quickly, and can be utilized to eliminate a significant portion of empty images even when there are no lapse images available. Even outside of this use case, it outperformed lapse frame differencing regarding elimination rates for both eligible datasets.
+
+When it comes to comparing image contents, local features have successfully been used for classification tasks for many years. In a Bag of Visual Words approach, local features are densely sampled from lapse images, clustered into a vocabulary, and then used for fitting a one-class support vector machine. In contrast to the first two approaches, a training process is required, controlled by several hyperparameters.
+
+The most successful configuration achieves lower AUC scores than frame differencing but similar elimination rates for low FNR. Yet, it has conceptual advantages: First, objects with forest-like brightness but different textures are hard to detect by frame differencing, whereas the local features are affected by the texture change. Second, large anomalous objects cause large image differences and therefore high anomaly scores in frame differencing. This is not necessarily the case for local feature-based approaches: Here, anomalies are manifested in the absence of normal features or the presence of normal ones.
+
+The biggest disadvantage of using more complex approaches is training speed: Both accuracy and computational effort of the local feature approach increase when using more keypoints. However, experiments showed that choosing random prototypes for the vocabulary performs equivalently while being much faster with a constant running time.
+
+Lastly, the autoencoder approach slightly outperforms the local feature approach on two of the datasets. A possible reason for the performance gain is the usage of color information. On Marten\_01, however, the multiple camera position changes cause the model's elimination rate to decline. Approach 4, therefore, appears to be less reliable in experiments. Different extensions such as Denoising Autoencoder and Sparse Autoencoder have little effect on the scores. Presumably, the architecture of the autoencoder model can be improved.
+
+In summary, the experiments show that it is often possible to eliminate the majority of empty images from camera trap images with relatively simple methods. The proposed methods work in a weakly supervised or in the case of approach 2, even in a completely unsupervised manner, and can thus minimize human annotation effort on the test set.
+
+% -------------------------------------------------------------------
+
+% insert further sections if necessary

+ 35 - 0
chapters/furtherWork/furtherWork.tex

@@ -0,0 +1,35 @@
+% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+% Author:   Felix Kleinsteuber
+% Title:    Anomaly Detection in Camera Trap Images
+% File:     furtherWork/furtherWork.tex
+% Part:     ideas for further work
+% Description:
+%         summary of the content in this chapter
+% Version:  16.05.2022
+% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
+\chapter{Further work} % \chap{Ausblick}
+\label{chap:furtherWork}
+
+% \begin{itemize}
+%   \item explain ideas for further work
+%   \item what is not yet solved?
+%   \item possible extensions/improvements
+%   \item you can use different sections (and even subsections) for different ideas, % problems
+%   \item a single text, e.g. with paragraphs, is also sufficient
+% \end{itemize}
+
+% -------------------------------------------------------------------
+
+There are several improvements and extensions for single approaches that were considered but not implemented in this work. In the following paragraphs, some possible improvements are discussed.
+
+\paragraph{Lapse Frame Differencing} For simplicity purposes, the images were converted to grayscale before taking the difference image. Additional RGB color information could improve the accuracy for daytime images. The approach could also benefit from switching to a different color space such as HSV (hue, saturation, value) or L*a*b* (Lightness $L^*$, color plane $(a*, b*)$). E.g., \cite{Haensch14:ColorSpacesForGraphCut} show that graph-cut segmentations based on L*a*b* are of higher quality than any other major color space. In contrast, segmentations in RGB are mostly worse than in any other space. It is conceivable that these findings can be transferred to the frame differencing approach. 
+
+\paragraph{Median Frame Differencing} The main problem with taking the median image is that it often fails to resemble the background accurately. Future work could examine the question of whether it is possible to detect such cases. Additionally, in scenarios where lapse images are available, Lapse and Median Frame Differencing could be combined to achieve higher reliability.
+
+\paragraph{Bag of Visual Words} In the current implementation, standard SIFT descriptors \cite{Lowe04:SIFT} are used. For daylight images, it is presumably beneficial to also employ color information. Different colored extensions to SIFT exist such as CSIFT \cite{Abdel-Hakim06:CSIFT}. Alternatively, a single channel from another color space could be used as described in the Lapse Frame Differencing paragraph.
+
+\paragraph{Autoencoder} There are a lot of additions to explore. The KL-divergence-based sparsity constraint (as explained in \autoref{sec:sparse_ae}) has not been implemented. It is possible that this extension could slightly improve the autoencoder quality. Moreover, it could be beneficial to make adjustments to the model architecture. Currently, a fully convolutional model with few parameters is used for simplicity and efficiency purposes. An autoencoder with dense layers can make global connections that a fully convolutional one cannot but requires the training of significantly more parameters. It is also possible that a model with a larger input image size (such as 512x512) would be better at detecting smaller anomalies: Many of the images where the autoencoder fails contain small birds or just the eyes of an animal. Another modification of autoencoders is the Variational Autoencoder (VAE) \cite{Kingma13:VAE}, which has also successfully been applied in anomaly detection \cite{Xu18:VAEforAD,Krajsic21:VAEforAD}. There are other deep learning models successfully applied in the weakly supervised anomaly detection field such as Generative Adversarial Networks (GANs) \cite{Goodfellow14:GANs}, which are much harder to train and often require more data.
+
+\paragraph{Evaluation} A particular challenge in creating this work was ordering, preprocessing, and annotating the dataset. Unfortunately, at the time of writing, there were no other annotated sessions available. In future work, annotation and evaluation of more sessions could lead to clearer results. It could also help answer the question of whether the different approaches fail on different image contents. If this is the case, combining approaches in a majority voting scenario could further improve reliability. 
+
+Since all proposed methods work as threshold classifiers, a suitable threshold must be found to be able to apply them in practice. Further research could bring the approaches to practical application and find heuristics for choosing threshold values.

+ 211 - 0
customdbvthesis.cls

@@ -0,0 +1,211 @@
+%%
+%% Some changes were made to this file.
+%% Original file: `dbvthesis.cls'
+%%
+%% IMPORTANT NOTICE:
+%% 
+%% You are not allowed to change this file.  You may however copy this file
+%% to a file with a different name and then change the copy.
+%% 
+%% You are NOT ALLOWED to distribute this file alone.  You are NOT ALLOWED
+%% to take money for the distribution or use of this file (or a changed
+%% version) except for a nominal charge for copying etc.
+%% 
+%% 
+\def\filedate{2012/01/02}
+\def\fileversion{1.0}
+\def\filename{dbvthesis.cls}
+\NeedsTeXFormat{LaTeX2e}[1997/06/01]
+
+\ProvidesClass{dbvthesis}[\filedate\space v\fileversion\space dbvthesis class]
+
+\def\BaseClass{scrbook}
+
+%%%%%%%%%%%%%%% Option Part %%%%%%%%%%%%%%%%%%%%
+\RequirePackage{ifthen}
+
+\DeclareOption{11pt}{\def\@fontsize{1}}
+\DeclareOption{12pt}{\def\@fontsize{2}}
+
+\DeclareOption{english}{\def\@language{english}}
+\DeclareOption{german}{\def\@language{german}}
+\DeclareOption{ngerman}{\def\@language{ngerman}}
+\def\engl{english}
+
+\ifx \@language\engl
+\ExecuteOptions{english} \ProcessOptions
+\else
+\ExecuteOptions{german} \ProcessOptions
+\fi
+
+\newif\ifpredefineddeclaration
+\DeclareOption{predefineddeclaration}{\predefineddeclarationtrue}
+
+\DeclareOption*{\PassOptionsToClass{\CurrentOption}{\BaseClass}}
+\ProcessOptions\relax
+
+%%%%%%%%%%%%%%% Option Part End %%%%%%%%%%%%%%%%
+
+    \LoadClass[1\@fontsize pt,a4paper,\@language,twoside,openright,listof=totoc,bibliography=totoc,headings=twolinechapter,numbers=noenddot]{\BaseClass}[1997/04/16]
+
+    \RequirePackage{setspace}
+    \onehalfspacing
+
+    \RequirePackage{graphicx}
+
+    \def\dbvthesisTitlepageAndDeclaration{
+
+	\bgroup
+	\def\baselinestretch{1.0}%
+
+	\def\Title##1{\def\title{##1}} \def\title{}
+	\def\Subtitle##1{\def\subtitle{##1}} \def\subtitle{}
+	\def\ThesisType##1{\def\thesisType{##1}} \def\thesisType{}
+
+	\def\FirstName##1{\def\firstName{##1}} \def\firstName{}
+	\def\LastName##1{\def\lastName{##1}} \def\lastName{}
+	\def\DateOfBirth##1##2##3{
+	    \ifx \@language\engl \def\dateOfBirth{\ifcase##2\or January\or February\or March\or April\or May\or June\or July\or August\or September\or October\or November\or December\fi \space
+						  \ifcase##1\or 1\or 2\or 3\or 4 \or 5\or 6\or 7\or 8\or 9\else ##1\fi ,~##3} 
+	    \else \def\dateOfBirth{##1.##2.##3}
+	    \fi	} \def\dateOfBirth{}
+	\def\Birthplace##1{\def\birthplace{##1}} \def\birthplace{}
+
+	\def\Supervisor##1{\def\supervisor{##1}} \def\supervisor{}
+	\def\Advisor##1{\def\advisor{##1}} \def\advisor{}
+
+	\def\ThesisStart##1##2##3{
+	    \ifx \@language\engl \def\thesisStart{\ifcase##2\or January\or February\or March\or April\or May\or June\or July\or August\or September\or October\or November\or December\fi \space
+						  \ifcase##1\or 1\or 2\or 3\or 4 \or 5\or 6\or 7\or 8\or 9\else ##1\fi ,~##3} 
+	    \else \def\thesisStart{##1.##2.##3}
+	    \fi 
+	} \def\thesisStart{}
+
+	\def\ThesisEnd##1##2##3{
+	    \ifx \@language\engl \def\thesisEnd{\ifcase##2\or January\or February\or March\or April\or May\or June\or July\or August\or September\or October\or November\or December\fi \space
+						\ifcase##1\or 1\or 2\or 3\or 4 \or 5\or 6\or 7\or 8\or 9\else ##1\fi ,~##3} 
+	    \else \def\thesisEnd{##1.##2.##3}
+	    \fi
+	    \def\declarationDate{##1.~\ifcase##2\or Januar\or Februar\or M\"arz\or April\or Mai\or Juni\or Juli\or August\or September\or Oktober\or November\or Dezember\fi \space##3}
+	} {\def\thesisEnd{} \def\declarationDate{}}
+
+	\def\SecondInstitute##1{\def\secondInstitute{##1}} \def\secondInstitute{}
+
+    }
+
+    \ifx \@language\engl
+
+	\def\chair{Computer Vision Group}
+	\def\department{Department of Mathematics and Computer Science}
+	\def\university{Friedrich-Schiller-Universit\"at Jena}
+
+	\def\supervisorText{Supervisor:}
+	\def\advisorText{Advisors:}
+
+	\def\thesisStartText{Started:}
+	\def\thesisEndText{Finished:}
+
+    \else
+
+	\def\chair{Lehrstuhl f\"ur Digitale Bildverarbeitung}
+	\def\department{Fakult\"at f\"ur Mathematik und Informatik}
+	\def\university{Friedrich-Schiller-Universit\"at Jena}
+
+	\def\supervisorText{Gutachter:}
+	\def\advisorText{Betreuer:}
+
+	\def\thesisStartText{Beginn der Arbeit:}
+	\def\thesisEndText{Ende der Arbeit:}
+
+    \fi
+
+    \def\enddbvthesisTitlepageAndDeclaration{%
+
+	\addtolength{\oddsidemargin}{1cm}
+	\enlargethispage{4cm}
+
+	{ % titlepage
+	  \thispagestyle{empty}
+
+	  \vfill
+
+	  \begin{center}
+	    \ifx \@language\engl \includegraphics[width=50mm]{images/UniJena_BildWortMarke_black.pdf} \else \includegraphics[width=50mm]{images/UniJena_BildWortMarke_black.pdf} \fi
+	    \vfill
+	    {\Huge \textbf{\title} \\}
+	    \onehalfspacing
+	    \ifx \subtitle\empty \else \textbf{\Large \subtitle \\} \fi
+	    \vfill
+	    \textbf{\large \thesisType \ifx \@language\engl \ in Computer Science \else \ im Fach Informatik \fi}
+	    \vfill
+	    \normalsize
+	    \ifx \@language\engl submitted by \else vorgelegt von \fi \\
+	    \textbf{\firstName~\lastName} \\
+	    \textbf{\ifx \@language\engl born~\dateOfBirth~in~\birthplace \else geboren~am~\dateOfBirth~in~\birthplace \fi} \\
+	    \vfill
+	    \ifx \@language\engl written at \else angefertigt am \fi \\
+	    \textbf{\chair \\ \department \\ \university} \\
+	    \ifx \secondInstitute\empty \else \vfill \ifx \@language\engl in cooperation with \else in Zusammenarbeit mit \fi \\ \textbf{\secondInstitute} \\ \fi
+	  \end{center}
+
+	  \vfill
+
+	  \begin{flushleft}
+	      \begin{tabular}{ll}
+		\ifx \supervisor\empty \else  \supervisorText & \supervisor \\ \fi
+		\advisorText & \advisor \\
+		\thesisStartText & \thesisStart \\
+		\thesisEndText & \thesisEnd \\
+	      \end{tabular}
+	  \end{flushleft}
+      
+	  \vfill
+	}
+
+	\cleardoublepage
+
+	\addtolength{\oddsidemargin}{-1cm}
+	
+	\ifpredefineddeclaration
+
+	    \setcounter{page}{1}
+
+	    \begin{center}{\sectfont\LARGE Erkl{\"a}rung}\end{center}
+		\noindent
+		Ich versichere, dass ich die vorliegende Arbeit (bei Gruppenarbeiten die entsprechend gekennzeichneten Anteile) selbstst{\"a}ndig verfasst und keine anderen als die angegebenen Hilfsmittel und Quellen benutzt habe.
+		Zitate und gedankliche {\"U}bernahmen aus fremden Quellen (einschlie{\ss}lich elektronischer Quellen) habe ich kenntlich gemacht.
+		Die eingereichte Arbeit wurde bisher keiner anderen Pr{\"u}fungsbeh{\"o}rde vorgelegt und wurde auch nicht ver{\"o}ffentlicht.
+		Mir ist bekannt, dass eine unwahre Erkl{\"a}rung rechtliche Folgen haben und insbesondere dazu f{\"u}hren kann, dass die Arbeit als nicht bestanden bewertet wird.
+		Die Richtlinien des Lehrstuhls f{\"u}r Examensarbeiten habe ich gelesen und anerkannt.\\
+		Seitens des Verfassers bestehen keine Einw{\"a}nde, die vorliegende Examensarbeit f{\"u}r die {\"o}ffentliche Benutzung zur Verf{\"u}gung zu stellen.\\[25mm]
+		\par \noindent
+		Jena, den \declarationDate \hfill \firstName~\lastName
+
+	\else
+
+		\include{declaration}
+
+	\fi
+
+	\egroup
+    }
+
+    \ifx \@language\engl
+    \IfFileExists{babel.sty}
+    {\RequirePackage[\@language]{babel}[1997/01/23] }
+    {\IfFileExists{english.sty}
+      {\RequirePackage{english}[1997/05/01]}
+      {\ClassError{dbvthesis}
+	{Neither babel nor english.sty installed !!!}
+	{Get babel or english.sty !!!}}}
+    \else
+    \IfFileExists{babel.sty}
+      {\RequirePackage[\@language]{babel}[1997/01/23]}
+      {\ClassError{dbvthesis}
+	{Babel not installed !!!}
+	{Get babel package !!!}}
+    \fi
+
+\endinput
+%%
+%% End of file `dbvthesis.cls'.

+ 13 - 0
declaration.tex

@@ -0,0 +1,13 @@
+\begin{center}{\sectfont\LARGE Erkl{\"a}rung}\end{center}
+
+	\noindent
+	Ich versichere, dass ich die vorliegende Arbeit (bei Gruppenarbeiten die entsprechend gekennzeichneten Anteile) selbstst{\"a}ndig verfasst und keine anderen als die angegebenen Hilfsmittel und Quellen benutzt habe.
+	Zitate und gedankliche {\"U}bernahmen aus fremden Quellen (einschlie{\ss}lich elektronischer Quellen) habe ich kenntlich gemacht.
+	Die eingereichte Arbeit wurde bisher keiner anderen Pr{\"u}fungsbeh{\"o}rde vorgelegt und wurde auch nicht ver{\"o}ffentlicht.
+	Mir ist bekannt, dass eine unwahre Erkl{\"a}rung rechtliche Folgen haben und insbesondere dazu f{\"u}hren kann, dass die Arbeit als nicht bestanden bewertet wird.
+	Die Richtlinien des Lehrstuhls f{\"u}r Examensarbeiten habe ich gelesen und anerkannt.\\
+	Seitens des Verfassers/der Verfasserin bestehen Einw{\"a}nde, die vorliegende Examensarbeit f{\"u}r die {\"o}ffentliche Benutzung zur Verf{\"u}gung zu stellen.\\[25mm]
+	\par \noindent
+	Jena, den 01. Januar 2050 \hfill  Richard Roe
+
+\cleardoublepage

+ 53 - 0
header.tex

@@ -0,0 +1,53 @@
+% file header.tex
+%
+% You may want to use these packages or edit some options
+% Feel free to insert more packages, if you need them
+
+% choose appropriate inputenc
+%\usepackage[utf8x]{inputenc}
+\usepackage[latin1]{inputenc}
+
+\usepackage{fontenc}
+
+% some usefull packages, if you do not need them, comment them out
+\usepackage{babel}
+\usepackage{amsfonts}
+\usepackage{amsmath}
+\usepackage{amssymb}
+\usepackage{amsthm}
+\usepackage{graphicx}
+\usepackage[hang]{caption}
+\usepackage{subcaption}
+\usepackage{booktabs}
+\usepackage{makeidx}
+%\usepackage{subfigure}
+\usepackage{xcolor}
+
+\usepackage{array}
+\usepackage[
+    plainpages=false,
+    pdfpagelayout=TwoPageRight,
+    pdfborder={0 0 0},
+    hyperfootnotes=false
+  ]{hyperref}
+
+\usepackage{longtable} % for tables larger than one page
+\usepackage{pdflscape} % for pages in landscape format
+\usepackage{rotating} % for rotated tables on landscape pages using sidewaystable
+\usepackage{booktabs}
+\usepackage{multirow}
+\usepackage{tabularx}
+\usepackage{enumitem}
+\usepackage{tikz}
+\usepackage{cite}
+\usepackage{bm}
+
+% footnote packages
+\usepackage{footnote}
+\usepackage{chngcntr}% necessary for next command
+\counterwithout{footnote}{chapter}% consecutive numbering of footnotes throughout the whole document, no renumeration in each chapter
+
+\usepackage{setspace}
+\onehalfspacing
+
+

+ 19 - 0
macros.tex

@@ -0,0 +1,19 @@
+%your own macros, abbreviations, etc.
+
+\newcommand\todo[1]{\textcolor{red}{TODO: #1}}
+\newcommand\fixme[1]{\textcolor{green}{FIXME: #1}}
+
+\newcommand\ie{\textit{i.e.}\xspace}
+\newcommand\eg{\textit{e.g.}\xspace}
+\newcommand\etal{\textit{et al.}\xspace}
+
+\DeclareMathOperator{\TNR}{TNR}
+\DeclareMathOperator{\TPR}{TPR}
+\DeclareMathOperator{\FNR}{FNR}
+\DeclareMathOperator{\FPR}{FPR}
+
+\newcommand\aster{\textcolor{gray}{($\ast$)\:}}
+
+\newcommand{\N}{\mathds{N}}
+\newcommand{\R}{\mathbb{R}}
+\newcommand{\I}{\mathbb{I}}

+ 690 - 0
thesis.bib

@@ -0,0 +1,690 @@
+% introduction
+
+@article{Cardinale12:BiodiversityLoss,
+author = {Cardinale, Bradley and Duffy, J. and Gonzalez, Andrew and Hooper, David and Perrings, Charles and Venail, Patrick and Narwani, Anita and Tilman, David and Wardle, David and Kinzig, Ann and Daily, Gretchen and Loreau, Michel and Grace, James and Larigauderie, Anne and Srivastava, Diane and Naeem, Shahid},
+year = {2012},
+month = {06},
+pages = {59-67},
+title = {Biodiversity loss and its impact on humanity},
+volume = {486},
+journal = {Nature},
+doi = {10.1038/nature11148}
+}
+
+
+@Article{Bianchi22:BiodiversityMonitoring,
+AUTHOR = {Bianchi, Carlo Nike and Azzola, Annalisa and Cocito, Silvia and Morri, Carla and Oprandi, Alice and Peirano, Andrea and Sgorbini, Sergio and Montefalcone, Monica},
+TITLE = {Biodiversity Monitoring in Mediterranean Marine Protected Areas: Scientific and Methodological Challenges},
+JOURNAL = {Diversity},
+VOLUME = {14},
+YEAR = {2022},
+NUMBER = {1},
+ARTICLE-NUMBER = {43},
+URL = {https://www.mdpi.com/1424-2818/14/1/43},
+ISSN = {1424-2818},
+DOI = {10.3390/d14010043}
+}
+
+% related work
+
+@article{Collins00:VideoSurveillance,
+author = {Collins, Robert and Lipton, Alan and Kanade, Takeo and Fujiyoshi, Hironobu and Duggins, David and Tsin, Yanghai and Tolliver, David and Enomoto, Nobuyoshi and Hasegawa, Osamu and Burt, Peter},
+year = {2000},
+month = {06},
+pages = {},
+title = {A System for Video Surveillance and Monitoring},
+volume = {5},
+journal = {Robot. Inst.}
+}
+
+@inproceedings{Gupta07:FrameDifferencing,
+author = {Gupta, Karan and Kulkarni, Anjali},
+year = {2007},
+month = {01},
+pages = {245-250},
+title = {Implementation of an Automated Single Camera Object Tracking System Using Frame Differencing and Dynamic Template Matching},
+isbn = {978-1-4020-8740-0},
+doi = {10.1007/978-1-4020-8741-7_44}
+}
+
+@misc{Lis19:ADImageResynthesis,
+  doi = {10.48550/ARXIV.1904.07595},
+  
+  url = {https://arxiv.org/abs/1904.07595},
+  
+  author = {Lis, Krzysztof and Nakka, Krishna and Fua, Pascal and Salzmann, Mathieu},
+  
+  keywords = {Computer Vision and Pattern Recognition (cs.CV), FOS: Computer and information sciences, FOS: Computer and information sciences, I.4.6; I.4.8},
+  
+  title = {Detecting the Unexpected via Image Resynthesis},
+  
+  publisher = {arXiv},
+  
+  year = {2019},
+  
+  copyright = {Creative Commons Attribution Share Alike 4.0 International}
+}
+
+@misc{DiBiase21:PixelwiseAD,
+  doi = {10.48550/ARXIV.2103.05445},
+  
+  url = {https://arxiv.org/abs/2103.05445},
+  
+  author = {Di Biase, Giancarlo and Blum, Hermann and Siegwart, Roland and Cadena, Cesar},
+  
+  keywords = {Computer Vision and Pattern Recognition (cs.CV), FOS: Computer and information sciences, FOS: Computer and information sciences},
+  
+  title = {Pixel-wise Anomaly Detection in Complex Driving Scenes},
+  
+  publisher = {arXiv},
+  
+  year = {2021},
+  
+  copyright = {arXiv.org perpetual, non-exclusive license}
+}
+
+@article{Japkowicz99:FirstAE,
+author = {Japkowicz, Nathalie and Myers, Catherine and Gluck, Mark},
+year = {1999},
+month = {10},
+pages = {},
+title = {A Novelty Detection Approach to Classification},
+journal = {Proceedings of the Fourteenth Joint Conference on Artificial Intelligence}
+}
+
+@article{Perera19:DeepOCC,
+	doi = {10.1109/tip.2019.2917862},
+	url = {https://doi.org/10.1109%2Ftip.2019.2917862},
+	year = 2019,
+	month = {nov},
+	publisher = {Institute of Electrical and Electronics Engineers ({IEEE})},
+	volume = {28},
+	number = {11},
+	pages = {5450--5463},
+	author = {Pramuditha Perera and Vishal M. Patel},
+	title = {Learning Deep Features for One-Class Classification},
+	journal = {{IEEE} Transactions on Image Processing}
+}
+
+@article{LeCun15:DeepLearning,
+author = {LeCun, Yann and Bengio, Y. and Hinton, Geoffrey},
+year = {2015},
+month = {05},
+pages = {436-44},
+title = {Deep Learning},
+volume = {521},
+journal = {Nature},
+doi = {10.1038/nature14539}
+}
+
+@article{LeCun89:CNN,
+title={Backpropagation applied to handwritten zip code recognition},
+author={LeCun, Yann and Boser, Bernhard and Denker, John S and Henderson, Donnie and Howard, Richard E and Hubbard, Wayne and Jackel, Lawrence D},
+journal={Neural computation},
+volume={1},
+number={4},
+pages={541--551},
+year={1989},
+publisher={MIT Press}
+}
+
+@article{Oza19:OCCNN,
+	doi = {10.1109/lsp.2018.2889273},
+	url = {https://doi.org/10.1109%2Flsp.2018.2889273},
+	year = 2019,
+	month = {feb},
+	publisher = {Institute of Electrical and Electronics Engineers ({IEEE})},
+	volume = {26},
+	number = {2},
+	pages = {277--281},
+	author = {Poojan Oza and Vishal M. Patel},
+	title = {One-Class Convolutional Neural Network},
+	journal = {{IEEE} Signal Processing Letters}
+}
+
+% theory
+
+@inbook{Kecman05:SVMs,
+  author = {Kecman, Vojislav},
+  year = {2005},
+  month = {05},
+  pages = {605--605},
+  title = {Support Vector Machines - An Introduction},
+  volume = {177},
+  isbn = {978-3-540-24388-5},
+  journal = {Support Vector Machines: Theory and Applications},
+  doi = {10.1007/10984697_1}
+}
+
+@article{Gidudu:SVMsMultiClass,
+  author = {Gidudu, Anthony and Hulley, Gregory and Marwala, Tshilidzi},
+  year = {2007},
+  month = {11},
+  pages = {},
+  title = {Image classification using SVMs: One-Against-One Vs One-against-All},
+  volume = {abs/0711.2914},
+  journal = {CoRR}
+}
+
+@article{Chang10:RBF,
+  author  = {Yin-Wen Chang and Cho-Jui Hsieh and Kai-Wei Chang and Michael Ringgaard and Chih-Jen Lin},
+  title   = {Training and Testing Low-degree Polynomial Data Mappings via Linear SVM},
+  journal = {Journal of Machine Learning Research},
+  year    = {2010},
+  volume  = {11},
+  number  = {48},
+  pages   = {1471--1490},
+  url     = {http://jmlr.org/papers/v11/chang10a.html}
+}
+
+@misc{Chalapathy19:DeepLearningADSurvey,
+  doi = {10.48550/ARXIV.1901.03407},
+  url = {https://arxiv.org/abs/1901.03407},
+  author = {Chalapathy, Raghavendra and Chawla, Sanjay},
+  keywords = {Machine Learning (cs.LG), Machine Learning (stat.ML), FOS: Computer and information sciences, FOS: Computer and information sciences},
+  title = {Deep Learning for Anomaly Detection: A Survey},
+  publisher = {arXiv},
+  year = {2019},
+  copyright = {arXiv.org perpetual, non-exclusive license}
+}
+
+@Article{Kiran18:ADInVideo,
+  AUTHOR = {Kiran, B. Ravi and Thomas, Dilip Mathew and Parakkal, Ranjith},
+  TITLE = {An Overview of Deep Learning Based Methods for Unsupervised and Semi-Supervised Anomaly Detection in Videos},
+  JOURNAL = {Journal of Imaging},
+  VOLUME = {4},
+  YEAR = {2018},
+  NUMBER = {2},
+  ARTICLE-NUMBER = {36},
+  URL = {https://www.mdpi.com/2313-433X/4/2/36},
+  ISSN = {2313-433X},
+  DOI = {10.3390/jimaging4020036}
+}
+
+@misc{Jiang22:VisualSensoryADSurvey,
+  doi = {10.48550/ARXIV.2202.07006},
+  url = {https://arxiv.org/abs/2202.07006},
+  author = {Jiang, Xi and Xie, Guoyang and Wang, Jinbao and Liu, Yong and Wang, Chengjie and Zheng, Feng and Jin, Yaochu},
+  keywords = {Computer Vision and Pattern Recognition (cs.CV), FOS: Computer and information sciences, FOS: Computer and information sciences},
+  title = {A Survey of Visual Sensory Anomaly Detection},
+  publisher = {arXiv},
+  year = {2022},
+  copyright = {Creative Commons Attribution Non Commercial No Derivatives 4.0 International}
+}
+
+@article{Perera21:OCCSurvey,
+  title={One-Class Classification: A Survey},
+  author={Pramuditha Perera and Poojan Oza and Vishal M. Patel},
+  journal={ArXiv},
+  year={2021},
+  volume={abs/2101.03064}
+}
+
+@article{Rosenblatt56:KDE1,
+author = {Murray Rosenblatt},
+title = {{Remarks on Some Nonparametric Estimates of a Density Function}},
+volume = {27},
+journal = {The Annals of Mathematical Statistics},
+number = {3},
+publisher = {Institute of Mathematical Statistics},
+pages = {832 -- 837},
+year = {1956},
+doi = {10.1214/aoms/1177728190},
+URL = {https://doi.org/10.1214/aoms/1177728190}
+}
+
+@article{Parzen62:KDE2,
+author = {Emanuel Parzen},
+title = {{On Estimation of a Probability Density Function and Mode}},
+volume = {33},
+journal = {The Annals of Mathematical Statistics},
+number = {3},
+publisher = {Institute of Mathematical Statistics},
+pages = {1065 -- 1076},
+year = {1962},
+doi = {10.1214/aoms/1177704472},
+URL = {https://doi.org/10.1214/aoms/1177704472}
+}
+
+@book{Goodfellow16:DeepLearning,
+  title={Deep Learning},
+  author={Ian Goodfellow and Yoshua Bengio and Aaron Courville},
+  publisher={MIT Press},
+  note={\url{http://www.deeplearningbook.org}},
+  year={2016}
+}
+
+@inproceedings{Nair10:ReLU,
+  author = {Nair, Vinod and Hinton, Geoffrey},
+  year = {2010},
+  month = {06},
+  pages = {807-814},
+  title = {Rectified Linear Units Improve Restricted Boltzmann Machines Vinod Nair},
+  volume = {27},
+  journal = {Proceedings of ICML}
+}
+
+@article{Bank20:Autoencoders,
+  title={Autoencoders},
+  author={Dor Bank and Noam Koenigstein and Raja Giryes},
+  journal={ArXiv},
+  year={2020},
+  volume={abs/2003.05991}
+}
+
+@inproceedings{Rumelhart86:Autoencoders,
+  title={Learning internal representations by error propagation},
+  author={David E. Rumelhart and Geoffrey E. Hinton and Ronald J. Williams},
+  year={1986}
+}
+
+@misc{Butt20:FrameDifferencing,
+  doi = {10.48550/ARXIV.2012.10708},
+  url = {https://arxiv.org/abs/2012.10708},
+  author = {Butt, Waqqas-ur-Rehman and Servin, Martin},
+  keywords = {Computer Vision and Pattern Recognition (cs.CV), FOS: Computer and information sciences, FOS: Computer and information sciences},
+  title = {Static object detection and segmentation in videos based on dual foregrounds difference with noise filtering},
+  publisher = {arXiv},
+  year = {2020},
+  copyright = {arXiv.org perpetual, non-exclusive license}
+}
+
+@article{Tekeli19:EliminationOfUselessImages,
+  author = {TEKELİ, Ulaş and Bastanlar, Yalin},
+  year = {2019},
+  month = {07},
+  pages = {2395-2411},
+  title = {Elimination of useless images from raw camera-trap data},
+  volume = {27},
+  journal = {TURKISH JOURNAL OF ELECTRICAL ENGINEERING & COMPUTER SCIENCES},
+  doi = {10.3906/elk-1808-130}
+}
+
+@article{Hinton06:Autoencoders,
+author = {G. E. Hinton  and R. R. Salakhutdinov },
+title = {Reducing the Dimensionality of Data with Neural Networks},
+journal = {Science},
+volume = {313},
+number = {5786},
+pages = {504-507},
+year = {2006},
+doi = {10.1126/science.1127647},
+URL = {https://www.science.org/doi/abs/10.1126/science.1127647},
+eprint = {https://www.science.org/doi/pdf/10.1126/science.1127647},
+abstract = {High-dimensional data can be converted to low-dimensional codes by training a multilayer neural network with a small central layer to reconstruct high-dimensional input vectors. Gradient descent can be used for fine-tuning the weights in such “autoencoder” networks, but this works well only if the initial weights are close to a good solution. We describe an effective way of initializing the weights that allows deep autoencoder networks to learn low-dimensional codes that work much better than principal components analysis as a tool to reduce the dimensionality of data.}}
+
+@inproceedings{Schoelkopf99:OneClassSVM,
+ author = {Sch\"{o}lkopf, Bernhard and Williamson, Robert C and Smola, Alex and Shawe-Taylor, John and Platt, John},
+ booktitle = {Advances in Neural Information Processing Systems},
+ editor = {S. Solla and T. Leen and K. M\"{u}ller},
+ pages = {},
+ publisher = {MIT Press},
+ title = {Support Vector Method for Novelty Detection},
+ url = {https://proceedings.neurips.cc/paper/1999/file/8725fb777f25776ffa9076e44fcfd776-Paper.pdf},
+ volume = {12},
+ year = {1999}
+}
+
+@misc{Yang21:OODSurvey,
+  doi = {10.48550/ARXIV.2110.11334},
+  url = {https://arxiv.org/abs/2110.11334},
+  author = {Yang, Jingkang and Zhou, Kaiyang and Li, Yixuan and Liu, Ziwei},
+  keywords = {Computer Vision and Pattern Recognition (cs.CV), Artificial Intelligence (cs.AI), Machine Learning (cs.LG), FOS: Computer and information sciences, FOS: Computer and information sciences},
+  title = {Generalized Out-of-Distribution Detection: A Survey},
+  publisher = {arXiv},
+  year = {2021},
+  copyright = {arXiv.org perpetual, non-exclusive license}
+}
+
+@book{Silverman86:DensityEstimation,
+  title={Density estimation for statistics and data analysis},
+  author={Silverman, Bernard W},
+  year={1986},
+  publisher={Chapman and Hall}
+}
+
+@article{Pimentel14:NoveltyDetection,
+  title = {A review of novelty detection},
+  journal = {Signal Processing},
+  volume = {99},
+  pages = {215-249},
+  year = {2014},
+  issn = {0165-1684},
+  doi = {https://doi.org/10.1016/j.sigpro.2013.12.026},
+  url = {https://www.sciencedirect.com/science/article/pii/S016516841300515X},
+  author = {Marco A.F. Pimentel and David A. Clifton and Lei Clifton and Lionel Tarassenko},
+  keywords = {Novelty detection, One-class classification, Machine learning},
+  abstract = {Novelty detection is the task of classifying test data that differ in some respect from the data that are available during training. This may be seen as “one-class classification”, in which a model is constructed to describe “normal” training data. The novelty detection approach is typically used when the quantity of available “abnormal” data is insufficient to construct explicit models for non-normal classes. Application includes inference in datasets from critical systems, where the quantity of available normal data is very large, such that “normality” may be accurately modelled. In this review we aim to provide an updated and structured investigation of novelty detection research papers that have appeared in the machine learning literature during the last decade.}
+}
+
+@article{Lowe04:SIFT,
+author = {Lowe, David},
+year = {2004},
+month = {11},
+pages = {91-},
+title = {Distinctive Image Features from Scale-Invariant Keypoints},
+volume = {60},
+journal = {International Journal of Computer Vision},
+doi = {10.1023/B:VISI.0000029664.99615.94}
+}
+
+@book{Chavez12:DSIFT,
+  title={Image classification with dense SIFT sampling: an exploration of optimal parameters},
+  author={Chavez, Aaron J},
+  year={2012},
+  publisher={Kansas State University}
+}
+
+@InProceedings{Bosch06:DSIFT1,
+  author="Bosch, Anna and Zisserman, Andrew and Mu{\~{n}}oz, Xavier",
+  editor="Leonardis, Ale{\v{s}} and Bischof, Horst and Pinz, Axel",
+  title="Scene Classification Via pLSA",
+  booktitle="Computer Vision -- ECCV 2006",
+  year="2006",
+  publisher="Springer Berlin Heidelberg",
+  address="Berlin, Heidelberg",
+  pages="517--530",
+  isbn="978-3-540-33839-0"
+}
+
+@INPROCEEDINGS{Bosch07:DSIFT2,
+  author={Bosch, Anna and Zisserman, Andrew and Mu{\~{n}}oz, Xavier},
+  booktitle={2007 IEEE 11th International Conference on Computer Vision},
+  title={Image Classification using Random Forests and Ferns},
+  year={2007},
+  volume={},
+  number={},
+  pages={1-8},
+  doi={10.1109/ICCV.2007.4409066}
+}
+
+@inproceedings{Tuytelaars10:DenseInterestPoints,
+author = {Tuytelaars, Tinne},
+year = {2010},
+month = {06},
+pages = {2281-2288},
+title = {Dense Interest Points},
+doi = {10.1109/CVPR.2010.5539911}
+}
+
+@article{Ng11:SparseAutoencoder,
+  title={Sparse autoencoder},
+  author={Ng, Andrew and others},
+  journal={CS294A Lecture notes},
+  volume={72},
+  number={2011},
+  pages={1--19},
+  year={2011},
+  url={https://web.stanford.edu/class/cs294a/sparseAutoencoder.pdf}
+}
+
+@article{Majnik13:ROCAnalysis,
+author = {Majnik, Matjaž and Bosnic, Zoran},
+year = {2013},
+month = {05},
+pages = {531-558},
+title = {ROC analysis of classifiers in machine learning: A survey},
+volume = {17},
+journal = {Intelligent Data Analysis},
+doi = {10.3233/IDA-130592}
+}
+
+@article{Flach05:ROCCurves,
+author = {Flach, Peter and Wu, Shaomin},
+year = {2005},
+month = {01},
+pages = {},
+title = {Repairing Concavities in ROC Curves},
+journal = {Reading}
+}
+
+@article{Hodge04:OutlierDetectionSurvey,
+author = {Hodge, Victoria and Austin, Jim},
+year = {2004},
+month = {10},
+pages = {85-126},
+title = {A Survey of Outlier Detection Methodologies},
+volume = {22},
+journal = {Artificial Intelligence Review},
+doi = {10.1023/B:AIRE.0000045502.10941.a9}
+}
+
+=== Already existing solutions ===
+
+@article{Norouzzadeh18:Solution1,
+author = {Mohammad Sadegh Norouzzadeh  and Anh Nguyen  and Margaret Kosmala  and Alexandra Swanson  and Meredith S. Palmer  and Craig Packer  and Jeff Clune },
+title = {Automatically identifying, counting, and describing wild animals in camera-trap images with deep learning},
+journal = {Proceedings of the National Academy of Sciences},
+volume = {115},
+number = {25},
+pages = {E5716-E5725},
+year = {2018},
+doi = {10.1073/pnas.1719367115},
+URL = {https://www.pnas.org/doi/abs/10.1073/pnas.1719367115},
+eprint = {https://www.pnas.org/doi/pdf/10.1073/pnas.1719367115}}
+
+@article{Willi19:Solution2,
+author = {Willi, Marco and Pitman, Ross T. and Cardoso, Anabelle W. and Locke, Christina and Swanson, Alexandra and Boyer, Amy and Veldthuis, Marten and Fortson, Lucy},
+title = {Identifying animal species in camera trap images using deep learning and citizen science},
+journal = {Methods in Ecology and Evolution},
+volume = {10},
+number = {1},
+pages = {80-91},
+keywords = {animal identification, camera trap, citizen science, convolutional neural networks, deep learning, machine learning},
+doi = {https://doi.org/10.1111/2041-210X.13099},
+url = {https://besjournals.onlinelibrary.wiley.com/doi/abs/10.1111/2041-210X.13099},
+eprint = {https://besjournals.onlinelibrary.wiley.com/doi/pdf/10.1111/2041-210X.13099},
+year = {2019}
+}
+
+@article{Yang21:SolutionEnsemble,
+author = {Yang, Deng-Qi and Tan, Kun and Huang, Zhi-Pang and Li, Xiao-Wei and Chen, Ben-Hui and Ren, Guo-Peng and Xiao, Wen},
+title = {An automatic method for removing empty camera trap images using ensemble learning},
+journal = {Ecology and Evolution},
+volume = {11},
+number = {12},
+pages = {7591-7601},
+keywords = {artificial intelligence, camera trap images, convolutional neural networks, deep learning, ensemble learning},
+doi = {https://doi.org/10.1002/ece3.7591},
+url = {https://onlinelibrary.wiley.com/doi/abs/10.1002/ece3.7591},
+eprint = {https://onlinelibrary.wiley.com/doi/pdf/10.1002/ece3.7591},
+year = {2021}
+}
+
+% Experiments
+
+@misc{Kingma14:Adam,
+  doi = {10.48550/ARXIV.1412.6980},
+  
+  url = {https://arxiv.org/abs/1412.6980},
+  
+  author = {Kingma, Diederik P. and Ba, Jimmy},
+  
+  keywords = {Machine Learning (cs.LG), FOS: Computer and information sciences, FOS: Computer and information sciences},
+  
+  title = {Adam: A Method for Stochastic Optimization},
+  
+  publisher = {arXiv},
+  
+  year = {2014},
+  
+  copyright = {arXiv.org perpetual, non-exclusive license}
+}
+
+@conference{Kluyver16:Jupyter,
+Title = {Jupyter Notebooks -- a publishing format for reproducible computational workflows},
+Author = {Thomas Kluyver and Benjamin Ragan-Kelley and Fernando P{\'e}rez and Brian Granger and Matthias Bussonnier and Jonathan Frederic and Kyle Kelley and Jessica Hamrick and Jason Grout and Sylvain Corlay and Paul Ivanov and Dami{\'a}n Avila and Safia Abdalla and Carol Willing},
+Booktitle = {Positioning and Power in Academic Publishing: Players, Agents and Agendas},
+Editor = {F. Loizides and B. Schmidt},
+Organization = {IOS Press},
+Pages = {87 - 90},
+Year = {2016}
+}
+
+@Article{Harris20:NumPy,
+ title         = {Array programming with {NumPy}},
+ author        = {Charles R. Harris and K. Jarrod Millman and St{\'{e}}fan J.
+                 van der Walt and Ralf Gommers and Pauli Virtanen and David
+                 Cournapeau and Eric Wieser and Julian Taylor and Sebastian
+                 Berg and Nathaniel J. Smith and Robert Kern and Matti Picus
+                 and Stephan Hoyer and Marten H. van Kerkwijk and Matthew
+                 Brett and Allan Haldane and Jaime Fern{\'{a}}ndez del
+                 R{\'{i}}o and Mark Wiebe and Pearu Peterson and Pierre
+                 G{\'{e}}rard-Marchant and Kevin Sheppard and Tyler Reddy and
+                 Warren Weckesser and Hameer Abbasi and Christoph Gohlke and
+                 Travis E. Oliphant},
+ year          = {2020},
+ month         = sep,
+ journal       = {Nature},
+ volume        = {585},
+ number        = {7825},
+ pages         = {357--362},
+ doi           = {10.1038/s41586-020-2649-2},
+ publisher     = {Springer Science and Business Media {LLC}},
+ url           = {https://doi.org/10.1038/s41586-020-2649-2}
+}
+
+@article{Bradski00:OpenCV,
+    author = {Bradski, G.},
+    citeulike-article-id = {2236121},
+    journal = {Dr. Dobb's Journal of Software Tools},
+    keywords = {bibtex-import},
+    posted-at = {2008-01-15 19:21:54},
+    priority = {4},
+    title = {{The OpenCV Library}},
+    year = {2000}
+}
+
+@article{Pedregosa11:scikit-learn,
+ title={Scikit-learn: Machine Learning in {P}ython},
+ author={Pedregosa, F. and Varoquaux, G. and Gramfort, A. and Michel, V.
+         and Thirion, B. and Grisel, O. and Blondel, M. and Prettenhofer, P.
+         and Weiss, R. and Dubourg, V. and Vanderplas, J. and Passos, A. and
+         Cournapeau, D. and Brucher, M. and Perrot, M. and Duchesnay, E.},
+ journal={Journal of Machine Learning Research},
+ volume={12},
+ pages={2825--2830},
+ year={2011}
+}
+
+@Article{Hunter07:Matplotlib,
+  Author    = {Hunter, J. D.},
+  Title     = {Matplotlib: A 2D graphics environment},
+  Journal   = {Computing in Science \& Engineering},
+  Volume    = {9},
+  Number    = {3},
+  Pages     = {90--95},
+  abstract  = {Matplotlib is a 2D graphics package used for Python for
+  application development, interactive scripting, and publication-quality
+  image generation across user interfaces and operating systems.},
+  publisher = {IEEE COMPUTER SOC},
+  doi       = {10.1109/MCSE.2007.55},
+  year      = 2007
+}
+
+@incollection{Paszke19:PyTorch,
+title = {PyTorch: An Imperative Style, High-Performance Deep Learning Library},
+author = {Paszke, Adam and Gross, Sam and Massa, Francisco and Lerer, Adam and Bradbury, James and Chanan, Gregory and Killeen, Trevor and Lin, Zeming and Gimelshein, Natalia and Antiga, Luca and Desmaison, Alban and Kopf, Andreas and Yang, Edward and DeVito, Zachary and Raison, Martin and Tejani, Alykhan and Chilamkurthy, Sasank and Steiner, Benoit and Fang, Lu and Bai, Junjie and Chintala, Soumith},
+booktitle = {Advances in Neural Information Processing Systems 32},
+editor = {H. Wallach and H. Larochelle and A. Beygelzimer and F. d Alch\'{e}-Buc and E. Fox and R. Garnett},
+pages = {8024--8035},
+year = {2019},
+publisher = {Curran Associates, Inc.},
+url = {http://papers.neurips.cc/paper/9015-pytorch-an-imperative-style-high-performance-deep-learning-library.pdf}
+}
+
+% Further work
+
+@inproceedings{Haensch14:ColorSpacesForGraphCut,
+author = {H{\"a}nsch, Ronny and Wang, Xi and Hellwich, Olaf},
+year = {2014},
+month = {01},
+pages = {},
+title = {Comparison of different Color Spaces for Image Segmentation using Graph-Cut},
+volume = {1},
+journal = {VISAPP 2014 - Proceedings of the 9th International Conference on Computer Vision Theory and Applications}
+}
+
+@article{Xu18:VAEforAD,
+  author    = {Haowen Xu and
+               Wenxiao Chen and
+               Nengwen Zhao and
+               Zeyan Li and
+               Jiahao Bu and
+               Zhihan Li and
+               Ying Liu and
+               Youjian Zhao and
+               Dan Pei and
+               Yang Feng and
+               Jie Chen and
+               Zhaogang Wang and
+               Honglin Qiao},
+  title     = {Unsupervised Anomaly Detection via Variational Auto-Encoder for Seasonal
+               KPIs in Web Applications},
+  journal   = {CoRR},
+  volume    = {abs/1802.03903},
+  year      = {2018},
+  url       = {http://arxiv.org/abs/1802.03903},
+  eprinttype = {arXiv},
+  eprint    = {1802.03903},
+  timestamp = {Wed, 05 Feb 2020 18:01:26 +0100},
+  biburl    = {https://dblp.org/rec/journals/corr/abs-1802-03903.bib},
+  bibsource = {dblp computer science bibliography, https://dblp.org}
+}
+
+@inproceedings{Krajsic21:VAEforAD,
+author = {Krajsic, Philippe and Franczyk, Bogdan},
+year = {2021},
+month = {02},
+pages = {},
+title = {Variational Autoencoder for Anomaly Detection in Event Data in Online Process Mining},
+doi = {10.5220/0010375905670574}
+}
+
+@inproceedings{Abdel-Hakim06:CSIFT,
+author = {Abdel-Hakim, Alaa and Farag, Aly},
+year = {2006},
+month = {02},
+pages = {1978 - 1983},
+title = {CSIFT: A SIFT descriptor with color invariant characteristics},
+volume = {2},
+isbn = {0-7695-2597-0},
+journal = {Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition},
+doi = {10.1109/CVPR.2006.95}
+}
+
+@misc{Goodfellow14:GANs,
+  doi = {10.48550/ARXIV.1406.2661},
+  
+  url = {https://arxiv.org/abs/1406.2661},
+  
+  author = {Goodfellow, Ian J. and Pouget-Abadie, Jean and Mirza, Mehdi and Xu, Bing and Warde-Farley, David and Ozair, Sherjil and Courville, Aaron and Bengio, Yoshua},
+  
+  keywords = {Machine Learning (stat.ML), Machine Learning (cs.LG), FOS: Computer and information sciences, FOS: Computer and information sciences},
+  
+  title = {Generative Adversarial Networks},
+  
+  publisher = {arXiv},
+  
+  year = {2014},
+  
+  copyright = {arXiv.org perpetual, non-exclusive license}
+}
+
+@misc{Kingma13:VAE,
+  doi = {10.48550/ARXIV.1312.6114},
+  
+  url = {https://arxiv.org/abs/1312.6114},
+  
+  author = {Kingma, Diederik P and Welling, Max},
+  
+  keywords = {Machine Learning (stat.ML), Machine Learning (cs.LG), FOS: Computer and information sciences, FOS: Computer and information sciences},
+  
+  title = {Auto-Encoding Variational Bayes},
+  
+  publisher = {arXiv},
+  
+  year = {2013},
+  
+  copyright = {arXiv.org perpetual, non-exclusive license}
+}

+ 104 - 0
thesis.tex

@@ -0,0 +1,104 @@
+%%%%% Explanation of class options
+%
+% font size: YOU HAVE TO CHOOSE one of the following: 11pt, 12pt
+% language: YOU HAVE TO CHOOSE one of the following: english, german, ngerman
+% predefineddeclaration: if you use option predefineddeclaration, a predefined declaration (always in German!) will be automatically included between titlepage and abstract using \FirstName, \LastName, \ThesisEnd, 
+% 			(1) if you do not use this option, the file "declaration.tex" will be included automatically between titlepage and abstract instead
+%       (2) if you do not agree to make your thesis publicly available, you CAN NOT USE this option, but have to use "declaration.tex" instead
+%       (3) "declaration.tex" needs to be modified manually with respect to name and date!
+%
+% additional options will be passed to the base class "scrbook"
+%
+%%%%% End of explanations
+\documentclass[12pt,english,predefineddeclaration,BCOR=10mm]{customdbvthesis}
+
+\include{./header} % put all the required packages and stuff in file header.tex
+\usepackage{blindtext}
+\include{./macros} % define your own abbreviations,commands, etc. in file macros.tex
+
+% widow and club penalty
+\widowpenalty = 10000
+\clubpenalty = 10000
+\displaywidowpenalty = 10000
+
+\begin{document}
+
+  \pagenumbering{Roman}
+
+  \begin{dbvthesisTitlepageAndDeclaration}
+
+    % Specify the title and a possible subtitle of your thesis, if you do not have a subtitle, comment the Subtitle command out
+    \Title{Anomaly Detection in Camera Trap Images} % mandatory
+    % \Subtitle{It is possible, that the subtitle is also too long for a single line} % optional
+
+    % Specify the type of your thesis, e.g. Diploma Thesis, Student Research Project, Bachelor Thesis, Master Thesis / Diplomarbeit, Studienarbeit, Bachelorarbeit, Masterarbeit 
+    \ThesisType{Bachelor Thesis} % mandatory
+
+    % Specify your first and last name as well as your date of birth and birthplace
+    \FirstName{Felix} % mandatory
+    \LastName{Kleinsteuber} % mandatory
+    \DateOfBirth{10}{12}{2000} % mandatory, format: {DD}{MM}{YYYY} (include leading zeros if necessary)
+    \Birthplace{Halle (Saale)} % mandatory
+
+    % Specify names of your supervisor and advisor(s): you can either split into supervisor and advisor or name everybody in the advisor command. in the last case, comment the supervisor command out
+    \Supervisor{Prof. Dr.-Ing. Joachim Denzler} % optional
+    \Advisor{B. Sc. Daphne Auer, Dr.-Ing. Paul Bodesheim} % mandatory
+
+    % Specify start and end of your thesis
+    \ThesisStart{11}{04}{2022} % mandatory, format: {DD}{MM}{YYYY} (include leading zeros if necessary)
+    \ThesisEnd{25}{08}{2022}   % mandatory, format: {DD}{MM}{YYYY} (include leading zeros if necessary)
+
+    % Specify a second institute, company, etc. if you do not have a second institute or a company, comment the following command out
+    % \SecondInstitute{Carl Zeiss AG \\ 07743 Jena \\ Germany} % optional
+
+    % note that the declaration (always in german!) will be automatically included between titlepage and abstract using \FirstName, \LastName, \ThesisEnd
+    % you can modify the declaration, if you do not use the class option predefineddeclaration, than you can edit the file "declaration.tex", which will be included automatically between titlepage and abstract
+
+  \end{dbvthesisTitlepageAndDeclaration}
+
+  \include{./abstract}
+
+  \tableofcontents
+  \cleardoublepage
+  \pagenumbering{arabic}
+
+  \include{./chapters/chap01-introduction/chap01-introduction}
+  %--------------------------------------------------
+  %--------------------------------------------------
+  \include{./chapters/chap02/chap02}
+  %--------------------------------------------------
+  %--------------------------------------------------
+  \include{./chapters/chap03/chap03}
+  %--------------------------------------------------
+  %--------------------------------------------------
+  \include{./chapters/chap04/chap04}
+  %--------------------------------------------------
+  %--------------------------------------------------
+  % include more chapters if you want or have to
+  %--------------------------------------------------
+  %--------------------------------------------------
+  \include{./chapters/conclusions/conclusions}
+  %--------------------------------------------------
+  %--------------------------------------------------
+  \include{./chapters/furtherWork/furtherWork}
+  %--------------------------------------------------
+  %--------------------------------------------------
+
+  \appendix
+
+  % if you do not have appendix sections, comment this include command out
+  % \include{./chapters/appendix/appendix}
+
+  \singlespacing
+
+  \interlinepenalty10000 % so no bib-entry will be separated by a pagebreak
+  \bibliography{thesis}
+  \bibliographystyle{apalike} % change the bib-style if you want to
+
+  \listoffigures
+  \begingroup
+  \let\clearpage\relax
+  \listoftables
+  \endgroup
+
+\end{document}