boehlke
/
avalanche_extension


			
							123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198
							\title{Literature Notes: Continual Learning}
\author{Julia Boehlke}
\date{\today}

\documentclass[12pt]{article}

\begin{document}
\maketitle


\section{Requirements of CL Literature}
\begin{itemize}
\itemsep0em 
  \item Avoid forgetting (*)
  \item Fixed memory and compute
  \item Enable forward transfer
  \item Enable backward transfer (*)
  \item Do not store examples
\end{itemize}


\section{Survey Papers / Meta Papers}
\paragraph{GDumb: A Simple Approachthat Questions Our Progressin Continual Learning} \cite{Prabhu2020GDumbAS}
\begin{itemize}
\itemsep0em 
\item GDumb = Greedy Smapler and Dumb Learner (class balanced fixed memory buffer, retrained from scratch using samples in buffer)
\item{Simplifying Assumptions in CL}
\begin{enumerate}
         \item Disjoint Task Formulation: at a particular duration in time data-stream will provide samples specific to one task. Sometimes this assumption also entails that there is only \emph{one} specific time, where data for a specific task is streamed. This means there is no backward transfer. 
         \item Task-Incremental(TI\_CL) : along with the disjoint task assumption, the task information (or id) is also passed during training and inference (multi-head). In Class-incremental continual learning (CI-CL) no such task information is given. 
         \item Online CL: restricting the learner to use each sampel only once to update parameters (unless stored in buffer). In offline CL there is unrestricted access to entire current dataset for training multiple epochs. 
\end{enumerate}
\item Online CL preferable in situations with fast spitting data stream.
\item Found GDumb outperforms most methods by large margin
\item Table 1 gives a great overview/categorization of methods and assumptions
\item non of the reviewd papers seem to match our assumptions/requirements exactly: not dijoint, class-incremental, offline. 
\end{itemize}

\paragraph{CVPR 2020 Continual Learning in Computer Vision Competition: Approaches,
Results, Current Challenges and Future Directions} \cite{Lomonaco2020CVPR2C}
\begin{itemize}
\itemsep0em 
\item CVPR Continula Learning challenge on CORe50 dataset including three different tasks: New Instances (8 batches of all classes, i.e., focus on backward transfer); Multi-Task New Classes (multi-head); New Instances and Classes (batches containing examples of single class may contain previously seen or new classes, i.e., disjoint setting focused on improving on seed classes with  single-head classification)
\item evaluate on a weighted sum of scores on accuracy, Disk usage, RAM, time
\item baseline include naive fine-tuning, rehersal with growing memory (20 images of each batch stored), and ARI* with latent replay \cite{Pellegrini2020LatentRF} (described below)
\item winning team uses replay method for NIC and devided network outputs by prior probability for each class to handle class imbalance (Buda et al. 2018)
\item top-4 solutions employ rehersal-based technique
\item on NI challenge, UT\_LG Team  rehersal training with batch instead of mini-batch level (for every epoch one memory batch and current new batch is concatenated) and introduce review step (with lower learning rate) before testing, where only memory data is used.
\item code available for all submissions 
\end{itemize}


\section{Papers}

\subsection{Rehersal Based Mehthods}
Rehersal Methods allow at least some data to be stored and used to \emph{reherse} previously learned knowledge. This is also known as Expereience Replay (ER). When no storage of data is possible, rehersal is often perforemed using generated images (\cite{shin2017continual}), where previousöy learned knowledge is stored indirectly.

\paragraph{Online Continual Learning with Maximally Interfered Retrieval} (MIR) \cite{Aljundi2019OnlineCL}
\begin{itemize}
\itemsep0em 
\item CI-CL, online, disjoint, rehersal based approach.
\item Sample criterion for controlled selection (from rehersal) of samples from buffer  where predictions will be most negatively impacted by forseen parameter update. Their Research question: what samples should be replayed from the previous history when new samples are received
\item most negatively impacted = loss changes most, when updating on new data (estimated for subset of buffer data).
\item also applicable to generative replay approaches 
\item in a reltivel small number of total classes/task case (MNIST SPLIT) their approach (ER+MIR) is significantly better (87,6\%) than random sampling ER (82.1\%). In other scenarios, their approach outperforms with a smaller margin.
\item (I dont really understand, why they restrict themselves to a disjoint setting, thos should work in non-disjoint situation.)
\end{itemize}

\paragraph{Gradient based sample selection for online continuallearning} \cite{Aljundi2019GradientBS}
\begin{itemize}
\itemsep0em 
\item CI-CL, online, non-disjoint expand GEM approach to situation where task boundries are not available
\item formulate replay buffer population problem as constrained minimization of the solid angle. Use a surrogate objective, which maximizes diversity of samples using the parameter gradients of samples instead of feature representations
\item indirectly adress the issue of class imbalance 
\item reevaluate replay buffer once a so called \emph{recent} buffer is full
\item also propose cheap alternative greedy sample selection for large buffers (removes overhead of gradient computation for all samples solving constrained optimization). Idea: compute score based on max. coisne simitlarity of current sample gradients with randomly selected subset of buffer gradients. When new sample arrives, compute its score and randomly select candidate for replecement (probability of normalized scores) and compare scores to decide. Replace constrained optimization when buffer is large with soft regularization equivalent to rehersal. 
\item Experiments performed on low resolution datasets such as MNIST and CIFAR10  
\item compared with random or clustering-based, buffer population methods and reservoir population methods, their approached show merit, especially the greedy approach using rehersal instead of constrained optimization. 
\item code availble
\end{itemize}

\paragraph{Random Sampling with a Reservoir }
\begin{itemize}
\itemsep0em
\item 1985 algorithm designed to uniformly sample from a stream of data where the total number of elements the stream will entail is unknown. 
\item This algorithm could be used to continuously update a fixed size buffer with samples from a stream while ensuring, that at the end, when the stream is done, that every sample has a probability of 1/(totel stream) of being in the buffer. 
\end{itemize}

\paragraph{More Is Better: An Analysis of InstanceQuantity/Quality Trade-off in Rehearsal-basedContinual Learning} \cite{Pelosin2021MoreIB}
\begin{itemize}
\itemsep0em
\item evaluated for class-incremental setting, CI-CL, disjoint
\item state that rehersal based methods are  'emerging as the most effective methodology to tackle CL' and refer to \cite{Knoblauch2020OptimalCL} for theoretical justification (optimal CL would require perfect memory)
\item investigate several dimensionality reductions (deep encoders, variational autoencoders, random projections). They compare their methods to GDumb, Greedy sampler and Dumb learner, which does not use any clever selection strategy for buffer or training approach. 
\item evaluated on final accuracy with several datasets(MNIST, CIFAR, ImageNet, Core50). Given a fixed memory size different numbers of samples can be stored when using different parameters for reduction. (peak performance achieved when storing 8x8 pixel images to fill memory)
\item only performed experiemtns  for disjoint setting, i.e., where datastream shows one task once during training. 
\item code available
\end{itemize}


\paragraph{Latent Replay for Real-Time Continual Learning} \cite{Pellegrini2020LatentRF}
\begin{itemize}
\itemsep0em
\item store representations from some intermediate layer in the network instead of images in inputspace to reduce memory requirement. To keep representations valid, they propose slowed-down learning for the layers below the latent replay layer. 
\item `a robot should be able to incrementally improve its object recognition capabilities while being exposed to newinstances of both known and completely new classes (de-noted as NIC setting - New Instances and Classes)'
\item this paper aims at imporving overall accuracy for the non-rehersal based methods such as AR1 and CWR \cite{Maltoni2019ContinuousLI} (described below)
\end{itemize}

\subsection{Kowledge Distillation}
This category is based on the disitllation loss. Basically, the output of old samples of the model becomes the new desired output when new data is available for updating. Especially in a multi-task/ multi-head scenario, the logits on heads for previously seen data shoul not change much when a new head is learned. The most famous, original introduction of distillation loss in continual learning was made by \cite{li2017learning}, which does not enable any backward transfer of knowledge and required task knowledge at inference. 

\paragraph{iCaRL: Incremental Classifier and Representation Learning} \cite{rebuffi2017icarl}
\begin{itemize}
\itemsep0em 
\item CI-CL, offline, disjoint and assumes that samples from each task (a batch of classes) are only present at one point in time of the data stream. 
\item assumes, there is a fixed size memory available to store examples from previous classes
\item use nearest-mean-of-examples classifier (using representatiosn) for inference. At training time, the sample memory buffer and model parameters are updated. When samples for a new class is available, a new training batch is constructed from the new and stored data. The output of the current network for all stored images of previous classes are stored since they are needed for the distillation loss. The model is updated with the cross-entropy loss for samples from the new class while the model is encouraged to reproduce the previously stored outputs (disitllation loss) for the old samples.
\item when new classes are introduced and weights are added to the network, some sampels in buffer are dropped to make room for samples from new class. The set of examples for each class is selected based on the current class mean of the feature vectors. 
\item Evalutaed using CIFAR100 and ImageNet datastes showing impressive results compared to previous methods for the disjoint task formulation
\item (Whil I think the idea of using the distillation loss for previously stored samples could be applicable in a non-disjoint task set formulation. The distillation loss is designed to preserve previosly infered knowledge in a model and allow forward transfer. In our situation backward transfer is one of the most important requirements, which the distillation loss is not designed for. I dont think it would be wise in our scenario to penelize model outputs changing for previously seen data since that might be necessary to improve the classification boundries.) 
\end{itemize}

\subsection{Regularization Approaches}
The basic idea behind regularization based approaches is to penelize a model for changing \emph{too much} with newly seen and finding a sensible trade-off between plasticity and stability of the network over time. 
Most influential in this category is the Elasitc Weight Consolidation Approach proposed by \cite{kirkpatrick2017overcoming}. Each parameters importance for classification of previous task is estimated using the Fisher Information (related to curvature of loss function). Updates to \emph{important} parameters are penelized proportionaly in the loss function when new tasks are learned. This approach is designed for task-incremental learning and does not allow backward transfer of knowledge.


\paragraph{Riemannian Walk for Incremental Learning:Understanding Forgetting and Intransigence} 
\cite{chaudhry2018riemannian}
\begin{itemize}
\itemsep0em 
\item CI-CL, offline, disjoint
\item RWalk is a generalization of EWC to CI setting. They use KL-divergence based regularization over conditional likelihood p(y|x) and a parameter importance score based on the sensitivity of the loss over the movement on the Riemannian manifold (induced by Fischer information) to mitigate catastrophic forgetting. By accumaulating parameter importance over the entire training trajectory, their approach allows class incremental learning. 
\item define task-wise measures for Forgetting(: diff between maximum knowledge and current knowledge) and Intransigence, the inability of a netowrk to learn  new tasks (:diff btw model trained on entire data and incrementally learned model trainied up to specific task).  
\item they show that for small number of samples their approach has much greater impact than when large datasets are availbale
\item suggest entropy-based sampling for creating the buffer dataset of old examples. Samples where the output of the neural network has a larger cross-entropy are more likely picked.
\item (while this approach allows for single-head classification, it still heavily relies on the disjoint dataset assumption. The basic idea is still that specific parameters have more importance for specific tasks and updating them when training new tasks should be avoided/reduced. For our application goals, this regularized loss could be used for a brief duration of the training when a new class is introduced. The first task would be defined as all previously known classes and the second task would consist of one new class only. This could be used in a strategy to focus learning of the new class while mitigating forgetting of the previous classes)
\end{itemize}

\paragraph{Gradient Episodic Memory for Continual Learning} (GEM) \cite{lopez2017gradient}
\begin{itemize}
\itemsep0em 
\item TI-CL, online, rehersal (+regularization) based approach
\item They introduce metrics for evaluting backward and forward transfer
\item No assumptions on the number of tasks are made. 
\item Use Memory buffer to constraint updates when training new tasks
\item Constraint: gradient direction of past task (estimated with memory) has positive dot product with gradient from batch (of new task).
\item Disadvantage: slow optimization with constraint and TASK INCEMENTAL
\end{itemize}

\paragraph{Continuous Learning in Single-Incremental-Task Scenarios} \cite{Maltoni2019ContinuousLI}
\begin{itemize} 
\itemsep0em
\item CI-CL, disjoint, introduce CWR and AR1 for NC (new class) learning, where each batch can contains new classes, but argue this could be adapted for NIC (new instance or new class) learning.
\item main idea: for the final layer have one set of consolidated weights used for inference and tempory weights reset to 0 for each batch used to updated the subset of weichts in the consolidated weights matrix relevant to the class seen in the current batch (CWR) 
\item while CWR uses fixed represenations extracted from a model, AR1 allows end-to-end CL by allowing model used for extracting to be trained simultaneously using regularized loss in a controlled manner. They use Synaptic Inteligence (a variant of EWC \cite{zenke2017continual})   
\item \cite{Lomonaco2020RehearsalFreeCL} expanded on this approach for the NIC task by updating weights for a class already seen using a weighted sum of past and current weights for the consolidation step
\item the results of this approach was further imporved on using laten replay method \cite{Pellegrini2020LatentRF}
\item \cite{Lomonaco2020RehearsalFreeCL} also provides a benchmark protocol for Core50 dataset on github for a NIC task

\end{itemize}


\subsection{Parameter Isolation}
Generally, this approach to continual learning is again originaly designed for task-incremental learning. The main idea is to generate binary masks for parameters for each task indicating the their importance for specific tasks. Susequently learned tasks are learned only using the leftover parmeters in a network. This approach generally relies on task-incremental learning. 

\paragraph{Conditional Channel Gated Networks for Task-Aware Continual Learning} 
\begin{itemize}
\itemsep0em 
\item CI-CL, offline, disjoint (assumes stream produced samples for one task for a duration in time, but during inference, no task information is provided.)
\item Original parameter isolation methods are not designed for class-incremental learning. This paper tries to generalize the formulation to class-incremental learning scenarios using some rehersal. 
\item main idea: jointly predict task and class label.
\item use gating module for each convolutional layer which decides which kernel in the layer should be applied (binary decision) based on the input feature. The gating module consists of a very shallow neural network tained with a sparsity objective such that the smallest possible number of kernels are applied. after the training of a task, the most important parameters are frozen, i.e., their gradients are zeroed out during updates for subsequent task learning.
\item (I dont see the advantage of parameter isoltion methods for class incremental learning. This approach practically splits the network into subsets for each task.) 
\end{itemize}


\subsection{Continual Learning with Focus on Imbalanced Data}

\paragraph{On Handling Class Imbalance in Continual Learning based
Network Intrusion Detection Systems}
\begin{itemize}
\item Application Domain: Anomaly-based Network intrusion detection
\item Class incremental setting
\item Sample replay with class balancing reservoir sampling (CBRS) 
\item compare with data augmentation strategy used for inflating small calsses in imbalanced datasets
\item Related WOrk section contains great overview of strategies for handling class imbalance in standard, non-continual settings
\end{itemize}

\bibliographystyle{abbrv}

\bibliography{life_long_leaning}

\end{document}