11 kuukautta sitten · eb8439bbca
--- a/chapters/chap02/chap02.tex
+++ b/chapters/chap02/chap02.tex
@@ -323,7 +323,7 @@ The SIR model makes a number of assumptions that are intended to reduce the
 
				 model's overall complexity while simultaneously increasing its divergence from
			
 
				 actual reality. One such assumption is that the size of the population, $N$,
			
 
				 remains constant, as the daily change is negligible to the total population.
			
 
				-This depiction is not an accurate representation of the actual relations
			
 
				+This depiction is not an accurate representation of the actual relations \todo{other assumptions in a bad light?}
			
 
				 observed in the real world, as the size of a population is subject to a number
			
 
				 of factors that can contribute to change. The population is increased by the
			
 
				 occurrence of births and decreased by the occurrence of deaths. Other examples
			
@@ -365,19 +365,20 @@ represents the number of susceptible individuals, that one infectious individual
 
				 infects at the onset of the pandemic. In light of the effects of $\beta$ and
			
 
				 $\alpha$ (see~\Cref{sec:pandemicModel:sir}), $\RO > 1$ indicates that the
			
 
				 pandemic is emerging. In this scenario $\alpha$ is relatively low due to the
			
 
				-limited number of infections resulting from $I(t_0) << S(t_0)$. When $\RO < 1$,
			
 
				-the disease is spreading rapidly across the population, with an increase in $I$
			
 
				-occurring at a high rate. Nevertheless, $\RO$ does not cover the entire time
			
 
				-span. For this reason, Millevoi \etal~\cite{Millevoi2023} introduce $\Rt$
			
 
				-which has the same interpretation as $\RO$, with the exception that $\Rt$ is
			
 
				-dependent on time. The definition of the time-dependent reproduction number on
			
 
				-the time interval $\mathcal{T}$ with the population size $N$,
			
 
				+limited number of infections resulting from $I(t_0) << S(t_0)$.\\ Further,
			
 
				+$\RO < 1$ leads to the disease spreading rapidly across the population, with an
			
 
				+increase in $I$ occurring at a high rate. Nevertheless, $\RO$ does not cover
			
 
				+the entire time span. For this reason, Millevoi \etal~\cite{Millevoi2023}
			
 
				+introduce $\Rt$ which has the same interpretation as $\RO$, with the exception
			
 
				+that $\Rt$ is dependent on time. The time-dependent reproduction number is
			
 
				+defined as,
			
 
				 \begin{equation}\label{eq:repr_num}
			
 
				-  \Rt=\frac{\beta(t)}{\alpha(t)}\cdot\frac{S(t)}{N}
			
 
				+  \Rt=\frac{\beta(t)}{\alpha(t)}\cdot\frac{S(t)}{N},
			
 
				 \end{equation}
			
 
				-includes the rates of change for information about the spread of the disease and
			
 
				-information of the decrease of the ratio of susceptible individuals in the
			
 
				-population. In contrast to $\beta$ and $\alpha$, $\Rt$ is not a parameter but
			
 
				+on the time interval $\mathcal{T}$. This definition includes the transition
			
 
				+rates for information about the spread of the disease and information of the
			
 
				+decrease of the ratio of susceptible individuals in the population. In contrast
			
 
				+to $\beta$ and $\alpha$, $\Rt$ is not a parameter but \todo{Sai comment - earlier?}
			
 
				 a state variable in the model and enabling the following reduction of the SIR
			
 
				 model.\\
			
 
				 
			
@@ -390,12 +391,12 @@ $S$ and $I$, with the term $R(t)=N-S(t)-I(t)$. Thus,
 
				   \end{split}
			
 
				 \end{equation}
			
 
				 is the reduction of~\Cref{eq:sir} on the time interval $\mathcal{T}$ using this
			
 
				-characteristic and the reproduction number \Rt (see ~\Cref{eq:repr_num}).
			
 
				+characteristic and the reproduction number $\Rt$ (see ~\Cref{eq:repr_num}).
			
 
				 Another issue that Millevoi \etal~\cite{Millevoi2023} seek to address is the
			
 
				-extensive range of values that the SIR groups can assume, spanning from $0$ to
			
 
				-$10^7$. Accordingly, they initially scale the time interval $\mathcal{T}$ using
			
 
				-its borders to calculate the scaled time $t_s = \frac{t - t_0}{t_f - t_0}\in
			
 
				-  [0, 1]$. Subsequently, they calculate the scaled groups,
			
 
				+extensive range of values that the SIR groups can assume. Accordingly, they
			
 
				+initially scale the time interval $\mathcal{T}$ using its borders to calculate
			
 
				+the scaled time $t_s = \frac{t - t_0}{t_f - t_0}\in[0, 1]$. Subsequently, they
			
 
				+calculate the scaled groups,
			
 
				 \begin{equation}
			
 
				   S_s(t_s) = \frac{S(t)}{C},\quad I_s(t_s) = \frac{I(t)}{C},\quad R_s(t_s) = \frac{R(t)}{C},
			
 
				 \end{equation}
			
@@ -404,11 +405,11 @@ variable $I$, results in,
 
				 \begin{equation}
			
 
				   \frac{dI_s}{dt_s} = \alpha(t_f - t_0)(\Rt - 1)I_s(t_s),
			
 
				 \end{equation}
			
 
				-a further reduced version of~\Cref{eq:sir} results in a more streamlined and
			
 
				-efficient process, as it entails the elimination of a parameter($\beta$) and two
			
 
				-state variables ($S$ and $R$), while adding the state variable $\Rt$. This is a
			
 
				-crucial aspect for the automated resolution of such differential equation
			
 
				-systems, as we describe in~\Cref{sec:mlp}.
			
 
				+which is a further reduced version of~\Cref{eq:sir}. This less complex
			
 
				+differential equation results in a less complex solution, as it entails the
			
 
				+elimination of a parameter ($\beta$) and the two state variables ($S$ and $R$).
			
 
				+The reduced SIR model, is more precise in applications with a worse data
			
 
				+situation, due to its fewer input variables.
			
 
				 
			
 
				 % -------------------------------------------------------------------
			
 
				 
			
@@ -419,16 +420,17 @@ equations in systems, illustrating how they can be utilized to elucidate the
 
				 impact of a specific parameter on the system's behavior.
			
 
				 In~\Cref{sec:epidemModel}, we show specific applications of differential
			
 
				 equations in an epidemiological context. The final objective is to solve these
			
 
				-equations. For this problem, there are multiple methods to achieve this goal. On
			
 
				-such method is the \emph{Multilayer Perceptron} (MLP)~\cite{Hornik1989}. In the
			
 
				-following section, we provide a brief overview of the structure and training of
			
 
				-these \emph{neural networks}. For reference, we use the book \emph{Deep Learning}
			
 
				-by Goodfellow \etal~\cite{Goodfellow-et-al-2016} as a foundation for our
			
 
				-explanations.\\
			
 
				+equations by finding a function that fits. Fitting measured data points to
			
 
				+approximate such a function, is one of the multiple methods to achieve this
			
 
				+goal. The \emph{Multilayer Perceptron} (MLP)~\cite{Rumelhart1986} is a
			
 
				+data-driven function approximator. In the following section, we provide a brief
			
 
				+overview of the structure and training of these \emph{neural networks}. For
			
 
				+reference, we use the book \emph{Deep Learning} by Goodfellow
			
 
				+\etal~\cite{Goodfellow-et-al-2016} as a foundation for our explanations.\\
			
 
				 
			
 
				 The objective is to develop an approximation method for any function $f^{*}$,
			
 
				-which could be a mathematical function or a mapping of an input vector to a
			
 
				-class or category. Let $\boldsymbol{x}$ be the input vector and $\boldsymbol{y}$
			
 
				+which could be a mathematical function or a mapping of an input vector to the
			
 
				+desired output. Let $\boldsymbol{x}$ be the input vector and $\boldsymbol{y}$
			
 
				 the label, class, or result. Then, $\boldsymbol{y} = f^{*}(\boldsymbol{x})$,
			
 
				 is the function to approximate. In the year 1958,
			
 
				 Rosenblatt~\cite{Rosenblatt1958} proposed the perceptron modeling the concept of
			
@@ -440,14 +442,15 @@ Papert~\cite{Minsky1972} demonstrate, the perceptron is only capable of
 
				 approximating a specific class of functions. Consequently, there is a necessity
			
 
				 for an expansion of the perceptron.\\
			
 
				 
			
 
				-As Goodfellow \etal proceed, the solution to this issue is to decompose $f$ into
			
 
				+As Goodfellow \etal~\cite{Goodfellow-et-al-2016} proceed, the solution to this issue is to decompose $f$ into
			
 
				 a chain structure of the form,
			
 
				 \begin{equation} \label{eq:mlp_char}
			
 
				   f(\boldsymbol{x}) = f^{(3)}(f^{(2)}(f^{(1)}(\boldsymbol{x}))).
			
 
				 \end{equation}
			
 
				-This converts a perceptron, which has only two layers (an input and an output
			
 
				-layer), into a multilayer perceptron. Each sub-function, designated as $f^{(i)}$,
			
 
				-is represented in the structure of an MLP as a \emph{layer}. A multitude of
			
 
				+This nested version of a perceptron is a multilayer perceptron. Each
			
 
				+sub-function, designated as $f^{(i)}$, is represented in the structure of an
			
 
				+MLP as a \emph{layer}, which contains a linear mapping and a nonlinear mapping
			
 
				+in form of an \emph{activation function}. A multitude of
			
 
				 \emph{Units} (also \emph{neurons}) compose each layer. A neuron performs the
			
 
				 same vector-to-scalar calculation as the perceptron does. Subsequently, a
			
 
				 nonlinear activation function transforms the scalar output into the activation
			
@@ -457,27 +460,30 @@ input vector $\boldsymbol{x}$ is provided to each unit of the first layer
 
				 $f^{(1)}$, which then gives the results to the units of the second layer
			
 
				 $f^{(2)}$, and so forth. The final layer is the \emph{output layer}. The
			
 
				 intervening layers, situated between the first and the output layers are the
			
 
				-\emph{hidden layers}. The alternating structure of linear and nonlinear
			
 
				-calculation enables MLP's to approximate any function. As Hornik
			
 
				-\etal~\cite{Hornik1989} demonstrate, MLP's are universal approximators.\\
			
 
				+\emph{hidden layers}. The term \emph{forward propagation} describes the
			
 
				+process of information flowing through the network from the input layer to the
			
 
				+output layer, resulting in a scalar loss. The alternating structure of linear
			
 
				+and nonlinear calculation enables MLP's to approximate any function. As Hornik
			
 
				+\etal~\cite{Hornik1989} proves, MLP's are universal approximators.\\
			
 
				 
			
 
				 \begin{figure}[h]
			
 
				   \centering
			
 
				   \includegraphics[scale=0.87]{MLP.pdf}
			
 
				-  \caption{A visualization of the SIR model, illustrating $N$ being split in the
			
 
				-    three groups $S$, $I$ and $R$.}
			
 
				+  \caption{A illustration of an MLP with two hidden layers. Each neuron of a layer
			
 
				+    is connected to every neuron of the neighboring layers. The arrow indicates
			
 
				+    the direction of the forward propagation.}
			
 
				   \label{fig:mlp_example}
			
 
				 \end{figure}
			
 
				-\todo{caption}
			
 
				+
			
 
				 The term \emph{training} describes the process of optimizing the parameters
			
 
				 $\theta$. In order to undertake training, it is necessary to have a set of
			
 
				 \emph{training data}, which is a set of pairs (also called training points) of
			
 
				 the input data $\boldsymbol{x}$ and its corresponding true solution
			
 
				 $\boldsymbol{y}$ of the function $f^{*}$. For the training process we must
			
 
				-define the \emph{loss function} $\Loss{ }$, using the model prediction
			
 
				+define a \emph{loss function} $\Loss{ }$, using the model prediction
			
 
				 $\hat{\boldsymbol{y}}$ and the true value $\boldsymbol{y}$, which will act as a
			
 
				 metric for evaluating the extent to which the model deviates from the correct
			
 
				-answer. One of the most common loss function is the \emph{mean square error}
			
 
				+answer. One common loss function is the \emph{mean square error}
			
 
				 (MSE) loss function. Let $N$ be the number of points in the set of training
			
 
				 data. Then,
			
 
				 \begin{equation} \label{eq:mse}
			
@@ -486,41 +492,43 @@ data. Then,
 
				 calculates the squared difference between each model prediction and true value
			
 
				 of a training and takes the mean across the whole training data. \\
			
 
				 
			
 
				-In the context of neural networks, \emph{forward propagation} describes the
			
 
				-process of information flowing through the network from the input layer to the
			
 
				-output layer, resulting in a scalar loss. Ultimately, the objective is to
			
 
				-utilize this information to optimize the parameters, in order to minimize the
			
 
				+Ultimately, the objective is to utilize this information to optimize the parameters, in order to minimize the
			
 
				 loss. One of the most fundamental optimization strategy is \emph{gradient
			
 
				   descent}. In this process, the derivatives are employed to identify the location
			
 
				-of local or global minima within a function. Given that a positive gradient
			
 
				+of local or global minima within a function, which lie where the gradient is
			
 
				+zero. Given that a positive gradient
			
 
				 signifies ascent and a negative gradient indicates descent, we must move the
			
 
				-variable by a constant \emph{learning rate} (step size) in the opposite
			
 
				+variable by a \emph{learning rate} (step size) in the opposite
			
 
				 direction to that of the gradient. The calculation of the derivatives in respect
			
 
				 to the parameters is a complex task, since our functions is a composition of
			
 
				 many functions (one for each layer). We can address this issue taking advantage
			
 
				 of~\Cref{eq:mlp_char} and employing the chain rule of calculus. Let
			
 
				-$\hat{\boldsymbol{y}} = f(w; \theta)$ be the model prediction with
			
 
				+$\hat{\boldsymbol{y}} = f(\boldsymbol{x}; \theta)$ be the model prediction with the
			
 
				+decomposed version $f(\boldsymbol{x}; \theta) = f^{(3)}(w; \theta_3)$ with
			
 
				 $w = f^{(2)}(z; \theta_2)$ and $z = f^{(1)}(\boldsymbol{x}; \theta_1)$.
			
 
				-$\boldsymbol{x}$ is the input vector and $\theta_1, \theta_2\subset\theta$.
			
 
				+$\boldsymbol{x}$ is the input vector and $\theta_3, \theta_2, \theta_1\subset\theta$.
			
 
				 Then,
			
 
				 \begin{equation}\label{eq:backprop}
			
 
				-  \nabla_{\theta_1} \Loss{ } = \frac{d\mathcal{L}}{d\hat{\boldsymbol{y}}}\frac{d\hat{\boldsymbol{y}}}{df^{(2)}}\frac{df^{(2)}}{df^{(1)}}\nabla_{\theta_1}f^{(1)},
			
 
				+  \nabla_{\theta_3} \Loss{ } = \frac{d\mathcal{L}}{d\hat{\boldsymbol{y}}}\frac{d\hat{\boldsymbol{y}}}{df^{(3)}}\nabla_{\theta_3}f^{(3)},
			
 
				 \end{equation}
			
 
				-is the gradient of $\Loss{ }$ in respect of the parameters $\theta_1$. The name
			
 
				-of this method in the context of neural networks is \emph{back propagation}. \todo{Insert source}\\
			
 
				+is the gradient of $\Loss{ }$ in respect of the parameters $\theta_3$. To obtain
			
 
				+$\nabla_{\theta_2} \Loss{ }$, we have to derive $\nabla_{\theta_3} \Loss{ }$ in
			
 
				+respect to $\theta_2$. The name of this method in the context of neural
			
 
				+networks is \emph{back propagation}~\cite{Rumelhart1986}, as it propagates the
			
 
				+error backwards through the neural network.\\
			
 
				 
			
 
				 In practical applications, an optimizer often accomplishes the optimization task
			
 
				-by executing gradient descent in the background. Furthermore, modifying  the
			
 
				-learning rate during training can be advantageous. For instance, making larger
			
 
				+by executing back propagation in the background. Furthermore, modifying the
			
 
				+learning rate during training can be advantageous. For instance, making larger \todo{leave whole paragraph out? - Niklas}
			
 
				 steps at the beginning and minor adjustments at the end. Therefore, schedulers
			
 
				 are implementations algorithms that employ diverse learning rate alteration
			
 
				 strategies.\\
			
 
				 
			
 
				-This section provides an overview of basic concepts of neural networks. For a
			
 
				-deeper understanding, we direct the reader to the book \emph{Deep Learning} by
			
 
				-Goodfellow \etal~\cite{Goodfellow-et-al-2016}. The next section will demonstrate
			
 
				-the application of neural networks in approximating solutions to differential
			
 
				-systems.
			
 
				+For a more in-depth discussion of practical considerations and additional
			
 
				+details like regularization, we direct the reader to the book
			
 
				+\emph{Deep Learning} by Goodfellow \etal~\cite{Goodfellow-et-al-2016}. The next
			
 
				+section will demonstrate the application of neural networks in approximating
			
 
				+solutions to differential systems.
			
 
				 
			
 
				 % -------------------------------------------------------------------
			
 
				 
			
@@ -537,15 +545,15 @@ differential equations. The \emph{physics-informed neural network} (PINN)
 
				 learns the system of differential equations during training, as it optimizes
			
 
				 its output to align with the equations.\\
			
 
				 
			
 
				-In contrast to standard MLP's, the loss term of a PINN comprises two
			
 
				-components. The first term incorporates the aforementioned prior knowledge to pertinent the problem. As Raissi
			
 
				+In contrast to standard MLP's, PINNs are not only data-driven. The loss term of a PINN comprises two
			
 
				+components. The first term incorporates the equations of the aforementioned prior knowledge to pertinent the problem. As Raissi
			
 
				 \etal~\cite{Raissi2017} propose, the residual of each differential equation in
			
 
				 the system must be minimized in order for the model to optimize its output in accordance with the theory.
			
 
				 We obtain the residual $r_i$, with $i\in\{1, ...,N_d\}$, by rearranging the differential equation and
			
 
				 calculating the difference between the left-hand side and the right-hand side
			
 
				 of the equation. $N_d$ is the number of differential equations in a system. As
			
 
				 Raissi \etal~\cite{Raissi2017} propose the \emph{physics
			
 
				-  loss} of a PINN,\todo{check source again}
			
 
				+  loss} of a PINN,
			
 
				 \begin{equation}
			
 
				   \mathcal{L}_{physics}(\boldsymbol{x},\hat{\boldsymbol{y}}) = \frac{1}{N_d}\sum_{i=1}^{N_d} ||r_i(\boldsymbol{x},\hat{\boldsymbol{y}})||^2,
			
 
				 \end{equation}
			
@@ -560,14 +568,13 @@ denote the number of training points. Then,
 
				 \end{equation}\\
			
 
				 represents the comprehensive loss function of a physics-informed neural network. \\
			
 
				 
			
 
				-\todo{check for correctness}
			
 
				 Given the nature of residuals, calculating the loss term of
			
 
				 $\mathcal{L}_{physics}(\boldsymbol{x},\hat{\boldsymbol{y}})$ requires the
			
 
				 calculation of the derivative of the output with respect to the input of
			
 
				 the neural network. As we outline in~\Cref{sec:mlp}, during the process of
			
 
				 back-propagation we calculate the gradients of the loss term in respect to a
			
 
				 layer-specific set of parameters denoted by $\theta_l$, where $l$ represents
			
 
				-the index of the \todo{check for consistency} respective layer. By employing
			
 
				+the index of the respective layer. By employing
			
 
				 the chain rule of calculus, the algorithm progresses from the output layer
			
 
				 through each hidden layer, ultimately reaching the first layer in order to
			
 
				 compute the respective gradients. The term,
			
@@ -603,7 +610,7 @@ which should ultimately yield an approximation of the true value.\\
 
				   \label{fig:spring}
			
 
				 \end{figure}
			
 
				 One illustrative example of a potential application for PINN's is the
			
 
				-\emph{damped harmonic oscillator}~\cite{Tenenbaum1985}. In this problem, we \todo{check source for wording}
			
 
				+\emph{damped harmonic oscillator}~\cite{Demtroeder2021}. In this problem, we
			
 
				 displace a body, which is attached to a spring, from its resting position. The
			
 
				 body is subject to three forces: firstly, the inertia exerted by the
			
 
				 displacement $u$, which points in the direction the displacement $u$; secondly
			
@@ -613,7 +620,7 @@ direction of the movement. In accordance with Newton's second law and the
 
				 combined influence of these forces, the body exhibits oscillatory motion around
			
 
				 its position of rest. The system is influenced by $m$ the mass of the body,
			
 
				 $\mu$ the coefficient of friction and $k$ the spring constant, indicating the
			
 
				-stiffness of the spring. The residual of the differential equation, \todo{check in book}
			
 
				+stiffness of the spring. The residual of the differential equation,
			
 
				 \begin{equation}
			
 
				   m\frac{d^2u}{dx^2}+\mu\frac{du}{dx}+ku=0,
			
 
				 \end{equation}
			
--- a/thesis.bbl
+++ b/thesis.bbl
@@ -1,4 +1,4 @@
 
				-\begin{thebibliography}{HSW89}
			
 
				+\begin{thebibliography}{RHW86}
			
 
				 
			
 
				 % this bibliography is generated by alphadin.bst [8.2] from 2005-12-21
			
 
				 
			
@@ -95,6 +95,15 @@
 
				 \newblock ISBN 3--540--63720--6. --
			
 
				 \newblock Description based on publisher supplied metadata and other sources.
			
 
				 
			
 
				+\bibitem[RHW86]{Rumelhart1986}
			
 
				+\textsc{Rumelhart}, David~E. ; \textsc{Hinton}, Geoffrey~E.  ;
			
 
				+  \textsc{Williams}, Ronald~J.:
			
 
				+\newblock Learning representations by back-propagating errors.
			
 
				+\newblock {In: }\emph{Nature} 323 (1986), Oktober, Nr. 6088, S. 533--536.
			
 
				+\newblock \url{http://dx.doi.org/10.1038/323533a0}. --
			
 
				+\newblock DOI 10.1038/323533a0. --
			
 
				+\newblock ISSN 1476--4687
			
 
				+
			
 
				 \bibitem[Ros58]{Rosenblatt1958}
			
 
				 \textsc{Rosenblatt}, F.:
			
 
				 \newblock The perceptron: A probabilistic model for information storage and
			
--- a/thesis.bib
+++ b/thesis.bib
@@ -227,4 +227,18 @@
 
				   publisher = {Institute of Electrical and Electronics Engineers (IEEE)},
			
 
				 }
			
 
				 
			
 
				+@Article{Rumelhart1986,
			
 
				+  author    = {Rumelhart, David E. and Hinton, Geoffrey E. and Williams, Ronald J.},
			
 
				+  journal   = {Nature},
			
 
				+  title     = {Learning representations by back-propagating errors},
			
 
				+  year      = {1986},
			
 
				+  issn      = {1476-4687},
			
 
				+  month     = oct,
			
 
				+  number    = {6088},
			
 
				+  pages     = {533--536},
			
 
				+  volume    = {323},
			
 
				+  doi       = {10.1038/323533a0},
			
 
				+  publisher = {Springer Science and Business Media LLC},
			
 
				+}
			
 
				+
			
 
				 @Comment{jabref-meta: databaseType:bibtex;}
			
--- a/thesis.pdf
+++ b/thesis.pdf