11 月之前 · bd065b68e7
--- a/chapters/chap02/chap02.tex
+++ b/chapters/chap02/chap02.tex
@@ -448,7 +448,7 @@ the label, class, or result. Then, $\boldsymbol{y} = f^{*}(\boldsymbol{x})$,
 
				 is the function to approximate. In the year 1958,
			
 
				 Rosenblatt~\cite{Rosenblatt1958} proposed the perceptron modeling the concept of
			
 
				 a neuron in a neuroscientific sense. The perceptron takes in the input vector
			
 
				-$\boldsymbol{x}$ performs an operation and produces a scalar result. This model
			
 
				+$\boldsymbol{x}$, performs an operation and produces a scalar result. This model
			
 
				 optimizes its parameters $\theta$ to be able to calculate $\boldsymbol{y} =
			
 
				   f(\boldsymbol{x}; \theta)$ as accurately as possible. As Minsky and
			
 
				 Papert~\cite{Minsky1972} demonstrate, the perceptron is only capable of
			
@@ -460,19 +460,19 @@ a chain structure of the form,
 
				 \begin{equation} \label{eq:mlp_char}
			
 
				   f(\boldsymbol{x}) = f^{(3)}(f^{(2)}(f^{(1)}(\boldsymbol{x}))).
			
 
				 \end{equation}
			
 
				-This nested version of a perceptron is a multilayer perceptron. Each
			
 
				+This nested version of a perceptron is called a multilayer perceptron. Each
			
 
				 sub-function, designated as $f^{(i)}$, is represented in the structure of an
			
 
				 MLP as a \emph{layer}, which contains a linear mapping and a nonlinear mapping
			
 
				 in form of an \emph{activation function}. A multitude of
			
 
				-\emph{Units} (also \emph{neurons}) compose each layer. A neuron performs the
			
 
				+\emph{units} (also \emph{neurons}) compose each layer. A neuron performs the
			
 
				 same vector-to-scalar calculation as the perceptron does. Subsequently, a
			
 
				 nonlinear activation function transforms the scalar output into the activation
			
 
				 of the unit. The layers are staggered in the neural network, with each layer
			
 
				-being connected to its neighbors, as illustrated in~\Cref{fig:mlp_example}. The
			
 
				-input vector $\boldsymbol{x}$ is provided to each unit of the first layer
			
 
				+being connected to its neighboring layers, as illustrated in~\Cref{fig:mlp_example}. The
			
 
				+input vector $\boldsymbol{x}$ is provided to each unit of the first layer (input layer)
			
 
				 $f^{(1)}$, which then gives the results to the units of the second layer
			
 
				 $f^{(2)}$, and so forth. The final layer is the \emph{output layer}. The
			
 
				-intervening layers, situated between the first and the output layers are the
			
 
				+intervening layers, situated between the input and the output layers are the
			
 
				 \emph{hidden layers}. The term \emph{forward propagation} describes the
			
 
				 process of information flowing through the network from the input layer to the
			
 
				 output layer, resulting in a scalar loss. The alternating structure of linear
			
@@ -503,7 +503,7 @@ data. Then,
 
				   \Loss{MSE} = \frac{1}{N}\sum_{i=1}^{N} ||\hat{\boldsymbol{y}}^{(i)}-\boldsymbol{y}^{(i)}||^2,
			
 
				 \end{equation}
			
 
				 calculates the squared difference between each model prediction and true value
			
 
				-of a training and takes the mean across the whole training data. \\
			
 
				+of a training data point and takes the mean across the whole training data. \\
			
 
				 
			
 
				 Ultimately, the objective is to utilize this information to optimize the parameters, in order to minimize the
			
 
				 loss. One of the most fundamental and seminal optimization strategy is \emph{gradient
			
@@ -513,8 +513,8 @@ zero. Given that a positive gradient
 
				 signifies ascent and a negative gradient indicates descent, we must move the
			
 
				 variable by a \emph{learning rate} (step size) in the opposite
			
 
				 direction to that of the gradient. The calculation of the derivatives in respect
			
 
				-to the parameters is a complex task, since our functions is a composition of
			
 
				-many functions (one for each layer). We can address this issue taking advantage
			
 
				+to the parameters is a complex task, since our function is a composition of
			
 
				+many functions (one for each layer). We can address this issue by taking advantage
			
 
				 of~\Cref{eq:mlp_char} and employing the chain rule of calculus. Let
			
 
				 $\hat{\boldsymbol{y}} = f(\boldsymbol{x}; \theta)$ be the model prediction with the
			
 
				 decomposed version $f(\boldsymbol{x}; \theta) = f^{(3)}(w; \theta_3)$ with
			
@@ -527,14 +527,14 @@ Then,
 
				 is the gradient of $\Loss{ }$ in respect of the parameters $\theta_3$. To obtain
			
 
				 $\nabla_{\theta_2} \Loss{ }$, we have to derive $\nabla_{\theta_3} \Loss{ }$ in
			
 
				 respect to $\theta_2$. The name of this method in the context of neural
			
 
				-networks is \emph{back propagation}~\cite{Rumelhart1986}, as it propagates the
			
 
				+networks is \emph{backpropagation}~\cite{Rumelhart1986}, as it propagates the
			
 
				 error backwards through the neural network.\\
			
 
				 
			
 
				 In practical applications, an optimizer often accomplishes the optimization task
			
 
				-by executing back propagation in the background. Furthermore, modifying the
			
 
				+by executing backpropagation in the background. Furthermore, modifying the
			
 
				 learning rate during training can be advantageous. For instance, making larger
			
 
				 steps at the beginning and minor adjustments at the end. Therefore, schedulers
			
 
				-are implementations algorithms that employ diverse learning rate alteration
			
 
				+are implementations of algorithms that employ diverse learning rate alteration
			
 
				 strategies.\\
			
 
				 
			
 
				 For a more in-depth discussion of practical considerations and additional
			
@@ -549,7 +549,7 @@ solutions to differential systems.
 
				 \label{sec:pinn}
			
 
				 
			
 
				 In~\Cref{sec:mlp}, we describe the structure and training of MLP's, which are
			
 
				-wildely recognized tools for approximating any kind of function. In 1997
			
 
				+wildely recognized tools for approximating any kind of function. In 1997,
			
 
				 Lagaris \etal~\cite{Lagaris1998} provided a method, that utilizes gradient
			
 
				 descent to solve ODEs and PDEs. Building on this approach, Raissi
			
 
				 \etal~\cite{Raissi2019} introduced the methodology with the name
			
@@ -577,14 +577,14 @@ fitted to the data through the mean square error data loss $\mathcal{L}_{\text{d
 
				 Moreover, the data loss function may include additional terms for initial and boundary
			
 
				 conditions. Furthermore, the physics are incorporated through an additional loss
			
 
				 term of the physics loss $\mathcal{L}_{\text{physics}}$ which includes the
			
 
				-differential equation through its residual $r=\boldsymbol{y} - \mathcal{D}(\boldsymbol{x})$.
			
 
				+differential equation through its residual $r=\nicefrac{d\boldsymbol{y}}{d\boldsymbol{x}} - \mathcal{D}(\boldsymbol{x})$.
			
 
				 This leads to the PINN loss function,
			
 
				 \begin{align}\label{eq:PINN_loss}
			
 
				   \mathcal{L}_{\text{PINN}}(\boldsymbol{x}, \boldsymbol{y},\hat{\boldsymbol{y}}) & = &  & \mathcal{L}_{\text{data}}         (\boldsymbol{y},\hat{\boldsymbol{y}})               & + & \quad \mathcal{L}_{\text{physics}}     (\boldsymbol{x}, \boldsymbol{y},\hat{\boldsymbol{y}}) &   \\
			
 
				                                                                                  & = &  & \frac{1}{N_t}\sum_{i=1}^{N_t} ||  \hat{\boldsymbol{y}}^{(i)}-\boldsymbol{y}^{(i)}||^2 & + & \quad\frac{1}{N_d}\sum_{i=1}^{N_d} ||  r_i(\boldsymbol{x},\hat{\boldsymbol{y}})||^2          & ,
			
 
				 \end{align}
			
 
				 with $N_d$ the number of differential equations in a system and $N_t$ the
			
 
				-number of training samples used for training. Utilizing~\Cref{eq:PINN_loss}, the
			
 
				+number of training samples used for training. Utilizing $\mathcal{L}_{\text{PINN}}$, the
			
 
				 PINN simultaneously optimizes its parameters $\theta$ to minimize both the data
			
 
				 loss and the physics loss. This makes it a multi-objective optimization problem.\\
			
 
				 
			
@@ -592,7 +592,7 @@ Given the nature of differential equations, calculating the loss term of
 
				 $\mathcal{L}_{\text{physics}}(\boldsymbol{x},\hat{\boldsymbol{y}})$ requires the
			
 
				 calculation of the derivative of the output with respect to the input of
			
 
				 the neural network. As we outline in~\Cref{sec:mlp}, during the process of
			
 
				-back-propagation we calculate the gradients of the loss term in respect to a
			
 
				+backpropagation we calculate the gradients of the loss term in respect to a
			
 
				 layer-specific set of parameters denoted by $\theta_l$, where $l$ represents
			
 
				 the index of the respective layer. By employing
			
 
				 the chain rule of calculus, the algorithm progresses from the output layer
			
@@ -602,7 +602,7 @@ compute the respective gradients. The term,
 
				   \nabla_{\boldsymbol{x}} \hat{\boldsymbol{y}} = \frac{d\hat{\boldsymbol{y}}}{df^{(2)}}\frac{df^{(2)}}{df^{(1)}}\nabla_{\boldsymbol{x}}f^{(1)},
			
 
				 \end{equation}
			
 
				 illustrates that, in contrast to the procedure described in~\Cref{eq:backprop},
			
 
				-this procedure the \emph{automatic differentiation} goes one step further and
			
 
				+this procedure, the \emph{automatic differentiation}, goes one step further and
			
 
				 calculates the gradient of the output with respect to the input
			
 
				 $\boldsymbol{x}$. In order to calculate the second derivative
			
 
				 $\frac{d\hat{\boldsymbol{y}}}{d\boldsymbol{x}}=\nabla_{\boldsymbol{x}} (\nabla_{\boldsymbol{x}} \hat{\boldsymbol{y}} ),$
			
@@ -621,16 +621,9 @@ parameters within the neural network. This enables the network to utilize a
 
				 specific value, that actively influences the physics loss
			
 
				 $\mathcal{L}_{\text{physics}}(\boldsymbol{x},\hat{\boldsymbol{y}})$. During the
			
 
				 training phase the optimizer aims to minimize the physics loss, which should
			
 
				-ultimately yield an approximation of the true parameter value fitting the
			
 
				+ultimately yield an approximation of the the true parameter value fitting the
			
 
				 observations.\\
			
 
				 
			
 
				-\begin{figure}[t]
			
 
				-  \centering
			
 
				-  \includegraphics[width=\textwidth]{oscilator.pdf}
			
 
				-  \caption{Illustration of of the movement of an oscillating body in the
			
 
				-    underdamped case. With $m=1kg$, $\mu=4\frac{Ns}{m}$ and $k=200\frac{N}{m}$.}
			
 
				-  \label{fig:spring}
			
 
				-\end{figure}
			
 
				 In order to illustrate the working of a PINN, we use the example of a
			
 
				 \emph{damped harmonic oscillator} taken from~\cite{Moseley}. In this problem, we
			
 
				 displace a body, which is attached to a spring, from its resting position. The
			
@@ -646,7 +639,16 @@ stiffness of the spring. The residual of the differential equation,
 
				 \begin{equation}
			
 
				   m\frac{d^2u}{dx^2}+\mu\frac{du}{dx}+ku=0,
			
 
				 \end{equation}
			
 
				-shows relation of these parameters in reference to the problem at hand. As
			
 
				+
			
 
				+\begin{figure}[t]
			
 
				+  \centering
			
 
				+  \includegraphics[width=\textwidth]{oscilator.pdf}
			
 
				+  \caption{Illustration of of the movement of an oscillating body in the
			
 
				+    underdamped case. With $m=1kg$, $\mu=4\frac{Ns}{m}$ and $k=200\frac{N}{m}$.}
			
 
				+  \label{fig:spring}
			
 
				+\end{figure}
			
 
				+
			
 
				+shows the relation of these parameters in reference to the problem at hand. As
			
 
				 Tenenbaum and Morris~\cite{Tenenbaum1985} provide, there are three potential solutions to this
			
 
				 issue. However only the \emph{underdamped case} results in an oscillating
			
 
				 movement of the body, as illustrated in~\Cref{fig:spring}. In order to apply a
			
@@ -664,7 +666,7 @@ not know the value of the friction $\mu$. In this case the loss function,
 
				 \end{equation}
			
 
				 includes the border conditions, the residual, in which $\hat{\mu}$ is a learnable
			
 
				 parameter and the data loss. By minimizing $\mathcal{L}_{\text{osc}}$ and
			
 
				-solving the inverse problem the PINN is able to find the missing parameter
			
 
				+solving the inverse problem, the PINN is able to find the missing parameter
			
 
				 $\mu$. This shows the methodology by which PINNs are capable of learning the
			
 
				 parameters of physical systems, such as the damped harmonic oscillator. In the
			
 
				 following section, we present the approach of Shaier \etal~\cite{Shaier2021} to
			
@@ -674,8 +676,8 @@ find the transmission rate and recovery rate of the SIR model using PINNs.
 
				 
			
 
				 \subsection{Disease-Informed Neural Networks}
			
 
				 \label{sec:pinn:dinn}
			
 
				-In the preceding section, we present a data-driven methodology, as described by Lagaris
			
 
				-\etal~\cite{Lagaris1998}, for solving systems of differential equations by employing
			
 
				+In the preceding section, we present a data-driven methodology, as described by Raissi
			
 
				+\etal~\cite{Raissi2019}, for solving systems of differential equations by employing
			
 
				 PINNs. In~\Cref{sec:pandemicModel:sir}, we describe the SIR model, which models
			
 
				 the relations of susceptible, infectious and removed individuals and simulates
			
 
				 the progress of a disease in a population with a constant size. A system of
			
@@ -695,24 +697,24 @@ would calculate the initial transmission rate using the initial size of the
 
				 susceptible group $S_0$ and the infectious group $I_0$. The recovery rate, then
			
 
				 could be defined using the amount of days a person between the point of
			
 
				 infection and the start of isolation $d$, $\alpha = \frac{1}{d}$. The analytical
			
 
				-solutions to the SIR models often use heuristic methods and require knowledge
			
 
				+solutions to the SIR models often use heuristic methods and require prior knowledge
			
 
				 like the sizes $S_0$ and $I_0$. A data-driven approach such as the one that
			
 
				 Shaier \etal~\cite{Shaier2021} propose does not suffer from these problems. Since the
			
 
				 model learns the parameters $\alpha$ and $\beta$ while learning the training
			
 
				 data consisting of the time points $\boldsymbol{t}$,  and the corresponding
			
 
				-measured sizes of the groups $\boldsymbol{S}, \boldsymbol{I}, \boldsymbol{R}$.
			
 
				-Let $\hat{\boldsymbol{S}}, \hat{\boldsymbol{I}}, \hat{\boldsymbol{R}}$ be the
			
 
				+measured sizes of the groups $\Psi=(\boldsymbol{S}, \boldsymbol{I}, \boldsymbol{R})$.
			
 
				+Let $\hat{\Psi}=(\hat{\boldsymbol{S}}, \hat{\boldsymbol{I}}, \hat{\boldsymbol{R}})$ be the
			
 
				 model predictions of the groups and
			
 
				-$r_S=\frac{d\hat{\boldsymbol{S}}}{dt}+\beta \hat{\boldsymbol{S}}\hat{\boldsymbol{I}},
			
 
				-  r_I=\frac{d\hat{\boldsymbol{I}}}{dt}-\beta \hat{\boldsymbol{S}}\hat{\boldsymbol{I}}+\alpha \hat{\boldsymbol{I}}$
			
 
				-and $r_R=\frac{d \hat{\boldsymbol{R}}}{dt} - \alpha \hat{\boldsymbol{I}}$ the
			
 
				+$r_S=\frac{d\hat{\boldsymbol{S}}}{dt}+\hat{\beta} \hat{\boldsymbol{S}}\hat{\boldsymbol{I}},
			
 
				+  r_I=\frac{d\hat{\boldsymbol{I}}}{dt}-\hat{\beta} \hat{\boldsymbol{S}}\hat{\boldsymbol{I}}+\alpha \hat{\boldsymbol{I}}$
			
 
				+and $r_R=\frac{d \hat{\boldsymbol{R}}}{dt} - \hat{\alpha} \hat{\boldsymbol{I}}$ the
			
 
				 residuals of each differential equation using the model predictions. Then,
			
 
				 \begin{equation}
			
 
				   \begin{split}
			
 
				-    \mathcal{L}_{SIR}(\boldsymbol{t}, \boldsymbol{S}, \boldsymbol{I}, \boldsymbol{R}, \hat{\boldsymbol{S}}, \hat{\boldsymbol{I}}, \hat{\boldsymbol{R}}) = &||r_S||^2 + ||r_I||^2 + ||r_R||^2\\
			
 
				+    \mathcal{L}_{\text{SIR}}(\boldsymbol{t}, \Psi, \hat{\Psi}) = &||r_S||^2 + ||r_I||^2 + ||r_R||^2\\
			
 
				     + &\frac{1}{N_t}\sum_{i=1}^{N_t} ||\hat{\boldsymbol{S}}^{(i)}-\boldsymbol{S}^{(i)}||^2 + ||\hat{\boldsymbol{I}}^{(i)}-\boldsymbol{I}^{(i)}||^2 + ||\hat{\boldsymbol{R}}^{(i)}-\boldsymbol{R}^{(i)}||^2,
			
 
				   \end{split}
			
 
				 \end{equation}
			
 
				-is the loss function of a DINN, with $\alpha$ and $\beta$ being learnable
			
 
				+is the loss function of a DINN, with $\hat{\alpha}$ and $\hat{\beta}$ being learnable
			
 
				 parameters. These are represented in the residuals of the ODEs.
			
 
				 % -------------------------------------------------------------------
			
--- a/thesis.pdf
+++ b/thesis.pdf