10 月之前 · bd065b68e7
--- a/chapters/chap02/chap02.tex
+++ b/chapters/chap02/chap02.tex
@@ -448,7 +448,7 @@ the label, class, or result. Then, $\boldsymbol{y} = f^{*}(\boldsymbol{x})$,
 
															 is the function to approximate. In the year 1958,
														
 
															 Rosenblatt~\cite{Rosenblatt1958} proposed the perceptron modeling the concept of
														
 
															 a neuron in a neuroscientific sense. The perceptron takes in the input vector
														
 
															-$\boldsymbol{x}$ performs an operation and produces a scalar result. This model
														
 
															+$\boldsymbol{x}$, performs an operation and produces a scalar result. This model
														
 
															 optimizes its parameters $\theta$ to be able to calculate $\boldsymbol{y} =
														
 
															   f(\boldsymbol{x}; \theta)$ as accurately as possible. As Minsky and
														
 
															 Papert~\cite{Minsky1972} demonstrate, the perceptron is only capable of
														
@@ -460,19 +460,19 @@ a chain structure of the form,
 
															 \begin{equation} \label{eq:mlp_char}
														
 
															   f(\boldsymbol{x}) = f^{(3)}(f^{(2)}(f^{(1)}(\boldsymbol{x}))).
														
 
															 \end{equation}
														
 
															-This nested version of a perceptron is a multilayer perceptron. Each
														
 
															+This nested version of a perceptron is called a multilayer perceptron. Each
														
 
															 sub-function, designated as $f^{(i)}$, is represented in the structure of an
														
 
															 MLP as a \emph{layer}, which contains a linear mapping and a nonlinear mapping
														
 
															 in form of an \emph{activation function}. A multitude of
														
 
															-\emph{Units} (also \emph{neurons}) compose each layer. A neuron performs the
														
 
															+\emph{units} (also \emph{neurons}) compose each layer. A neuron performs the
														
 
															 same vector-to-scalar calculation as the perceptron does. Subsequently, a
														
 
															 nonlinear activation function transforms the scalar output into the activation
														
 
															 of the unit. The layers are staggered in the neural network, with each layer
														
 
															-being connected to its neighbors, as illustrated in~\Cref{fig:mlp_example}. The
														
 
															-input vector $\boldsymbol{x}$ is provided to each unit of the first layer
														
 
															+being connected to its neighboring layers, as illustrated in~\Cref{fig:mlp_example}. The
														
 
															+input vector $\boldsymbol{x}$ is provided to each unit of the first layer (input layer)
														
 
															 $f^{(1)}$, which then gives the results to the units of the second layer
														
 
															 $f^{(2)}$, and so forth. The final layer is the \emph{output layer}. The
														
 
															-intervening layers, situated between the first and the output layers are the
														
 
															+intervening layers, situated between the input and the output layers are the
														
 
															 \emph{hidden layers}. The term \emph{forward propagation} describes the
														
 
															 process of information flowing through the network from the input layer to the
														
 
															 output layer, resulting in a scalar loss. The alternating structure of linear
														
@@ -503,7 +503,7 @@ data. Then,
 
															   \Loss{MSE} = \frac{1}{N}\sum_{i=1}^{N} ||\hat{\boldsymbol{y}}^{(i)}-\boldsymbol{y}^{(i)}||^2,
														
 
															 \end{equation}
														
 
															 calculates the squared difference between each model prediction and true value
														
 
															-of a training and takes the mean across the whole training data. \\
														
 
															+of a training data point and takes the mean across the whole training data. \\
														
 
															 Ultimately, the objective is to utilize this information to optimize the parameters, in order to minimize the
														
 
															 loss. One of the most fundamental and seminal optimization strategy is \emph{gradient
														
@@ -513,8 +513,8 @@ zero. Given that a positive gradient
 
															 signifies ascent and a negative gradient indicates descent, we must move the
														
 
															 variable by a \emph{learning rate} (step size) in the opposite
														
 
															 direction to that of the gradient. The calculation of the derivatives in respect
														
 
															-to the parameters is a complex task, since our functions is a composition of
														
 
															-many functions (one for each layer). We can address this issue taking advantage
														
 
															+to the parameters is a complex task, since our function is a composition of
														
 
															+many functions (one for each layer). We can address this issue by taking advantage
														
 
															 of~\Cref{eq:mlp_char} and employing the chain rule of calculus. Let
														
 
															 $\hat{\boldsymbol{y}} = f(\boldsymbol{x}; \theta)$ be the model prediction with the
														
 
															 decomposed version $f(\boldsymbol{x}; \theta) = f^{(3)}(w; \theta_3)$ with
														
@@ -527,14 +527,14 @@ Then,
 
															 is the gradient of $\Loss{ }$ in respect of the parameters $\theta_3$. To obtain
														
 
															 $\nabla_{\theta_2} \Loss{ }$, we have to derive $\nabla_{\theta_3} \Loss{ }$ in
														
 
															 respect to $\theta_2$. The name of this method in the context of neural
														
 
															-networks is \emph{back propagation}~\cite{Rumelhart1986}, as it propagates the
														
 
															+networks is \emph{backpropagation}~\cite{Rumelhart1986}, as it propagates the
														
 
															 error backwards through the neural network.\\
														
 
															 In practical applications, an optimizer often accomplishes the optimization task
														
 
															-by executing back propagation in the background. Furthermore, modifying the
														
 
															+by executing backpropagation in the background. Furthermore, modifying the
														
 
															 learning rate during training can be advantageous. For instance, making larger
														
 
															 steps at the beginning and minor adjustments at the end. Therefore, schedulers
														
 
															-are implementations algorithms that employ diverse learning rate alteration
														
 
															+are implementations of algorithms that employ diverse learning rate alteration
														
 
															 strategies.\\
														
 
															 For a more in-depth discussion of practical considerations and additional
														
@@ -549,7 +549,7 @@ solutions to differential systems.
 
															 \label{sec:pinn}
														
 
															 In~\Cref{sec:mlp}, we describe the structure and training of MLP's, which are
														
 
															-wildely recognized tools for approximating any kind of function. In 1997
														
 
															+wildely recognized tools for approximating any kind of function. In 1997,
														
 
															 Lagaris \etal~\cite{Lagaris1998} provided a method, that utilizes gradient
														
 
															 descent to solve ODEs and PDEs. Building on this approach, Raissi
														
 
															 \etal~\cite{Raissi2019} introduced the methodology with the name
														
@@ -577,14 +577,14 @@ fitted to the data through the mean square error data loss $\mathcal{L}_{\text{d
 
															 Moreover, the data loss function may include additional terms for initial and boundary
														
 
															 conditions. Furthermore, the physics are incorporated through an additional loss
														
 
															 term of the physics loss $\mathcal{L}_{\text{physics}}$ which includes the
														
 
															-differential equation through its residual $r=\boldsymbol{y} - \mathcal{D}(\boldsymbol{x})$.
														
 
															+differential equation through its residual $r=\nicefrac{d\boldsymbol{y}}{d\boldsymbol{x}} - \mathcal{D}(\boldsymbol{x})$.
														
 
															 This leads to the PINN loss function,
														
 
															 \begin{align}\label{eq:PINN_loss}
														
 
															   \mathcal{L}_{\text{PINN}}(\boldsymbol{x}, \boldsymbol{y},\hat{\boldsymbol{y}}) & = &  & \mathcal{L}_{\text{data}}         (\boldsymbol{y},\hat{\boldsymbol{y}})               & + & \quad \mathcal{L}_{\text{physics}}     (\boldsymbol{x}, \boldsymbol{y},\hat{\boldsymbol{y}}) &   \\
														
 
															                                                                                  & = &  & \frac{1}{N_t}\sum_{i=1}^{N_t} ||  \hat{\boldsymbol{y}}^{(i)}-\boldsymbol{y}^{(i)}||^2 & + & \quad\frac{1}{N_d}\sum_{i=1}^{N_d} ||  r_i(\boldsymbol{x},\hat{\boldsymbol{y}})||^2          & ,
														
 
															 \end{align}
														
 
															 with $N_d$ the number of differential equations in a system and $N_t$ the
														
 
															-number of training samples used for training. Utilizing~\Cref{eq:PINN_loss}, the
														
 
															+number of training samples used for training. Utilizing $\mathcal{L}_{\text{PINN}}$, the
														
 
															 PINN simultaneously optimizes its parameters $\theta$ to minimize both the data
														
 
															 loss and the physics loss. This makes it a multi-objective optimization problem.\\
														
@@ -592,7 +592,7 @@ Given the nature of differential equations, calculating the loss term of
 
															 $\mathcal{L}_{\text{physics}}(\boldsymbol{x},\hat{\boldsymbol{y}})$ requires the
														
 
															 calculation of the derivative of the output with respect to the input of
														
 
															 the neural network. As we outline in~\Cref{sec:mlp}, during the process of
														
 
															-back-propagation we calculate the gradients of the loss term in respect to a
														
 
															+backpropagation we calculate the gradients of the loss term in respect to a
														
 
															 layer-specific set of parameters denoted by $\theta_l$, where $l$ represents
														
 
															 the index of the respective layer. By employing
														
 
															 the chain rule of calculus, the algorithm progresses from the output layer
														
@@ -602,7 +602,7 @@ compute the respective gradients. The term,
 
															   \nabla_{\boldsymbol{x}} \hat{\boldsymbol{y}} = \frac{d\hat{\boldsymbol{y}}}{df^{(2)}}\frac{df^{(2)}}{df^{(1)}}\nabla_{\boldsymbol{x}}f^{(1)},
														
 
															 \end{equation}
														
 
															 illustrates that, in contrast to the procedure described in~\Cref{eq:backprop},
														
 
															-this procedure the \emph{automatic differentiation} goes one step further and
														
 
															+this procedure, the \emph{automatic differentiation}, goes one step further and
														
 
															 calculates the gradient of the output with respect to the input
														
 
															 $\boldsymbol{x}$. In order to calculate the second derivative
														
 
															 $\frac{d\hat{\boldsymbol{y}}}{d\boldsymbol{x}}=\nabla_{\boldsymbol{x}} (\nabla_{\boldsymbol{x}} \hat{\boldsymbol{y}} ),$
														
@@ -621,16 +621,9 @@ parameters within the neural network. This enables the network to utilize a
 
															 specific value, that actively influences the physics loss
														
 
															 $\mathcal{L}_{\text{physics}}(\boldsymbol{x},\hat{\boldsymbol{y}})$. During the
														
 
															 training phase the optimizer aims to minimize the physics loss, which should
														
 
															-ultimately yield an approximation of the true parameter value fitting the
														
 
															+ultimately yield an approximation of the the true parameter value fitting the
														
 
															 observations.\\
														
 
															-\begin{figure}[t]
														
 
															-  \centering
														
 
															-  \includegraphics[width=\textwidth]{oscilator.pdf}
														
 
															-  \caption{Illustration of of the movement of an oscillating body in the
														
 
															-    underdamped case. With $m=1kg$, $\mu=4\frac{Ns}{m}$ and $k=200\frac{N}{m}$.}
														
 
															-  \label{fig:spring}
														
 
															-\end{figure}
														
 
															 In order to illustrate the working of a PINN, we use the example of a
														
 
															 \emph{damped harmonic oscillator} taken from~\cite{Moseley}. In this problem, we
														
 
															 displace a body, which is attached to a spring, from its resting position. The
														
@@ -646,7 +639,16 @@ stiffness of the spring. The residual of the differential equation,
 
															 \begin{equation}
														
 
															   m\frac{d^2u}{dx^2}+\mu\frac{du}{dx}+ku=0,
														
 
															 \end{equation}
														
 
															-shows relation of these parameters in reference to the problem at hand. As
														
 
															+
														
 
															+\begin{figure}[t]
														
 
															+  \centering
														
 
															+  \includegraphics[width=\textwidth]{oscilator.pdf}
														
 
															+  \caption{Illustration of of the movement of an oscillating body in the
														
 
															+    underdamped case. With $m=1kg$, $\mu=4\frac{Ns}{m}$ and $k=200\frac{N}{m}$.}
														
 
															+  \label{fig:spring}
														
 
															+\end{figure}
														
 
															+
														
 
															+shows the relation of these parameters in reference to the problem at hand. As
														
 
															 Tenenbaum and Morris~\cite{Tenenbaum1985} provide, there are three potential solutions to this
														
 
															 issue. However only the \emph{underdamped case} results in an oscillating
														
 
															 movement of the body, as illustrated in~\Cref{fig:spring}. In order to apply a
														
@@ -664,7 +666,7 @@ not know the value of the friction $\mu$. In this case the loss function,
 
															 \end{equation}
														
 
															 includes the border conditions, the residual, in which $\hat{\mu}$ is a learnable
														
 
															 parameter and the data loss. By minimizing $\mathcal{L}_{\text{osc}}$ and
														
 
															-solving the inverse problem the PINN is able to find the missing parameter
														
 
															+solving the inverse problem, the PINN is able to find the missing parameter
														
 
															 $\mu$. This shows the methodology by which PINNs are capable of learning the
														
 
															 parameters of physical systems, such as the damped harmonic oscillator. In the
														
 
															 following section, we present the approach of Shaier \etal~\cite{Shaier2021} to
														
@@ -674,8 +676,8 @@ find the transmission rate and recovery rate of the SIR model using PINNs.
 
															 \subsection{Disease-Informed Neural Networks}
														
 
															 \label{sec:pinn:dinn}
														
 
															-In the preceding section, we present a data-driven methodology, as described by Lagaris
														
 
															-\etal~\cite{Lagaris1998}, for solving systems of differential equations by employing
														
 
															+In the preceding section, we present a data-driven methodology, as described by Raissi
														
 
															+\etal~\cite{Raissi2019}, for solving systems of differential equations by employing
														
 
															 PINNs. In~\Cref{sec:pandemicModel:sir}, we describe the SIR model, which models
														
 
															 the relations of susceptible, infectious and removed individuals and simulates
														
 
															 the progress of a disease in a population with a constant size. A system of
														
@@ -695,24 +697,24 @@ would calculate the initial transmission rate using the initial size of the
 
															 susceptible group $S_0$ and the infectious group $I_0$. The recovery rate, then
														
 
															 could be defined using the amount of days a person between the point of
														
 
															 infection and the start of isolation $d$, $\alpha = \frac{1}{d}$. The analytical
														
 
															-solutions to the SIR models often use heuristic methods and require knowledge
														
 
															+solutions to the SIR models often use heuristic methods and require prior knowledge
														
 
															 like the sizes $S_0$ and $I_0$. A data-driven approach such as the one that
														
 
															 Shaier \etal~\cite{Shaier2021} propose does not suffer from these problems. Since the
														
 
															 model learns the parameters $\alpha$ and $\beta$ while learning the training
														
 
															 data consisting of the time points $\boldsymbol{t}$,  and the corresponding
														
 
															-measured sizes of the groups $\boldsymbol{S}, \boldsymbol{I}, \boldsymbol{R}$.
														
 
															-Let $\hat{\boldsymbol{S}}, \hat{\boldsymbol{I}}, \hat{\boldsymbol{R}}$ be the
														
 
															+measured sizes of the groups $\Psi=(\boldsymbol{S}, \boldsymbol{I}, \boldsymbol{R})$.
														
 
															+Let $\hat{\Psi}=(\hat{\boldsymbol{S}}, \hat{\boldsymbol{I}}, \hat{\boldsymbol{R}})$ be the
														
 
															 model predictions of the groups and
														
 
															-$r_S=\frac{d\hat{\boldsymbol{S}}}{dt}+\beta \hat{\boldsymbol{S}}\hat{\boldsymbol{I}},
														
 
															-  r_I=\frac{d\hat{\boldsymbol{I}}}{dt}-\beta \hat{\boldsymbol{S}}\hat{\boldsymbol{I}}+\alpha \hat{\boldsymbol{I}}$
														
 
															-and $r_R=\frac{d \hat{\boldsymbol{R}}}{dt} - \alpha \hat{\boldsymbol{I}}$ the
														
 
															+$r_S=\frac{d\hat{\boldsymbol{S}}}{dt}+\hat{\beta} \hat{\boldsymbol{S}}\hat{\boldsymbol{I}},
														
 
															+  r_I=\frac{d\hat{\boldsymbol{I}}}{dt}-\hat{\beta} \hat{\boldsymbol{S}}\hat{\boldsymbol{I}}+\alpha \hat{\boldsymbol{I}}$
														
 
															+and $r_R=\frac{d \hat{\boldsymbol{R}}}{dt} - \hat{\alpha} \hat{\boldsymbol{I}}$ the
														
 
															 residuals of each differential equation using the model predictions. Then,
														
 
															 \begin{equation}
														
 
															   \begin{split}
														
 
															-    \mathcal{L}_{SIR}(\boldsymbol{t}, \boldsymbol{S}, \boldsymbol{I}, \boldsymbol{R}, \hat{\boldsymbol{S}}, \hat{\boldsymbol{I}}, \hat{\boldsymbol{R}}) = &||r_S||^2 + ||r_I||^2 + ||r_R||^2\\
														
 
															+    \mathcal{L}_{\text{SIR}}(\boldsymbol{t}, \Psi, \hat{\Psi}) = &||r_S||^2 + ||r_I||^2 + ||r_R||^2\\
														
 
															     + &\frac{1}{N_t}\sum_{i=1}^{N_t} ||\hat{\boldsymbol{S}}^{(i)}-\boldsymbol{S}^{(i)}||^2 + ||\hat{\boldsymbol{I}}^{(i)}-\boldsymbol{I}^{(i)}||^2 + ||\hat{\boldsymbol{R}}^{(i)}-\boldsymbol{R}^{(i)}||^2,
														
 
															   \end{split}
														
 
															 \end{equation}
														
 
															-is the loss function of a DINN, with $\alpha$ and $\beta$ being learnable
														
 
															+is the loss function of a DINN, with $\hat{\alpha}$ and $\hat{\beta}$ being learnable
														
 
															 parameters. These are represented in the residuals of the ODEs.
														
 
															 % -------------------------------------------------------------------
														
--- a/thesis.pdf
+++ b/thesis.pdf