|
@@ -448,7 +448,7 @@ the label, class, or result. Then, $\boldsymbol{y} = f^{*}(\boldsymbol{x})$,
|
|
|
is the function to approximate. In the year 1958,
|
|
|
Rosenblatt~\cite{Rosenblatt1958} proposed the perceptron modeling the concept of
|
|
|
a neuron in a neuroscientific sense. The perceptron takes in the input vector
|
|
|
-$\boldsymbol{x}$ performs an operation and produces a scalar result. This model
|
|
|
+$\boldsymbol{x}$, performs an operation and produces a scalar result. This model
|
|
|
optimizes its parameters $\theta$ to be able to calculate $\boldsymbol{y} =
|
|
|
f(\boldsymbol{x}; \theta)$ as accurately as possible. As Minsky and
|
|
|
Papert~\cite{Minsky1972} demonstrate, the perceptron is only capable of
|
|
@@ -460,19 +460,19 @@ a chain structure of the form,
|
|
|
\begin{equation} \label{eq:mlp_char}
|
|
|
f(\boldsymbol{x}) = f^{(3)}(f^{(2)}(f^{(1)}(\boldsymbol{x}))).
|
|
|
\end{equation}
|
|
|
-This nested version of a perceptron is a multilayer perceptron. Each
|
|
|
+This nested version of a perceptron is called a multilayer perceptron. Each
|
|
|
sub-function, designated as $f^{(i)}$, is represented in the structure of an
|
|
|
MLP as a \emph{layer}, which contains a linear mapping and a nonlinear mapping
|
|
|
in form of an \emph{activation function}. A multitude of
|
|
|
-\emph{Units} (also \emph{neurons}) compose each layer. A neuron performs the
|
|
|
+\emph{units} (also \emph{neurons}) compose each layer. A neuron performs the
|
|
|
same vector-to-scalar calculation as the perceptron does. Subsequently, a
|
|
|
nonlinear activation function transforms the scalar output into the activation
|
|
|
of the unit. The layers are staggered in the neural network, with each layer
|
|
|
-being connected to its neighbors, as illustrated in~\Cref{fig:mlp_example}. The
|
|
|
-input vector $\boldsymbol{x}$ is provided to each unit of the first layer
|
|
|
+being connected to its neighboring layers, as illustrated in~\Cref{fig:mlp_example}. The
|
|
|
+input vector $\boldsymbol{x}$ is provided to each unit of the first layer (input layer)
|
|
|
$f^{(1)}$, which then gives the results to the units of the second layer
|
|
|
$f^{(2)}$, and so forth. The final layer is the \emph{output layer}. The
|
|
|
-intervening layers, situated between the first and the output layers are the
|
|
|
+intervening layers, situated between the input and the output layers are the
|
|
|
\emph{hidden layers}. The term \emph{forward propagation} describes the
|
|
|
process of information flowing through the network from the input layer to the
|
|
|
output layer, resulting in a scalar loss. The alternating structure of linear
|
|
@@ -503,7 +503,7 @@ data. Then,
|
|
|
\Loss{MSE} = \frac{1}{N}\sum_{i=1}^{N} ||\hat{\boldsymbol{y}}^{(i)}-\boldsymbol{y}^{(i)}||^2,
|
|
|
\end{equation}
|
|
|
calculates the squared difference between each model prediction and true value
|
|
|
-of a training and takes the mean across the whole training data. \\
|
|
|
+of a training data point and takes the mean across the whole training data. \\
|
|
|
|
|
|
Ultimately, the objective is to utilize this information to optimize the parameters, in order to minimize the
|
|
|
loss. One of the most fundamental and seminal optimization strategy is \emph{gradient
|
|
@@ -513,8 +513,8 @@ zero. Given that a positive gradient
|
|
|
signifies ascent and a negative gradient indicates descent, we must move the
|
|
|
variable by a \emph{learning rate} (step size) in the opposite
|
|
|
direction to that of the gradient. The calculation of the derivatives in respect
|
|
|
-to the parameters is a complex task, since our functions is a composition of
|
|
|
-many functions (one for each layer). We can address this issue taking advantage
|
|
|
+to the parameters is a complex task, since our function is a composition of
|
|
|
+many functions (one for each layer). We can address this issue by taking advantage
|
|
|
of~\Cref{eq:mlp_char} and employing the chain rule of calculus. Let
|
|
|
$\hat{\boldsymbol{y}} = f(\boldsymbol{x}; \theta)$ be the model prediction with the
|
|
|
decomposed version $f(\boldsymbol{x}; \theta) = f^{(3)}(w; \theta_3)$ with
|
|
@@ -527,14 +527,14 @@ Then,
|
|
|
is the gradient of $\Loss{ }$ in respect of the parameters $\theta_3$. To obtain
|
|
|
$\nabla_{\theta_2} \Loss{ }$, we have to derive $\nabla_{\theta_3} \Loss{ }$ in
|
|
|
respect to $\theta_2$. The name of this method in the context of neural
|
|
|
-networks is \emph{back propagation}~\cite{Rumelhart1986}, as it propagates the
|
|
|
+networks is \emph{backpropagation}~\cite{Rumelhart1986}, as it propagates the
|
|
|
error backwards through the neural network.\\
|
|
|
|
|
|
In practical applications, an optimizer often accomplishes the optimization task
|
|
|
-by executing back propagation in the background. Furthermore, modifying the
|
|
|
+by executing backpropagation in the background. Furthermore, modifying the
|
|
|
learning rate during training can be advantageous. For instance, making larger
|
|
|
steps at the beginning and minor adjustments at the end. Therefore, schedulers
|
|
|
-are implementations algorithms that employ diverse learning rate alteration
|
|
|
+are implementations of algorithms that employ diverse learning rate alteration
|
|
|
strategies.\\
|
|
|
|
|
|
For a more in-depth discussion of practical considerations and additional
|
|
@@ -549,7 +549,7 @@ solutions to differential systems.
|
|
|
\label{sec:pinn}
|
|
|
|
|
|
In~\Cref{sec:mlp}, we describe the structure and training of MLP's, which are
|
|
|
-wildely recognized tools for approximating any kind of function. In 1997
|
|
|
+wildely recognized tools for approximating any kind of function. In 1997,
|
|
|
Lagaris \etal~\cite{Lagaris1998} provided a method, that utilizes gradient
|
|
|
descent to solve ODEs and PDEs. Building on this approach, Raissi
|
|
|
\etal~\cite{Raissi2019} introduced the methodology with the name
|
|
@@ -577,14 +577,14 @@ fitted to the data through the mean square error data loss $\mathcal{L}_{\text{d
|
|
|
Moreover, the data loss function may include additional terms for initial and boundary
|
|
|
conditions. Furthermore, the physics are incorporated through an additional loss
|
|
|
term of the physics loss $\mathcal{L}_{\text{physics}}$ which includes the
|
|
|
-differential equation through its residual $r=\boldsymbol{y} - \mathcal{D}(\boldsymbol{x})$.
|
|
|
+differential equation through its residual $r=\nicefrac{d\boldsymbol{y}}{d\boldsymbol{x}} - \mathcal{D}(\boldsymbol{x})$.
|
|
|
This leads to the PINN loss function,
|
|
|
\begin{align}\label{eq:PINN_loss}
|
|
|
\mathcal{L}_{\text{PINN}}(\boldsymbol{x}, \boldsymbol{y},\hat{\boldsymbol{y}}) & = & & \mathcal{L}_{\text{data}} (\boldsymbol{y},\hat{\boldsymbol{y}}) & + & \quad \mathcal{L}_{\text{physics}} (\boldsymbol{x}, \boldsymbol{y},\hat{\boldsymbol{y}}) & \\
|
|
|
& = & & \frac{1}{N_t}\sum_{i=1}^{N_t} || \hat{\boldsymbol{y}}^{(i)}-\boldsymbol{y}^{(i)}||^2 & + & \quad\frac{1}{N_d}\sum_{i=1}^{N_d} || r_i(\boldsymbol{x},\hat{\boldsymbol{y}})||^2 & ,
|
|
|
\end{align}
|
|
|
with $N_d$ the number of differential equations in a system and $N_t$ the
|
|
|
-number of training samples used for training. Utilizing~\Cref{eq:PINN_loss}, the
|
|
|
+number of training samples used for training. Utilizing $\mathcal{L}_{\text{PINN}}$, the
|
|
|
PINN simultaneously optimizes its parameters $\theta$ to minimize both the data
|
|
|
loss and the physics loss. This makes it a multi-objective optimization problem.\\
|
|
|
|
|
@@ -592,7 +592,7 @@ Given the nature of differential equations, calculating the loss term of
|
|
|
$\mathcal{L}_{\text{physics}}(\boldsymbol{x},\hat{\boldsymbol{y}})$ requires the
|
|
|
calculation of the derivative of the output with respect to the input of
|
|
|
the neural network. As we outline in~\Cref{sec:mlp}, during the process of
|
|
|
-back-propagation we calculate the gradients of the loss term in respect to a
|
|
|
+backpropagation we calculate the gradients of the loss term in respect to a
|
|
|
layer-specific set of parameters denoted by $\theta_l$, where $l$ represents
|
|
|
the index of the respective layer. By employing
|
|
|
the chain rule of calculus, the algorithm progresses from the output layer
|
|
@@ -602,7 +602,7 @@ compute the respective gradients. The term,
|
|
|
\nabla_{\boldsymbol{x}} \hat{\boldsymbol{y}} = \frac{d\hat{\boldsymbol{y}}}{df^{(2)}}\frac{df^{(2)}}{df^{(1)}}\nabla_{\boldsymbol{x}}f^{(1)},
|
|
|
\end{equation}
|
|
|
illustrates that, in contrast to the procedure described in~\Cref{eq:backprop},
|
|
|
-this procedure the \emph{automatic differentiation} goes one step further and
|
|
|
+this procedure, the \emph{automatic differentiation}, goes one step further and
|
|
|
calculates the gradient of the output with respect to the input
|
|
|
$\boldsymbol{x}$. In order to calculate the second derivative
|
|
|
$\frac{d\hat{\boldsymbol{y}}}{d\boldsymbol{x}}=\nabla_{\boldsymbol{x}} (\nabla_{\boldsymbol{x}} \hat{\boldsymbol{y}} ),$
|
|
@@ -621,16 +621,9 @@ parameters within the neural network. This enables the network to utilize a
|
|
|
specific value, that actively influences the physics loss
|
|
|
$\mathcal{L}_{\text{physics}}(\boldsymbol{x},\hat{\boldsymbol{y}})$. During the
|
|
|
training phase the optimizer aims to minimize the physics loss, which should
|
|
|
-ultimately yield an approximation of the true parameter value fitting the
|
|
|
+ultimately yield an approximation of the the true parameter value fitting the
|
|
|
observations.\\
|
|
|
|
|
|
-\begin{figure}[t]
|
|
|
- \centering
|
|
|
- \includegraphics[width=\textwidth]{oscilator.pdf}
|
|
|
- \caption{Illustration of of the movement of an oscillating body in the
|
|
|
- underdamped case. With $m=1kg$, $\mu=4\frac{Ns}{m}$ and $k=200\frac{N}{m}$.}
|
|
|
- \label{fig:spring}
|
|
|
-\end{figure}
|
|
|
In order to illustrate the working of a PINN, we use the example of a
|
|
|
\emph{damped harmonic oscillator} taken from~\cite{Moseley}. In this problem, we
|
|
|
displace a body, which is attached to a spring, from its resting position. The
|
|
@@ -646,7 +639,16 @@ stiffness of the spring. The residual of the differential equation,
|
|
|
\begin{equation}
|
|
|
m\frac{d^2u}{dx^2}+\mu\frac{du}{dx}+ku=0,
|
|
|
\end{equation}
|
|
|
-shows relation of these parameters in reference to the problem at hand. As
|
|
|
+
|
|
|
+\begin{figure}[t]
|
|
|
+ \centering
|
|
|
+ \includegraphics[width=\textwidth]{oscilator.pdf}
|
|
|
+ \caption{Illustration of of the movement of an oscillating body in the
|
|
|
+ underdamped case. With $m=1kg$, $\mu=4\frac{Ns}{m}$ and $k=200\frac{N}{m}$.}
|
|
|
+ \label{fig:spring}
|
|
|
+\end{figure}
|
|
|
+
|
|
|
+shows the relation of these parameters in reference to the problem at hand. As
|
|
|
Tenenbaum and Morris~\cite{Tenenbaum1985} provide, there are three potential solutions to this
|
|
|
issue. However only the \emph{underdamped case} results in an oscillating
|
|
|
movement of the body, as illustrated in~\Cref{fig:spring}. In order to apply a
|
|
@@ -664,7 +666,7 @@ not know the value of the friction $\mu$. In this case the loss function,
|
|
|
\end{equation}
|
|
|
includes the border conditions, the residual, in which $\hat{\mu}$ is a learnable
|
|
|
parameter and the data loss. By minimizing $\mathcal{L}_{\text{osc}}$ and
|
|
|
-solving the inverse problem the PINN is able to find the missing parameter
|
|
|
+solving the inverse problem, the PINN is able to find the missing parameter
|
|
|
$\mu$. This shows the methodology by which PINNs are capable of learning the
|
|
|
parameters of physical systems, such as the damped harmonic oscillator. In the
|
|
|
following section, we present the approach of Shaier \etal~\cite{Shaier2021} to
|
|
@@ -674,8 +676,8 @@ find the transmission rate and recovery rate of the SIR model using PINNs.
|
|
|
|
|
|
\subsection{Disease-Informed Neural Networks}
|
|
|
\label{sec:pinn:dinn}
|
|
|
-In the preceding section, we present a data-driven methodology, as described by Lagaris
|
|
|
-\etal~\cite{Lagaris1998}, for solving systems of differential equations by employing
|
|
|
+In the preceding section, we present a data-driven methodology, as described by Raissi
|
|
|
+\etal~\cite{Raissi2019}, for solving systems of differential equations by employing
|
|
|
PINNs. In~\Cref{sec:pandemicModel:sir}, we describe the SIR model, which models
|
|
|
the relations of susceptible, infectious and removed individuals and simulates
|
|
|
the progress of a disease in a population with a constant size. A system of
|
|
@@ -695,24 +697,24 @@ would calculate the initial transmission rate using the initial size of the
|
|
|
susceptible group $S_0$ and the infectious group $I_0$. The recovery rate, then
|
|
|
could be defined using the amount of days a person between the point of
|
|
|
infection and the start of isolation $d$, $\alpha = \frac{1}{d}$. The analytical
|
|
|
-solutions to the SIR models often use heuristic methods and require knowledge
|
|
|
+solutions to the SIR models often use heuristic methods and require prior knowledge
|
|
|
like the sizes $S_0$ and $I_0$. A data-driven approach such as the one that
|
|
|
Shaier \etal~\cite{Shaier2021} propose does not suffer from these problems. Since the
|
|
|
model learns the parameters $\alpha$ and $\beta$ while learning the training
|
|
|
data consisting of the time points $\boldsymbol{t}$, and the corresponding
|
|
|
-measured sizes of the groups $\boldsymbol{S}, \boldsymbol{I}, \boldsymbol{R}$.
|
|
|
-Let $\hat{\boldsymbol{S}}, \hat{\boldsymbol{I}}, \hat{\boldsymbol{R}}$ be the
|
|
|
+measured sizes of the groups $\Psi=(\boldsymbol{S}, \boldsymbol{I}, \boldsymbol{R})$.
|
|
|
+Let $\hat{\Psi}=(\hat{\boldsymbol{S}}, \hat{\boldsymbol{I}}, \hat{\boldsymbol{R}})$ be the
|
|
|
model predictions of the groups and
|
|
|
-$r_S=\frac{d\hat{\boldsymbol{S}}}{dt}+\beta \hat{\boldsymbol{S}}\hat{\boldsymbol{I}},
|
|
|
- r_I=\frac{d\hat{\boldsymbol{I}}}{dt}-\beta \hat{\boldsymbol{S}}\hat{\boldsymbol{I}}+\alpha \hat{\boldsymbol{I}}$
|
|
|
-and $r_R=\frac{d \hat{\boldsymbol{R}}}{dt} - \alpha \hat{\boldsymbol{I}}$ the
|
|
|
+$r_S=\frac{d\hat{\boldsymbol{S}}}{dt}+\hat{\beta} \hat{\boldsymbol{S}}\hat{\boldsymbol{I}},
|
|
|
+ r_I=\frac{d\hat{\boldsymbol{I}}}{dt}-\hat{\beta} \hat{\boldsymbol{S}}\hat{\boldsymbol{I}}+\alpha \hat{\boldsymbol{I}}$
|
|
|
+and $r_R=\frac{d \hat{\boldsymbol{R}}}{dt} - \hat{\alpha} \hat{\boldsymbol{I}}$ the
|
|
|
residuals of each differential equation using the model predictions. Then,
|
|
|
\begin{equation}
|
|
|
\begin{split}
|
|
|
- \mathcal{L}_{SIR}(\boldsymbol{t}, \boldsymbol{S}, \boldsymbol{I}, \boldsymbol{R}, \hat{\boldsymbol{S}}, \hat{\boldsymbol{I}}, \hat{\boldsymbol{R}}) = &||r_S||^2 + ||r_I||^2 + ||r_R||^2\\
|
|
|
+ \mathcal{L}_{\text{SIR}}(\boldsymbol{t}, \Psi, \hat{\Psi}) = &||r_S||^2 + ||r_I||^2 + ||r_R||^2\\
|
|
|
+ &\frac{1}{N_t}\sum_{i=1}^{N_t} ||\hat{\boldsymbol{S}}^{(i)}-\boldsymbol{S}^{(i)}||^2 + ||\hat{\boldsymbol{I}}^{(i)}-\boldsymbol{I}^{(i)}||^2 + ||\hat{\boldsymbol{R}}^{(i)}-\boldsymbol{R}^{(i)}||^2,
|
|
|
\end{split}
|
|
|
\end{equation}
|
|
|
-is the loss function of a DINN, with $\alpha$ and $\beta$ being learnable
|
|
|
+is the loss function of a DINN, with $\hat{\alpha}$ and $\hat{\beta}$ being learnable
|
|
|
parameters. These are represented in the residuals of the ODEs.
|
|
|
% -------------------------------------------------------------------
|