% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Author:   Phillip Rothenbeck
% Title:    Investigating the Evolution of the COVID-19 Pandemic in Germany Using Physics-Informed Neural Networks
% File:     chap03/chap03.tex
% Part:     Methods
% Description:
%         summary of the content in this chapter
% Version:  26.08.2024
% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\chapter{Methods}
\label{chap:methods}
This chapter provides the methods, that we employ to address the problem that we
present in~\Cref{chap:introduction}.~\Cref{sec:preprocessing} outlines
our approaches for preprocessing of the available data and has two
sections. The first section describes the publicly available data provided by
the \emph{Robert Koch Institute} (RKI). The second section outlines the
techniques we use to process this data to fit our project's requirements.
Subsequently, we give a theoretical overview of the PINN's that we employ.
These latter sections, establish the foundation for the implementations
used in~\Cref{sec:sir:setup} and~\Cref{sec:rsir:setup}.

% -------------------------------------------------------------------

\section{Epidemiological Data}
\label{sec:preprocessing}
In this thesis, we want to analyze the COVID-19 pandemic in Germany utilizing
the SIR model and PINNs. For a PINN to learn the parameters of the SIR model,
we need pandemic data in the correct format for the approach. Let $N_t$ be the
number of training points, then let $i\in\{1, ..., N_t\}$
be the index of the training points. The data required by the PINN for solving
the SIR model (see~\Cref{sec:pinn:dinn}), consists of pairs
$(\boldsymbol{t}^{(i)}, (\boldsymbol{S}^{(i)}, \boldsymbol{I}^{(i)}, \boldsymbol{R}^{(i)}))$,
with $\boldsymbol{t}^{(i)}$ representing the time in days since the first
measurement and $\boldsymbol{S}^{(i)}, \boldsymbol{I}^{(i)}$, and $\boldsymbol{R}^{(i)}$
the corresponding size of the compartments. Given that the system of
differential equations representing the reduced SIR model
(see~\Cref{sec:pandemicModel:rsir}) consists of a single differential equation
for $I$, it is necessary to obtain pairs of the form $(\boldsymbol{t}^{(i)}, \boldsymbol{I}^{(i)})$.
This section, focuses on the structure of the available data and the methods we
employ to transform it into the correct structure.

% -------------------------------------------------------------------

\subsection{RKI Data}
\label{sec:preprocessing:rki}
The RKI is a biomedical institute in Germany responsible for
the monitoring and prevention of diseases. As the central institution of the
German government in the field of biomedicine, one of its tasks during the
COVID-19 pandemic was to track the number of infections and death cases in
Germany. The data was collected by university hospitals, research facilities
and laboratories through the conduction of tests. Each new case had to be
reported within a period of 24 hours at the latest to the respective state
authority. Each state authority collects the cases for a day and had to report
them to the RKI by the following working day~\cite{GHDead}. The RKI then refines
the data and releases statistics and updates its repositories holding the
information for the public to access. For the purposes of this thesis we
concentrate on two of these repositories.\\

The first repository is called \emph{COVID-19-Todesfälle in Deutschland}~\cite{GHDead}.
The dataset comprises discrete data points, each with a date indicating the
point in time at which the respective data was collected. The dates span from
2020-03-09, to the present day. For each date, the dataset provides the total
number of infection and death cases, the number  of new deaths, and the
case-fatality ratio. The total number of infection and death cases represents
the sum of all cases reported up to that date, including the newly reported
data. The dataset includes two additional subsets, that contain the death case
information organized by age group or by the individual states within Germany on
a weekly basis.\\

\begin{figure}[t]
    \centering
    \includegraphics[width=\textwidth]{dataset_visualization.pdf}
    \caption{A visualization of the total death case and infection case data for
        each day from the data set \emph{COVID-19-Todesfälle in Deutschland}. Status
        of 2024-08-20.}
    \label{fig:rki_data}
\end{figure}

The second repository is entitled \emph{SARS-CoV-2 Infektionen in Deutschland}~\cite{GHInf}.
This dataset contains comprehensive data regarding the infections of each county
on a daily basis. The counties are encoded using the \emph{Community Identification Number}~\cite{GFSO},
wherein the first two digits denote the state, the third digit represents the
government district, and the last two digits indicate the county. Each data
point displays the gender, the age group, number death, infection and recovery
cases and the reference and report date. The reference date marks the onset of
illness in the individual. In the absence of this information, the reference
date is equivalent to the report date.\\

The RKI assumes that the duration of the illness under normal conditions is 14 days,
while the duration of severe cases is assumed to be 28 days. The recovery cases
in the dataset are calculated using these assumptions, by adding the duration on
the reference date if it is given. As the RKI states, the recovery data should be used
with caution. Since we require the recovery data for further calculations, the
following section presents the solutions we employed to address this issue.

% -------------------------------------------------------------------

\subsection{Data Preprocessing}
\label{sec:preprocessing:rq}

At the outset of this section, we establish the format of the data, that is
necessary for training the PINNs. In this section, we present the method, that we
employ to preprocess and transform the RKI data (see~\Cref{sec:preprocessing:rki})
into the training data. \\

In order to obtain the SIR data we require the size of each SIR compartment for
each time point.  The infection case data for the German states is available on
a daily basis. To obtain the daily cases for the entire country we need to
differentiate the total number of cases. The size of the population is defined
as the respective size at the beginning of 2020.  Using the starting conditions
of~\Cref{eq:startCond}, we iterate through each day, modifying the sizes of the
groups in a consecutive manner. For each iteration we subtract the new infection
cases from $\boldsymbol{S}^{(i-1)}$ to obtain $\boldsymbol{S}^{(i)}$, for
$\boldsymbol{I}^{(i)}$, we add the new cases and subtract deaths and recoveries,
and the size of $\boldsymbol{R}^{(i)}$ is obtained by adding the new deaths and
recoveries as they occur.\\

As previously stated in~\Cref{sec:preprocessing:rki} the data on recoveries may
either be unreliable or is entirely absent. To address this, we propose a method
for computing the number of recovered individuals per day. Under the assumption
that recovery takes $D$ days, we present the recovery queue, a data structure
that holds the number of infections for a given day, retains them for $D$ days,
and releases them into the removed group $D$ days later.\\

\begin{figure}[t]
    \centering
    \includegraphics[width=\textwidth]{recovery_queue.pdf}
    \caption{The recovery queue takes in the infected individuals for the $k$'th
        day and releases them $D$ days later into the removed group.}
    \label{fig:recovery_queue}
\end{figure}

In order to solve the reduced SIR model, we employ a similar algorithm to that
used for the SIR model. However, in contrast to the recovery queue, we utilize
a constant recovery rate $\alpha$ to transfer a portion $\alpha\boldsymbol{I}^{(i)}$
of infections, which have recovered or died on the $i$'th day and put them into
the $\boldsymbol{R}^{(i+1)}$ compartment of the next day, which is irrelevant to
our purposes. The transformed data for both the SIR model and the reduced SIR
model are then employed by the PINN models, which we describe in the subsequent
section.

% -------------------------------------------------------------------

\section{Estimating Epidemiological Parameters using PINNs}
\label{sec:pinn:sir}

In the preceding section, we present the methods we employ to preprocess and
format the data from the RKI in accordance with the specifications required for
the application in this thesis. Here, we will present the method we employ
to identify the SIR parameters $\alpha$ and $\beta$ for our data. As a
foundation for our work, we draw upon the work of Shaier \etal~\cite{Shaier2021},
to solve the SIR system of ODEs using PINNs.\\

In order to conduct an analysis of a pandemic, it is necessary to have a
quantifiable measure that indicates whether the disease in question has the
capacity to spread rapidly through a population or is it not successful in
infecting a significant number of individuals. In~\Cref{sec:pandemicModel:sir},
we provide an in-depth discussion of the SIR model, and show, that the
transmission rate $\beta$ and the recovery rate $\alpha$ work as the
aforementioned quantifiers in this model. The specific values of these
epidemiological parameters belonging to the training data define a function that
solves the system of differential equations of the SIR model. This function is
able to return the size of each compartment at a specific point in time. Thus,
from a mathematical and semantic perspective, it is essential to determine the
corresponding values governing the development of the pandemic.\\

In order to ascertain the transmission rate $\beta$ and the recovery rate
$\alpha$ from the preprocessed RKI data of $\Psi=(\boldsymbol{S}, \boldsymbol{I}, \boldsymbol{R})$
for a given set of time points, it is necessary to employ a data-driven approach
that outputs a model prediction of $\hat{\Psi}=(\hat{\boldsymbol{S}}, \hat{\boldsymbol{I}}, \hat{\boldsymbol{R}})$
for a set of time points, with the aim of minimizing the term,
\begin{equation}\label{eq:SIR_obs_term}
    \Big\|\hat{\boldsymbol{S}}^{(i)}-\boldsymbol{S}^{(i)}\Big\|^2  + \Big\|\hat{\boldsymbol{I}}^{(i)}-\boldsymbol{I}^{(i)}\Big\|^2 + \Big\|\hat{\boldsymbol{R}}^{(i)}-\boldsymbol{R}^{(i)}\Big\|^2,
\end{equation}
for each data point in the set of training data points of a cardinality $N_t$ and with
$i\in\{1, ..., N_t\}$. Moreover, the aforementioned parameters must satisfy the system
of differential equations that govern the SIR model. For this reason, Shaier
\etal~\cite{Shaier2021} utilize a PINN framework to satisfy both requirements.
Their approach, which they refer to as the \emph{Disease-Informed Neural Network}
(see~\Cref{sec:pinn:dinn}), takes epidemiological data as the input and returns
the two epidemiological parameters $\alpha$ and $\beta$. Their method
achieves this by finding an approximate solution of to the inverse problem of
physics-informed neural networks (see~\Cref{sec:pinn}). In terms of the SIR
model, a PINN addresses the inverse problem in two ways. First, it minimizes the mean of~\Cref{eq:SIR_obs_term}
by bringing the model predictions $\hat{\Psi}$
closer to the actual values $\Psi$
for each time point. Second, it reduces the residuals of the ODEs that
constitute the SIR model. While the forward problem concludes at this point, the
inverse problem presets that a parameter is unknown. Thus, we designate the parameters
$\alpha$ and $\beta$ as free, learnable parameters, $\hat{\beta}$ and
$\hat{\alpha}$. These separate trainable parameters are values that are
optimized during the training process and must fit the equations of the set of
ODEs. \\

Assuming that the values of the epidemiological parameters stay below
1~\cite{Shaier2021}, we force the value of both rates to be in a
range of $[-1, 1]$. Therefore, we regularize the parameters using the
\emph{tangens hyperbolicus}. This results in the terms,
\begin{equation}
    \tilde{\beta} = \tanh(\hat{\beta}),\quad \tilde{\alpha} = \tanh(\hat{\alpha}),
\end{equation}
where $\tilde{\alpha}$ and $\hat{\beta}$ are regularized model predictions.\\

The input data must include the time point $\boldsymbol{t}^{(i)}$ and its
corresponding measured true values of $\Psi^{(i)}$.
In its forward path, the PINN receives the time point $\boldsymbol{t}^{(i)}$ as its input, from which it
calculates its model prediction $\hat{\Psi}^{(i)}$
based on its model parameters $\theta$. Subsequently, the model computes the loss function. It calculates the data loss by taking the
mean squared error of~\Cref{eq:SIR_obs_term} over all $N_t$ training samples.
Therefore,
\begin{equation}
    \mathcal{L}_{\text{data}}(\Psi, \hat{\Psi}) = \frac{1}{N_t}\sum_{i=1}^{N_t} \Big\|\hat{\boldsymbol{S}}^{(i)}-\boldsymbol{S}^{(i)}\Big\|^2  + \Big\|\hat{\boldsymbol{I}}^{(i)}-\boldsymbol{I}^{(i)}\Big\|^2 + \Big\|\hat{\boldsymbol{R}}^{(i)}-\boldsymbol{R}^{(i)}\Big\|^2,
\end{equation}
is the term for the data loss. We find the ODEs of~\Cref{eq:modSIR} perform best
in our setup. Hence, we utilize them in our physics loss. In order for the model
to learn the system of differential equations, it is necessary to obtain the
residual of each ODE. The mean square error of the residuals constitutes the
physics loss
$\mathcal{L}_{\text{physics}}(\boldsymbol{t}, \Psi, \hat{\Psi})$.
The residuals are calculated using the model predictions $\hat{\Psi}$
and the regularized model predictions of the parameters, $\tilde{\alpha}$ and $\tilde{\beta}$.
The residuals are given by,
\begin{equation}
    0=\frac{d\hat{\boldsymbol{S}}}{d\boldsymbol{t}}+ \tilde{\beta}\frac{\hat{\boldsymbol{S}}\hat{\boldsymbol{I}}}{N}, \quad 0=\frac{d\hat{\boldsymbol{I}}}{d\boldsymbol{t}} - \tilde{\beta}\frac{\hat{\boldsymbol{S}}\hat{\boldsymbol{I}}}{N} + \tilde{\alpha}\hat{\boldsymbol{I}}, \quad 0=\frac{d\hat{\boldsymbol{R}}}{d\boldsymbol{t}} + \hat{\alpha}\hat{\boldsymbol{I}}.
\end{equation}
Thus,
\begin{equation}
    \mathcal{L}_{\text{SIR}}(\boldsymbol{t}, \Psi, \hat{\Psi}) = \mathcal{L}_{\text{physics}}(\boldsymbol{t}, \Psi, \hat{\Psi}) + \mathcal{L}_{\text{data}}(\Psi, \hat{\Psi})
\end{equation}
is the multi-objective loss equation encapsuling both the physics loss and the
data loss for our approach. By minimizing these loss terms our model learns the
given training data but also the physics of the system. This enables our model
to simultaneously learn the values of the parameters $\alpha$ and $\beta$
during training.\\

As this section concentrates on the finding of the time constant parameters
$\alpha$ and $\beta$, the next section will show our approach of finding the
reproduction number $\Rt$ on the German data of the RKI.

% -------------------------------------------------------------------

\section{Estimating the Reproduction Number using PINNs}
\label{sec:pinn:rsir}

The previous section illustrates the methodology we employ to determine the
constant transmission and recovery rates from a data set obtained during
the COVID-19 pandemic in Germany. In this section, we utilize PINNs to identify
the time-dependent reproduction number, $\Rt$, while reducing the number of
state variables and the reliance on assumptions, by decreasing the number of ODEs
in the system of differential equations of the SIR model. The methodology
presented in this section is based on the approach developed by Millevoi
\etal~\cite{Millevoi2023}.\\

In real-world pandemics, the rate of infection is influenced by a multitude of
factors. Events such as the growing awareness for the disease among the general
population, the introduction of non-pharmaceutical mitigations such as social
distancing policies, and the emergence of a new variant have an impact on the
transmission rate $\beta$. Accordingly, a transmission rate that is not
time-dependent and constant across the entire duration of the pandemic may not
accurately reflect the dynamics of the spread of a real-world disease. In~\Cref{sec:pandemicModel:rsir},
we provide, following Millevoi \etal~\cite{Millevoi2023}, the definition of the
time-dependent $\beta(t)$ and subsequently that of the reproduction number,
$\Rt$ which represents the number of new infections that occur as a result of
one infectious individual. It indicates whether a pandemic is emerging or if it
is spreading rapidly through the susceptible population. By inserting the
definition of~\Cref{eq:repr_num}, into the system of ODEs of the SIR model, we
can derive one~\Cref{eq:reduced_sir_ODE}. In order to solve this, we must
identify a function that maps a time point to the size of the infectious
compartment and the specific reproduction number.\\

As with the constant epidemiological parameters, we employ a data-driven approach for
identifying the time-dependent reproduction number $\Rt$. The PINN approximates
the size $\boldsymbol{I}$ with its model prediction $\hat{\boldsymbol{I}}$ by
minimizing the term,
\begin{equation}\label{eq:rSir_squared_err}
    \Big\|\hat{\boldsymbol{I}}^{(i)}-\boldsymbol{I}^{(i)}\Big\|^2,
\end{equation}
for each $i\in\{1,...,N_t\}$. In order to identify the reproduction number, the
PINN minimizes the residuals of the ODE during the training process. The
training process is analogous to the PINN, which identifies $\alpha$
and $\beta$ (see~\Cref{sec:pinn:sir}). However, there are two key differences. Firstly, the absence of
free, trainable parameters. Secondly, the inclusion of an additional state variable that
fluctuates in response to the input. While the state variable $\boldsymbol{I}$
is approximated using the error between the training data and the predicted
values, the state variable $\Rt$ is approximated exclusively based on the
residual of the ODE.\\

The PINN receives the input of $\boldsymbol{t}^{(i)}$ and generates a prediction of
($\hat{\boldsymbol{I}}^{(i)}$, $\Rt^{(i)}$). As previously stated, the PINN minimizes
the distance between the true values of $\boldsymbol{I}$ and the model predictions
$\hat{\boldsymbol{I}}$ by minimizing the mean squared error. Consequently, the
data loss function is defined by,
\begin{equation}
    \mathcal{L}_{\text{data}}(\boldsymbol{I}, \hat{\boldsymbol{I}}) = \frac{1}{N_t}\sum_{i=1}^{N_t} \Big\|\hat{\boldsymbol{I}}^{(i)}-\boldsymbol{I}^{(i)}\Big\|^2.
\end{equation}
The physics loss function is defined as the squared error of the residual of the
ODE. The residual of the reduced SIR model is given by,
\begin{equation}
    0 = \frac{dI_s}{dt_s} - \alpha(t_f - t_0)(\Rt - 1)I_s(t_s).
\end{equation}
During training we first fit the data agnostic to physics utilizing only the
data loss $\mathcal{L}_{\text{data}}(\boldsymbol{I}, \hat{\boldsymbol{I}})$.
Then we train on composite loss function given by,
\begin{equation}
    \mathcal{L}_{\text{rSIR}}(\boldsymbol{t}, \boldsymbol{I}, \hat{\boldsymbol{I}}) = \bigg\|\frac{dI_s}{dt_s} - \alpha(t_f - t_0)(\Rt - 1)I_s(t_s)\bigg\|^2+ \frac{1}{N_t}\sum_{i=1}^{N_t} \Big\|\hat{\boldsymbol{I}}^{(i)}-\boldsymbol{I}^{(i)}\Big\|^2,
\end{equation}
to achieve a better solution.\\

Although we set the transmission rate to be time-dependent, we define the
recovery time to be constant over time to reduce the complexity of the problem. The
RKI~\cite{GHInf} posits that the typical recovery period for the illness under
normal conditions is 14 days, while those individuals with severe cases require
approximately 28 days to recover. As we assume the case with a normal condition,
we can set the recovery time to $D=14$, which yields $\alpha = \nicefrac{1}{14}$.\\

We perform extensive empirical evaluations of the methodology employed to
determine the reproduction number, along with the other techniques, that this
chapter presents in the next chapter.

% -------------------------------------------------------------------