rothenbeck
/
Thesis


			
				
					
						
						
							123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198
							% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
% Author:   Phillip Rothenbeck
% Title:    Investigating the Evolution of the COVID-19 Pandemic in Germany Using Physics-Informed Neural Networks
% File:     chap03/chap03.tex
% Part:     Methods
% Description:
%         summary of the content in this chapter
% Version:  20.08.2024
% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
\chapter{Methods   8}
\label{chap:methods}
This chapter provides the methods, that we employ to address the problem that we
present in~\Cref{chap:introduction}.~\Cref{sec:preprocessing} outlines
our approaches for preprocessing of the available data and has two
sections. The first section describes the publicly available data provided by
the \emph{Robert Koch Institute} (RKI)\footnote[1]{\url{https://www.rki.de/EN/Home/homepage_node.html}}.
The second section outlines the techniques we use to process this data to fit
our project's requirements. Subsequently, we give a theoretical overview of the
PINN's that we employ. These latter sections, establish the foundation for the
implementations described in~\Cref{sec:sir:setup} and~\Cref{sec:rsir:setup}.

% -------------------------------------------------------------------

\section{Data Preprocessing   3}
\label{sec:preprocessing}
In order for the PINNs to be effective with the data available to us, it is
necessary for the data to be in the format required by the epidemiological
models, which the PINNs will solve. Let $N_t$ be the number of training points,
then let $i\in\{1, ..., N_t\}$ be the index of the training points. The data
required by the PINN for solving the SIR model (see~\Cref{sec:pinn:dinn}),
consists of pairs $(\boldsymbol{t}^{(i)}, (\boldsymbol{S}^{(i)}, \boldsymbol{I}^{(i)}, \boldsymbol{R}^{(i)}))$.
Given that the system of differential equations representing the reduced SIR
model (see~\Cref{sec:pandemicModel:rsir}) consists of a single differential
equation for $I$, it is necessary to obtain pairs of the form
$(\boldsymbol{t}^{(i)}, \boldsymbol{I}^{(i)})$. This section, focuses on the
structure of the available data and the methods we employ to transform it into
the correct structure.

% -------------------------------------------------------------------

\subsection{RKI Data   2}
\label{sec:preprocessing:rki}
The Robert Koch Institute is responsible for the on monitoring and prevention of
diseases. As the central institution of the German government in the field of
biomedicine, one of its tasks during the COVID-19 pandemic was it to track the
number of infections and death cases in Germany. The data was collected by
university hospitals, research facilities and laboratories through the
conduction of tests. Each new case must be reported within a period of 24 hours
at the latest to the respective state authority. Each state authority collects
the cases for a day and must report them to the RKI by the following working
day. The RKI then refines the data and releases statistics and updates its
repositories holding the information for the public to access. For the purposes
of this thesis we concentrate on two of these repositories.\\

The first repository is called \emph{COVID-19-Todesfälle in Deutschland}\footnote{\url{https://github.com/robert-koch-institut/COVID-19-Todesfaelle_in_Deutschland.git}}.
The dataset comprises discrete data points, each with a date indicating the
point in time at which the respective data was collected. The dates span from
March 9, 2020, to the present day. For each date, the dataset provides the total
number of infection and death cases, the number  of new deaths, and the
case-fatality ratio. The total number of infection and death cases represents
the sum of all cases reported up to that date, including the newly reported
data. The dataset includes two additional datasets, that contain the death case
information organized by age group or by the individual states within Germany on
a weekly basis.\\

\begin{figure}[h]
    \centering
    \includegraphics[width=\textwidth]{dataset_visualization.pdf}
    \caption{A visualization of the total death case and infection case data for
        each day from the data set \emph{COVID-19-Todesfälle in Deutschland}. Status
        of the 20'th of August 2024.}
    \label{fig:rki_data}
\end{figure}

The second repository is entitled \emph{SARS-CoV-2 Infektionen in Deutschland}.
This dataset contains comprehensive data regarding the infections of each county
on a daily basis. The counties are encoded using the \emph{Community Identification Number}\footnote{\url{https://www.destatis.de/DE/Themen/Laender-Regionen/Regionales/Gemeindeverzeichnis/_inhalt.html}},
wherein the first two digits denote the state, the third digit represents the
government district, and the last two digits indicate the county. Each data
point displays the gender, the age group, number death, infection and recovery
cases and the reference and report date. The reference date marks the onset of
illness in the individual. In the absence of this information, the reference
date is equivalent to the report date.\\

The RKI assumes that the duration of the illness under normal conditions is 14 days,
while the duration of severe cases is assumed to be 28 days. The recovery cases
in the dataset are calculated using these assumptions, by adding the duration on
the reference date if it is given. As stated in the ReadMe, the recovery data
should be used with caution. Since we require the recovery data for further
calculations, the following section presents the solutions we employed to address
this issue.

% -------------------------------------------------------------------

\subsection{Recovery Queue and Recovery Rate   1}
\label{sec:preprocessing:rq}

At the outset of this section, we establish the format of the data, that is
necessary for training the PINNs. In this subsection, we present the method, that we
employ to preprocess and transform the RKI data (see~\Cref{sec:preprocessing:rki})
into the training data. \\

In order to obtain the SIR data we require the size of each SIR compartment for
each time point.  The infection case data for the German states is available on
a daily basis. To obtain the daily cases for the entire country we need to
differentiate the total number of cases. The size of the population is defined
as the respective size at the beginning of 2020.  Using the starting conditions
of~\Cref{eq:startCond}, we iterate through each day, modifying the sizes of the
groups in a consecutive manner. For each iteration we subtract the new infection
cases from $\boldsymbol{S}^{(i-1)}$ to obtain $\boldsymbol{S}^{(i)}$, for
$\boldsymbol{I}^{(i)}$, we add the new cases and subtract deaths and recoveries,
and the size of $\boldsymbol{R}^{(i)}$ is obtained by adding the new deaths and
recoveries as they occur.\\

As previously stated in~\Cref{sec:preprocessing:rki} the data on recoveries may
either be unreliable or is entirely absent. To address this, we propose a method
for computing the number of recovered individuals per day. Under the assumption
that recovery takes $D$ days, we present the recovery queue, a data structure
that holds the number of infections for a given day, retains them for $D$ days,
and releases them into the removed group $D$ days later.\\

\begin{figure}[h]
    \centering
    \includegraphics[width=\textwidth]{recovery_queue.pdf}
    \caption{The recovery queue takes in the infected individuals for the $k$'th
        day and releases them $D$ days later into the removed group.}
    \label{fig:rki_data}
\end{figure}

In order to solve the reduced SIR model, we employ a similar algorithm to that
used for the SIR model. However, in contrast to the recovery queue, we utilize
the set recovery rate $\alpha$ to transfer a portion $\alpha\boldsymbol{I}^{(i)}$
of infections, which have recovered on the $i$ and put them into the
$\boldsymbol{R}^{(i)}$ compartment, which is irrelevant to our purposes. \\

The transformed data for both the SIR model and the reduced SIR model are then
employed by the PINN models, which we describe in the subsequent section.

% -------------------------------------------------------------------

\section{PINN for the SIR Model   3}
\label{sec:pinn:sir}

In the last section we present the methods, we use to transform the RKI data
(see~\Cref{sec:preprocessing}) into the format that is used by the PINNs to seek
a solution for the SIR models. In this section we lay out the methodology we
employ for this thesis concerning PINNs for SIR models.\\

The data, which is yielded by the preprocessing, is in the structure of pairs of
$(\boldsymbol{t^{(i)}}, (\boldsymbol{S^{(i)}},\boldsymbol{I^{(i)}},\boldsymbol{R^{(i)}}))$,
which contain the sizes of the susceptible, infectious, and removed compartments
together with their respective time point with the index $i$. This means that
this training data contains the measured solutions of the functions $S(t)$,
$I(t),$ and $R(t)$, which a neural network may use to approximate these
functions. Furthermore, a PINN can carry out this task with a higher precision
for more complex problems were the unknown function is more complex and just a
system of differential equations is given.\\

In this thesis we want to find the solutions of the SIR models belonging to the
cases of the datasets. The SIR model is given through the system of differential
equations (see~\Cref{eq:sir}), which describes the relations and fluctuations of
the three compartments through transition rates $\beta$ and $\alpha$. As we
explain in~\Cref{sec:pandemicModel:sir}, these parameters influence course of
the pandemic, which is described by their respective model. Mathematically, when
we find a pair of parameters for a dataset, these parameters describe a
function, that solves the system of differential equations for our data set. A
PINN finds parameters for a given set of differential equations by solving the
inverse problem. As Shaier \etal~\cite{Shaier2021} propose, a DINN solves inverse
problems by setting the parameters $\beta$ and $\alpha$ to trainable parameters
$\widehat{\beta}$ and $\widehat{\alpha}$. As described in~\Cref{sec:pinn}, the
DINN learns the parameters to optimize its model predictions $\hat{\boldsymbol{S}}$,
$\hat{\boldsymbol{I}}$, and $\hat{\boldsymbol{R}}$, to fit the differential
equations through the usage of their residuals and the given data.\\

The PINN uses the loss function to determine how far it is away from the true
solution. For the DINN~\cite{Shaier2021} this loss function includes the mean
squared error of each residual in addition to the mean squared error of the
model predictions concerning their respective true solutions. On the contrary to
Shaier \etal, who use the set of differential equations of~\Cref{eq:sir} for
their loss function, we use~\Cref{eq:modSIR}. The reason for this choice is that
we encountered a better practical performance during our work than when using
the equation, used by Shaier \etal. Let $N$ be the size of the population and
$N_t$ the number of training point of the used dataset then,

\begin{equation}
    \begin{split}
        \mathcal{L}_{\text{SIR}}(\boldsymbol{t}, \boldsymbol{S}, \boldsymbol{I}, \boldsymbol{R}, \hat{\boldsymbol{S}}, \hat{\boldsymbol{I}}, \hat{\boldsymbol{R}}) = &\bigg\|\frac{d\hat{\boldsymbol{S}}}{d\boldsymbol{t}}+ \widehat{\beta}\frac{\hat{\boldsymbol{S}}\hat{\boldsymbol{I}}}{N}\bigg\|^2\\ + &\bigg\|\frac{d\hat{\boldsymbol{I}}}{d\boldsymbol{t}} - \widehat{\beta}\frac{\hat{\boldsymbol{S}}\hat{\boldsymbol{I}}}{N} + \widehat{\alpha}\hat{\boldsymbol{I}}\bigg\|^2\\ + &\bigg\|\frac{d\hat{\boldsymbol{R}}}{d\boldsymbol{t}} + \widehat{\alpha}\hat{\boldsymbol{I}}\bigg\|^2\\
        + &\frac{1}{N_t}\sum_{i=1}^{N_t} \Big\|\hat{\boldsymbol{S}}^{(i)}-\boldsymbol{S}^{(i)}\Big\|^2  + \Big\|\hat{\boldsymbol{I}}^{(i)}-\boldsymbol{I}^{(i)}\Big\|^2 + \Big\|\hat{\boldsymbol{R}}^{(i)}-\boldsymbol{R}^{(i)}\Big\|^2,
    \end{split}
\end{equation}

is the loss function, that employ to find the transition parameters $\beta$ and
$alpha$ for the given dataset.

% -------------------------------------------------------------------

\section{PINN for the reduced SIR Model   2}
\label{sec:pinn:rsir}