123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198 |
- % %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
- % Author: Phillip Rothenbeck
- % Title: Investigating the Evolution of the COVID-19 Pandemic in Germany Using Physics-Informed Neural Networks
- % File: chap03/chap03.tex
- % Part: Methods
- % Description:
- % summary of the content in this chapter
- % Version: 20.08.2024
- % %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
- \chapter{Methods 8}
- \label{chap:methods}
- This chapter provides the methods, that we employ to address the problem that we
- present in~\Cref{chap:introduction}.~\Cref{sec:preprocessing} outlines
- our approaches for preprocessing of the available data and has two
- sections. The first section describes the publicly available data provided by
- the \emph{Robert Koch Institute} (RKI)\footnote[1]{\url{https://www.rki.de/EN/Home/homepage_node.html}}.
- The second section outlines the techniques we use to process this data to fit
- our project's requirements. Subsequently, we give a theoretical overview of the
- PINN's that we employ. These latter sections, establish the foundation for the
- implementations described in~\Cref{sec:sir:setup} and~\Cref{sec:rsir:setup}.
- % -------------------------------------------------------------------
- \section{Data Preprocessing 3}
- \label{sec:preprocessing}
- In order for the PINNs to be effective with the data available to us, it is
- necessary for the data to be in the format required by the epidemiological
- models, which the PINNs will solve. Let $N_t$ be the number of training points,
- then let $i\in\{1, ..., N_t\}$ be the index of the training points. The data
- required by the PINN for solving the SIR model (see~\Cref{sec:pinn:dinn}),
- consists of pairs $(\boldsymbol{t}^{(i)}, (\boldsymbol{S}^{(i)}, \boldsymbol{I}^{(i)}, \boldsymbol{R}^{(i)}))$.
- Given that the system of differential equations representing the reduced SIR
- model (see~\Cref{sec:pandemicModel:rsir}) consists of a single differential
- equation for $I$, it is necessary to obtain pairs of the form
- $(\boldsymbol{t}^{(i)}, \boldsymbol{I}^{(i)})$. This section, focuses on the
- structure of the available data and the methods we employ to transform it into
- the correct structure.
- % -------------------------------------------------------------------
- \subsection{RKI Data 2}
- \label{sec:preprocessing:rki}
- The Robert Koch Institute is responsible for the on monitoring and prevention of
- diseases. As the central institution of the German government in the field of
- biomedicine, one of its tasks during the COVID-19 pandemic was it to track the
- number of infections and death cases in Germany. The data was collected by
- university hospitals, research facilities and laboratories through the
- conduction of tests. Each new case must be reported within a period of 24 hours
- at the latest to the respective state authority. Each state authority collects
- the cases for a day and must report them to the RKI by the following working
- day. The RKI then refines the data and releases statistics and updates its
- repositories holding the information for the public to access. For the purposes
- of this thesis we concentrate on two of these repositories.\\
- The first repository is called \emph{COVID-19-Todesfälle in Deutschland}\footnote{\url{https://github.com/robert-koch-institut/COVID-19-Todesfaelle_in_Deutschland.git}}.
- The dataset comprises discrete data points, each with a date indicating the
- point in time at which the respective data was collected. The dates span from
- March 9, 2020, to the present day. For each date, the dataset provides the total
- number of infection and death cases, the number of new deaths, and the
- case-fatality ratio. The total number of infection and death cases represents
- the sum of all cases reported up to that date, including the newly reported
- data. The dataset includes two additional datasets, that contain the death case
- information organized by age group or by the individual states within Germany on
- a weekly basis.\\
- \begin{figure}[h]
- \centering
- \includegraphics[width=\textwidth]{dataset_visualization.pdf}
- \caption{A visualization of the total death case and infection case data for
- each day from the data set \emph{COVID-19-Todesfälle in Deutschland}. Status
- of the 20'th of August 2024.}
- \label{fig:rki_data}
- \end{figure}
- The second repository is entitled \emph{SARS-CoV-2 Infektionen in Deutschland}.
- This dataset contains comprehensive data regarding the infections of each county
- on a daily basis. The counties are encoded using the \emph{Community Identification Number}\footnote{\url{https://www.destatis.de/DE/Themen/Laender-Regionen/Regionales/Gemeindeverzeichnis/_inhalt.html}},
- wherein the first two digits denote the state, the third digit represents the
- government district, and the last two digits indicate the county. Each data
- point displays the gender, the age group, number death, infection and recovery
- cases and the reference and report date. The reference date marks the onset of
- illness in the individual. In the absence of this information, the reference
- date is equivalent to the report date.\\
- The RKI assumes that the duration of the illness under normal conditions is 14 days,
- while the duration of severe cases is assumed to be 28 days. The recovery cases
- in the dataset are calculated using these assumptions, by adding the duration on
- the reference date if it is given. As stated in the ReadMe, the recovery data
- should be used with caution. Since we require the recovery data for further
- calculations, the following section presents the solutions we employed to address
- this issue.
- % -------------------------------------------------------------------
- \subsection{Recovery Queue and Recovery Rate 1}
- \label{sec:preprocessing:rq}
- At the outset of this section, we establish the format of the data, that is
- necessary for training the PINNs. In this subsection, we present the method, that we
- employ to preprocess and transform the RKI data (see~\Cref{sec:preprocessing:rki})
- into the training data. \\
- In order to obtain the SIR data we require the size of each SIR compartment for
- each time point. The infection case data for the German states is available on
- a daily basis. To obtain the daily cases for the entire country we need to
- differentiate the total number of cases. The size of the population is defined
- as the respective size at the beginning of 2020. Using the starting conditions
- of~\Cref{eq:startCond}, we iterate through each day, modifying the sizes of the
- groups in a consecutive manner. For each iteration we subtract the new infection
- cases from $\boldsymbol{S}^{(i-1)}$ to obtain $\boldsymbol{S}^{(i)}$, for
- $\boldsymbol{I}^{(i)}$, we add the new cases and subtract deaths and recoveries,
- and the size of $\boldsymbol{R}^{(i)}$ is obtained by adding the new deaths and
- recoveries as they occur.\\
- As previously stated in~\Cref{sec:preprocessing:rki} the data on recoveries may
- either be unreliable or is entirely absent. To address this, we propose a method
- for computing the number of recovered individuals per day. Under the assumption
- that recovery takes $D$ days, we present the recovery queue, a data structure
- that holds the number of infections for a given day, retains them for $D$ days,
- and releases them into the removed group $D$ days later.\\
- \begin{figure}[h]
- \centering
- \includegraphics[width=\textwidth]{recovery_queue.pdf}
- \caption{The recovery queue takes in the infected individuals for the $k$'th
- day and releases them $D$ days later into the removed group.}
- \label{fig:rki_data}
- \end{figure}
- In order to solve the reduced SIR model, we employ a similar algorithm to that
- used for the SIR model. However, in contrast to the recovery queue, we utilize
- the set recovery rate $\alpha$ to transfer a portion $\alpha\boldsymbol{I}^{(i)}$
- of infections, which have recovered on the $i$ and put them into the
- $\boldsymbol{R}^{(i)}$ compartment, which is irrelevant to our purposes. \\
- The transformed data for both the SIR model and the reduced SIR model are then
- employed by the PINN models, which we describe in the subsequent section.
- % -------------------------------------------------------------------
- \section{PINN for the SIR Model 3}
- \label{sec:pinn:sir}
- In the last section we present the methods, we use to transform the RKI data
- (see~\Cref{sec:preprocessing}) into the format that is used by the PINNs to seek
- a solution for the SIR models. In this section we lay out the methodology we
- employ for this thesis concerning PINNs for SIR models.\\
- The data, which is yielded by the preprocessing, is in the structure of pairs of
- $(\boldsymbol{t^{(i)}}, (\boldsymbol{S^{(i)}},\boldsymbol{I^{(i)}},\boldsymbol{R^{(i)}}))$,
- which contain the sizes of the susceptible, infectious, and removed compartments
- together with their respective time point with the index $i$. This means that
- this training data contains the measured solutions of the functions $S(t)$,
- $I(t),$ and $R(t)$, which a neural network may use to approximate these
- functions. Furthermore, a PINN can carry out this task with a higher precision
- for more complex problems were the unknown function is more complex and just a
- system of differential equations is given.\\
- In this thesis we want to find the solutions of the SIR models belonging to the
- cases of the datasets. The SIR model is given through the system of differential
- equations (see~\Cref{eq:sir}), which describes the relations and fluctuations of
- the three compartments through transition rates $\beta$ and $\alpha$. As we
- explain in~\Cref{sec:pandemicModel:sir}, these parameters influence course of
- the pandemic, which is described by their respective model. Mathematically, when
- we find a pair of parameters for a dataset, these parameters describe a
- function, that solves the system of differential equations for our data set. A
- PINN finds parameters for a given set of differential equations by solving the
- inverse problem. As Shaier \etal~\cite{Shaier2021} propose, a DINN solves inverse
- problems by setting the parameters $\beta$ and $\alpha$ to trainable parameters
- $\widehat{\beta}$ and $\widehat{\alpha}$. As described in~\Cref{sec:pinn}, the
- DINN learns the parameters to optimize its model predictions $\hat{\boldsymbol{S}}$,
- $\hat{\boldsymbol{I}}$, and $\hat{\boldsymbol{R}}$, to fit the differential
- equations through the usage of their residuals and the given data.\\
- The PINN uses the loss function to determine how far it is away from the true
- solution. For the DINN~\cite{Shaier2021} this loss function includes the mean
- squared error of each residual in addition to the mean squared error of the
- model predictions concerning their respective true solutions. On the contrary to
- Shaier \etal, who use the set of differential equations of~\Cref{eq:sir} for
- their loss function, we use~\Cref{eq:modSIR}. The reason for this choice is that
- we encountered a better practical performance during our work than when using
- the equation, used by Shaier \etal. Let $N$ be the size of the population and
- $N_t$ the number of training point of the used dataset then,
- \begin{equation}
- \begin{split}
- \mathcal{L}_{\text{SIR}}(\boldsymbol{t}, \boldsymbol{S}, \boldsymbol{I}, \boldsymbol{R}, \hat{\boldsymbol{S}}, \hat{\boldsymbol{I}}, \hat{\boldsymbol{R}}) = &\bigg\|\frac{d\hat{\boldsymbol{S}}}{d\boldsymbol{t}}+ \widehat{\beta}\frac{\hat{\boldsymbol{S}}\hat{\boldsymbol{I}}}{N}\bigg\|^2\\ + &\bigg\|\frac{d\hat{\boldsymbol{I}}}{d\boldsymbol{t}} - \widehat{\beta}\frac{\hat{\boldsymbol{S}}\hat{\boldsymbol{I}}}{N} + \widehat{\alpha}\hat{\boldsymbol{I}}\bigg\|^2\\ + &\bigg\|\frac{d\hat{\boldsymbol{R}}}{d\boldsymbol{t}} + \widehat{\alpha}\hat{\boldsymbol{I}}\bigg\|^2\\
- + &\frac{1}{N_t}\sum_{i=1}^{N_t} \Big\|\hat{\boldsymbol{S}}^{(i)}-\boldsymbol{S}^{(i)}\Big\|^2 + \Big\|\hat{\boldsymbol{I}}^{(i)}-\boldsymbol{I}^{(i)}\Big\|^2 + \Big\|\hat{\boldsymbol{R}}^{(i)}-\boldsymbol{R}^{(i)}\Big\|^2,
- \end{split}
- \end{equation}
- is the loss function, that employ to find the transition parameters $\beta$ and
- $alpha$ for the given dataset.
- % -------------------------------------------------------------------
- \section{PINN for the reduced SIR Model 2}
- \label{sec:pinn:rsir}
|