123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320 |
- % %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
- % Author: Phillip Rothenbeck
- % Title: Investigating the Evolution of the COVID-19 Pandemic in Germany Using Physics-Informed Neural Networks
- % File: chap03/chap03.tex
- % Part: Methods
- % Description:
- % summary of the content in this chapter
- % Version: 26.08.2024
- % %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
- \chapter{Methods}
- \label{chap:methods}
- This chapter provides the methods, that we employ to address the problem that we
- present in~\Cref{chap:introduction}.~\Cref{sec:preprocessing} outlines
- our approaches for preprocessing of the available data and has two
- sections. The first section describes the publicly available data provided by
- the \emph{Robert Koch Institute} (RKI)\footnote{\url{https://www.rki.de/EN/Home/homepage_node.html}}.
- The second section outlines the techniques we use to process this data to fit
- our project's requirements. Subsequently, we give a theoretical overview of the
- PINN's that we employ. These latter sections, establish the foundation for the
- implementations described in~\Cref{sec:sir:setup} and~\Cref{sec:rsir:setup}.
- % -------------------------------------------------------------------
- \section{Epidemiological Data}
- \label{sec:preprocessing}
- In this thesis we want to analyze the COVID-19 pandemic in Germany utilizing
- the SIR model and PINNs. For a PINN to learn the parameters of the SIR model,
- we need pandemic data in the correct format for the approach. Let $N_t$ be the
- number of training points, then let $i\in\{1, ..., N_t\}$
- be the index of the training points. The data required by the PINN for solving
- the SIR model (see~\Cref{sec:pinn:dinn}), consists of pairs
- $(\boldsymbol{t}^{(i)}, (\boldsymbol{S}^{(i)}, \boldsymbol{I}^{(i)}, \boldsymbol{R}^{(i)}))$,
- with $\boldsymbol{t}^{(i)}$ representing the time in days since the first
- measurement and $\boldsymbol{S}^{(i)}, \boldsymbol{I}^{(i)}$, and $\boldsymbol{R}^{(i)}$
- the corresponding size of the compartments. Given that the system of
- differential equations representing the reduced SIR model
- (see~\Cref{sec:pandemicModel:rsir}) consists of a single differential equation
- for $I$, it is necessary to obtain pairs of the form $(\boldsymbol{t}^{(i)}, \boldsymbol{I}^{(i)})$.
- This section, focuses on the structure of the available data and the methods we
- employ to transform it into the correct structure.
- % -------------------------------------------------------------------
- \subsection{RKI Data}
- \label{sec:preprocessing:rki}
- The RKI is a biomedical institute in Germany responsible for
- the on monitoring and prevention of diseases. As the central institution of the
- German government in the field of biomedicine, one of its tasks during the
- COVID-19 pandemic was to track the number of infections and death cases in
- Germany. The data was collected by university hospitals, research facilities
- and laboratories through the conduction of tests. Each new case had to be
- reported within a period of 24 hours at the latest to the respective state
- authority. Each state authority collects the cases for a day and had to report
- them to the RKI by the following working day~\cite{GHDead}. The RKI then refines
- the data and releases statistics and updates its repositories holding the
- information for the public to access. For the purposes of this thesis we
- concentrate on two of these repositories.\\
- The first repository is called \emph{COVID-19-Todesfälle in Deutschland}~\cite{GHDead}.
- The dataset comprises discrete data points, each with a date indicating the
- point in time at which the respective data was collected. The dates span from
- 2020-03-09, to the present day. For each date, the dataset provides the total
- number of infection and death cases, the number of new deaths, and the
- case-fatality ratio. The total number of infection and death cases represents
- the sum of all cases reported up to that date, including the newly reported
- data. The dataset includes two additional subsets, that contain the death case
- information organized by age group or by the individual states within Germany on
- a weekly basis.\\
- \begin{figure}[t]
- \centering
- \includegraphics[width=\textwidth]{dataset_visualization.pdf}
- \caption{A visualization of the total death case and infection case data for
- each day from the data set \emph{COVID-19-Todesfälle in Deutschland}. Status
- of 2024-08-20.}
- \label{fig:rki_data}
- \end{figure}
- The second repository is entitled \emph{SARS-CoV-2 Infektionen in Deutschland}~\cite{GHInf}.
- This dataset contains comprehensive data regarding the infections of each county
- on a daily basis. The counties are encoded using the \emph{Community Identification Number}\footnote{\url{https://www.destatis.de/DE/Themen/Laender-Regionen/Regionales/Gemeindeverzeichnis/_inhalt.html}},
- wherein the first two digits denote the state, the third digit represents the
- government district, and the last two digits indicate the county. Each data
- point displays the gender, the age group, number death, infection and recovery
- cases and the reference and report date. The reference date marks the onset of
- illness in the individual. In the absence of this information, the reference
- date is equivalent to the report date.\\
- The RKI assumes that the duration of the illness under normal conditions is 14 days,
- while the duration of severe cases is assumed to be 28 days. The recovery cases
- in the dataset are calculated using these assumptions, by adding the duration on
- the reference date if it is given. As stated, the recovery data should be used
- with caution. Since we require the recovery data for further calculations, the
- following section presents the solutions we employed to address this issue.
- % -------------------------------------------------------------------
- \subsection{Data Preprocessing}
- \label{sec:preprocessing:rq}
- At the outset of this section, we establish the format of the data, that is
- necessary for training the PINNs. In this subsection, we present the method, that we
- employ to preprocess and transform the RKI data (see~\Cref{sec:preprocessing:rki})
- into the training data. \\
- In order to obtain the SIR data we require the size of each SIR compartment for
- each time point. The infection case data for the German states is available on
- a daily basis. To obtain the daily cases for the entire country we need to
- differentiate the total number of cases. The size of the population is defined
- as the respective size at the beginning of 2020. Using the starting conditions
- of~\Cref{eq:startCond}, we iterate through each day, modifying the sizes of the
- groups in a consecutive manner. For each iteration we subtract the new infection
- cases from $\boldsymbol{S}^{(i-1)}$ to obtain $\boldsymbol{S}^{(i)}$, for
- $\boldsymbol{I}^{(i)}$, we add the new cases and subtract deaths and recoveries,
- and the size of $\boldsymbol{R}^{(i)}$ is obtained by adding the new deaths and
- recoveries as they occur.\\
- As previously stated in~\Cref{sec:preprocessing:rki} the data on recoveries may
- either be unreliable or is entirely absent. To address this, we propose a method
- for computing the number of recovered individuals per day. Under the assumption
- that recovery takes $D$ days, we present the recovery queue, a data structure
- that holds the number of infections for a given day, retains them for $D$ days,
- and releases them into the removed group $D$ days later.\\
- \begin{figure}[t]
- \centering
- \includegraphics[width=\textwidth]{recovery_queue.pdf}
- \caption{The recovery queue takes in the infected individuals for the $k$'th
- day and releases them $D$ days later into the removed group.}
- \label{fig:recovery_queue}
- \end{figure}
- In order to solve the reduced SIR model, we employ a similar algorithm to that
- used for the SIR model. However, in contrast to the recovery queue, we utilize
- a set recovery rate $\alpha$ to transfer a portion $\alpha\boldsymbol{I}^{(i)}$
- of infections, which have recovered or died on the $i$'th day and put them into
- the $\boldsymbol{R}^{(i+1)}$ compartment of the next day, which is irrelevant to
- our purposes. The transformed data for both the SIR model and the reduced SIR
- model are then employed by the PINN models, which we describe in the subsequent
- section.
- % -------------------------------------------------------------------
- \section{Estimating Epidemiological Parameters using PINNs}
- \label{sec:pinn:sir}
- In the preceding section, we present the methods we employ to preprocess and
- format the data from the RKI in accordance with the specifications required for
- the application in this thesis. Here, we will present the method we employ
- to identify the SIR parameters $\beta$ and $\alpha$ for our data. As a
- foundation for our work, we draw upon the work of Shaier \etal~\cite{Shaier2021},
- to solve the SIR system of ODEs using PINNs.\\
- In order to conduct an analysis of a pandemic, it is necessary to have a
- quantifiable measure that indicates whether the disease in question has the
- capacity to spread rapidly through a population or is it not successful in
- infecting a significant number of individuals. In~\Cref{sec:pandemicModel:sir},
- we provide an in-depth discussion of the SIR model, and show, that the
- transmission rate $\beta$ and the recovery rate $\alpha$ work as the
- aforementioned quantifiers in this model. The specific values of these
- epidemiological parameters belonging to the training data define a function that
- solves the system of differential equations of the SIR model. This function is
- able to return the size of each compartment at a specific point in time. Thus,
- from a mathematical and semantic perspective, it is essential to determine the
- corresponding values governing the development of the pandemic.\\
- In order to ascertain the transmission rate $\beta$ and the recovery rate
- $\alpha$ from the preprocessed RKI data of $\Psi=(\boldsymbol{S}, \boldsymbol{I}, \boldsymbol{R})$
- for a given set of time points, it is necessary to employ a data-driven approach
- that outputs a model prediction of $\hat{\Psi}=(\hat{\boldsymbol{S}}, \hat{\boldsymbol{I}}, \hat{\boldsymbol{R}})$
- for a set of time points, with the aim of minimizing the term,
- \begin{equation}\label{eq:SIR_obs_term}
- \Big\|\hat{\boldsymbol{S}}^{(i)}-\boldsymbol{S}^{(i)}\Big\|^2 + \Big\|\hat{\boldsymbol{I}}^{(i)}-\boldsymbol{I}^{(i)}\Big\|^2 + \Big\|\hat{\boldsymbol{R}}^{(i)}-\boldsymbol{R}^{(i)}\Big\|^2,
- \end{equation}
- for each data point in the set of training dataset of a cardinality $N_t$ and with
- $i\in\{1, ..., N_t\}$. Moreover, the aforementioned parameters must satisfy the system
- of differential equations that govern the SIR model. For this reason, Shaier
- \etal~\cite{Shaier2021} utilize a PINN framework to satisfy both requirements.
- Their approach, which they refer to as the \emph{Disease-Informed Neural Network}
- (see~\Cref{sec:pinn:dinn}), takes epidemiological data as the input and returns
- the two epidemiological parameters $\alpha$ and $\beta$. Their method
- achieves this by finding an approximate solution of to the inverse problem of
- physics-informed neural networks (see~\Cref{sec:pinn}). In terms of the SIR
- model, a PINN addresses the inverse problem in two ways. First, it minimizes the mean of~\Cref{eq:SIR_obs_term}
- by bringing the model predictions $\hat{\Psi}$
- closer to the actual values $\Psi$
- for each time point. Second, it reduces the residuals of the ODEs that
- constitute the SIR model. While the forward problem concludes at this point, the
- inverse problem presets that a parameter is unknown. Thus, we designate the parameters
- $\beta$ and $\alpha$ as free, learnable parameters, $\hat{\beta}$ and
- $\hat{\alpha}$. These separate trainable parameters are values that are
- optimized during the training process and must fit the equations of the set of
- ODEs. \\
- Assuming that the values of the epidemiological parameters stay below
- 1~\cite{Shaier2021}, we force the value of both rates to be in a
- range of $[-1, 1]$. Therefore, we regularize the parameters using the
- \emph{tangens hyperbolicus}. This results in the terms,
- \begin{equation}
- \tilde{\beta} = \tanh(\hat{\beta}),\quad \tilde{\alpha} = \tanh(\hat{\alpha}),
- \end{equation}
- where $\tilde{\alpha}$ are regularized model predictions.\\
- The input data must include the time point $\boldsymbol{t}^{(i)}$ and its
- corresponding measured true values of $\Psi^{(i)}$.
- In its forward path, the PINN receives the time point $\boldsymbol{t}^{(i)}$ as its input, from which it
- calculates its model prediction $\hat{\Psi}^{(i)}$
- based on its model parameters $\theta$. Subsequently, the model computes the loss function. It calculates the data loss by taking the
- mean squared error of~\Cref{eq:SIR_obs_term} over all $N_t$ training samples.
- Therefore,
- \begin{equation}
- \mathcal{L}_{\text{data}}(\Psi, \hat{\Psi}) = \frac{1}{N_t}\sum_{i=1}^{N_t} \Big\|\hat{\boldsymbol{S}}^{(i)}-\boldsymbol{S}^{(i)}\Big\|^2 + \Big\|\hat{\boldsymbol{I}}^{(i)}-\boldsymbol{I}^{(i)}\Big\|^2 + \Big\|\hat{\boldsymbol{R}}^{(i)}-\boldsymbol{R}^{(i)}\Big\|^2,
- \end{equation}
- is the term for the data loss. We find the ODEs of~\Cref{eq:modSIR} perform best
- in our setup. Hence, we utilize them in our physics loss. In order for the model
- to learn the system of differential equations, it is necessary to obtain the
- residual of each ODE. The mean square error of the residuals constitutes the
- physics loss
- $\mathcal{L}_{\text{physics}}(\boldsymbol{t}, \Psi, \hat{\Psi})$.
- The residuals are calculated using the model predictions $\hat{\Psi}$
- and the regularized model predictions of the parameters, $\tilde{\beta}$ and $\tilde{\alpha}$.
- The residuals are given by,
- \begin{equation}
- 0=\frac{d\hat{\boldsymbol{S}}}{d\boldsymbol{t}}+ \tilde{\beta}\frac{\hat{\boldsymbol{S}}\hat{\boldsymbol{I}}}{N}, \quad 0=\frac{d\hat{\boldsymbol{I}}}{d\boldsymbol{t}} - \tilde{\beta}\frac{\hat{\boldsymbol{S}}\hat{\boldsymbol{I}}}{N} + \tilde{\alpha}\hat{\boldsymbol{I}}, \quad 0=\frac{d\hat{\boldsymbol{R}}}{d\boldsymbol{t}} + \hat{\alpha}\hat{\boldsymbol{I}}.
- \end{equation}
- Thus,
- \begin{equation}
- \mathcal{L}_{\text{SIR}}(\boldsymbol{t}, \Psi, \hat{\Psi}) = \mathcal{L}_{\text{physics}}(\boldsymbol{t}, \Psi, \hat{\Psi}) + \mathcal{L}_{\text{data}}(\Psi, \hat{\Psi})
- \end{equation}
- is the multi-objective loss equation encapsuling both the physics loss and the
- data loss for our approach. By minimizing these loss terms our model learns the
- given training data but also the physics of the system. This enables our model
- to simultaneously learn the values of the parameters $\beta$ and $\alpha$
- during training.\\
- As this section concentrates on the finding of the time constant parameters
- $\beta$ and $\alpha$, the next section will show our approach of finding the
- reproduction number $\Rt$ on the German data of the RKI.
- % -------------------------------------------------------------------
- \section{Estimating the Reproduction Number using PINNs}
- \label{sec:pinn:rsir}
- The previous section illustrates the methodology we employ to determine the
- constant transmission and recovery rates from a data set obtained from
- the COVID-19 pandemic in Germany. In this section, we utilize PINNs to identify
- the time-dependent reproduction number, $\Rt$, while reducing the number of
- state variables and the reliance on assumptions, by decreasing the number of ODEs
- in the system of differential equations of the SIR model. The methodology
- presented in this section is based on the approach developed by Millevoi
- \etal~\cite{Millevoi2023}.\\
- In real-world pandemics, the rate of infection is influenced by a multitude of
- factors. Events such as the growing awareness for the disease among the general
- population, the introduction of non-pharmaceutical mitigations such as social
- distancing policies, and the emergence of a new variant have an impact on the
- transmission rate $\beta$. Accordingly, a transmission rate that is not
- time-dependent and constant across the entire duration of the pandemic may not
- accurately reflect the dynamics of the spread of a real-world disease. In~\Cref{sec:pandemicModel:rsir},
- we provide, following Millevoi \etal~\cite{Millevoi2023}, the definition of the
- time-dependent $\beta(t)$ and subsequently that of the reproduction number,
- $\Rt$ which represents the number of new infections that occur as a result of
- one infectious individual. It indicates whether a pandemic is emerging or if it
- is spreading rapidly through the susceptible population. By inserting the
- definition of~\Cref{eq:repr_num}, into the system of ODEs of the SIR model, we
- can derive one~\Cref{eq:reduced_sir_ODE}. In order to solve this, we must
- identify a function that maps a time point to the size of the infectious
- compartment and the specific reproduction number.\\
- As with the constant epidemiological parameters, we employ a data-driven approach for
- identifying the time-dependent reproduction number $\Rt$. The PINN approximates
- the size $\boldsymbol{I}$ with its model prediction $\hat{\boldsymbol{I}}$ by
- minimizing the term,
- \begin{equation}\label{eq:rSir_squared_err}
- \Big\|\hat{\boldsymbol{I}}^{(i)}-\boldsymbol{I}^{(i)}\Big\|^2,
- \end{equation}
- for each $i\in\{1,...,N_t\}$. In order to identify the reproduction number, the
- PINN minimizes the residuals of the ODE during the training process. The
- training process is analogous to the PINN, which identifies $\beta$
- and $\alpha$ (see~\Cref{sec:pinn:sir}). However, there are two key differences. Firstly, the absence of
- free, trainable parameters. Secondly, the inclusion of an additional state variable that
- fluctuates in response to the input. While the state variable $\boldsymbol{I}$
- is approximated using the error between the training data and the predicted
- values, the state variable $\Rt$ is approximated exclusively based on the
- residual of the ODE.\\
- The PINN receives the input of $\boldsymbol{t}^{(i)}$ and generates a prediction of
- ($\hat{\boldsymbol{I}}^{(i)}$, $\Rt^{(i)}$). As previously stated, the PINN minimizes
- the distance between the true values of $\boldsymbol{I}$ and the model predictions
- $\hat{\boldsymbol{I}}$ by minimizing the mean squared error. Consequently, the
- data loss function is defined by,
- \begin{equation}
- \mathcal{L}_{\text{data}}(\boldsymbol{I}, \hat{\boldsymbol{I}}) = \frac{1}{N_t}\sum_{i=1}^{N_t} \Big\|\hat{\boldsymbol{I}}^{(i)}-\boldsymbol{I}^{(i)}\Big\|^2.
- \end{equation}
- The physics loss function is defined as the squared error of the residual of the
- ODE. The residual of the reduced SIR model is given by,
- \begin{equation}
- 0 = \frac{dI_s}{dt_s} - \alpha(t_f - t_0)(\Rt - 1)I_s(t_s).
- \end{equation}
- During training we first fit the data agnostic to physics utilizing only the
- data loss $\mathcal{L}_{\text{data}}(\boldsymbol{I}, \hat{\boldsymbol{I}})$.
- Then we train on composite loss function given by,
- \begin{equation}
- \mathcal{L}_{\text{rSIR}}(\boldsymbol{t}, \boldsymbol{I}, \hat{\boldsymbol{I}}) = \bigg\|\frac{dI_s}{dt_s} - \alpha(t_f - t_0)(\Rt - 1)I_s(t_s)\bigg\|^2+ \frac{1}{N_t}\sum_{i=1}^{N_t} \Big\|\hat{\boldsymbol{I}}^{(i)}-\boldsymbol{I}^{(i)}\Big\|^2,
- \end{equation}
- to achieve a better solution.\\
- Although we set the transmission rate to be time-dependent, we define the
- recovery time constant over time to reduce the complexity of the problem. The
- RKI~\cite{GHInf} posits that the typical recovery period for the illness under
- normal conditions is 14 days, while those individuals with severe cases require
- approximately 28 days to recover. As we assume the case with normal condition,
- we can set the recovery time to $D=14$, which yields $\alpha = \nicefrac{1}{14}$.\\
- We perform extensive empirical evaluations of the methodology employed to
- determine the reproduction number, along with the other techniques, that this
- chapter presents in the next chapter.
- % -------------------------------------------------------------------
|