123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245 |
- % %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
- % Author: Phillip Rothenbeck
- % Title: Investigating the Evolution of the COVID-19 Pandemic in Germany Using Physics-Informed Neural Networks
- % File: chap03/chap03.tex
- % Part: Methods
- % Description:
- % summary of the content in this chapter
- % Version: 20.08.2024
- % %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
- \chapter{Methods 8}
- \label{chap:methods}
- This chapter provides the methods, that we employ to address the problem that we
- present in~\Cref{chap:introduction}.~\Cref{sec:preprocessing} outlines
- our approaches for preprocessing of the available data and has two
- sections. The first section describes the publicly available data provided by
- the \emph{Robert Koch Institute} (RKI)\footnote[1]{\url{https://www.rki.de/EN/Home/homepage_node.html}}.
- The second section outlines the techniques we use to process this data to fit
- our project's requirements. Subsequently, we give a theoretical overview of the
- PINN's that we employ. These latter sections, establish the foundation for the
- implementations described in~\Cref{sec:sir:setup} and~\Cref{sec:rsir:setup}.
- % -------------------------------------------------------------------
- \section{Epidemiological Data 3}
- \label{sec:preprocessing}
- In order for the PINNs to be effective with the data available to us, it is
- necessary for the data to be in the format required by the epidemiological
- models, which the PINNs will solve. Let $N_t$ be the number of training points,
- then let $i\in\{1, ..., N_t\}$ be the index of the training points. The data
- required by the PINN for solving the SIR model (see~\Cref{sec:pinn:dinn}),
- consists of pairs $(\boldsymbol{t}^{(i)}, (\boldsymbol{S}^{(i)}, \boldsymbol{I}^{(i)}, \boldsymbol{R}^{(i)}))$.
- Given that the system of differential equations representing the reduced SIR
- model (see~\Cref{sec:pandemicModel:rsir}) consists of a single differential
- equation for $I$, it is necessary to obtain pairs of the form
- $(\boldsymbol{t}^{(i)}, \boldsymbol{I}^{(i)})$. This section, focuses on the
- structure of the available data and the methods we employ to transform it into
- the correct structure.
- % -------------------------------------------------------------------
- \subsection{RKI Data 2}
- \label{sec:preprocessing:rki}
- The Robert Koch Institute is responsible for the on monitoring and prevention of
- diseases. As the central institution of the German government in the field of
- biomedicine, one of its tasks during the COVID-19 pandemic was it to track the
- number of infections and death cases in Germany. The data was collected by
- university hospitals, research facilities and laboratories through the
- conduction of tests. Each new case must be reported within a period of 24 hours
- at the latest to the respective state authority. Each state authority collects
- the cases for a day and must report them to the RKI by the following working
- day. The RKI then refines the data and releases statistics and updates its
- repositories holding the information for the public to access. For the purposes
- of this thesis we concentrate on two of these repositories.\\
- The first repository is called \emph{COVID-19-Todesfälle in Deutschland}\footnote{\url{https://github.com/robert-koch-institut/COVID-19-Todesfaelle_in_Deutschland.git}}.
- The dataset comprises discrete data points, each with a date indicating the
- point in time at which the respective data was collected. The dates span from
- March 9, 2020, to the present day. For each date, the dataset provides the total
- number of infection and death cases, the number of new deaths, and the
- case-fatality ratio. The total number of infection and death cases represents
- the sum of all cases reported up to that date, including the newly reported
- data. The dataset includes two additional datasets, that contain the death case
- information organized by age group or by the individual states within Germany on
- a weekly basis.\\
- \begin{figure}[h]
- \centering
- \includegraphics[width=\textwidth]{dataset_visualization.pdf}
- \caption{A visualization of the total death case and infection case data for
- each day from the data set \emph{COVID-19-Todesfälle in Deutschland}. Status
- of the 20'th of August 2024.}
- \label{fig:rki_data}
- \end{figure}
- The second repository is entitled \emph{SARS-CoV-2 Infektionen in Deutschland}.
- This dataset contains comprehensive data regarding the infections of each county
- on a daily basis. The counties are encoded using the \emph{Community Identification Number}\footnote{\url{https://www.destatis.de/DE/Themen/Laender-Regionen/Regionales/Gemeindeverzeichnis/_inhalt.html}},
- wherein the first two digits denote the state, the third digit represents the
- government district, and the last two digits indicate the county. Each data
- point displays the gender, the age group, number death, infection and recovery
- cases and the reference and report date. The reference date marks the onset of
- illness in the individual. In the absence of this information, the reference
- date is equivalent to the report date.\\
- The RKI assumes that the duration of the illness under normal conditions is 14 days,
- while the duration of severe cases is assumed to be 28 days. The recovery cases
- in the dataset are calculated using these assumptions, by adding the duration on
- the reference date if it is given. As stated in the ReadMe, the recovery data
- should be used with caution. Since we require the recovery data for further
- calculations, the following section presents the solutions we employed to address
- this issue.
- % -------------------------------------------------------------------
- \subsection{Data Preprocessing 1}
- \label{sec:preprocessing:rq}
- At the outset of this section, we establish the format of the data, that is
- necessary for training the PINNs. In this subsection, we present the method, that we
- employ to preprocess and transform the RKI data (see~\Cref{sec:preprocessing:rki})
- into the training data. \\
- In order to obtain the SIR data we require the size of each SIR compartment for
- each time point. The infection case data for the German states is available on
- a daily basis. To obtain the daily cases for the entire country we need to
- differentiate the total number of cases. The size of the population is defined
- as the respective size at the beginning of 2020. Using the starting conditions
- of~\Cref{eq:startCond}, we iterate through each day, modifying the sizes of the
- groups in a consecutive manner. For each iteration we subtract the new infection
- cases from $\boldsymbol{S}^{(i-1)}$ to obtain $\boldsymbol{S}^{(i)}$, for
- $\boldsymbol{I}^{(i)}$, we add the new cases and subtract deaths and recoveries,
- and the size of $\boldsymbol{R}^{(i)}$ is obtained by adding the new deaths and
- recoveries as they occur.\\
- As previously stated in~\Cref{sec:preprocessing:rki} the data on recoveries may
- either be unreliable or is entirely absent. To address this, we propose a method
- for computing the number of recovered individuals per day. Under the assumption
- that recovery takes $D$ days, we present the recovery queue, a data structure
- that holds the number of infections for a given day, retains them for $D$ days,
- and releases them into the removed group $D$ days later.\\
- \begin{figure}[h]
- \centering
- \includegraphics[width=\textwidth]{recovery_queue.pdf}
- \caption{The recovery queue takes in the infected individuals for the $k$'th
- day and releases them $D$ days later into the removed group.}
- \label{fig:recovery_queue}
- \end{figure}
- In order to solve the reduced SIR model, we employ a similar algorithm to that
- used for the SIR model. However, in contrast to the recovery queue, we utilize
- the set recovery rate $\alpha$ to transfer a portion $\alpha\boldsymbol{I}^{(i)}$
- of infections, which have recovered on the $i$ and put them into the
- $\boldsymbol{R}^{(i)}$ compartment, which is irrelevant to our purposes. \\
- The transformed data for both the SIR model and the reduced SIR model are then
- employed by the PINN models, which we describe in the subsequent section.
- % -------------------------------------------------------------------
- \section{Estimating Epidemiological Parameters using PINNs 3}
- \label{sec:pinn:sir}
- In the preceding section, we present the methods we employ to preprocess and
- format the data from the RKI in accordance with the specifications required for
- the work of this thesis. In this section, we will present the method we employ
- to identify the non-time-dependent SIR parameters $\beta$ and $\alpha$ for the
- data. As a foundation for our work, we draw upon the work of Shaier et
- al.~\cite{Shaier2021}, to solve the SIR system of ODEs using PINNs.\\
- In order to conduct an analysis of a pandemic, it is necessary to have a quantifiable measure
- that indicates whether the disease in question has the capacity to spread rapidly through a
- population or is it not successful in infecting a significant number of
- individuals. We employ the SIR model to construct an abstraction of the complex
- relations inherent to real-world pandemics. The SIR model divides the population into three
- compartments. It is accompanied by a with system of ODEs that encapsulates the
- fluctuations and relationships between these compartments (see~\Cref{eq:sir}).
- The transmission rate $\beta$ and the recovery rate $\alpha$ work as the
- aforementioned quantifiers. We obtain data from the preprocessing stage. It
- provides insight into the progression of the COVID-19 pandemic in Germany.
- The objective is to identify a function that solves the system of differential
- equations of the SIR model, by returning the size of each compartment at a
- specific point in time. This function is supposed to be able to reconstruct the
- training data and is defined by the values of the transition rates $\beta$ and
- $\alpha$. From a mathematical and semantic perspective, it is essential to
- determine these values of the parameter.\\
- In order to ascertain the transmission rate $\beta$ and the recovery rate $\alpha$
- from the preprocessed RKI data of $(\boldsymbol{S}, \boldsymbol{I}, \boldsymbol{R})$
- for a given set of time points, it is necessary to employ a data-driven approach that outputs
- a model prediction of $(\hat{\boldsymbol{S}}, \hat{\boldsymbol{I}}, \hat{\boldsymbol{R}})$
- for a set of time points, with the aim of minimizing the term,
- \begin{equation}\label{eq:SIR_obs_term}
- \Big\|\hat{\boldsymbol{S}}^{(i)}-\boldsymbol{S}^{(i)}\Big\|^2 + \Big\|\hat{\boldsymbol{I}}^{(i)}-\boldsymbol{I}^{(i)}\Big\|^2 + \Big\|\hat{\boldsymbol{R}}^{(i)}-\boldsymbol{R}^{(i)}\Big\|^2,
- \end{equation}
- for each data point in the set of training dataset of a cardinality $N_tt$ and with
- $i\in\{1, ..., N_t\}$. Moreover, the aforementioned parameters must satisfy the system
- of differential equations that govern the SIR model. For this reason, Shaier
- \etal~\cite{Shaier2021} utilize a PINN framework to satisfy both requirements.
- Their approach, which they refer to as the \emph{disease-informed neural network}
- (see~\Cref{sec:pinn:dinn}), takes epidemiological data as the input and returns
- the two transition rates $\alpha$ and $\beta$. This method
- achieves this by finding an approximate solution of to the inverse problem of
- physics-informed neural networks (see~\Cref{sec:pinn}). In terms of the terms of
- the SIR model, a PINN addresses the inverse problemin two ways. First, it minimizes~\Cref{eq:SIR_obs_term}
- by bringing the model predictions $(\boldsymbol{S}, \boldsymbol{I}, \boldsymbol{R})$
- closer to the actual values $(\hat{\boldsymbol{S}}, \hat{\boldsymbol{I}}, \hat{\boldsymbol{R}})$
- for each time point. Second, it reduces the residuals of the ODEs that
- constitute the SIR model. While the forward problem concludes at this point, the
- inverse problem presets that a parameter is unknown. Thus, we designate the parameters
- $\beta$ and $\alpha$ as free, learnable parameters, $\widehat{\beta}$ and
- $\widehat{\alpha}$. These separate trainable parameters are values that are
- optimized during the training process and must fit the equations of the set of
- ODEs. Furthermore, we know, that the transition rates
- do not surpass the value of $1$. Consequently, we force the value of both rates to be in a
- range of $[-1, 1]$. Therefor, we regularize the parameters using the
- \emph{tangens hyperbolicus}. This results in the terms,
- \begin{equation}
- \widehat{\beta} = \tanh(\tilde{\beta}),\quad \widehat{\alpha} = \tanh(\tilde{\alpha}),
- \end{equation}
- where $\tilde{\beta}$ and $\tilde{\alpha}$ are the predicted values of the model
- and $\widehat{\beta}$ and $\widehat{\alpha}$ are regularized model predictions.\\
- The input data must include the time point $\boldsymbol{t}^{(i)}$ and its
- corresponding measured true values of $(\boldsymbol{S}^{(i)}, \boldsymbol{I}^{(i)}, \boldsymbol{R}^{(i)})$.
- In its forward path, the PINN receives the time point $\boldsymbol{t}^{(i)}$ as its input, from which it
- calculates its model prediction $(\hat{\boldsymbol{S}}^{(i)}, \hat{\boldsymbol{I}}^{(i)}, \hat{\boldsymbol{R}}^{(i)})$
- based on its model parameters $\theta$. Subsequently, the model computes the loss function. It calculates the observation loss by taking the
- mean squared error of~\Cref{eq:SIR_obs_term} over all $N_t$ training samples.
- Therefore, the term for the observation loss is,
- \begin{equation}
- \mathcal{L}_{\text{obs}}(\boldsymbol{S}, \boldsymbol{I}, \boldsymbol{R}, \hat{\boldsymbol{S}}, \hat{\boldsymbol{I}}, \hat{\boldsymbol{R}}) = \frac{1}{N_t}\sum_{i=1}^{N_t} \Big\|\hat{\boldsymbol{S}}^{(i)}-\boldsymbol{S}^{(i)}\Big\|^2 + \Big\|\hat{\boldsymbol{I}}^{(i)}-\boldsymbol{I}^{(i)}\Big\|^2 + \Big\|\hat{\boldsymbol{R}}^{(i)}-\boldsymbol{R}^{(i)}\Big\|^2,
- \end{equation}
- is the term for the observation loss. Given superior performance in practical applications
- relative to the ODEs of~\Cref{eq:sir}, we utilize the ODEs of~\Cref{eq:modSIR}
- in our physics loss. In order for the model to learn the system of differential,
- it is necessary to obtain the residual of each ODE. The mean square error of the residuals constitutes
- the physics loss $\mathcal{L}_{\text{physiks}}(\boldsymbol{t}, \boldsymbol{S}, \boldsymbol{I}, \boldsymbol{R}, \hat{\boldsymbol{S}}, \hat{\boldsymbol{I}}, \hat{\boldsymbol{R}})$.
- The residuals are calculated using the model predictions $(\hat{\boldsymbol{S}}, \hat{\boldsymbol{I}}, \hat{\boldsymbol{R}})$ and the regularized model predictions of the parameters $\widehat{\beta}$ and $\widehat{\alpha}$. The residuals are given by,
- \begin{equation}
- 0=\frac{d\hat{\boldsymbol{S}}}{d\boldsymbol{t}}+ \widehat{\beta}\frac{\hat{\boldsymbol{S}}\hat{\boldsymbol{I}}}{N}, \quad 0=\frac{d\hat{\boldsymbol{I}}}{d\boldsymbol{t}} - \widehat{\beta}\frac{\hat{\boldsymbol{S}}\hat{\boldsymbol{I}}}{N} + \widehat{\alpha}\hat{\boldsymbol{I}}, \quad 0=\frac{d\hat{\boldsymbol{R}}}{d\boldsymbol{t}} + \widehat{\alpha}\hat{\boldsymbol{I}}.
- \end{equation}
- Thus,
- \begin{equation}
- \begin{split}
- \mathcal{L}_{\text{SIR}}(\boldsymbol{t}, \boldsymbol{S}, \boldsymbol{I}, \boldsymbol{R}, \hat{\boldsymbol{S}}, \hat{\boldsymbol{I}}, \hat{\boldsymbol{R}}) = &\bigg\|\frac{d\hat{\boldsymbol{S}}}{d\boldsymbol{t}}+ \widehat{\beta}\frac{\hat{\boldsymbol{S}}\hat{\boldsymbol{I}}}{N}\bigg\|^2\\ + &\bigg\|\frac{d\hat{\boldsymbol{I}}}{d\boldsymbol{t}} - \widehat{\beta}\frac{\hat{\boldsymbol{S}}\hat{\boldsymbol{I}}}{N} + \widehat{\alpha}\hat{\boldsymbol{I}}\bigg\|^2\\ + &\bigg\|\frac{d\hat{\boldsymbol{R}}}{d\boldsymbol{t}} + \widehat{\alpha}\hat{\boldsymbol{I}}\bigg\|^2\\
- + &\frac{1}{N_t}\sum_{i=1}^{N_t} \Big\|\hat{\boldsymbol{S}}^{(i)}-\boldsymbol{S}^{(i)}\Big\|^2 + \Big\|\hat{\boldsymbol{I}}^{(i)}-\boldsymbol{I}^{(i)}\Big\|^2 + \Big\|\hat{\boldsymbol{R}}^{(i)}-\boldsymbol{R}^{(i)}\Big\|^2,
- \end{split}
- \end{equation}
- is the equation of the total loss for our approach. This loss value is then
- back-propagated through our network, while the model predictions of the
- parameters $\beta$ and $\alpha$ are optimized using the loss as well.\\
- As this section concentrates on the finding of the time constant parameters
- $\beta$ and $\alpha$, the next section will show our approach of finding the
- reproduction number $\Rt$ on the German data of the RKI.
- % -------------------------------------------------------------------
- \section{PINN for the reduced SIR Model 2}
- \label{sec:pinn:rsir}
- % -------------------------------------------------------------------
|