% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % Author: Phillip Rothenbeck % Title: Investigating the Evolution of the COVID-19 Pandemic in Germany Using Physics-Informed Neural Networks % File: chap03/chap03.tex % Part: Methods % Description: % summary of the content in this chapter % Version: 26.08.2024 % %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \chapter{Methods} \label{chap:methods} This chapter provides the methods, that we employ to address the problem that we present in~\Cref{chap:introduction}.~\Cref{sec:preprocessing} outlines our approaches for preprocessing of the available data and has two sections. The first section describes the publicly available data provided by the \emph{Robert Koch Institute} (RKI). The second section outlines the techniques we use to process this data to fit our project's requirements. Subsequently, we give a theoretical overview of the PINN's that we employ. These latter sections, establish the foundation for the implementations used in~\Cref{sec:sir:setup} and~\Cref{sec:rsir:setup}. % ------------------------------------------------------------------- \section{Epidemiological Data} \label{sec:preprocessing} In this thesis, we want to analyze the COVID-19 pandemic in Germany utilizing the SIR model and PINNs. For a PINN to learn the parameters of the SIR model, we need pandemic data in the correct format for the approach. Let $N_t$ be the number of training points, then let $i\in\{1, ..., N_t\}$ be the index of the training points. The data required by the PINN for solving the SIR model (see~\Cref{sec:pinn:dinn}), consists of pairs $(\boldsymbol{t}^{(i)}, (\boldsymbol{S}^{(i)}, \boldsymbol{I}^{(i)}, \boldsymbol{R}^{(i)}))$, with $\boldsymbol{t}^{(i)}$ representing the time in days since the first measurement and $\boldsymbol{S}^{(i)}, \boldsymbol{I}^{(i)}$, and $\boldsymbol{R}^{(i)}$ the corresponding size of the compartments. Given that the system of differential equations representing the reduced SIR model (see~\Cref{sec:pandemicModel:rsir}) consists of a single differential equation for $I$, it is necessary to obtain pairs of the form $(\boldsymbol{t}^{(i)}, \boldsymbol{I}^{(i)})$. This section, focuses on the structure of the available data and the methods we employ to transform it into the correct structure. % ------------------------------------------------------------------- \subsection{RKI Data} \label{sec:preprocessing:rki} The RKI is a biomedical institute in Germany responsible for the monitoring and prevention of diseases. As the central institution of the German government in the field of biomedicine, one of its tasks during the COVID-19 pandemic was to track the number of infections and death cases in Germany. The data was collected by university hospitals, research facilities and laboratories through the conduction of tests. Each new case had to be reported within a period of 24 hours at the latest to the respective state authority. Each state authority collects the cases for a day and had to report them to the RKI by the following working day~\cite{GHDead}. The RKI then refines the data and releases statistics and updates its repositories holding the information for the public to access. For the purposes of this thesis we concentrate on two of these repositories.\\ The first repository is called \emph{COVID-19-Todesfälle in Deutschland}~\cite{GHDead}. The dataset comprises discrete data points, each with a date indicating the point in time at which the respective data was collected. The dates span from 2020-03-09, to the present day. For each date, the dataset provides the total number of infection and death cases, the number of new deaths, and the case-fatality ratio. The total number of infection and death cases represents the sum of all cases reported up to that date, including the newly reported data. The dataset includes two additional subsets, that contain the death case information organized by age group or by the individual states within Germany on a weekly basis.\\ \begin{figure}[t] \centering \includegraphics[width=\textwidth]{dataset_visualization.pdf} \caption{A visualization of the total death case and infection case data for each day from the data set \emph{COVID-19-Todesfälle in Deutschland}. Status of 2024-08-20.} \label{fig:rki_data} \end{figure} The second repository is entitled \emph{SARS-CoV-2 Infektionen in Deutschland}~\cite{GHInf}. This dataset contains comprehensive data regarding the infections of each county on a daily basis. The counties are encoded using the \emph{Community Identification Number}~\cite{GFSO}, wherein the first two digits denote the state, the third digit represents the government district, and the last two digits indicate the county. Each data point displays the gender, the age group, number death, infection and recovery cases and the reference and report date. The reference date marks the onset of illness in the individual. In the absence of this information, the reference date is equivalent to the report date.\\ The RKI assumes that the duration of the illness under normal conditions is 14 days, while the duration of severe cases is assumed to be 28 days. The recovery cases in the dataset are calculated using these assumptions, by adding the duration on the reference date if it is given. As the RKI states, the recovery data should be used with caution. Since we require the recovery data for further calculations, the following section presents the solutions we employed to address this issue. % ------------------------------------------------------------------- \subsection{Data Preprocessing} \label{sec:preprocessing:rq} At the outset of this section, we establish the format of the data, that is necessary for training the PINNs. In this section, we present the method, that we employ to preprocess and transform the RKI data (see~\Cref{sec:preprocessing:rki}) into the training data. \\ In order to obtain the SIR data we require the size of each SIR compartment for each time point. The infection case data for the German states is available on a daily basis. To obtain the daily cases for the entire country we need to differentiate the total number of cases. The size of the population is defined as the respective size at the beginning of 2020. Using the starting conditions of~\Cref{eq:startCond}, we iterate through each day, modifying the sizes of the groups in a consecutive manner. For each iteration we subtract the new infection cases from $\boldsymbol{S}^{(i-1)}$ to obtain $\boldsymbol{S}^{(i)}$, for $\boldsymbol{I}^{(i)}$, we add the new cases and subtract deaths and recoveries, and the size of $\boldsymbol{R}^{(i)}$ is obtained by adding the new deaths and recoveries as they occur.\\ As previously stated in~\Cref{sec:preprocessing:rki} the data on recoveries may either be unreliable or is entirely absent. To address this, we propose a method for computing the number of recovered individuals per day. Under the assumption that recovery takes $D$ days, we present the recovery queue, a data structure that holds the number of infections for a given day, retains them for $D$ days, and releases them into the removed group $D$ days later.\\ \begin{figure}[t] \centering \includegraphics[width=\textwidth]{recovery_queue.pdf} \caption{The recovery queue takes in the infected individuals for the $k$'th day and releases them $D$ days later into the removed group.} \label{fig:recovery_queue} \end{figure} In order to solve the reduced SIR model, we employ a similar algorithm to that used for the SIR model. However, in contrast to the recovery queue, we utilize a constant recovery rate $\alpha$ to transfer a portion $\alpha\boldsymbol{I}^{(i)}$ of infections, which have recovered or died on the $i$'th day and put them into the $\boldsymbol{R}^{(i+1)}$ compartment of the next day, which is irrelevant to our purposes. The transformed data for both the SIR model and the reduced SIR model are then employed by the PINN models, which we describe in the subsequent section. % ------------------------------------------------------------------- \section{Estimating Epidemiological Parameters using PINNs} \label{sec:pinn:sir} In the preceding section, we present the methods we employ to preprocess and format the data from the RKI in accordance with the specifications required for the application in this thesis. Here, we will present the method we employ to identify the SIR parameters $\alpha$ and $\beta$ for our data. As a foundation for our work, we draw upon the work of Shaier \etal~\cite{Shaier2021}, to solve the SIR system of ODEs using PINNs.\\ In order to conduct an analysis of a pandemic, it is necessary to have a quantifiable measure that indicates whether the disease in question has the capacity to spread rapidly through a population or is it not successful in infecting a significant number of individuals. In~\Cref{sec:pandemicModel:sir}, we provide an in-depth discussion of the SIR model, and show, that the transmission rate $\beta$ and the recovery rate $\alpha$ work as the aforementioned quantifiers in this model. The specific values of these epidemiological parameters belonging to the training data define a function that solves the system of differential equations of the SIR model. This function is able to return the size of each compartment at a specific point in time. Thus, from a mathematical and semantic perspective, it is essential to determine the corresponding values governing the development of the pandemic.\\ In order to ascertain the transmission rate $\beta$ and the recovery rate $\alpha$ from the preprocessed RKI data of $\Psi=(\boldsymbol{S}, \boldsymbol{I}, \boldsymbol{R})$ for a given set of time points, it is necessary to employ a data-driven approach that outputs a model prediction of $\hat{\Psi}=(\hat{\boldsymbol{S}}, \hat{\boldsymbol{I}}, \hat{\boldsymbol{R}})$ for a set of time points, with the aim of minimizing the term, \begin{equation}\label{eq:SIR_obs_term} \Big\|\hat{\boldsymbol{S}}^{(i)}-\boldsymbol{S}^{(i)}\Big\|^2 + \Big\|\hat{\boldsymbol{I}}^{(i)}-\boldsymbol{I}^{(i)}\Big\|^2 + \Big\|\hat{\boldsymbol{R}}^{(i)}-\boldsymbol{R}^{(i)}\Big\|^2, \end{equation} for each data point in the set of training data points of a cardinality $N_t$ and with $i\in\{1, ..., N_t\}$. Moreover, the aforementioned parameters must satisfy the system of differential equations that govern the SIR model. For this reason, Shaier \etal~\cite{Shaier2021} utilize a PINN framework to satisfy both requirements. Their approach, which they refer to as the \emph{Disease-Informed Neural Network} (see~\Cref{sec:pinn:dinn}), takes epidemiological data as the input and returns the two epidemiological parameters $\alpha$ and $\beta$. Their method achieves this by finding an approximate solution of to the inverse problem of physics-informed neural networks (see~\Cref{sec:pinn}). In terms of the SIR model, a PINN addresses the inverse problem in two ways. First, it minimizes the mean of~\Cref{eq:SIR_obs_term} by bringing the model predictions $\hat{\Psi}$ closer to the actual values $\Psi$ for each time point. Second, it reduces the residuals of the ODEs that constitute the SIR model. While the forward problem concludes at this point, the inverse problem presets that a parameter is unknown. Thus, we designate the parameters $\alpha$ and $\beta$ as free, learnable parameters, $\hat{\beta}$ and $\hat{\alpha}$. These separate trainable parameters are values that are optimized during the training process and must fit the equations of the set of ODEs. \\ Assuming that the values of the epidemiological parameters stay below 1~\cite{Shaier2021}, we force the value of both rates to be in a range of $[-1, 1]$. Therefore, we regularize the parameters using the \emph{tangens hyperbolicus}. This results in the terms, \begin{equation} \tilde{\beta} = \tanh(\hat{\beta}),\quad \tilde{\alpha} = \tanh(\hat{\alpha}), \end{equation} where $\tilde{\alpha}$ and $\hat{\beta}$ are regularized model predictions.\\ The input data must include the time point $\boldsymbol{t}^{(i)}$ and its corresponding measured true values of $\Psi^{(i)}$. In its forward path, the PINN receives the time point $\boldsymbol{t}^{(i)}$ as its input, from which it calculates its model prediction $\hat{\Psi}^{(i)}$ based on its model parameters $\theta$. Subsequently, the model computes the loss function. It calculates the data loss by taking the mean squared error of~\Cref{eq:SIR_obs_term} over all $N_t$ training samples. Therefore, \begin{equation} \mathcal{L}_{\text{data}}(\Psi, \hat{\Psi}) = \frac{1}{N_t}\sum_{i=1}^{N_t} \Big\|\hat{\boldsymbol{S}}^{(i)}-\boldsymbol{S}^{(i)}\Big\|^2 + \Big\|\hat{\boldsymbol{I}}^{(i)}-\boldsymbol{I}^{(i)}\Big\|^2 + \Big\|\hat{\boldsymbol{R}}^{(i)}-\boldsymbol{R}^{(i)}\Big\|^2, \end{equation} is the term for the data loss. We find the ODEs of~\Cref{eq:modSIR} perform best in our setup. Hence, we utilize them in our physics loss. In order for the model to learn the system of differential equations, it is necessary to obtain the residual of each ODE. The mean square error of the residuals constitutes the physics loss $\mathcal{L}_{\text{physics}}(\boldsymbol{t}, \Psi, \hat{\Psi})$. The residuals are calculated using the model predictions $\hat{\Psi}$ and the regularized model predictions of the parameters, $\tilde{\alpha}$ and $\tilde{\beta}$. The residuals are given by, \begin{equation} 0=\frac{d\hat{\boldsymbol{S}}}{d\boldsymbol{t}}+ \tilde{\beta}\frac{\hat{\boldsymbol{S}}\hat{\boldsymbol{I}}}{N}, \quad 0=\frac{d\hat{\boldsymbol{I}}}{d\boldsymbol{t}} - \tilde{\beta}\frac{\hat{\boldsymbol{S}}\hat{\boldsymbol{I}}}{N} + \tilde{\alpha}\hat{\boldsymbol{I}}, \quad 0=\frac{d\hat{\boldsymbol{R}}}{d\boldsymbol{t}} + \hat{\alpha}\hat{\boldsymbol{I}}. \end{equation} Thus, \begin{equation} \mathcal{L}_{\text{SIR}}(\boldsymbol{t}, \Psi, \hat{\Psi}) = \mathcal{L}_{\text{physics}}(\boldsymbol{t}, \Psi, \hat{\Psi}) + \mathcal{L}_{\text{data}}(\Psi, \hat{\Psi}) \end{equation} is the multi-objective loss equation encapsuling both the physics loss and the data loss for our approach. By minimizing these loss terms our model learns the given training data but also the physics of the system. This enables our model to simultaneously learn the values of the parameters $\alpha$ and $\beta$ during training.\\ As this section concentrates on the finding of the time constant parameters $\alpha$ and $\beta$, the next section will show our approach of finding the reproduction number $\Rt$ on the German data of the RKI. % ------------------------------------------------------------------- \section{Estimating the Reproduction Number using PINNs} \label{sec:pinn:rsir} The previous section illustrates the methodology we employ to determine the constant transmission and recovery rates from a data set obtained during the COVID-19 pandemic in Germany. In this section, we utilize PINNs to identify the time-dependent reproduction number, $\Rt$, while reducing the number of state variables and the reliance on assumptions, by decreasing the number of ODEs in the system of differential equations of the SIR model. The methodology presented in this section is based on the approach developed by Millevoi \etal~\cite{Millevoi2023}.\\ In real-world pandemics, the rate of infection is influenced by a multitude of factors. Events such as the growing awareness for the disease among the general population, the introduction of non-pharmaceutical mitigations such as social distancing policies, and the emergence of a new variant have an impact on the transmission rate $\beta$. Accordingly, a transmission rate that is not time-dependent and constant across the entire duration of the pandemic may not accurately reflect the dynamics of the spread of a real-world disease. In~\Cref{sec:pandemicModel:rsir}, we provide, following Millevoi \etal~\cite{Millevoi2023}, the definition of the time-dependent $\beta(t)$ and subsequently that of the reproduction number, $\Rt$ which represents the number of new infections that occur as a result of one infectious individual. It indicates whether a pandemic is emerging or if it is spreading rapidly through the susceptible population. By inserting the definition of~\Cref{eq:repr_num}, into the system of ODEs of the SIR model, we can derive one~\Cref{eq:reduced_sir_ODE}. In order to solve this, we must identify a function that maps a time point to the size of the infectious compartment and the specific reproduction number.\\ As with the constant epidemiological parameters, we employ a data-driven approach for identifying the time-dependent reproduction number $\Rt$. The PINN approximates the size $\boldsymbol{I}$ with its model prediction $\hat{\boldsymbol{I}}$ by minimizing the term, \begin{equation}\label{eq:rSir_squared_err} \Big\|\hat{\boldsymbol{I}}^{(i)}-\boldsymbol{I}^{(i)}\Big\|^2, \end{equation} for each $i\in\{1,...,N_t\}$. In order to identify the reproduction number, the PINN minimizes the residuals of the ODE during the training process. The training process is analogous to the PINN, which identifies $\alpha$ and $\beta$ (see~\Cref{sec:pinn:sir}). However, there are two key differences. Firstly, the absence of free, trainable parameters. Secondly, the inclusion of an additional state variable that fluctuates in response to the input. While the state variable $\boldsymbol{I}$ is approximated using the error between the training data and the predicted values, the state variable $\Rt$ is approximated exclusively based on the residual of the ODE.\\ The PINN receives the input of $\boldsymbol{t}^{(i)}$ and generates a prediction of ($\hat{\boldsymbol{I}}^{(i)}$, $\Rt^{(i)}$). As previously stated, the PINN minimizes the distance between the true values of $\boldsymbol{I}$ and the model predictions $\hat{\boldsymbol{I}}$ by minimizing the mean squared error. Consequently, the data loss function is defined by, \begin{equation} \mathcal{L}_{\text{data}}(\boldsymbol{I}, \hat{\boldsymbol{I}}) = \frac{1}{N_t}\sum_{i=1}^{N_t} \Big\|\hat{\boldsymbol{I}}^{(i)}-\boldsymbol{I}^{(i)}\Big\|^2. \end{equation} The physics loss function is defined as the squared error of the residual of the ODE. The residual of the reduced SIR model is given by, \begin{equation} 0 = \frac{dI_s}{dt_s} - \alpha(t_f - t_0)(\Rt - 1)I_s(t_s). \end{equation} During training we first fit the data agnostic to physics utilizing only the data loss $\mathcal{L}_{\text{data}}(\boldsymbol{I}, \hat{\boldsymbol{I}})$. Then we train on composite loss function given by, \begin{equation} \mathcal{L}_{\text{rSIR}}(\boldsymbol{t}, \boldsymbol{I}, \hat{\boldsymbol{I}}) = \bigg\|\frac{dI_s}{dt_s} - \alpha(t_f - t_0)(\Rt - 1)I_s(t_s)\bigg\|^2+ \frac{1}{N_t}\sum_{i=1}^{N_t} \Big\|\hat{\boldsymbol{I}}^{(i)}-\boldsymbol{I}^{(i)}\Big\|^2, \end{equation} to achieve a better solution.\\ Although we set the transmission rate to be time-dependent, we define the recovery time to be constant over time to reduce the complexity of the problem. The RKI~\cite{GHInf} posits that the typical recovery period for the illness under normal conditions is 14 days, while those individuals with severe cases require approximately 28 days to recover. As we assume the case with a normal condition, we can set the recovery time to $D=14$, which yields $\alpha = \nicefrac{1}{14}$.\\ We perform extensive empirical evaluations of the methodology employed to determine the reproduction number, along with the other techniques, that this chapter presents in the next chapter. % -------------------------------------------------------------------