chap03.tex 19 KB

123456789101112131415161718192021222324252627282930313233343536373839404142434445464748495051525354555657585960616263646566676869707172737475767778798081828384858687888990919293949596979899100101102103104105106107108109110111112113114115116117118119120121122123124125126127128129130131132133134135136137138139140141142143144145146147148149150151152153154155156157158159160161162163164165166167168169170171172173174175176177178179180181182183184185186187188189190191192193194195196197198199200201202203204205206207208209210211212213214215216217218219220221222223224225226227228229230231232233234235236237238239240241242243244245246247248249250251252253254255256257258259260261262263264265266267268269270271272273274275276277278279280281282283284285286287288289290291292293294295296297298299300301302303304305306307308309310311312313314315316317318319320
  1. % %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
  2. % Author: Phillip Rothenbeck
  3. % Title: Investigating the Evolution of the COVID-19 Pandemic in Germany Using Physics-Informed Neural Networks
  4. % File: chap03/chap03.tex
  5. % Part: Methods
  6. % Description:
  7. % summary of the content in this chapter
  8. % Version: 26.08.2024
  9. % %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
  10. \chapter{Methods}
  11. \label{chap:methods}
  12. This chapter provides the methods, that we employ to address the problem that we
  13. present in~\Cref{chap:introduction}.~\Cref{sec:preprocessing} outlines
  14. our approaches for preprocessing of the available data and has two
  15. sections. The first section describes the publicly available data provided by
  16. the \emph{Robert Koch Institute} (RKI)\footnote{\url{https://www.rki.de/EN/Home/homepage_node.html}}.
  17. The second section outlines the techniques we use to process this data to fit
  18. our project's requirements. Subsequently, we give a theoretical overview of the
  19. PINN's that we employ. These latter sections, establish the foundation for the
  20. implementations described in~\Cref{sec:sir:setup} and~\Cref{sec:rsir:setup}.
  21. % -------------------------------------------------------------------
  22. \section{Epidemiological Data}
  23. \label{sec:preprocessing}
  24. In this thesis we want to analyze the COVID-19 pandemic in Germany utilizing
  25. the SIR model and PINNs. For a PINN to learn the parameters of the SIR model,
  26. we need pandemic data in the correct format for the approach. Let $N_t$ be the
  27. number of training points, then let $i\in\{1, ..., N_t\}$
  28. be the index of the training points. The data required by the PINN for solving
  29. the SIR model (see~\Cref{sec:pinn:dinn}), consists of pairs
  30. $(\boldsymbol{t}^{(i)}, (\boldsymbol{S}^{(i)}, \boldsymbol{I}^{(i)}, \boldsymbol{R}^{(i)}))$,
  31. with $\boldsymbol{t}^{(i)}$ representing the time in days since the first
  32. measurement and $\boldsymbol{S}^{(i)}, \boldsymbol{I}^{(i)}$, and $\boldsymbol{R}^{(i)}$
  33. the corresponding size of the compartments. Given that the system of
  34. differential equations representing the reduced SIR model
  35. (see~\Cref{sec:pandemicModel:rsir}) consists of a single differential equation
  36. for $I$, it is necessary to obtain pairs of the form $(\boldsymbol{t}^{(i)}, \boldsymbol{I}^{(i)})$.
  37. This section, focuses on the structure of the available data and the methods we
  38. employ to transform it into the correct structure.
  39. % -------------------------------------------------------------------
  40. \subsection{RKI Data}
  41. \label{sec:preprocessing:rki}
  42. The RKI is a biomedical institute in Germany responsible for
  43. the on monitoring and prevention of diseases. As the central institution of the
  44. German government in the field of biomedicine, one of its tasks during the
  45. COVID-19 pandemic was to track the number of infections and death cases in
  46. Germany. The data was collected by university hospitals, research facilities
  47. and laboratories through the conduction of tests. Each new case had to be
  48. reported within a period of 24 hours at the latest to the respective state
  49. authority. Each state authority collects the cases for a day and had to report
  50. them to the RKI by the following working day~\cite{GHDead}. The RKI then refines
  51. the data and releases statistics and updates its repositories holding the
  52. information for the public to access. For the purposes of this thesis we
  53. concentrate on two of these repositories.\\
  54. The first repository is called \emph{COVID-19-Todesfälle in Deutschland}~\cite{GHDead}.
  55. The dataset comprises discrete data points, each with a date indicating the
  56. point in time at which the respective data was collected. The dates span from
  57. 2020-03-09, to the present day. For each date, the dataset provides the total
  58. number of infection and death cases, the number of new deaths, and the
  59. case-fatality ratio. The total number of infection and death cases represents
  60. the sum of all cases reported up to that date, including the newly reported
  61. data. The dataset includes two additional subsets, that contain the death case
  62. information organized by age group or by the individual states within Germany on
  63. a weekly basis.\\
  64. \begin{figure}[t]
  65. \centering
  66. \includegraphics[width=\textwidth]{dataset_visualization.pdf}
  67. \caption{A visualization of the total death case and infection case data for
  68. each day from the data set \emph{COVID-19-Todesfälle in Deutschland}. Status
  69. of 2024-08-20.}
  70. \label{fig:rki_data}
  71. \end{figure}
  72. The second repository is entitled \emph{SARS-CoV-2 Infektionen in Deutschland}~\cite{GHInf}.
  73. This dataset contains comprehensive data regarding the infections of each county
  74. on a daily basis. The counties are encoded using the \emph{Community Identification Number}\footnote{\url{https://www.destatis.de/DE/Themen/Laender-Regionen/Regionales/Gemeindeverzeichnis/_inhalt.html}},
  75. wherein the first two digits denote the state, the third digit represents the
  76. government district, and the last two digits indicate the county. Each data
  77. point displays the gender, the age group, number death, infection and recovery
  78. cases and the reference and report date. The reference date marks the onset of
  79. illness in the individual. In the absence of this information, the reference
  80. date is equivalent to the report date.\\
  81. The RKI assumes that the duration of the illness under normal conditions is 14 days,
  82. while the duration of severe cases is assumed to be 28 days. The recovery cases
  83. in the dataset are calculated using these assumptions, by adding the duration on
  84. the reference date if it is given. As stated, the recovery data should be used
  85. with caution. Since we require the recovery data for further calculations, the
  86. following section presents the solutions we employed to address this issue.
  87. % -------------------------------------------------------------------
  88. \subsection{Data Preprocessing}
  89. \label{sec:preprocessing:rq}
  90. At the outset of this section, we establish the format of the data, that is
  91. necessary for training the PINNs. In this subsection, we present the method, that we
  92. employ to preprocess and transform the RKI data (see~\Cref{sec:preprocessing:rki})
  93. into the training data. \\
  94. In order to obtain the SIR data we require the size of each SIR compartment for
  95. each time point. The infection case data for the German states is available on
  96. a daily basis. To obtain the daily cases for the entire country we need to
  97. differentiate the total number of cases. The size of the population is defined
  98. as the respective size at the beginning of 2020. Using the starting conditions
  99. of~\Cref{eq:startCond}, we iterate through each day, modifying the sizes of the
  100. groups in a consecutive manner. For each iteration we subtract the new infection
  101. cases from $\boldsymbol{S}^{(i-1)}$ to obtain $\boldsymbol{S}^{(i)}$, for
  102. $\boldsymbol{I}^{(i)}$, we add the new cases and subtract deaths and recoveries,
  103. and the size of $\boldsymbol{R}^{(i)}$ is obtained by adding the new deaths and
  104. recoveries as they occur.\\
  105. As previously stated in~\Cref{sec:preprocessing:rki} the data on recoveries may
  106. either be unreliable or is entirely absent. To address this, we propose a method
  107. for computing the number of recovered individuals per day. Under the assumption
  108. that recovery takes $D$ days, we present the recovery queue, a data structure
  109. that holds the number of infections for a given day, retains them for $D$ days,
  110. and releases them into the removed group $D$ days later.\\
  111. \begin{figure}[t]
  112. \centering
  113. \includegraphics[width=\textwidth]{recovery_queue.pdf}
  114. \caption{The recovery queue takes in the infected individuals for the $k$'th
  115. day and releases them $D$ days later into the removed group.}
  116. \label{fig:recovery_queue}
  117. \end{figure}
  118. In order to solve the reduced SIR model, we employ a similar algorithm to that
  119. used for the SIR model. However, in contrast to the recovery queue, we utilize
  120. a set recovery rate $\alpha$ to transfer a portion $\alpha\boldsymbol{I}^{(i)}$
  121. of infections, which have recovered or died on the $i$'th day and put them into
  122. the $\boldsymbol{R}^{(i+1)}$ compartment of the next day, which is irrelevant to
  123. our purposes. The transformed data for both the SIR model and the reduced SIR
  124. model are then employed by the PINN models, which we describe in the subsequent
  125. section.
  126. % -------------------------------------------------------------------
  127. \section{Estimating Epidemiological Parameters using PINNs}
  128. \label{sec:pinn:sir}
  129. In the preceding section, we present the methods we employ to preprocess and
  130. format the data from the RKI in accordance with the specifications required for
  131. the application in this thesis. Here, we will present the method we employ
  132. to identify the SIR parameters $\beta$ and $\alpha$ for our data. As a
  133. foundation for our work, we draw upon the work of Shaier \etal~\cite{Shaier2021},
  134. to solve the SIR system of ODEs using PINNs.\\
  135. In order to conduct an analysis of a pandemic, it is necessary to have a
  136. quantifiable measure that indicates whether the disease in question has the
  137. capacity to spread rapidly through a population or is it not successful in
  138. infecting a significant number of individuals. In~\Cref{sec:pandemicModel:sir},
  139. we provide an in-depth discussion of the SIR model, and show, that the
  140. transmission rate $\beta$ and the recovery rate $\alpha$ work as the
  141. aforementioned quantifiers in this model. The specific values of these
  142. epidemiological parameters belonging to the training data define a function that
  143. solves the system of differential equations of the SIR model. This function is
  144. able to return the size of each compartment at a specific point in time. Thus,
  145. from a mathematical and semantic perspective, it is essential to determine the
  146. corresponding values governing the development of the pandemic.\\
  147. In order to ascertain the transmission rate $\beta$ and the recovery rate
  148. $\alpha$ from the preprocessed RKI data of $\Psi=(\boldsymbol{S}, \boldsymbol{I}, \boldsymbol{R})$
  149. for a given set of time points, it is necessary to employ a data-driven approach
  150. that outputs a model prediction of $\hat{\Psi}=(\hat{\boldsymbol{S}}, \hat{\boldsymbol{I}}, \hat{\boldsymbol{R}})$
  151. for a set of time points, with the aim of minimizing the term,
  152. \begin{equation}\label{eq:SIR_obs_term}
  153. \Big\|\hat{\boldsymbol{S}}^{(i)}-\boldsymbol{S}^{(i)}\Big\|^2 + \Big\|\hat{\boldsymbol{I}}^{(i)}-\boldsymbol{I}^{(i)}\Big\|^2 + \Big\|\hat{\boldsymbol{R}}^{(i)}-\boldsymbol{R}^{(i)}\Big\|^2,
  154. \end{equation}
  155. for each data point in the set of training dataset of a cardinality $N_t$ and with
  156. $i\in\{1, ..., N_t\}$. Moreover, the aforementioned parameters must satisfy the system
  157. of differential equations that govern the SIR model. For this reason, Shaier
  158. \etal~\cite{Shaier2021} utilize a PINN framework to satisfy both requirements.
  159. Their approach, which they refer to as the \emph{Disease-Informed Neural Network}
  160. (see~\Cref{sec:pinn:dinn}), takes epidemiological data as the input and returns
  161. the two epidemiological parameters $\alpha$ and $\beta$. Their method
  162. achieves this by finding an approximate solution of to the inverse problem of
  163. physics-informed neural networks (see~\Cref{sec:pinn}). In terms of the SIR
  164. model, a PINN addresses the inverse problem in two ways. First, it minimizes the mean of~\Cref{eq:SIR_obs_term}
  165. by bringing the model predictions $\hat{\Psi}$
  166. closer to the actual values $\Psi$
  167. for each time point. Second, it reduces the residuals of the ODEs that
  168. constitute the SIR model. While the forward problem concludes at this point, the
  169. inverse problem presets that a parameter is unknown. Thus, we designate the parameters
  170. $\beta$ and $\alpha$ as free, learnable parameters, $\hat{\beta}$ and
  171. $\hat{\alpha}$. These separate trainable parameters are values that are
  172. optimized during the training process and must fit the equations of the set of
  173. ODEs. \\
  174. Assuming that the values of the epidemiological parameters stay below
  175. 1~\cite{Shaier2021}, we force the value of both rates to be in a
  176. range of $[-1, 1]$. Therefore, we regularize the parameters using the
  177. \emph{tangens hyperbolicus}. This results in the terms,
  178. \begin{equation}
  179. \tilde{\beta} = \tanh(\hat{\beta}),\quad \tilde{\alpha} = \tanh(\hat{\alpha}),
  180. \end{equation}
  181. where $\tilde{\alpha}$ are regularized model predictions.\\
  182. The input data must include the time point $\boldsymbol{t}^{(i)}$ and its
  183. corresponding measured true values of $\Psi^{(i)}$.
  184. In its forward path, the PINN receives the time point $\boldsymbol{t}^{(i)}$ as its input, from which it
  185. calculates its model prediction $\hat{\Psi}^{(i)}$
  186. based on its model parameters $\theta$. Subsequently, the model computes the loss function. It calculates the data loss by taking the
  187. mean squared error of~\Cref{eq:SIR_obs_term} over all $N_t$ training samples.
  188. Therefore,
  189. \begin{equation}
  190. \mathcal{L}_{\text{data}}(\Psi, \hat{\Psi}) = \frac{1}{N_t}\sum_{i=1}^{N_t} \Big\|\hat{\boldsymbol{S}}^{(i)}-\boldsymbol{S}^{(i)}\Big\|^2 + \Big\|\hat{\boldsymbol{I}}^{(i)}-\boldsymbol{I}^{(i)}\Big\|^2 + \Big\|\hat{\boldsymbol{R}}^{(i)}-\boldsymbol{R}^{(i)}\Big\|^2,
  191. \end{equation}
  192. is the term for the data loss. We find the ODEs of~\Cref{eq:modSIR} perform best
  193. in our setup. Hence, we utilize them in our physics loss. In order for the model
  194. to learn the system of differential equations, it is necessary to obtain the
  195. residual of each ODE. The mean square error of the residuals constitutes the
  196. physics loss
  197. $\mathcal{L}_{\text{physics}}(\boldsymbol{t}, \Psi, \hat{\Psi})$.
  198. The residuals are calculated using the model predictions $\hat{\Psi}$
  199. and the regularized model predictions of the parameters, $\tilde{\beta}$ and $\tilde{\alpha}$.
  200. The residuals are given by,
  201. \begin{equation}
  202. 0=\frac{d\hat{\boldsymbol{S}}}{d\boldsymbol{t}}+ \tilde{\beta}\frac{\hat{\boldsymbol{S}}\hat{\boldsymbol{I}}}{N}, \quad 0=\frac{d\hat{\boldsymbol{I}}}{d\boldsymbol{t}} - \tilde{\beta}\frac{\hat{\boldsymbol{S}}\hat{\boldsymbol{I}}}{N} + \tilde{\alpha}\hat{\boldsymbol{I}}, \quad 0=\frac{d\hat{\boldsymbol{R}}}{d\boldsymbol{t}} + \hat{\alpha}\hat{\boldsymbol{I}}.
  203. \end{equation}
  204. Thus,
  205. \begin{equation}
  206. \mathcal{L}_{\text{SIR}}(\boldsymbol{t}, \Psi, \hat{\Psi}) = \mathcal{L}_{\text{physics}}(\boldsymbol{t}, \Psi, \hat{\Psi}) + \mathcal{L}_{\text{data}}(\Psi, \hat{\Psi})
  207. \end{equation}
  208. is the multi-objective loss equation encapsuling both the physics loss and the
  209. data loss for our approach. By minimizing these loss terms our model learns the
  210. given training data but also the physics of the system. This enables our model
  211. to simultaneously learn the values of the parameters $\beta$ and $\alpha$
  212. during training.\\
  213. As this section concentrates on the finding of the time constant parameters
  214. $\beta$ and $\alpha$, the next section will show our approach of finding the
  215. reproduction number $\Rt$ on the German data of the RKI.
  216. % -------------------------------------------------------------------
  217. \section{Estimating the Reproduction Number using PINNs}
  218. \label{sec:pinn:rsir}
  219. The previous section illustrates the methodology we employ to determine the
  220. constant transmission and recovery rates from a data set obtained from
  221. the COVID-19 pandemic in Germany. In this section, we utilize PINNs to identify
  222. the time-dependent reproduction number, $\Rt$, while reducing the number of
  223. state variables and the reliance on assumptions, by decreasing the number of ODEs
  224. in the system of differential equations of the SIR model. The methodology
  225. presented in this section is based on the approach developed by Millevoi
  226. \etal~\cite{Millevoi2023}.\\
  227. In real-world pandemics, the rate of infection is influenced by a multitude of
  228. factors. Events such as the growing awareness for the disease among the general
  229. population, the introduction of non-pharmaceutical mitigations such as social
  230. distancing policies, and the emergence of a new variant have an impact on the
  231. transmission rate $\beta$. Accordingly, a transmission rate that is not
  232. time-dependent and constant across the entire duration of the pandemic may not
  233. accurately reflect the dynamics of the spread of a real-world disease. In~\Cref{sec:pandemicModel:rsir},
  234. we provide, following Millevoi \etal~\cite{Millevoi2023}, the definition of the
  235. time-dependent $\beta(t)$ and subsequently that of the reproduction number,
  236. $\Rt$ which represents the number of new infections that occur as a result of
  237. one infectious individual. It indicates whether a pandemic is emerging or if it
  238. is spreading rapidly through the susceptible population. By inserting the
  239. definition of~\Cref{eq:repr_num}, into the system of ODEs of the SIR model, we
  240. can derive one~\Cref{eq:reduced_sir_ODE}. In order to solve this, we must
  241. identify a function that maps a time point to the size of the infectious
  242. compartment and the specific reproduction number.\\
  243. As with the constant epidemiological parameters, we employ a data-driven approach for
  244. identifying the time-dependent reproduction number $\Rt$. The PINN approximates
  245. the size $\boldsymbol{I}$ with its model prediction $\hat{\boldsymbol{I}}$ by
  246. minimizing the term,
  247. \begin{equation}\label{eq:rSir_squared_err}
  248. \Big\|\hat{\boldsymbol{I}}^{(i)}-\boldsymbol{I}^{(i)}\Big\|^2,
  249. \end{equation}
  250. for each $i\in\{1,...,N_t\}$. In order to identify the reproduction number, the
  251. PINN minimizes the residuals of the ODE during the training process. The
  252. training process is analogous to the PINN, which identifies $\beta$
  253. and $\alpha$ (see~\Cref{sec:pinn:sir}). However, there are two key differences. Firstly, the absence of
  254. free, trainable parameters. Secondly, the inclusion of an additional state variable that
  255. fluctuates in response to the input. While the state variable $\boldsymbol{I}$
  256. is approximated using the error between the training data and the predicted
  257. values, the state variable $\Rt$ is approximated exclusively based on the
  258. residual of the ODE.\\
  259. The PINN receives the input of $\boldsymbol{t}^{(i)}$ and generates a prediction of
  260. ($\hat{\boldsymbol{I}}^{(i)}$, $\Rt^{(i)}$). As previously stated, the PINN minimizes
  261. the distance between the true values of $\boldsymbol{I}$ and the model predictions
  262. $\hat{\boldsymbol{I}}$ by minimizing the mean squared error. Consequently, the
  263. data loss function is defined by,
  264. \begin{equation}
  265. \mathcal{L}_{\text{data}}(\boldsymbol{I}, \hat{\boldsymbol{I}}) = \frac{1}{N_t}\sum_{i=1}^{N_t} \Big\|\hat{\boldsymbol{I}}^{(i)}-\boldsymbol{I}^{(i)}\Big\|^2.
  266. \end{equation}
  267. The physics loss function is defined as the squared error of the residual of the
  268. ODE. The residual of the reduced SIR model is given by,
  269. \begin{equation}
  270. 0 = \frac{dI_s}{dt_s} - \alpha(t_f - t_0)(\Rt - 1)I_s(t_s).
  271. \end{equation}
  272. During training we first fit the data agnostic to physics utilizing only the
  273. data loss $\mathcal{L}_{\text{data}}(\boldsymbol{I}, \hat{\boldsymbol{I}})$.
  274. Then we train on composite loss function given by,
  275. \begin{equation}
  276. \mathcal{L}_{\text{rSIR}}(\boldsymbol{t}, \boldsymbol{I}, \hat{\boldsymbol{I}}) = \bigg\|\frac{dI_s}{dt_s} - \alpha(t_f - t_0)(\Rt - 1)I_s(t_s)\bigg\|^2+ \frac{1}{N_t}\sum_{i=1}^{N_t} \Big\|\hat{\boldsymbol{I}}^{(i)}-\boldsymbol{I}^{(i)}\Big\|^2,
  277. \end{equation}
  278. to achieve a better solution.\\
  279. Although we set the transmission rate to be time-dependent, we define the
  280. recovery time constant over time to reduce the complexity of the problem. The
  281. RKI~\cite{GHInf} posits that the typical recovery period for the illness under
  282. normal conditions is 14 days, while those individuals with severe cases require
  283. approximately 28 days to recover. As we assume the case with normal condition,
  284. we can set the recovery time to $D=14$, which yields $\alpha = \nicefrac{1}{14}$.\\
  285. We perform extensive empirical evaluations of the methodology employed to
  286. determine the reproduction number, along with the other techniques, that this
  287. chapter presents in the next chapter.
  288. % -------------------------------------------------------------------