% %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% % Author: Phillip Rothenbeck % Title: Investigating the Evolution of the COVID-19 Pandemic in Germany Using Physics-Informed Neural Networks % File: chap02/chap02.tex % Part: theoretical background % Description: % summary of the content in this chapter % Version: 20.08.2024 % %%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%% \chapter{Theoretical Background} \label{chap:background} This chapter introduces the theoretical foundations for the work presented in this thesis. In~\Cref{sec:domain} and~\Cref{sec:differentialEq}, we describe differential equations and the underlying theory. In these Sections, both the explanations and the approach are based on a book on analysis by Rudin~\cite{Rudin2007} and a book about ordinary differential equations by Tenenbaum and Pollard~\cite{Tenenbaum1985}. Subsequently, we employ this knowledge to examine various pandemic models in~\Cref{sec:epidemModel}. Finally, we address the topic of neural networks with a focus on the multilayer perceptron in~\Cref{sec:mlp} and physics-informed neural networks in~\Cref{sec:pinn}. % ------------------------------------------------------------------- \section{Mathematical Modelling using Functions} \label{sec:domain} To model a physical problem mathematically, it is necessary to define a fundamental set of numbers or quantities upon which the subsequent calculations will be based. These sets may represent, for instance, a specific time interval or a distance. The term \emph{domain} describes these fundamental sets of numbers or quantities~\cite{Rudin2007}. A \emph{variable} is a changing entity living in a certain domain. In this thesis, we will focus on domains of real numbers in $\mathbb{R}$.\\ The mapping between variables enables the modeling of a physical process and may depict semantics. We use functions in order to facilitate this mapping. Let $A, B\subset\mathbb{R}$ be two subsets of the real numbers, then we define a function as the mapping \begin{equation} f: A\rightarrow B. \end{equation} In other words, the function $f$ maps elements $x\in A$ to values $f(x)\in B$. $A$ is the \emph{domain} of $f$, while $B$ is the \emph{codomain} of $f$. Functions are capable of representing the state of a system as a value based on an input value from their domain. One illustrative example is a function that maps a time step to the distance covered since a starting point. In this case, time serves as the domain, while the distance is the codomain. % ------------------------------------------------------------------- \section{Mathematical Modelling using Differential Equations} \label{sec:differentialEq} Often, the behavior of a variable or a quantity across a domain is more interesting than its current state. Functions are able to give us the latter, but do not contain any information about the change of a system. The objective is to determine an effective method for calculating the change of a function across its domain. Let $f$ be a function and $[a, b]\subset \mathbb{R}$ an interval of real numbers. The expression \begin{equation} m = \frac{f(b) - f(a)}{b-a} \end{equation} gives the average rate of change. While the average rate of change is useful in many cases, the momentary rate of change is more accurate. To calculate the momentary rate of change at $x$, we let the value $t$ approach $x$ and thereby narrow down the interval to an infinitesimal. For each $x\in[a, b]$ we calculate \begin{equation} \label{eqn:differential} \frac{df}{dx} = \lim_{t\to x} \frac{f(t) - f(x)}{t-x}, \end{equation} if it exists. As the Tenenbaum and Pollard~\cite{Tenenbaum1985} define, $\nicefrac{df}{dx}$ is the \emph{derivative}, which is ``the rate of change of a variable with respect to another''~\cite{Tenenbaum1985}. The relation between a variable and its derivative is modeled in a \emph{differential equation}. The derivative of $\nicefrac{df}{dx}$ yields $\nicefrac{d^2f}{dx^2}$, which is the function that calculates the rate of change of the rate of change and is called the \emph{second order derivative}. Iterating this $n$ times results in $\nicefrac{d^nf}{dx^n}$, the derivative of the $n$'th order. A method for obtaining a differential equation is to derive it from the semantics of a problem. For example, in physics a differential equation can be derived from the law of the conservation of energy~\cite{Demtroeder2021}. Differential equations find application in various areas, such as engineering \eg, the Kirchhoff's circuit laws~\cite{Kirchhoff1845} to describe the relation between voltage and current in systems with resistors, inductors, and capacitors; physics with, \eg, the Schrödinger equation to predict the probability of finding particles like electrons in specific places or states in a quantum system; economics, \eg, Black-Scholes equation~\cite{Oksendal2000} to forecast financial derivative prices; epidemiology with the SIR Model~\cite{1927}; and beyond.\\ In the context of functions, it is possible to have multiple domains, meaning that a function has more than one parameter. To illustrate, consider a function operating in two-dimensional space, wherein each parameter represents one axis. Another example would be a function, that maps its inputs of a location variable and a time variable on a height. The term \emph{partial differential equations} (PDE) describes differential equations of such functions, which contain partial derivatives with respect to each individual domain. In contrast, \emph{ordinary differential equations} (ODE) contain derivatives for a function having only one domain~\cite{Tenenbaum1985}. In this thesis, we restrict ourselves to ODE's. Furthermore, a \emph{system of differential equations} is the name for a set of differential equations. The equations in a system of differential equations each have their own codomain, which is part of the problem, while they all share the same domain.\\ Tenenbaum and Pollard~\cite{Tenenbaum1985} provide many examples for ODE's, including the \emph{Motion of a Particle Along a Straight Line}. This example is based on Newton's second law which states that ``the rate of change of the momentum of a body ($momentum = mass \cdot velocity$) is proportional to the resultant external force $F$ acted upon it''~\cite{Tenenbaum1985}. Let $m$ be the mass of the body in kilograms, $v$ its velocity in meters per second and $t$ the time in seconds. Then, Newton's second law translates mathematically to \begin{equation} \label{eq:newtonSecLaw} F = m\frac{dv}{dt}. \end{equation} It is evident that the acceleration, $a=\frac{dv}{dt}$, as the rate of change of the velocity is part of the equation. Additionally, the velocity of a body is the derivative of the distance traveled by that body. Based on these findings, we can rewrite the~\Cref{eq:newtonSecLaw} to, \begin{equation} F=ma=m\frac{d^2s}{dt^2}, \end{equation} showing that the force $F$ influences the changes of the body's position over time.\\ To conclude, note that this explanation of differential equations focuses on the aspects deemed crucial for this thesis and is not intended to be a complete explanation of the subject. To gain a better understanding of it, we recommend the books mentioned above~\cite{Rudin2007,Tenenbaum1985}. In the following section we describe the application of these principles in epidemiological models. % ------------------------------------------------------------------- \section{Epidemiological Models} \label{sec:epidemModel} Pandemics, like \emph{COVID-19}, have resulted in a significant number of fatalities. Hence, the question arises: How should we analyze a pandemic effectively? It is essential to study whether the employed countermeasures are efficacious in combating the pandemic. Given the unfavorable public response~\cite{Jaschke2023} to measures such as lockdowns, it is imperative to investigate if their efficacy remains commensurate with the costs incurred to those affected. In the event that alternative and novel technologies were in use, such as the mRNA vaccines in the context of COVID-19, it is needful to test the effect and find the optimal variant. In order to shed light on the aforementioned events, we need a method to quantify the pandemic along with its course of progression.\\ The real world is a highly complex system, which presents a significant challenge attempting to describe it fully in a mathematical model. Therefore, the model must reduce the complexity while retaining the essential information. Furthermore, it must address the issue of limited data availability. For instance, during COVID-19 institutions such as the Robert Koch Institute (RKI) were only able to collect data on infections and mortality cases. Consequently, we require a model that employs an abstraction of the real world to illustrate the events and relations that are pivotal to understanding the problem. % ------------------------------------------------------------------- \subsection{SIR Model} \label{sec:pandemicModel:sir} In 1927, Kermack and McKendrick~\cite{1927} introduced the \emph{SIR Model}, which subsequently became one of the most influential epidemiological models. This model enables the modeling of infections during epidemiological events such as pandemics. The book \emph{Mathematical Models in Biology}~\cite{EdelsteinKeshet2005} reiterates the model and serves as the foundation for the following explanation of SIR models.\\ The SIR model is capable of illustrating diseases, which are transferred through contact or proximity of an individual carrying the illness and a healthy individual. This is possible due to the distinction between infected individuals who are carriers of the disease and the part of the population, which is susceptible to infection. In the model, the mentioned groups are capable to change, \eg, healthy individuals becoming infected. The model assumes the size $N$ of the population to remain constant throughout the duration of the pandemic. The population $N$ comprises three distinct compartments: the \emph{susceptible} group $S$, the \emph{infectious} group $I$ and the \emph{removed} group $R$ (hence SIR model). Let $\mathcal{T} = [t_0, t_f]\subseteq \mathbb{R}_{\geq0}$ be the time span of the pandemic, then, \begin{equation} S: \mathcal{T}\rightarrow\mathbb{N}, \quad I: \mathcal{T}\rightarrow\mathbb{N}, \quad R: \mathcal{T}\rightarrow\mathbb{N}, \end{equation} give the values of $S$, $I$ and $R$ at a certain point of time $t\in\mathcal{T}$. For $S$, $I$, $R$ and $N$ applies: \begin{equation} \label{eq:N_char} N = S + I + R. \end{equation} The model makes another assumption by stating that recovered people are immune to the illness and infectious individuals can not infect them. The individuals in the $R$ group are either recovered or deceased, and thus unable to transmit or carry the disease. \begin{figure}[t] \centering \includegraphics[width=\textwidth]{sir_graph.pdf} \caption{A visualization of the SIR model~\cite{1927}, illustrating $N$ being split in the three groups $S$, $I$ and $R$.} \label{fig:sir_model} \end{figure} As visualized in the~\Cref{fig:sir_model} the individuals may transition between groups based on epidemiological parameters. The transmission rate $\beta$ is responsible for individuals becoming infected, while the rate of removal or recovery rate $\alpha$ (also referred to as $\delta$ or $\nu$, \eg,~\cite{Millevoi2023,EdelsteinKeshet2005}) moves individuals from $I$ to $R$.\\ We can describe this problem mathematically using a system of differential equations (see ~\Cref{sec:differentialEq}). Thus, Kermack and McKendrick~\cite{1927} propose the following set of differential equations: \begin{equation}\label{eq:sir} \begin{split} \frac{dS}{dt} &= -\beta S I,\\ \frac{dI}{dt} &= \beta S I - \alpha I,\\ \frac{dR}{dt} &= \alpha I. \end{split} \end{equation} This set of differential equations, is based on the following assumption: ``The rate of transmission of a microparasitic disease is proportional to the rate of encounter of susceptible and infective individuals modelled by the product ($\beta S I$)'', according to Edelstein-Keshet~\cite{EdelsteinKeshet2005}. The system shows the change in size of the groups per time unit due to infections, recoveries, and deaths.\\ The term $\beta SI$ describes the rate of encounters of susceptible and infected individuals. This term is dependent on the size of $S$ and $I$, thus Anderson and May~\cite{Anderson1991} propose a modified model: \begin{equation}\label{eq:modSIR} \begin{split} \frac{dS}{dt} &= -\beta \frac{SI}{N},\\ \frac{dI}{dt} &= \beta \frac{SI}{N} - \alpha I,\\ \frac{dR}{dt} &= \alpha I. \end{split} \end{equation} By normylizing $\beta SI$ by $N$ the~\Cref{eq:modSIR} is more correct in a real world~aspect~\cite{Anderson1991}.\\ The initial phase of a pandemic is characterized by the infection of a small number of individuals, while the majority of the population remains susceptible. The infectious group has not yet infected any individuals thus neither recovery nor mortality is possible. Let $I_0\in\mathbb{N}$ be the number of infected individuals at the beginning of the disease. Then, \begin{equation}\label{eq:startCond} \begin{split} S(0) &= N - I_{0},\\ I(0) &= I_{0},\\ R(0) &= 0, \end{split} \end{equation} describes the initial configuration of a system in which a disease has just emerged.\\ \begin{figure}[t] %\centering \setlength{\unitlength}{1cm} % Set the unit length for coordinates \begin{picture}(12, 9.5) % Specify the size of the picture environment (width, height) % reference \put(0, 1.75){ \begin{subfigure}{0.35\textwidth} \centering \includegraphics[width=\textwidth]{reference_params_synth.pdf} \caption{$\alpha=0.35$, $\beta=0.5$} \label{fig:synth_norm} \end{subfigure} } % 1. row, 1.image (low beta) \put(5.25, 5){ \begin{subfigure}{0.3\textwidth} \centering \includegraphics[width=\textwidth]{low_beta_synth.pdf} \caption{$\alpha=0.25$, $\beta=0.5$} \label{fig:synth_low_beta} \end{subfigure} } % 1. row, 2.image (high beta) \put(9.5, 5){ \begin{subfigure}{0.3\textwidth} \centering \includegraphics[width=\textwidth]{high_beta_synth.pdf} \caption{$\alpha=0.45$, $\beta=0.5$} \label{fig:synth_high_beta} \end{subfigure} } % 2. row, 1.image (low alpha) \put(5.25, 0){ \begin{subfigure}{0.3\textwidth} \centering \includegraphics[width=\textwidth]{low_alpha_synth.pdf} \caption{$\alpha=0.35$, $\beta=0.4$} \label{fig:synth_low_alpha} \end{subfigure} } % 2. row, 2.image (high alpha) \put(9.5, 0){ \begin{subfigure}{0.3\textwidth} \centering \includegraphics[width=\textwidth]{high_alpha_synth.pdf} \caption{$\alpha=0.35$, $\beta=0.6$} \label{fig:synth_high_alpha} \end{subfigure} } \end{picture} \caption{Synthetic data, using~\Cref{eq:modSIR} and $N=7.9\cdot 10^6$, $I_0=10$ with different sets of parameters. We visualize the case with the reference parameters in (a). In (b) and (c) we keep $\alpha$ constant, while varying the value of $\beta$. In contrast, (d) and (e) have varying values of $\alpha$.} \label{fig:synth_sir} \end{figure} In the SIR model the temporal occurrence and the height of the peak (or peaks) of the infectious group are of great importance for understanding the dynamics of a pandemic. A low peak occurring at a late point in time indicates that the disease is unable to keep pace with the rate of recovery, resulting in its demise before it can exert a significant influence on the population. In contrast, an early and high peak means that the disease is rapidly transmitted through the population, with a significant proportion of individuals having been infected.~\Cref{fig:sir_model} illustrates this effect by varying the values of $\alpha$ and $\beta$ while simulating a pandemic using a model such as~\Cref{eq:modSIR}. It is evident that both the transmission rate $\beta$ and the recovery rate $\alpha$ influence the height and time of the peak of $I$. When the number of infections exceeds the number of recoveries, the peak of $I$ will occur early and will be high. On the other hand, if recoveries occur at a faster rate than new infections the peak will occur later and will be low. Thus, it is crucial to know both $\alpha$ and $\beta$, as these parameters characterize how the pandemic evolves.\\ The SIR model is based on a number of assumptions that are intended to reduce the overall complexity of the model while still representing the processes observed in the real-world. For example, the size of a population in the real-world is subject to a number of factors that can contribute to change. The population is increased by the occurrence of births and decreased by the occurrence of deaths. One assumption, stated in the SIR model is that the size of the population, $N$, remains constant, as the daily change is negligible compared to the total population. Other examples include the impossibility for individuals to be susceptible again, after having recovered, or the possibility for the epidemiological parameters to change due to new variants or the implementation of new countermeasures. We address this latter option in the next~\Cref{sec:pandemicModel:rsir}. % ------------------------------------------------------------------- \subsection{Reduced SIR Model and the Reproduction Number} \label{sec:pandemicModel:rsir} The~\Cref{sec:pandemicModel:sir} presents the classical SIR model by Kermack and McKendrick~\cite{1927}. This model contains two scalar parameters $\alpha$ and $\beta$, which describe the course of a pandemic over its duration. This is beneficial when examining the overall pandemic; however, in the real world, disease behavior is dynamic, and the values of the parameters $\alpha$ and $\beta$ change throughout the course of the disease. The reason for this is due to events such as the implementation of countermeasures that reduce the contact between the infectious and susceptible individuals, the emergence of a new variant of the disease that increases its infectiousness or deadliness, or the administration of a vaccination that provides previously susceptible individuals with immunity without ever being infected. As these fine details of disease progression are missed in the constant rates, Liu and Stechlinski~\cite{Liu2012}, and Setianto and Hidayat~\cite{Setianto2023}, introduce time-dependent epidemiological parameters and the time-dependent reproduction number to address this issue. Millevoi \etal~\cite{Millevoi2023} present a reduced version of the SIR model.\\ For the time interval, $\mathcal{T} = [t_0, t_f]\subseteq \mathbb{R}_{\geq0}$, they alter the definition of $\alpha$ and $\beta$ to be time-dependent, \begin{equation} \beta: \mathcal{T}\rightarrow\mathbb{R}_{\geq0}, \quad\alpha: \mathcal{T}\rightarrow\mathbb{R}_{\geq0}. \end{equation} Another crucial element is $D(t) = \frac{1}{\alpha(t)}$, which represents the time span an infected individual requires to recuperate. Subsequently, at the initial time point $t_0$, the \emph{reproduction number}, \begin{equation} \RO = \beta(t_0)D(t_0) = \frac{\beta(t_0)}{\alpha(t_0)}, \end{equation} represents the number of susceptible individuals, that one infectious individual infects at the onset of the pandemic. In light of the effects of $\alpha$ and $\beta$ (see~\Cref{sec:pandemicModel:sir}), $\RO < 1$ indicates that the pandemic is emerging. In this scenario $\alpha$ is relatively low due to the limited number of infections resulting from $I(t_0) << S(t_0)$. Further, $\RO > 1$ leads to the disease spreading rapidly among the population, with an increase in $I$ occurring at a high rate. Nevertheless, $\RO$ does not cover the entire time span. For this reason, Millevoi \etal~\cite{Millevoi2023} introduce $\Rt$ which has the same interpretation as $\RO$, with the exception that $\Rt$ is dependent on time. The time-dependent reproduction number is defined as, \begin{equation}\label{eq:repr_num} \Rt=\frac{\beta(t)}{\alpha(t)}\cdot\frac{S(t)}{N}, \end{equation} on the time interval $\mathcal{T}$ and the population size $N$. This definition includes the epidemiological parameters for information about the spread of the disease and information of the decrease of the ratio of susceptible individuals in the population. In contrast to $\alpha$ and $\beta$, $\Rt$ is not a parameter but a state variable in the model, which gives information about the reproduction of the disease for each day. As Millevoi \etal~\cite{Millevoi2023} show, $\Rt$ enables the following reduction of the SIR model.\\ \Cref{eq:N_char} allows for the calculation of the value of the group $R$ using $S$ and $I$, with the term $R(t)=N-S(t)-I(t)$. Thus, \begin{equation} \begin{split} \frac{dS}{dt} &= \alpha(\Rt-1)I(t),\\ \frac{dI}{dt} &= -\alpha\Rt I(t), \end{split} \end{equation} is the reduction of~\Cref{eq:sir} on the time interval $\mathcal{T}$ using this characteristic and the reproduction number $\Rt$ (see ~\Cref{eq:repr_num}). Another issue that Millevoi \etal~\cite{Millevoi2023} seek to address is the extensive range of values that the SIR groups can assume. Accordingly, they initially scale the time interval $\mathcal{T}$ using its borders to calculate the scaled time $t_s = \frac{t - t_0}{t_f - t_0}\in[0, 1]$. Subsequently, they calculate the scaled groups, \begin{equation} S_s(t_s) = \frac{S(t)}{C},\quad I_s(t_s) = \frac{I(t)}{C},\quad R_s(t_s) = \frac{R(t)}{C}, \end{equation} using a large constant scaling factor $C\in\mathbb{N}$. Applying this to the variable $I$, results in, \begin{equation}\label{eq:reduced_sir_ODE} \frac{dI_s}{dt_s} = \alpha(t_f - t_0)(\Rt - 1)I_s(t_s), \end{equation} which is a further reduced version of~\Cref{eq:sir}. This less complex differential equation results in a less complex solution, as it entails the elimination of a parameter ($\beta$) and two state variables ($S$ and $R$). The reduced SIR model is more precise due to fewer input variables, making it advantageous in situations with limited data, such as when recovery data is missing. % ------------------------------------------------------------------- \section{Multilayer Perceptron} \label{sec:mlp} In~\Cref{sec:differentialEq}, we discuss the modeling of systems using differential equations in systems, illustrating how they can be utilized to elucidate the impact of a specific parameter on the system's behavior. In~\Cref{sec:epidemModel}, we show specific applications of differential equations in an epidemiological context. Solving such systems is crucial and involves finding a function that best fits the data. Fitting measured data points to approximate such a function, is one of the multiple methods to achieve this goal. The \emph{Multilayer Perceptron} (MLP)~\cite{Rumelhart1986} is a data-driven function approximator. In the following section, we provide a brief overview of the structure and training of these \emph{neural networks}. For reference, we use the book \emph{Deep Learning} by Goodfellow \etal~\cite{Goodfellow-et-al-2016} as a foundation for our explanations.\\ The objective is to develop an approximation method for any function $f^{*}$, which could be a mathematical function or a mapping of an input vector to the desired output. Let $\boldsymbol{x}$ be the input vector and $\boldsymbol{y}$ the label, class, or result. Then, $\boldsymbol{y} = f^{*}(\boldsymbol{x})$, is the function to approximate. In the year 1958, Rosenblatt~\cite{Rosenblatt1958} proposed the perceptron modeling the concept of a neuron in a neuroscientific sense. The perceptron takes in the input vector $\boldsymbol{x}$, performs an operation and produces a scalar result. This model optimizes its parameters $\theta$ to be able to calculate $\boldsymbol{y} = f(\boldsymbol{x}; \theta)$ as accurately as possible. As Minsky and Papert~\cite{Minsky1972} demonstrate, the perceptron is only capable of approximating a specific class of functions. Consequently, there is a necessity for an expansion of the perceptron.\\ As Goodfellow \etal~\cite{Goodfellow-et-al-2016} proceed, the solution to this issue is to decompose $f$ into a chain structure of the form, \begin{equation} \label{eq:mlp_char} f(\boldsymbol{x}) = f^{(3)}(f^{(2)}(f^{(1)}(\boldsymbol{x}))). \end{equation} This nested version of a perceptron is called a multilayer perceptron. Each sub-function, designated as $f^{(i)}$, is represented in the structure of an MLP as a \emph{layer}, which contains a linear mapping and a nonlinear mapping in form of an \emph{activation function}. A multitude of \emph{units} (also \emph{neurons}) compose each layer. A neuron performs the same vector-to-scalar calculation as the perceptron does. Subsequently, a nonlinear activation function transforms the scalar output into the activation of the unit. The layers are staggered in the neural network, with each layer being connected to its neighboring layers, as illustrated in~\Cref{fig:mlp_example}. The input vector $\boldsymbol{x}$ is provided to each unit of the first layer (input layer) $f^{(1)}$, which then gives the results to the units of the second layer $f^{(2)}$, and so forth. The final layer is the \emph{output layer}. The intervening layers, situated between the input and the output layers are the \emph{hidden layers}. The term \emph{forward propagation} describes the process of information flowing through the network from the input layer to the output layer, resulting in a scalar loss. The alternating structure of linear and nonlinear calculation enables MLP's to approximate any function. As Hornik \etal~\cite{Hornik1989} proves, MLP's are universal approximators.\\ \begin{figure}[t] \centering \includegraphics[width=\textwidth]{MLP.pdf} \caption{A illustration of an MLP~\cite{Rumelhart1986} with two hidden layers. Each neuron of a layer is connected to every neuron of the neighboring layers. The arrow indicates the direction of the forward propagation.} \label{fig:mlp_example} \end{figure} The term \emph{training} describes the process of optimizing the parameters $\theta$. In order to undertake training, it is necessary to have a set of \emph{training data}, which is a set of pairs (also called training points) of the input data $\boldsymbol{x}$ and its corresponding true solution $\boldsymbol{y}$ of the function $f^{*}$. For the training process we must define a \emph{loss function} $\Loss{ }$, using the model prediction $\hat{\boldsymbol{y}}$ and the true value $\boldsymbol{y}$, which will act as a metric for evaluating the extent to which the model deviates from the correct answer. One common loss function is the \emph{mean square error} (MSE) loss function. Let $N$ be the number of points in the set of training data. Then, \begin{equation} \label{eq:mse} \Loss{MSE} = \frac{1}{N}\sum_{i=1}^{N} ||\hat{\boldsymbol{y}}^{(i)}-\boldsymbol{y}^{(i)}||^2, \end{equation} calculates the squared difference between each model prediction and true value of a training data point and takes the mean across the whole training data. \\ Ultimately, the objective is to utilize this information to optimize the parameters, in order to minimize the loss. One of the most fundamental and seminal optimization strategy is \emph{gradient descent}. In this process, the derivatives are employed to identify the location of local or global minima within a function, which lie where the gradient is zero. Given that a positive gradient signifies ascent and a negative gradient indicates descent, we must move the variable by a \emph{learning rate} (step size) in the opposite direction to that of the gradient. The calculation of the derivatives in respect to the parameters is a complex task, since our function is a composition of many functions (one for each layer). We can address this issue by taking advantage of~\Cref{eq:mlp_char} and employing the chain rule of calculus. Let $\hat{\boldsymbol{y}} = f(\boldsymbol{x}; \theta)$ be the model prediction with the decomposed version $f(\boldsymbol{x}; \theta) = f^{(3)}(w; \theta_3)$ with $w = f^{(2)}(z; \theta_2)$ and $z = f^{(1)}(\boldsymbol{x}; \theta_1)$. $\boldsymbol{x}$ is the input vector and $\theta_3, \theta_2, \theta_1\subset\theta$. Then, \begin{equation}\label{eq:backprop} \nabla_{\theta_3} \Loss{ } = \frac{d\mathcal{L}}{d\hat{\boldsymbol{y}}}\frac{d\hat{\boldsymbol{y}}}{df^{(3)}}\nabla_{\theta_3}f^{(3)}, \end{equation} is the gradient of $\Loss{ }$ in respect of the parameters $\theta_3$. To obtain $\nabla_{\theta_2} \Loss{ }$, we have to derive $\nabla_{\theta_3} \Loss{ }$ in respect to $\theta_2$. The name of this method in the context of neural networks is \emph{backpropagation}~\cite{Rumelhart1986}, as it propagates the error backwards through the neural network.\\ In practical applications, an optimizer often accomplishes the optimization task by executing backpropagation in the background. Furthermore, modifying the learning rate during training can be advantageous. For instance, making larger steps at the beginning and minor adjustments at the end. Therefore, schedulers are implementations of algorithms that employ diverse learning rate alteration strategies.\\ For a more in-depth discussion of practical considerations and additional details like regularization, we direct the reader to the book \emph{Deep Learning} by Goodfellow \etal~\cite{Goodfellow-et-al-2016}. The next section will demonstrate the application of neural networks in approximating solutions to differential systems. % ------------------------------------------------------------------- \section{Physics-Informed Neural Networks} \label{sec:pinn} In~\Cref{sec:mlp}, we describe the structure and training of MLP's, which are wildely recognized tools for approximating any kind of function. In 1997, Lagaris \etal~\cite{Lagaris1998} provided a method, that utilizes gradient descent to solve ODEs and PDEs. Building on this approach, Raissi \etal~\cite{Raissi2019} introduced the methodology with the name \emph{Physics-Informed Neural Network} (PINN) in 2017. In this approach, the model learns to approximate a function using provided data points while leveraging the available knowledge about the problem in the form of a system of differential equations.\\ In contrast to standard MLP models, PINNs are not solely data-driven. The differential equation, \begin{equation} \boldsymbol{y}=\mathcal{D}(\boldsymbol{x}), \end{equation} includes both the solution $\boldsymbol{y}$, and the operant $\mathcal{D}$, which incorporates all derivatives with respect to the input $\boldsymbol{x}$. This equation contains the information of the physical properties and dynamics of $\boldsymbol{y}$. In order to find the solution $\boldsymbol{y}$, we must solve the differential equation with respect to data which is related to the problem at hand. As Raissi \etal~\cite{Raissi2019} propose, we employ a neural network with the parameters $\theta$. The MLP then is supposed to optimize its parameters for its output $\hat{\boldsymbol{y}}$ to approximate the solution $\boldsymbol{y}$. In order to achieve this, we train the model on data containing input-output pairs with measures of $\boldsymbol{y}$. The output $\hat{\boldsymbol{y}}$ is fitted to the data through the mean square error data loss $\mathcal{L}_{\text{data}}$. Moreover, the data loss function may include additional terms for initial and boundary conditions. Furthermore, the physics are incorporated through an additional loss term of the physics loss $\mathcal{L}_{\text{physics}}$ which includes the differential equation through its residual $r=\nicefrac{d\boldsymbol{y}}{d\boldsymbol{x}} - \mathcal{D}(\boldsymbol{x})$. This leads to the PINN loss function, \begin{align}\label{eq:PINN_loss} \mathcal{L}_{\text{PINN}}(\boldsymbol{x}, \boldsymbol{y},\hat{\boldsymbol{y}}) & = & & \mathcal{L}_{\text{data}} (\boldsymbol{y},\hat{\boldsymbol{y}}) & + & \quad \mathcal{L}_{\text{physics}} (\boldsymbol{x}, \boldsymbol{y},\hat{\boldsymbol{y}}) & \\ & = & & \frac{1}{N_t}\sum_{i=1}^{N_t} || \hat{\boldsymbol{y}}^{(i)}-\boldsymbol{y}^{(i)}||^2 & + & \quad\frac{1}{N_d}\sum_{i=1}^{N_d} || r_i(\boldsymbol{x},\hat{\boldsymbol{y}})||^2 & , \end{align} with $N_d$ the number of differential equations in a system and $N_t$ the number of training samples used for training. Utilizing $\mathcal{L}_{\text{PINN}}$, the PINN simultaneously optimizes its parameters $\theta$ to minimize both the data loss and the physics loss. This makes it a multi-objective optimization problem.\\ Given the nature of differential equations, calculating the loss term of $\mathcal{L}_{\text{physics}}(\boldsymbol{x},\hat{\boldsymbol{y}})$ requires the calculation of the derivative of the output with respect to the input of the neural network. As we outline in~\Cref{sec:mlp}, during the process of backpropagation we calculate the gradients of the loss term in respect to a layer-specific set of parameters denoted by $\theta_l$, where $l$ represents the index of the respective layer. By employing the chain rule of calculus, the algorithm progresses from the output layer through each hidden layer, ultimately reaching the first layer in order to compute the respective gradients. The term, \begin{equation} \nabla_{\boldsymbol{x}} \hat{\boldsymbol{y}} = \frac{d\hat{\boldsymbol{y}}}{df^{(2)}}\frac{df^{(2)}}{df^{(1)}}\nabla_{\boldsymbol{x}}f^{(1)}, \end{equation} illustrates that, in contrast to the procedure described in~\Cref{eq:backprop}, this procedure, the \emph{automatic differentiation}, goes one step further and calculates the gradient of the output with respect to the input $\boldsymbol{x}$. In order to calculate the second derivative $\frac{d\hat{\boldsymbol{y}}}{d\boldsymbol{x}}=\nabla_{\boldsymbol{x}} (\nabla_{\boldsymbol{x}} \hat{\boldsymbol{y}} ),$ this procedure must be repeated.\\ Above we present a method by Raissi et al.~\cite{Raissi2019} for approximating functions through the use of systems of differential equations. As previously stated, we want to find a solution for systems of differential equations. In problems where we must solve an ODE or PDE, we have to find a set of parameters that satisfies the system for any input $\boldsymbol{x}$. In the context of PINNs, this is an inverse problem. We have training data from measurements and the corresponding differential equations, but the parameters of these equations are unknown. To address this challenge, we implement these parameters as distinct learnable parameters within the neural network. This enables the network to utilize a specific value, that actively influences the physics loss $\mathcal{L}_{\text{physics}}(\boldsymbol{x},\hat{\boldsymbol{y}})$. During the training phase the optimizer aims to minimize the physics loss, which should ultimately yield an approximation of the the true parameter value fitting the observations.\\ In order to illustrate the working of a PINN, we use the example of a \emph{damped harmonic oscillator} taken from~\cite{Moseley}. In this problem, we displace a body, which is attached to a spring, from its resting position. The body is subject to three forces: firstly, the inertia exerted by the displacement $u$, which points in the direction of the displacement; secondly, the restoring force of the spring, which attempts to return the body to its original position and thirdly, the friction force, which points in the opposite direction of the movement. In accordance with Newton's second law and the combined influence of these forces, the body exhibits oscillatory motion around its position of rest. The system is influenced by $m$ the mass of the body, $\mu$ the coefficient of friction and $k$ the spring constant, indicating the stiffness of the spring. The residual of the differential equation, \begin{equation} m\frac{d^2u}{dx^2}+\mu\frac{du}{dx}+ku=0, \end{equation} \begin{figure}[t] \centering \includegraphics[width=\textwidth]{oscilator.pdf} \caption{Illustration of of the movement of an oscillating body in the underdamped case. With $m=1kg$, $\mu=4\frac{Ns}{m}$ and $k=200\frac{N}{m}$.} \label{fig:spring} \end{figure} shows the relation of these parameters in reference to the problem at hand. As Tenenbaum and Morris~\cite{Tenenbaum1985} provide, there are three potential solutions to this issue. However only the \emph{underdamped case} results in an oscillating movement of the body, as illustrated in~\Cref{fig:spring}. In order to apply a PINN to this problem, we require a set of training data $x$. This consists of pairs of time points and corresponding displacement measurements $(t^{(i)}, u^{(i)})$, where $i\in\{1, ..., N_t\}$. In this hypothetical case, we know the mass $m=1kg$, and the spring constant $k=200\frac{N}{m}$ and the initial displacement $u^{(1)} = 1$ and $\frac{du(0)}{dt} = 0$. However, we do not know the value of the friction $\mu$. In this case the loss function, \begin{equation} \begin{split} \mathcal{L}_{\text{osc}}(\boldsymbol{x}, \boldsymbol{u}, \hat{\boldsymbol{u}}) = & (u^{(1)}-1)+\frac{du(0)}{dt}+ \frac{1}{N_t}\sum_{i=1}^{N_t} ||\hat{\boldsymbol{u}}^{(i)}-\boldsymbol{u}^{(i)}||^2 \\ + & ||m\frac{d^2u}{dx^2}+\hat{\mu}\frac{du}{dx}+ku||^2, \end{split} \end{equation} includes the border conditions, the residual, in which $\hat{\mu}$ is a learnable parameter and the data loss. By minimizing $\mathcal{L}_{\text{osc}}$ and solving the inverse problem, the PINN is able to find the missing parameter $\mu$. This shows the methodology by which PINNs are capable of learning the parameters of physical systems, such as the damped harmonic oscillator. In the following section, we present the approach of Shaier \etal~\cite{Shaier2021} to find the transmission rate and recovery rate of the SIR model using PINNs. % ------------------------------------------------------------------- \subsection{Disease-Informed Neural Networks} \label{sec:pinn:dinn} In the preceding section, we present a data-driven methodology, as described by Raissi \etal~\cite{Raissi2019}, for solving systems of differential equations by employing PINNs. In~\Cref{sec:pandemicModel:sir}, we describe the SIR model, which models the relations of susceptible, infectious and removed individuals and simulates the progress of a disease in a population with a constant size. A system of differential equations models these relations. Shaier \etal~\cite{Shaier2021} propose a method to solve the equations of the SIR model using a PINN, which they call a \emph{Disease-Informed Neural Network} (DINN).\\ To solve~\Cref{eq:sir} we need to find the recovery rate $\alpha$ and the transmission rate $\beta$. As Shaier \etal~\cite{Shaier2021} point out, there are different approaches to solve this set of equations. For instance, building on the assumption, that at the beginning one infected individual infects $-n$ other people, concluding in $\frac{dS(0)}{dt} = -n$. Then, \begin{equation} \beta=-\frac{\frac{dS}{dt}}{S_0I_0} \end{equation} would calculate the initial transmission rate using the initial size of the susceptible group $S_0$ and the infectious group $I_0$. The recovery rate, then could be defined using the amount of days a person between the point of infection and the start of isolation $d$, $\alpha = \frac{1}{d}$. The analytical solutions to the SIR models often use heuristic methods and require prior knowledge like the sizes $S_0$ and $I_0$. A data-driven approach such as the one that Shaier \etal~\cite{Shaier2021} propose does not suffer from these problems. Since the model learns the parameters $\alpha$ and $\beta$ while learning the training data consisting of the time points $\boldsymbol{t}$, and the corresponding measured sizes of the groups $\Psi=(\boldsymbol{S}, \boldsymbol{I}, \boldsymbol{R})$. Let $\hat{\Psi}=(\hat{\boldsymbol{S}}, \hat{\boldsymbol{I}}, \hat{\boldsymbol{R}})$ be the model predictions of the groups and $r_S=\frac{d\hat{\boldsymbol{S}}}{dt}+\hat{\beta} \hat{\boldsymbol{S}}\hat{\boldsymbol{I}}, r_I=\frac{d\hat{\boldsymbol{I}}}{dt}-\hat{\beta} \hat{\boldsymbol{S}}\hat{\boldsymbol{I}}+\alpha \hat{\boldsymbol{I}}$ and $r_R=\frac{d \hat{\boldsymbol{R}}}{dt} - \hat{\alpha} \hat{\boldsymbol{I}}$ the residuals of each differential equation using the model predictions. Then, \begin{equation} \begin{split} \mathcal{L}_{\text{SIR}}(\boldsymbol{t}, \Psi, \hat{\Psi}) = &||r_S||^2 + ||r_I||^2 + ||r_R||^2\\ + &\frac{1}{N_t}\sum_{i=1}^{N_t} ||\hat{\boldsymbol{S}}^{(i)}-\boldsymbol{S}^{(i)}||^2 + ||\hat{\boldsymbol{I}}^{(i)}-\boldsymbol{I}^{(i)}||^2 + ||\hat{\boldsymbol{R}}^{(i)}-\boldsymbol{R}^{(i)}||^2, \end{split} \end{equation} is the loss function of a DINN, with $\hat{\alpha}$ and $\hat{\beta}$ being learnable parameters. These are represented in the residuals of the ODEs. % -------------------------------------------------------------------