^{1}

^{1}

^{1}

^{1}

^{2}

^{3}

^{1}

^{1}

^{2}

^{*}

Taewan Goo and Catherine Apio contributed equally to this work.

For the novel coronavirus disease 2019 (COVID-19), predictive modeling, in the literature, uses broadly susceptible exposed infected recoverd (SEIR)/susceptible infected recoverd (SIR), agent-based, curve-fitting models. Governments and legislative bodies rely on insights from prediction models to suggest new policies and to assess the effectiveness of enforced policies. Therefore, access to accurate outbreak prediction models is essential to obtain insights into the likely spread and consequences of infectious diseases. The objective of this study is to predict the future COVID-19 situation of Korea. Here, we employed 5 models for this analysis; SEIR, local linear regression (LLR), negative binomial (NB) regression, segment Poisson, deep-learning based long short-term memory models (LSTM) and tree based gradient boosting machine (GBM). After prediction, model performance comparison was evelauated using relative mean squared errors (RMSE) for two sets of train (January 20, 2020‒December 31, 2020 and January 20, 2020‒January 31, 2021) and testing data (January 1, 2021‒February 28, 2021 and February 1, 2021‒February 28, 2021) . Except for segmented Poisson model, the other models predicted a decline in the daily confirmed cases in the country for the coming future. RMSE values’ comparison showed that LLR, GBM, SEIR, NB, and LSTM respectively, performed well in the forecasting of the pandemic situation of the country. A good understanding of the epidemic dynamics would greatly enhance the control and prevention of COVID-19 and other infectious diseases. Therefore, with increasing daily confirmed cases since this year, these results could help in the pandemic response by informing decisions about planning, resource allocation, and decision concerning social distancing policies.

The novel coronavirus disease 2019 (COVID-19) presents an important and urgent threat to global health. Since the outbreak in early December 2019 in the Hubei province of the People’s Republic of China, the number of patients confirmed to have the disease has exceeded 118 million as the disease spread globally, and the number of people infected is probably much higher [

To mitigate and suppress the burden of COVID-19 on the healthcare system, while also protecting the general public, especially the highly susceptible group of people, robust models that predict the prognosis of COVID-19 were urgently needed to support decisions about shielding, hospital admission, treatment, and population level interventions [

Other features, such as social distancing, stay-at-home orders, use of facemasks or self-quarantine, travel restriction, and contact tracing could help predict what comes next. For better understanding, prediction models are important for better estimation about the disease and its possible threats such as the number of cases based on the level of severity can help determine the need of numbers of ventilators and other sophisticated medical equipment. Furthermore, countries need to shape their health system responses in accordance with the need [

For COVID-19, predictive modeling, in the literature, uses broadly susceptible exposed infected recoverd (SEIR)/ susceptible infected recoverd (SIR), agent-based, curve-fitting models. Besides, machine-learning models that are built on statistical tools have widely been used too [

Therefore, with increasing daily confirmed cases since the beginning of 2021, in Korea and elsewhere, such models could help in the response to pandemic by informing decisions about planning, resource allocation, and decision of the social distancing.

The daily series of confirmed cases of COVID-19 for South Korea from January 20, 2020 to Febraury 28, 2021 was obtained from Kaggle (from January 20 to June 30, 2020) [

As one model may not give the best prediction of the COVID-19 situation of Korea, we present prediction results estimated by different models that can apply the above data. Many models are available and have been implemented for forecasting the pandemic situation of many countries and look at afew. In this section, we introduce segmented Poisson model, LLR model, deep-learning based LSTM model, NB model, SEIR model, and GBM model used for predicting the COVID-19 situation of Korea.

Here, we regarded the confirmed cases as a function of time _{t}

where _{t}_{t}

Breakpoints were considered in the daily confirmed cases during the analysis by splitting the daily confirmed cases into segments (

where _{i}

NB model is defined as [

where _{t}_{t}_{t-1} as the history of the joint process {_{t}_{t}_{t}

where ^{2} is defined as ^{2}=1/

where _{t}

Our LLR model is based on Poisson model which previously mentioned. For this model, local quadratic approximation is fitted within a smoothing window of bandwidth ^{3})^{3} for each point. The local quadratic log-likelihood is defined as;

where _{0}, _{1}, _{2})^{t}.

We utilize a rolling origin cross validation to select optimal bandwidth of the smoothing window [

Here, LSTM network is considered [_{t}_{t-h}_{t-1}

The LSTM architecture is described in _{t}_{t-1} and _{t-1} to be trained. The _{t-1} and _{t-1} are the state and output of the last block, respectively. We assume four blocks with 64 units, each with a 0.2 dropout layer. The optimization is held using the adam optimizer to minimize the MSE during the training process.

Once the optimal model is built, it is applied to the test data for prediction. The prediction is performed sequentially by using the current output as the part of input of the next prediction. The analysis was performed using Python version 3.7.6, and ‘keras’ library.

The infectious disease dynamic can be formulated with a mathematical model. We consider the SEIR model to fit the dataset of COVID-19 daily confirmed cases and predict the incidence of COVID-19 epidemic in Korea. In SEIR model, population is divided in four groups: susceptible (S), exposed (E), symptomatic and infectious (I) and recovered (R) individuals. This model includes the spread of infection during the latent period. The latency of COVID-19 infection is biologically realistic. The SEIR model is defined by the following the system of ordinary differential equation [

where

GBM is a tree based machine learning algorithm that can be used for regression and classification problems. GBM consist of weak regression learner and decision trees. The decision tree uses the input value to determine which regression learner is best to make predictions.

Based on adaptive boosting algorithm, GBM can build a strong regression learner by iteratively combining a set of weak regression leaners. GBM use gradient descent for minimizing loss function of a strong regression learner. Like other boosting algorithms, GBM adds models into the tree using greedy style [

where _{m}_{m-1} is previous model and _{m}_{m}_{m}

To build GBM, ‘LightGBM’ library was used [

To evaluate the above models, RMSEs for the train and test datasets for each of the fitted models were calculated as follows:

where _{t}

The COVID-19 daily confirmed cases of the country were divided into two regions (non-capital and capital) with the total being domestic and analysed using the above models. The data was split into two subsets and used in the training and prediction analysis of the models.

As for model evaluation, in

With the second data subset: in the country, LLR and SEIR had the lowest RMSE while NB and GBM had the lowest train RMSE values, respectively. Capital region showed that LLR and NB had the lowest train RMSE while NB and GBM had the lowest test RMSE values. In the non-capital region, SEIR and LLR had the lowest train RMSE while NB and GBM had the lowest test RMSE values.

Therefore, taking into lower train and test RMSE values for all region and both data subsets, we can conclude that LLR model, GBM, SEIR model and then NB model were the best prediction models for forecasting of the COVID-19 situation of Korea. Segmented Poisson model tended to have the highest test RMSE values in all scenarios.

A look that the prediction plots of the these models shows that the daily COVID-19 confirmed cases will decline in the country (domestic), Seoul metropolitan (capital) and non-Seoul metropolitan (non-capital) areas using LLR, and NB models using the first data subset, while it will increase and stay constant using the segmented Poisson and LSTM models, respectively (

The objective of our analysis was to predict the future COVID-19 situation of South Korea using daily confirmed cases. We employed six different models in this analysis and all models gave some different prediction results for different data subsets and regions. The evaluation of the models using RMSE showed that local likelihood rgression, GBM, SEIR and NB models had the lowest RMSE values, making them the best models, though LSTM gave better RMSE values compared to segmented Poisson model. LLR, GBM, SEIR, NB and LSTM models mainly predicted a decline in COVID-19 daily confirmed cases in the country and the two regions of Korea. We can reasonably take that these results portray the future situation of the country.

With the first dataset, NB, SEIR, and LLR, respectively showed the best test performance in domestic, capital, and non-capital areas, while with second dataset, GBM showed the best test performance for all regions. In case of NB model, we found that the coefficient of confirmed cases of the day before had the largest value. This means that confirmed cases of the day before can affect the prediction of most future confirmed cases. In case of GBM, we could obtain feature importance plots of the model (

Note that the number of daily confirmed cases varied across regions of Korea, so we first fitted these models for each of the regions and compared them with the number of observed cases. However, the comparison of prediction models based on regional data was not convincing due to the small sample sizes. Instead of fitting the models for each region, we considered combining the two regions of capital and non-capital areas, which provided enough sample sizes. We also tried predicitng the number of deaths due to COVID-19. However, the data was not large enough (with only 1,669 deaths as of March 14, 2021) [

In our study, there is one challenge that predictions made reflect interventions in place at the time the model was developed. So, one can argue the influence of government intervention policies in the above observed results. However, our comparison result is still valid because all models reflected the same intervention effects. Actually, the Korean government has maintained a high level of social distancing with a ban in gatherings of more than five persons [

A good understanding of the epidemic dynamic would greatly enhance the control and prevention of COVID-19 as well as other infectious diseases. Therefore, taking precaution when using prediction to support a decision, for example, return to work or lowering of the social distancing level, is highly encouraged too.

Conceptualization: TP. Data curation: KH, GH. Formal analysis: TG, GH, DL, JHL, JL. Funding acquisition: TP. Methodology: TP. Writing - original draft: TG, CA. Writing - review & editing: TP, TG, CA.

Taesung Park serves as an editor of the Genomics and Informatics, but has no role in the decision to publish this article. All remaining authors have declared no conflicts of interest.

Supplementary data can be found with this article online at

Breakpoints used for segmented Poisson model

Prediction of the coronavirus disease 2019 (COVID-19) situation for the domestic region with the first data subset. LSTM, long short-term memory.

Prediction of the coronavirus disease 2019 (COVID-19) situation for the Capital region with the first data subset. LSTM, long short-term memory.

Prediction of the coronavirus disease 2019 (COVID-19) situation for the non-capital region with the first data subset. LSTM, long short-term memory.

Prediction of the coronavirus disease 2019 (COVID-19) situation for the domestic region with the second data subset. LSTM, long short-term memory.

Prediction of the coronavirus disease 2019 (COVID-19) situation for the capital region with the second data subset. LSTM, long short-term memory.

Prediction of the coronavirus disease 2019 situation for the non-capital region with the second data subset. LSTM, long short-term memory.

Prediction of the coronavirus disease 2019 (COVID-19) situation for the three regions with the first and second data subset using susceptible exposed infected recoverd (SEIR).

Prediction of the coronavirus disease 2019 (COVID-19) situation for the three regions with the first and second data subset using gradient boosting machine (GBM).

Feature importance of the coronavirus disease 2019 situation prediction for all region with the second data.

Daily confirmed cases of South Korea. Daily confirmed cases of capital, non-capital, and domestic is represented in red, blue, and green, respectively.

Long short-term memory (LSTM) model architecture. (A) Overall architecture of LSTM. (B) The LSTM block architecture.

RMSE for the regions and models following the the two data subsets

Region | Model | RMSE of data split 1 |
RMSE of data split 2 |
||
---|---|---|---|---|---|

Train | Test | Train | Test | ||

(Jan 20, 2020‒Dec 31, 2021) | (Jan 1, 2021‒Feb 28, 2021) | (Jan 20, 2020‒Jan 31, 2021) | (Feb 1, 2021‒Feb 28, 2021) | ||

Domestic | Segmented Poisson | 0.088 | 1194.103 | 0.251 | 16.415 |

Negative binomial | 0.057 | 0.409 | 0.063 | 2.088 | |

Local regression | 0.037 | 0.793 | 0.039 | 14.856 | |

LSTM | 0.051 | 23.117 | 0.083 | 6.327 | |

SEIR | 0.033 | 0.956 | 0.035 | 2.658 | |

GBM | 0.022 | 1.507 | 0.082 | 0.591 | |

Capital | Segmented Poisson | 0.075 | 668.199 | 0.235 | 5.312 |

Negative binomial | 0.061 | 1.311 | 0.064 | 3.078 | |

Local regression | 0.042 | 1.135 | 0.046 | 3.846 | |

LSTM | 0.054 | 14.8 | 0.074 | 3.934 | |

SEIR | 0.073 | 0.410 | 0.072 | 3.109 | |

GBM | 0.021 | 1.960 | 0.095 | 0.892 | |

Non-capital | Segmented Poisson | 0.118 | 1131.838 | 0.195 | 34.935 |

Negative binomial | 0.097 | 0.912 | 0.103 | 1.157 | |

Local regression | 0.074 | 0.522 | 0.076 | 33.232 | |

LSTM | 0.087 | 15.207 | 0.119 | 4.774 | |

SEIR | 0.036 | 0.610 | 0.036 | 1.964 | |

GBM | 0.015 | 0.607 | 0.083 | 0.855 |

RMSE, relative mean squared error; LSTM, long short-term memory; SEIR, susceptible exposed infected recoverd; GBM, gradient boosting machine.