Comparison of Nitrogen Dioxide Predictions During a Pandemic and Non-pandemic Scenario in the City of Madrid using a Convolutional LSTM Network

Traditionally, machine learning technologies with the methods and capabilities available, combined with a geospatial dimension, can perform predictive analyzes of air quality with greater accuracy. However, air pollution is in°uenced by many external factors, one of which has recently been caused by the restrictions applied to curb the relentless advance of COVID-19. These sudden changes in air quality levels can negatively in°uence current forecasting models. This work compares air pollution forecasts during a pandemic and non-pandemic period under the same conditions. The ConvLSTM algorithm was applied to predict the concentration of nitrogen dioxide using data from the air quality and meteorological stations in Madrid. The proposed model was applied for two scenarios: pandemic (January – June 2020) and non-pan-demic (January – June 2019), each with sub-scenarios based on time granularity (1-h, 12-h, 24-h and 48-h) and combination of features. The Root Mean Square Error was taken as the estimation metric, and the results showed that the proposed method outperformed a reference model, and the feature selection technique signi¯cantly improved the overall accuracy.


Introduction
Many studies have con¯rmed the e®ectiveness of machine learning technologies, for instance, for analyzing the time series of a recurrent neural network. 1,2Because forecasting air quality can be viewed as a time series analysis, all applied time series analysis methods and algorithms can also be used to forecast air quality.In addition to forecasting along the time axis, it is also important to know the air quality value in places with no stations.Several authors have focused on the spatial factor in their studies. 3,4In order to capture spatiotemporal patterns more e±ciently and make very accurate predictions, various studies have suggested using Convolutional LSTM (ConvLSTM) to predict rainfall, 5 tra±c accidents 6 and air quality, 7 amongst other things.The importance of forecasting air quality and handling its consequences is growing day by day and continues to be the center of government and scienti¯c attention.The study shows that short-term and long-term exposures to air pollutants cause about seven million death annually. 8Various approaches and control measures have been implemented to reduce the concentration of these pollutants.Having information about the future concentration in advance can prompt decisionmakers to implement certain strategies that can decrease concentration.To improve the accuracy of forecasts and the choice of the model, it is also very important to consider the factors that may directly or indirectly a®ect air quality.One of these factors is the lockdowns imposed due to the coronavirus disease 2019 (COVID-19) pandemic.To control the COVID-19 outbreak, all countries adopted severe tra±c restrictions and self-quarantine measures, 9 which resulted in a decrease in air pollution. 10An example of this was especially apparent in Madrid, where due to COVID-19 restrictions, the concentration of nitrogen dioxide (NO 2 ) dropped to 62%. 11Taking into account the above information, the main objective of this work is to predict NO 2 concentration using ConvLSTM and compare it with LSTM, which, looking at the Table 1, can be considered as a benchmark model (Table 1 shows publications focused on NO 2 prediction and implemented methods, these results are extracted from the following paper). 12The analysis is done for two scenarios: pandemic (January-June 2020) and non-pandemic (January-June 2019), in each of which the following sub-scenarios were de¯ned, based on time intervals (1-h, 12-h,   24-h and 48-h) and features combinations.The Root Mean Square Error (RMSE) metric was applied to evaluate the results provided by each of the models.

Materials and Methods
The study area of this work was the city of Madrid (Fig. 1).According to the study by Khomenko et al. 13 related to premature mortality due to air pollution in European cities, which considered PM 2:5 and NO 2 as the main pollutants, Madrid was found to be in the top position of the ranking of European cities with the highest NO 2 mortality burden.Therefore, bearing in mind the importance of NO 2 for Madrid, it was selected as an air pollutant for predictive analysis.
The datasets used in this work are NO 2 data and meteorological data from January to June 2019 (non-pandemic scenario) and from January to June 2020 (pandemic scenario) and the location of the control stations, which was used to generate grid cells.The data were obtained from Open data portal of the Madrid City Council. 14There are 24 air quality control stations and 26 meteorological control stations.The meteorological data includes ultraviolet radiation, wind speed, wind direction, temperature, ConvLSTM is a type of recurrent neural network, similar to LSTM with only one di®erence À À À convolution operations are used instead of internal matrix multiplications. 30Having convolutional structures in the input-to-state and state-to-state transitions makes it possible to consider a spatial factor in addition to a temporal factor.In this analysis, the model architecture was constructed by stacking several convLSTM layers which were combined with dropout and batch normalization layers and the entire network was ¯nished with a Conv2D layer.

Experiments and Results
This section presents a detailed description of the experiments implemented and the results obtained.The overall work°ow of the analysis is presented in Fig. 3.It can be seen that the work°ow consists of two segments formed on the basis of the software applied: ArcGIS Pro and Google Colab.The ¯rst step was to create grid cells in a given area.In this study, the part of Madrid within the following extent was selected as the given area: Top À À À 4,486,449.725263m; Bottom À À À 4,466,449.725263m; Left À À À 434,215.234430m; Right À À À 451,215.234430m.The grid cells were created using the Fishnet tool (https://bit.ly/3vUpBxj).Both cell size width and height were 1000 m.There are a total of 340 cells (20 by 17), which cover 340 km 2 or 56.27% of the total area of the city of Madrid.The logic behind selecting this area was to select a minimum extension that included all air quality control stations to achieve higher accuracy.The value of each cell includes the values of NO 2 and meteorological attributes obtained from assigned stations at a certain time.A zero value was assigned to the cells that did not include any stations.After generating grid cells, the next step was to export the output as Comma Separated Values (CSV) ¯les, which were used as input in further stages of the analysis.Overall, 4344 and 4368 CSV ¯les were generated corresponding to every hour during January-June 2019 and January-June 2020, respectively.The above process is the ¯rst segment.The second segment presents machine learning techniques applied to the data obtained.In the ¯gure, it can be seen that the machine learning process begins with feature engineering.The following feature engineering techniques were applied in this work: (a) handling outliers: Before outliers can be processed they must be detected, and summary statistics from Table 2 can help detect them.The minimum values of humidity and temperature show that they are outliers.Temperatures below À3 (https://bit.ly/3gOxLD0)and humidity with negative values were considered outliers and replaced with the average of the previous and next values.(b) imputation: As already mentioned, there are 24 air pollution control stations and 26 meteorological stations, which means that of 340 cells around 8% have data.In order to solve the problem related to missing data, inverse distance weighting was applied.(c) feature selection: A signi¯cant step in machine learning analysis is the feature selection process.As already mentioned, only nine features were included.From Table 2, it can be seen that no ultraviolet radiation was recorded for the pandemic period.Therefore, this feature has been removed.Also, regarding precipitation, it was found out that around 99% of data were 0, so this feature was also eliminated.For remaining variables, the mutual information (MI) technique was implemented. 31 where P ðx i ; yÞ is the joint probability distribution of two variables, P ðx i Þ and P ðyÞ are marginal distributions, HðxÞ is the entropy for x, and HðxjyÞ is the conditional entropy.
Figure 4 shows the feature importance scores of 5 additional datasets based on MI.It should be noted that the wind direction is not included in Fig. 4. The wind direction is circular data and needs to be converted for later use (the details are mentioned below).Looking at Fig. 4, it can be seen that the wind speed has a higher score compared to other variables, so it was chosen for further analysis with the wind direction, given their interconnection.(d) transformation: Regarding wind direction, which is circular data, it was converted in categorical data with the following categories: north, east, south, west, southwest, northeast, southeast, northwest, and later by implementing One Hot Encoder 32 it was used for predictive analysis.(e) data splitting: In this step, independent and dependent datasets were generated based on the time granularity (to predict NO 2 in t 0 hours based on the data for the previous 24 h, where t 0 2 f1; 12; 24; 48g).(f) scaling: The ranges of each feature can be different from each other.In order to achieve better results, scaling can be a very useful technique.The input data was normalized with min-max (0-1) normalization.
After preprocessing the data with the aforementioned techniques, the dataset was split into training and testing sets with a fraction of 0.2: 80% training set and 20% testing set.Apart from this split, GridSearchCV with blocking time series split was applied on the training set for parameter optimization.Blocking time series split was chosen instead of cross-validation because it considers the time series aspect and prevents leakage from one set to another.In order to reduce computation time for parameter optimization, GridSearchCV was applied on the sampled dataset, which was generated by sampling the data every 6 h.Table 3 shows optimized parameters with the options that were tried, and the option that was ¯nally selected is indicated in bold.
After parameter optimization, it was decided that the ¯nalized model should be implemented in the following scenarios: (a) Including all features; and (b) Including only selected features (NO 2 , wind speed, and wind direction).RMSE was taken as an evaluation metric, and EarlyStopping callback was used to prevent over¯tting.Table 4 presents the executed results.It can be seen that the feature selection signi¯cantly improved the results, given the fact that often the presence of many features prevents the model from e®ectively generalizing to the data due to the curse of dimensionality, which also exists in this work.Regarding machine learning algorithms, it should be noticed that convLSTM outperformed LSTM, particularly in the ¯rst scenario compared to the second scenario the di®erences between the two models are signi¯cant.Regarding the two di®erent periods, it should be noted that the pandemic period exceeds the non-pandemic period in most sub-scenarios; however, the di®erence is not large.Although the variance of a pandemic year is smaller than for a non-pandemic year, nevertheless, the algorithms are trained and tested for each period separately, which means that the models during training most likely learn and generalize all existing patterns for both periods.In terms of time granularity, 1-h granularity outperformed other granularities in all sub-scenarios, but this trend does not maintain for other time granularities, which can be related to the selection of the historical time lags. 33Based on the above ¯ndings, it can be concluded that analysis involved feature selection yields greater accuracy.ConvLSTM being able to convey spatial information in addition to temporal information has a clear advantage over LSTM, which can also be noted from the ¯nal results.

Conclusions and Future Work
Taking into account the fact that the concentration of NO 2 , in addition to seasonal changes, has undergone the impact of COVID-19, this work uses ConvLSTM to compare the results for the pandemic and non-pandemic periods.The analysis was carried out for di®erent time resolutions (1-h, 12-h, 24-h and 48-h) with di®erent feature combinations.The strength of the chosen algorithm is that, despite the temporal prediction, it can also perform higher predictions in the spatial plane.RMSE was chosen as the assessment metric.The ¯nal results showed that the proposed model outperformed the LSTM, which can be explained by the ability of the convLSTM to generalize and transfer the spatiotemporal information.In terms of datasets, the analyzes performed with selected features surpassed the results performed with all features due to the problem of high dimensionality.Regarding future work, there are several aspects to be considered.It may be useful to include tra±c data since it plays a decisive role in raising the level of NO 2 .Another extension of this work could be to apply the analysis to another city and compare the results for di®erent case studies.

Fig. 1 .NitrogenFig. 2 .
Fig. 1.Air quality stations, meteorological stations and grid cells segments on the de¯ned area of the city of Madrid.

Fig. 3 .
Fig.3.The detailed work°ow of the analysis.

Table
. Publications focused on the prediction of nitrogen dioxide and implemented algorithms.
It calculates the mutuality between additional datasets and the target dataset (NO 2 ).The formula to calculate MI is presented as follows:

Table 4 .
RMSE of ConvLSTM and LSTM for the periods January-June 2019 (Non-pandemic) and 2020 (Pandemic) in terms of features combination and time granularities.