data set
The study focuses on water quality monitoring data obtained from the national monitoring station in Sanhong Village, Liaohe City. The Liaohe River is the largest tributary of the Xiuhe River, which flows through Jing’an County, Yichun City. It plays an important role as the main river of the prefecture, and eventually joins Poyang Lake via Xiu River.
The monitoring dataset spans the period from November 2020 to December 2022, with measurements taken every four hours, totaling 4,700 data points. These include water temperature (TEMP), pH, dissolved oxygen (DO), potassium permanganate (PP), ammonia nitrogen (TAN), total phosphorus (TP), total nitrogen (TN), and electrical conductivity (EC). It includes nine indicators. Turbidity (TUB). This dataset is obtained from the Jiangxi Province Environmental Quality Information Release Platform.
In addition, meteorological data was also collected for Yichun during the same period, including six indicators: temperature, pressure, humidity, wind speed, dew point temperature, and precipitation. This data was obtained from the website “Reliable Prognosis”.
Among various water quality indicators, dissolved oxygen concentration is an important indicator for evaluating water quality.^{twenty one}. Therefore, this paper focuses on utilizing dissolved oxygen as a target indicator for model prediction.
Through a series of experiments and evaluations, the optimal number of modalities was found to be 4, as it demonstrated the best performance and accuracy during model training. This paper uses the EEMD method (four modes) to resolve dissolved oxygen indicators through experimental comparison. Figure 3 shows the waveform diagram of each mode after decomposition into the validation set and test set.
Through autocorrelation experiments, we observed that the three modes IMF1, IMF2, and IMF3 exhibit obvious periodic characteristics, whereas IMF4 retains the trend characteristics inherent in the data.
(i) Handling missing values and outliers
While analyzing the data, it was found that certain issues such as missing values and outliers existed due to factors such as equipment maintenance and failures that occurred during the data collection process.
For indicators with a significant number of consecutive missing values, linear interpolation is used to fill the gaps according to the following formula:
$$\begin{aligned} \varphi \left( x \right) = \frac{xx_{1} }{x_{0}x_{1}} y_{0} +\frac{xx_{ 0} }{x_{1}x_{0}} y_{1} \end{aligned}$$
(6)
where x represents time and \(\varphi \left( x \right)\) represents the estimate at that particular time x.Coordinate \(x_{0}\) and \(y_{0}\) represents the first known data point, \(x_{1}\) and \(y_{1}\) represents the second known data point.
(ii) Normalization
Because water quality metrics have distinct scales, each metric is normalized using the following formula for optimal model training.
$$\begin{Alignment} x^{‘} =\frac{xmin(x)}{max(x)min(x)} \end{Alignment}$$
(7)
Here x is the original data that needs to be normalized. \(x^{‘}\) is normalized data and its value range is: [0,1], maximum(X) and minutes(X) are the maximum and minimum values in the dataset, respectively.
(iii) Correlation analysis
To investigate the importance of each indicator in the prediction process, a correlation analysis was performed on the data and the correlation heat map is shown in Figure 4.
After EEMD decomposition, it is evident that the correlation between dissolved oxygen and various indicators such as temperature, electrical conductivity, ammonia nitrogen, and total nitrogen increases.
Determination of model parameters
In this paper, grid search is used to optimize model parameters. Only one parameter can be adjusted at a time, and grid search is used for fine tuning. By iteratively performing the aforementioned steps, the optimized model parameters are shown in Table 1.
Experimental evaluation indicators
Mean absolute error (MAE), mean squared error (MSE), mean absolute percent error (MAPE), and correlation coefficient \((R^2)\) It is used as a quantitative metric to evaluate the predictive performance of the model.
$$\begin{aligned} MAE= & {} \frac{\sum \left y {\hat{y}} \right }{n} \end{aligned}$$
(8)
$$\begin{aligned} MSE= & {} \frac{\sum (y{\hat{y}} )^{2}}{n} \end{aligned}$$
(9)
$$\begin{aligned} MAPE= & {} \frac{100\%}{n} \sum \left \frac{{\hat{y}}y }{y} \right \end{Aligned }$$
(Ten)
$$\begin{aligned} R^{2}= & {} \frac{\sum ({\hat{y}}{\bar{y}})^{2} }{\sum (y{ \bar{y}})^{2}} \end{Alignment}$$
(11)
where y is the true value, \({\hat{y}}\) is the predicted value, \({\bar{y}}\) The average of the indicators. When comparing models, lower values of MAE, MSE, and MAPE indicate better model performance. \(R^2\) A value closer to 1 indicates a better model.
experimental design
Dissolved oxygen is selected as the target variable for prediction, and both singlestep and multistep predictions are performed. Based on data correlation analysis, the following four data combinations are designed as shown in Table 2:
Based on the above four data combinations, the experiment is designed as follows.

(I)
Window size experiments: Examine the effect of window size on results.

(ii)
Model comparison: Compare with mainstream time series prediction models XGBoost, LSTM, GRU, and Informer.

(iii)
Correlation experiment: We conduct a multistage comparative prediction experiment on four data combinations.

(iv)
Ablation experiment: Confirm the role of each module through an ablation experiment.
Experimental results and analysis
In this paper, we conduct related experiments based on the above plan.
(i) Sliding window size experiment: To determine the optimal window size, comparative experiments are performed using window sizes 8 and 48 for XGBoost, LSTM, GRU, and the proposed model.
Based on the experimental results, each model seems to show low sensitivity to window size. \(R^2\) For example, the XGBoost model improved its prediction results by only 2% when increasing the window size to 48. However, better prediction results were observed for other models when the window size was set to 8. , In this paper, we choose a window size of 8 for subsequent experiments.
(ii) Popular forecasting models commonly used in the field of time series forecasting, namely XGBoost, LSTM, and GRU, are selected for comparison. In the field of time series forecasting, several popular forecasting models are often used for comparative analysis. These models include XGBoost, LSTM, and GRU. Considering the widespread adoption of transformerbased models for time series forecasting, Temporal Fusion Transformer (TFT) was introduced by Bryan et al.^{twenty two} TFT can learn complex relationships between different time scales in time series data. Based on this, Jitha et al.^{twenty three} A timefused transformer architecture was utilized to model and predict river water quality indicators.
Additionally, Zhou et al.^{twenty four} We proposed an Informer model for longterm time series forecasting. Therefore, we conducted an experiment that incorporated the Informer model into comparative analysis.
Comparison experiments are performed with step sizes of 1 (4 hours), 6 (1 day), 12 (2 days), and 18 (3 days). The results are shown in Table 3. Optimal results are shown in bold.
The results show that the model proposed in this paper consistently achieves the best prediction performance in steps 1, 6, and 12 of combination 1, with the following improvements: \(R^2\) Increased by 5%, 7%, and 5% compared to the second best model. And in step 18, the model achieved his second best result with a difference of only 0.01 from the optimal value. Introducing weather data (combination 2) slightly improves the predictive performance of all models. \(R^2\) The values remain relatively consistent across different step sizes. In particular, the proposed model continues to provide optimal results at step sizes 1, 6, and 12. In step 18, Informer performed slightly better than the proposed model, proving the advantage of her Informer in longterm prediction.
As the prediction step size increases, the prediction performance of various models tends to decrease. However, the proposed model consistently achieves the best results across almost all step sizes, demonstrating its effectiveness in predicting dissolved oxygen.
Examining the onestep prediction curve, it is clear that the model proposed in this paper fits the actual value better, and the curve almost overlaps with the true value. The curves are shown in Figure 5.
(iii) After correlation analysis, the top four most correlated indicators are selected and utilized in combination with the proposed model for multistage prediction. The results are shown in Table 4. Optimal values are shown in bold for reference.
It can be seen that the prediction accuracy is relatively stable even after screening the indicators based on correlation analysis. Specifically, combination 3 achieves her second best result. \(R^2\) Onestep prediction gives the value, but combination 4 gives the best value. \(R^2\) 6step prediction value.
In summary, choosing metrics that are highly correlated with the target allows you to reduce the dimensionality of your data without significantly compromising model performance. The proposed model continues to provide robust multistep dissolved oxygen predictions when incorporating these correlation metrics. This approach allows for more efficient water quality modeling by utilizing fewer but useful variables, thereby streamlining the modeling process.
(iv) Ablation experiment: To further demonstrate the contribution of individual modules in the proposed model, a corresponding ablation experiment was devised. The results are shown in Table 5. Optimal values are highlighted in bold for clarity.
It is clear that including the CNN module improves the prediction performance in step 1. However, as the step size increases, its influence decreases. On the contrary, the introduction of the EEMD decomposition module significantly improves the prediction performance, with both combinations 1 and 2 consistently providing the second best results across all step sizes. This highlights that EEMD contributes significantly to improving prediction compared to CNN module. .
Leave a Reply