Multivariate Forecasting of Daily Streamflow Using Upstream Hydrological Stations
In the previous sections, we progressively built our hydrological forecasting pipeline.
We first downloaded and cleaned a daily time series corresponding to the discharge of the Mississippi River at St. Louis. We then compared several univariate one-day-ahead (D+1) forecasting approaches. Finally, we constructed a multivariate dataset comprising thirteen stations distributed along the Mississippi and Missouri Rivers.
We can now move on to the next step: exploiting this collection of time series to improve the forecasting of daily discharge at St. Louis.
The underlying idea is straightforward.
When an increase or decrease in streamflow occurs upstream in the basin—for example, in Minnesota along the Mississippi or farther west along the Missouri, it does not affect St. Louis instantaneously. Instead, the hydrological signal propagates progressively downstream.
Upstream stations can therefore provide useful predictive information several days before the corresponding effect is observed at our target station.
The objective of this section is to quantify this contribution precisely.
We will seek to answer three questions:
- Do upstream stations actually improve forecasting performance?
- Which type of model makes the best use of this information?
- How much historical information should be retained?
Scripts used in this section
forecast_discharge_multi.py
General Principle of the Script
The script implements a complete multivariate one-day-ahead (D+1) forecasting pipeline.
The target variable remains:
07010000_daily_data.csv # St. Louis
using the column:
discharge
which we will rename saint_louis
The other twelve CSV files are used as input features.
Each file has exactly the same structure:
date
discharge
and covers the following period:
2010-01-01 → 2025-12-31
All time series are daily, aligned and complete (i.e. with no missing values).
Moreover, they can be combined directly: each date corresponds to exactly one available observation for each of the thirteen stations, making it possible to construct the multivariate dataset immediately.
The script then loads each file individually and merges them into a single DataFrame.
The result is:
Combined dataset:
(5844, 13)
We therefore obtain a dataset containing 5,844 days and 13 hydrological variables.
A Fundamental Constraint: Avoiding Temporal Leakage
In forecasting, the chronological order of the data must be strictly respected.
This constraint is particularly important in machine learning.
Unlike a standard classification problem—where the rows of a dataset can often be shuffled without consequence—a time series has a natural ordering: the past precedes the future, and this ordering itself carries information.
The model must therefore be trained exclusively on data that would actually have been available at the time the forecast was made.
If the goal is to forecast the discharge at St. Louis for day d, the model cannot use a measurement that would not yet have been available on that date.
For example, it would be incorrect to include, among the input features, an observation recorded on day d at another station if that measurement is only published at the end of the day.
In that case, the model would be benefiting from information originating in the future.
This issue is commonly referred to in the literature as data leakage or temporal leakage: the model appears to perform extremely well during testing, but in reality it is exploiting information that would never be available in an operational forecasting setting.
The resulting performance estimates are therefore artificially optimistic and no longer reflect the true forecasting capability of the model.
For this reason, the script enforces a strict rule.
To forecast:
Saint-Louis[d]
the most recent value that may be used from any other station is:
Station[d−1]
All time series are therefore used in lagged form.
This methodological point is fundamental.
It ensures that the forecasting pipeline remains realistic in an operational setting.
Sliding Time Windows
Using a single lagged value is generally not sufficient for forecasting.
The discharge of a river rarely depends only on the level observed the previous day. It also reflects a more gradual dynamic: an increasing or decreasing trend, the propagation of a flood wave, the recession following a peak, or the cumulative influence of several consecutive days.
To capture this behavior, the script uses the principle of sliding time windows.
The idea is straightforward: for each date to be predicted, the model receives not only the most recent available observation but also several days of historical data.
Example with a 2-day window:
St. Louis J−1
St. Louis J−2
Quincy J−1
Quincy J−2
Dubuque J−1
Dubuque J−2
...
With thirteen stations, a 2-day window produces
13 × 2 = 26 features
A larger window allows the model to retain a longer temporal memory.
For example, with a 15-day window, the model receives:
- the last 15 observed values at St. Louis;
- the last 15 observed values at Quincy;
- the last 15 observed values at Dubuque;
- and so on.
This results in:
13 × 15 = 195 features
Each date in the dataset is therefore transformed into a numerical feature vector representing a portion of the basin's recent hydrological history.
The script forecast_discharge_multi.py uses the function build_window_dataset() to
automate this transformation.
It processes the time series chronologically and constructs, for each date:
The time series is processed chronologically so that, for each date, an input matrix X is constructed from the sliding time windows, while the corresponding target value is stored in a target vector y.
In this way, a collection of time series is transformed into a conventional tabular dataset that can be used directly by machine learning algorithms.
The choice of the window size is ultimately an empirical one, since a window that is too short may fail to capture part of the useful signal, whereas a window that is too long introduces additional noise while unnecessarily increasing the dimensionality of the problem.
The script therefore evaluates several window lengths in order to identify the one that provides the best trade-off.
Models Compared
The script compares four different approaches.
The objective is not merely to achieve the best possible evaluation metric, but also to compare several families of models in order to better understand the structure of the problem.
Some approaches rely on very simple assumptions, while others attempt to capture more complex relationships.
Comparing these methods therefore helps answer two questions:
- Which approach provides the most accurate one-day-ahead (D+1) streamflow forecasts?
- And what level of model complexity is actually useful in this hydrological context?
Persistence Baseline
The first benchmark is the simplest possible approach: persistence.
Its principle is:
Q(t) ≈ Q(t−1)
In other words, the forecasted discharge for tomorrow is assumed to be identical to the discharge observed today.
This assumption may seem naïve.
However, in hydrology—as in many time series forecasting problems—it often provides a surprisingly strong baseline: when variations are gradual, the previous day's value is already a good estimator of the next day's discharge.
This baseline therefore plays an important role, as it defines the minimum level of performance that a forecasting model should surpass.
A more sophisticated model that fails to outperform persistence would, in practice, provide no added value.
Linear Regression
The second approach is linear regression.
Its principle is to construct the forecast from all available variables by assigning each of them a weight that is learned automatically during training.
In other words, the model analyzes the historical data and automatically estimates how much importance should be assigned to each available source of information, including recent discharge values observed at St. Louis, measurements recorded at upstream stations such as Quincy and Dubuque, and observations collected along the Missouri River.
Some variables receive higher weights because they are more strongly correlated with the value to be predicted, while others contribute less.
The final forecast is then obtained as the weighted sum of all these contributions.
In simplified form:
Predicted discharge =
weight₁ × St. Louis(D−1)
+ weight₂ × St. Louis(D−2)
+ weight₃ × Quincy(D−1)
+ weight₄ × Dubuque(J−2)
+ . . .
The coefficients (weight₁, weight₂, etc.) are not chose manually.
They are learned automatically during training so as to minimize the error between the model's predictions and the actual observed discharges.
Linear regression offers several advantages in this context: it is fast to train, robust, and easy to interpret.
Moreover, it often performs remarkably well when multiple time series remain strongly correlated with one another, as is frequently the case for hydrological stations located within the same river basin.
XGBoost
The third model is XGBoost.
It belongs to the family of gradient boosting models, which progressively build multiple decision trees to improve forecasting accuracy.
Its operating principle differs from that of linear regression.
Rather than applying a single mathematical combination of the input variables, it incrementally constructs an ensemble of decision trees, with each tree correcting part of the errors made by the previous ones.
The resulting model can therefore learn nonlinear relationships, interactions between stations, local effects, and asymmetric behaviors that would be difficult to represent with a simple linear formula.
In hydrology, this can be particularly useful when a given station strongly influences the discharge at St. Louis only under specific conditions—for example, during a flood event or when several upstream contributions combine.
XGBoost often achieves excellent performance on tabular datasets.
LightGBM
The fourth model is LightGBM.
Like XGBoost, it belongs to the gradient boosting family of algorithms.
The underlying idea is similar: to build a sequence of decision trees that progressively correct the errors made by the preceding ones.
The main differences lie in the learning strategy and the internal optimization techniques.
LightGBM is often faster and more memory-efficient than XGBoost. Moreover, it is highly competitive on large datasets.
Including it here allows us to compare two modern implementations of gradient boosting and determine whether one of them is better able to exploit the temporal structure of the river basin.
In the end, the script compares four different models: the Persistence baseline, Linear Regression, XGBoost, and LightGBM. These span a range of increasing complexity, from a very simple benchmark to two state-of-the-art boosting methods.
This comparison provides a more comprehensive view of the problem and makes it possible to assess whether increasing model complexity leads to a genuine improvement in daily streamflow forecasting.
Temporal Data Split
As in the previous section, the dataset is divided into three distinct chronological periods.
The first corresponds to the training set:
2010-01-01 → 2020-12-31
Using these historical data, the algorithms progressively identify the relationships between the discharge observed at St. Louis, the upstream stations along the Mississippi River, and those located on the Missouri River.
In other words, this period constitutes the primary learning phase for the models.
The second period corresponds to the validation set:
2021-01-01 → 2023-12-31
This phase serves an intermediate purpose.
The model does not learn any new relationships from these data. Instead, the validation set is used to evaluate its behavior on more recent observations that were not seen during training.
In practice, the validation set is used to compare different versions of the forecasting pipeline before the final configuration is fixed.
For example, several different models can be trained on exactly the same historical data and then evaluated to determine which one performs best over the 2021–2023 period. Likewise, different temporal window lengths can be tested: using only the previous two days, then five days, then ten days, and observing which window yields the best results. Similarly, some models have their own internal hyperparameters — for example, the number of trees or their maximum depth in XGBoost — and the validation set allows us to determine which settings perform best.
In other words, this period serves as a testing ground: several plausible options are compared under realistic conditions, and the one that performs best is selected before proceeding to the final evaluation on data that have never been seen before.
Finally comes the test set:
2024-01-01 → 2025-12-31
This final period represents the future from the model's perspective.
It remains completely isolated throughout the entire model development process.
The corresponding data are used neither during model training nor during the intermediate comparisons carried out during the experimentation phase.
In practice, the models are trained on the first period, and the different configurations are then evaluated on the validation set to identify the most effective one.
The test set is used only at the very end, once all methodological choices have been finalized.
It therefore provides an independent evaluation of the forecasting pipeline on data that have never been used before.
The evaluation metrics are computed on this period only after the entire modeling process has been completed.
This principle is fundamental in time series forecasting.
It allows the pipeline to be evaluated under conditions that closely resemble real-world operation: the model is trained on past data, tuned using a more recent period, and finally tested on dates it has never previously encountered.
This approach prevents artificially optimistic performance estimates and provides a more faithful assessment of the model's ability to generalize to future data.
Results
We can now compare the performance of the different models on the validation set and then on the test set.
The complete table produced by the script first confirms that incorporating upstream hydrological stations substantially improves forecasting performance compared with the persistence baseline. It also reveals clear differences depending on both the type of model used and the length of the temporal window.
Persistence Baseline
The persistence baseline assumes that tomorrow's discharge will be identical to the discharge observed today.
On the test set, it achieves:
MAE : 8308
MAPE : 4.43 %
However, as emphasized earlier, this baseline primarily serves as a reference point. Any model that is genuinely useful should be able to outperform it.
Linear Regression
Linear regression achieves the best performance in the benchmark.
Test set:
w2 → MAE 4489
w5 → MAE 4450
w10 → MAE 4424
w15 → MAE 4488
The best result is obtained with a 10-day window:
MAE : 4424
MAPE : 2.73 %
8308 → 4424
representing a reduction of nearly 47% in the mean absolute error.
Another noteworthy point is that the results remain remarkably stable across different window lengths.
Between 2 and 15 days, the variation in MAE is relatively small.
This suggests that most of the useful information is already contained in the first few days of history, while a slightly longer window allows the model to better capture the propagation of the hydrological signal without introducing excessive additional noise.
The consistency between the validation and test results is also reassuring: the model exhibits very similar behavior on both periods.
XGBoost
XGBoost also provides a clear improvement over the persistence baseline.
Test set:
w2 → MAE 6059
w5 → MAE 5939
w10 → MAE 6169
w15 → MAE 5878
Its best result is obtained with:
15 day window → MAE 5878
The model therefore represents a clear improvement over the persistence baseline.
However, it consistently performs worse than linear regression.
Another noteworthy observation is that its performance varies more substantially with the choice of window length.
The model appears to be more sensitive to the temporal parameters, which is a common characteristic of boosting-based approaches.
LightGBM
LightGBM achieves performance comparable to that of XGBoost.
Test set:
w2 → MAE 5903
w5 → MAE 5855
w10 → MAE 5811
w15 → MAE 6119
10-day window → MAE 5811
As with XGBoost, the persistence baseline is clearly outperformed, and the overall results remain strong. However, LightGBM also falls short of the performance achieved by linear regression.
The validation scores are broadly consistent with the test scores, confirming the model's overall stability.
Analysis
The first noteworthy result emerges when comparing the best performance achieved by linear regression in the univariate and multivariate settings.
The univariate reference comes from the script forecast_discharge_mono.py
The best multivariate configuration corresponds to linear regression with a 10-day window, evaluated in forecast_discharge_multi.py.
Comparison
Univariate: Validation MAE ≈ 6899 | Test MAE ≈ 6458
Multivariate : Validation MAE ≈ 4436 | Test MAE ≈ 4424
The improvement is substantial.
The inclusion of upstream hydrological stations therefore leads to a significant enhancement in forecasting performance.
These additional time series clearly contain predictive information that can be exploited to anticipate the discharge at St. Louis.
From a hydrological perspective, this result is entirely consistent: variations observed several hundred kilometers upstream along the Mississippi or the Missouri take time to propagate and gradually influence the discharge measured at St. Louis.
The second important finding concerns the behavior of the models.
Linear regression consistently achieves better performance than both XGBoost and LightGBM.
This suggests that, for this dataset, the dynamics of the river basin remain relatively well structured and that the dominant relationships among stations can be captured effectively by a simple linear model.
In other words, the additional complexity introduced by the boosting models does not translate into any further improvement in forecasting performance.
Finally, the optimal temporal window appears to be around 10 days.
A window that is too short may fail to capture the full propagation dynamics of the hydrological signal.
Conversely, a longer window introduces more redundant information and does not lead to any additional performance gains.
Note
We also explored the use of several algorithms to optimize a station-specific temporal window for each monitoring station. Two approaches are worth mentioning:
1) Independent optimization of each station's window. We first tested a strategy in which the optimal temporal window was determined separately for each station before combining all stations into a single linear regression model. This approach led to a slight improvement in performance (validation MAE: 4252 vs. 4336 with a uniform 10-day window; test MAE: 4381 vs. 4424), but at the cost of a considerably more complex procedure.
2) An evolutionary algorithm. We also developed an evolutionary algorithm to optimize the temporal windows. Although this approach consistently improved performance on the validation set, the resulting window configurations varied substantially across runs, and the gains were not reproduced robustly on the test set.
Given that this additional complexity yields only a modest improvement, we retain a single 10-day window throughout this tutorial. This solution is simpler to implement, easier to reproduce, and provides excellent overall performance.
Conclusion
This stage marks an important step forward in the construction of our forecasting pipeline.
The results show that incorporating upstream hydrological stations significantly improves the quality of the forecasts at St. Louis. The time series observed along the Mississippi and Missouri Rivers clearly provide useful predictive information, enabling more effective anticipation of variations in daily streamflow.
Our experiments also indicate that a temporal window of approximately 10 days offers the best compromise: it is long enough to capture the propagation of the hydrological signal across the river basin while avoiding the introduction of excessive redundant information.
Finally, among the models evaluated, linear regression achieves the best performance. In this particular setting, a simple model is sufficient to exploit the dependencies between stations effectively and outperforms the more complex boosting-based approaches.
In the next section, we will further enrich this forecasting pipeline by incorporating meteorological variables, in order to assess the extent to which atmospheric conditions can provide additional predictive power beyond hydrological observations alone.