Using Meteorological Features

In the previous section, we constructed a meteorological dataset from ERA5 data.

We will now use it to generate streamflow forecasts that simultaneously exploit hydrological observations from the upstream monitoring stations, and meteorological information derived from the ERA5 reanalysis.

The objective is not to build a more sophisticated model, but rather to determine whether the addition of meteorological information provides a measurable improvement over a model based solely on hydrological data.

In other words, we will keep exactly the same forecasting pipeline as in the previous section and modify only one component: the model's input features.

Scripts used in this section

forecast_discharge_geozones.py

Principle of the Experiment

The forecast_discharge_geozones.py script uses two families of input features.

The hydrological features correspond to the streamflow time series from the various monitoring stations located upstream of St. Louis.

The meteorological features are derived from the CSV files constructed from the ERA5 data.

Because hydrological and meteorological features do not necessarily evolve on the same time scales, the script allows two separate temporal windows to be defined: a hydrological window and a meteorological window.

In this experiment, the hydrological window is fixed at 10 days. This value corresponds to the best configuration obtained in the "Multivariate Models" section.

By contrast, we vary the meteorological window in order to determine which historical depth provides the best forecasting performance.

The windows tested correspond to historical horizons of 1, 2, 3, 4, and 5 days.

For each of these configurations, the model proceeds through the following steps:

construction of the supervised learning dataset;
training a linear regression model on the training set;
evaluating its performance on the validation set;
retraining the model on the combined training and validation sets;
performing a final evaluation on the test set.

Two Levels of Spatial Partitioning

In the previous section, we explained that the ERA5 data can be aggregated using different levels of spatial zoning.

Here, we restrict our analysis to two configurations.

`geoslice = 1`

The entire study area is summarized by a single daily spatial average.

Each day is therefore described by a single meteorological feature: tp

This representation is extremely compact but retains no information about the spatial distribution of precipitation.

`geoslice = 3`

The study area is divided into a 3 × 3 grid, resulting in 9 regions.

Each day is then represented by nine meteorological features:

tp_0_0
tp_0_1
...
tp_2_2

The model therefore has access to a more detailed description of the geographical distribution of precipitation.

Running the script

Reminder: the Baseline Hydrological Model

Before adding the ERA5 features, the best-performing model relied exclusively on the hydrological stations with a 10-day temporal window.

The performance achieved by this model was:

Validation MAE      Test MAE
4436                4424

These values will serve as the reference against which the contribution of the meteorological features will be evaluated.

Results with `geoslice = 1`

We run the following command:

forecast_discharge_geozones.py --geoslice 1.

The results are summarized below.

Meteorogical window   Validation MAE    Test MAE
1                       4355            4415
2                       4346            4389
3                       4343            4382
4                       4341            4390
5                       4341            4391

A slight improvement over the purely hydrological model can be observed.

On the validation set, the MAE decreases from 4436 to approximately 4341, representing an improvement of just over 2%.

On the test set, the best configuration achieves an MAE of 4382, compared with 4424 for the model without meteorological features.

The improvement therefore remains modest.

This result is expected: by summarizing the entire basin with a single daily average, a large portion of the spatial information contained in the ERA5 data is lost.

Results with `geoslice = 3`

We then repeat exactly the same experiment, this time using a grid composed of nine regions.

We run the following command:

forecast_discharge_geozones.py --forecast 3

The results are as follows:

Meteorological window   Validation MAE  Test MAE
1                       4236            4174
2                       4243            4167
3                       4286            4176
4                       4324            4200
5                       4335            4206

The improvement is clearly more pronounced.

The best result on the validation set is obtained with a 1-day meteorological window:

Model                           Validation MAE
Hydrological only                   4436
Hydrology + ERA5 (geoslice=3)       4236

On the test set, the best configuration is obtained with a 2-day meteorological window:

Model                           Test MAE
Hydrological only                   4424
Hydrology + ERA5 (geoslice=3)       4167

Compared to the exclusive use of hydrological data, the reduction in MAE is now around 6%, representing a significant improvement.

These results highlight two interesting phenomena.

First, the addition of meteorological information clearly provides complementary predictive information beyond that contained in the streamflow time series alone.

The two families of features do not play the same role. The hydrological stations describe the current state of the river network and already embed the cumulative effects of precipitation that occurred over the previous days. In this sense, they provide the model with a long-term memory of the basin's hydrological response.

By contrast, the ERA5 precipitation features mainly supply information about the most recent meteorological conditions that have not yet been fully reflected in the observed streamflows. Rather than replacing the hydrological observations, they complement them by capturing short-term forcing capable of influencing the discharge over the following days.

Second, the way this information is represented is important.

When the entire study area is summarized by a single average (geoslice = 1), the improvement remains modest.

By contrast, when part of the spatial information is preserved by dividing the basin into nine regions (geoslice = 3), the forecasting performance improves substantially.

In other words, it is not only the total amount of precipitation that is informative, but also its geographical distribution.

Conclusion

The integration of ERA5 features confirms that meteorological data can usefully complement hydrological observations in a streamflow forecasting task.

The most interesting result is not merely the improvement in MAE, but the fact that a relatively simple spatial representation—a partition into only nine regions—is already sufficient to capture additional predictive information.

This approach provides an effective compromise between two extremes: using a single average over the entire basin, which oversimplifies the problem, or using the thousands of ERA5 grid points directly, which would result in far too many features for the scope of this tutorial.

Thus, without modifying the learning model itself, a simple enrichment of the input features leads to a noticeable improvement in forecasting performance.