Skip to content

Integrating Upstream Hydrological Stations

In the previous section, we examined the forecasting of the daily discharge of the Mississippi River at St. Louis using only the local time series observed at that station.

This univariate approach already produced strong results. However, it ignores a key piece of hydrological information: what is happening upstream of the observation point.

For a large river such as the Mississippi, the discharge measured at St. Louis is partly influenced by water volumes that have already been observed several days earlier at stations located farther north. A flood wave or a gradual increase in discharge does not propagate instantaneously; it takes time to travel hundreds of kilometers downstream.

In other words, some upstream stations may contain leading information that can be exploited to forecast the future discharge at St. Louis.

The objective of this section is therefore to build a multivariate hydrological dataset by combining observations from several monitoring stations located along the Mississippi River and the Missouri River, its main tributary. To ensure consistency, only stations providing sufficiently long daily time series and jointly covering the following period are considered:

2010-01-01 → 2025-12-31

The goal at this stage is not to perform forecasting.

Instead, this section focuses on hydrological data mining: identifying the most relevant monitoring stations, retrieving the available time series, and assembling a consistent collection of time series that can be used in the remainder of the tutorial.


Note

Reading the remainder of this section is optional.

It describes the most time-consuming part of the workflow: identifying relevant monitoring stations and preparing the raw data.

If your primary interest is modeling, you may skip directly to the next section.

However, this step illustrates a practical methodology for building a real-world multivariate dataset from open data sources—a process that is often more time-consuming and complex than training the models themselves.


Scripts used in this section

  • get_discharge_stations.py
  • generate_series.py

Step 1: Identifying Relevant Monitoring Stations

The USGS provides access to several hundred thousand environmental monitoring sites.

The first challenge is therefore to narrow this search space by retaining only stations that are relevant to our study. In practice, this means selecting stations located along the river or one of its major tributaries, measuring streamflow, and providing a sufficiently complete historical record.

To automate this step, we developed the following script:

get_discharge_stations.py

Role of the get_discharge_stations.py script

The script searches for hydrological monitoring stations that are located along a specified river and within a user-defined geographic bounding box.

It relies on the Python library dataretrieval to query USGS web services.

Once the stations have been retrieved, the script generates two separate CSV files, grouped according to the operating agency.

Why two files?

Because the stations are not all managed by the same organization.

For our purposes, two categories are particularly important.

USGS Stations

USGS stations generally have an identifier beginning with:

USGS-

For each of these stations, the script automatically queries the seriesCatalogOutput service in order to determine whether the streamflow variable (00060) is available, the period over which it is available, and the continuity of the corresponding record.

Only stations confirmed to provide streamflow (discharge) measurements are retained.

These stations constitute our primary data source.

USCE Stations

Some stations are operated by the U.S. Army Corps of Engineers (USACE).

Their identifiers generally begin with:

USCE-

These stations are not covered by the seriesCatalogOutput API.

Therefore, the script exports them without automatic validation so that they can be reviewed manually.


Output Files

For a given river, the script generates:

<river>_USGS_stations_discharge.csv

A list of USGS stations confirmed to provide streamflow measurements.

and

<river>_USCE_stations.csv

A raw list of USACE stations found within the study area.


Selected Stations on the Mississippi River

After manually reviewing the files generated for the Mississippi River, we selected the following five USGS stations:

Station Code
St. Paul (Minnesota) 05331000
Hastings (Minnesota) 05331580
Clinton (Iowa) 05420500
Keokuk (Iowa) 05474500
Grafton (Illinois) 05587450

We then add two USACE stations:

Station Code
Quincy (Illinois) 395556091245801
Dubuque (Iowa) 422958090391101

The selection of these latter stations requires manual verification through the portal:

https://rivergages.mvr.usace.army.mil/WaterControl/new/layout.cfm

In practice, some pages have similar names but do not all provide streamflow measurements.

For example:

  • for Quincy, we select Mississippi River at Lock and Dam 21 ;
  • for Dubuque, Mississippi River at Lock and Dam 11.

Why These Stations?

The selected stations span several hundred kilometers upstream of St. Louis.

This approach captures the gradual propagation of streamflow variations along the Mississippi River while simultaneously providing intermediate observation points that track the river's evolution before it reaches St. Louis.

The resulting hydrological sequence is therefore:

St. Paul
→ Hastings
→ Dubuque
→ Clinton
→ Quincy
→ Keokuk
→ Grafton
→ Saint-Louis

Each station provides information that is shifted in time relative to St. Louis.

Depending on the hydrological conditions, this information may become predictive of streamflow over the following days.


Selected Stations on the Missouri River

It is also necessary to incorporate the Mississippi's main tributary: the Missouri River.

Its confluence with the Mississippi is located just upstream of St. Louis.

Ignoring its variations would mean overlooking a significant portion of the incoming streamflow.

We therefore select the following stations:

Station Code
Omaha 06610000
Kansas City 06893000
Boonville 06909000
Hermann 06934500
St. Charles 06935965

These stations make it possible to track the dynamics of the Missouri River from upstream locations to the immediate vicinity of its confluence with the Mississippi.

They naturally complement the stations located along the Mississippi River.


Step 2: Generating the Time Series

Now that the stations have been selected, the next step is to generate the corresponding daily time series.

This process consists of three stages:

  1. downloading the USGS station data;
  2. manually retrieving the USACE station data;
  3. harmonizing all the data into a single, consistent format.

Downloading USGS Station Data

The automatic download is performed using:

station_download_daily.py

This script downloads the daily data for a specified USGS station and exports it as a JSON file.

In the code, you simply need to modify the station identifier:

station_id = "05331000"

The script queries the USGS API, downloads the daily observations, detects missing values, identifies any gaps in the record, and finally exports the resulting dataset to data/rivers/<station_id>_daily_data.json.

data/rivers/<station_id>_daily_data.json

The resulting file is still a raw time series.

It may contain a small number of missing observations, which will be handled in the next step.


Downloading USACE Station Data

For USACE stations, the data retrieval process remains manual.

The portal:

https://rivergages.mvr.usace.army.mil/WaterControl/new/layout.cfm

provides the observations through HTML pages rather than through a simple API comparable to that of the USGS.

The corresponding pages are therefore saved locally in:

data/rivers/

Final Harmonization

The final step uses:

generate_series.py

This script scans all files located in data/rivers/ and processes both JSON and HTML inputs to reconstruct a complete daily time grid. It then detects missing observations, fills any gaps by linear interpolation, and exports the final result as a CSV file.

Each generated file contains the two following columns:

date
discharge

over the entire period:

2010-01-01 → 2025-12-31

for a total of:

5845 rows

per station (corresponding to 5,844 daily intervals);


Final Result

The data/rivers directory now contains thirteen CSV files:

############ Stations on the Mississippi River ############

07010000_daily_data.csv # St. Louis (reference series)
Quincy.csv
Dubuque.csv
05331000_daily_data.csv # St. Paul, Minnesota
05331580_daily_data.csv # Hastings, Minnesota
05420500_daily_data.csv # Clinto, Iowa
05474500_daily_data.csv # Keokuk, Iowa
05587450_daily_data.csv # Grafton, Illinois

############ Stations on the Missouri River ############

06610000_daily_data.csv # Omaha, Nebraska
06893000_daily_data.csv # Kansas City, Missouri
06909000_daily_data.csv # Boonville, Missouri
06934500_daily_data.csv # Hermann, Missouri
06935965_daily_data.csv # St. Charles, Missouri

We now have a consistent multivariate dataset with a common daily frequency and a common time span. It comprises thirteen hydrological stations located along the Mississippi and Missouri Rivers and contains no missing values.

The next section will use these time series to train the first multivariate forecasting models and quantify the improvement provided by upstream hydrological information.