Skip to content

Building Thermal Aggregates

Why Aggregate Weather Data?

ERA5 data provides extremely detailed meteorological information. For each day, it includes temperature, HDD, CDD, HDD², and CDD² values for several thousand geographical locations.

This wealth of information is both a strength and a challenge. In theory, more data should lead to better forecasts. In practice, however, using all these series directly would introduce major complications. The variables are highly redundant, and many consist primarily of noise, which makes the forecasting models unstable and significantly increases the risk of overfitting. To circumvent these issues, we will adopt a much more robust approach by summarizing the overall thermal state of the country using just a handful of synthetic indicators.

This strategy is highly intuitive given our specific objective. Because we are forecasting national electricity consumption, the model needs to capture the overall thermal baseline, widespread cold snaps, and major heatwaves, rather than getting bogged down in local micro-variations.

Building Spatial Aggregates

We will now transform the thousands of ERA5 series into a few synthetic variables.

Spatial Mean Temperature

For each day, we calculate the mean temperature across all sites using the following equation: \(T_{\text{mean}}(t) = \frac{1}{N} \sum_{i=1}^{N} T_i(t)\)

where:

  • \(N\) is the number of ERA5 points.
  • \(T_i(t)\) is the temperature at site \(i\) on day \(t\).

This variable summarizes the average thermal baseline of the studied territory.

Extreme Temperatures

We also calculate the spatial minimum and maximum temperatures:

  • \(T_{\text{min}}(t) = \min_{i} T_i(t)\)
  • and \(T_{\text{max}}(t) = \max_{i} T_i(t)\)

These two variables help detect extreme thermal events, where

  • \(T_{\text{min}}\) captures the coldest snaps
  • and \(T_{\text{max}}\) captures the peak heatwaves.

Although national consumption primarily depends on the mean temperature, extreme conditions can trigger major demand spikes.

Aggregating Derived Thermal Variables

The same averaging procedure is then applied to the derived thermal variables, namely HDD, CDD, HDD², and CDD².

For each of these, we calculate a daily spatial mean:

\(\text{HDD}_{\text{mean}}(t) = \frac{1}{N} \sum_{i=1}^{N} \text{HDD}_i(t)\)

and analogously for the other variables.

Massive Reduction in Complexity

Following this aggregation step, we no longer have to handle tens of thousands of meteorological time series. Instead, the entire meteorological information is summarized by just seven synthetic variables: \(T_{\text{mean}}\), \(T_{\text{min}}\), \(T_{\text{max}}\), \(\text{HDD}_{\text{mean}}\), \(\text{CDD}_{\text{mean}}\), \(\text{HDD}^2_{\text{mean}}\), and \(\text{CDD}^2_{\text{mean}}\).

This represents a massive reduction in dimensionality. We shift from a gigantic meteorological feature space to a small set of robust, easily interpretable variables. This simplification generally improves model stability, generalization capability, and overall readability.

Adding Meteorological Lag Memory

Much like electricity consumption, weather exhibits a certain degree of inertia. The effects of a prolonged cold snap typically persist over several consecutive days: buildings cool down progressively, people adjust their behavior only gradually, and heating systems themselves respond with inherent inertia.

Therefore, we will apply the same principle to our weather variables as we did for consumption autoregression: using a rolling time window that looks back at previous days' observations. As with the AR model, we choose a 14-day window. To predict the residual for day \(D\), each observation includes:

  • The 7 thermal variables observed at \(D-14\)
  • Those from \(D-13\)
  • Those from \(D-12\)
  • ...all the way up to \(D-1\).

In other words, the model leverages meteorological data from the 14 days leading up to the day being forecast. This yields \(14 \times 7 = 98\) meteorological features used to predict the residuals of our base model.

An Intentionally Simple Approach

The strategy chosen for this tutorial is intentionally straightforward. We are not trying to build an overly sophisticated weather model,potimize every single geographical location and automatically select from hundreds of candidate variables.

Instead, our goal is different: to demonstrate that a simple, physically coherent, and robust thermal representation can already significantly improve an energy forecast. This philosophy is crucial in applied machine learning: in many real-world scenarios, a simpler, more robust, and more interpretable model is highly preferable to an extremely complex solution that struggles to generalize.

Script integrating the meteorological features: scripts/with_meteo.py