Meteorological Data: What Information Is Useful?

Introducing ERA5 variables

At this stage, our model rests on two pillars:

Consumption history,
Calendar structure.

It achieves an MAE of approximately 26,200 MWh, which is already a solid result. However, a decisive factor is still missing: the weather.

Intuitively, the link is obvious. When temperatures drop in the winter, consumption increases due to heating. Conversely, in the summer, consumption peaks are often linked to the use of air conditioning. Other variables, such as cloud cover or wind, can also play a more indirect role.

For instance, we could integrate meteorological features from the ERA5 reanalysis, a product of the Copernicus program that provides consistent estimates of global atmospheric conditions. These data offer the advantage of being comprehensive, homogeneous over time, and easy to use.

To keep this within the scope of a tutorial and avoid over-complicating the analysis, we will limit ourselves to a single ERA5 variable: the daily mean temperature (T_mean). We will then calculate four thermal variables derived from T_mean.

Downloading ERA5 Data

To access ERA5 data, simply create a free account on Copernicus (https://accounts.ecmwf.int/auth/realms/ecmwf/login-actions/registration).

The data can be downloaded from: https://cds.climate.copernicus.eu/datasets/derived-era5-single-levels-daily-statistics?tab=overview

Next, click on the "Download" tab and follow these steps:

Variable: Select the "2m_temperature" checkbox.

Year: Choose the desired year (note: only one year can be downloaded at a time).

Month & Day: In the "Month" section, click "Select All." Do the same in the "Day" section.

Daily Statistic: Leave it as "Daily mean" (as we are specifically interested in the daily average).

Frequency: Select "1 Hourly."

Geographical Area: Select "Sub-region extraction" and enter the desired coordinates (in this case: 36.50° North, 63.0° South, -15° West, and 18.25° East).

Finally, click "Submit form." Repeat this process for each required year (13 times in our case).

In the end, we obtain 13 files in NetCDF format. These files contain the history of daily mean temperatures for each grid point over one year. They will be used to build our reference database (see the following sections).

Derived Thermal Variables

To understand how weather influences electricity consumption, we cannot simply use raw temperature in a linear fashion. Indeed, the impact of temperature on consumption differs significantly whether it is 50°F, 70°F, or 90°F (10°C, 20°C, or 30°C). To address this, we use the concepts of Heating Degree Days (HDD) and Cooling Degree Days (CDD), calculated from the daily mean temperature (T_mean).

1. Conversion to Celsius
Since raw data is often provided in Kelvin (the scientific unit), we first convert it to Celsius: T_mean = T_mean_K - 273.15

2. Heating Requirements (HDD: Heating Degree Days)
It is generally estimated that below 64°F (18°C), households begin to turn on the heating. If T_mean = 10°C, the gap is 18 - 10 = 8 degrees of heating requirement. If T_mean = 20°C, the requirement is zero (we do not use negative values).Calculation formula for HDD:
HDD = np.maximum(18 - Tmean, 0)

3. Cooling Requirements (CDD: Cooling Degree Days)
Conversely, above 72°F (22°C), cooling systems are considered to be in demand.If T_mean = 28°C, the gap is 28 - 22 = 6 degrees of cooling requirement.Calculation formula for CDD:
CDD = np.maximum(Tmean - 22, 0)

4. Capturing Extreme Effects (HDD² and CDD²)
The relationship between temperature and consumption is not a simple straight line: when it is extremely cold, consumption increases much faster than when it is just "cool."
By adding the square of these values (HDD² and CDD²), we allow the model to capture this acceleration. This enables better prediction of consumption peaks during cold spells or heatwaves.

In summary, the code used to calculate our four thermal variables is as follows:

# Convert to Celsius
Tmean = Tmean_K - 273.15

# Calculate linear requirements (18°C and 22°C thresholds)
HDD = np.maximum(18 - Tmean, 0)
CDD = np.maximum(Tmean - 22, 0)

# Calculate quadratic terms for non-linear effects
HDD2 = HDD**2
CDD2 = CDD**2

With these 4 variables, the model identifies three distinct zones:

The Comfort Zone (between 64°F and 72°F / 18°C and 22°C): Temperature has almost no impact on consumption.
The Cold Zone: Consumption increases exponentially (following a "curved" pattern) as the temperature drops.
The Heat Zone: Consumption increases with the use of air conditioners and fans.

The Search Space

After processing, we are left with five variables: T_mean, HDD, HDD², CDD, and CDD². For each point on the selected grid, we therefore have five time series, each corresponding to one variable.

We store all of these time series in a SQLite database (the DDL for which is provided in the appendix). We perform this transformation for practical reasons: storing data in a SQL database offers several advantages, particularly the ability to use a powerful and well-established query language (SQL).

The script used to generate this database is: data\other_vars\create_db.py.

What this script does:

Creates the thermal.db SQLite database.
Reads and parses the NetCDF files downloaded from the Copernicus website (T_mean temperature series).
Calculates the time series corresponding to the derived thermal variables.
Creates the records for storing the data (in the sites and feature_series tables).

Note: Since downloading ERA5 data is a relatively long and tedious process, we recommend using the pre-existing database, which can be downloaded from the Zenodo website (see Appendices, Section 1).

Each meteorological feature therefore corresponds to a combination of a variable (for example, T__mean) and a location (latitude, longitude).

From Rich Weather Data to the Representation Problem

At this stage, we have a massive amount of meteorological information at our disposal:

14,338 ERA5 sites
5 thermal variables per site
This yields a total of 71,690 meteorological time series.

An initial idea might be to automatically select the "best" variables from these tens of thousands of candidates to inject them directly into our forecasting model.

However, this approach immediately raises several challenges:

High Collinearity: These series are extremely correlated with one another. Two geographically close locations often exhibit nearly identical thermal behavior. Feeding all these variables directly into a linear model would lead to an unstable and hard-to-interpret system.
Risk of Overfitting: With such a massive number of explanatory variables, a model can easily learn relationships specific to the training set that fail to generalize to future years.
Scope Realignment: Ultimately, our goal is not to build an overly sophisticated meteorological system, but rather to understand step-by-step how thermal information can improve an energy forecast.

We need to find a much more compact, robust, and easier-to-use representation. In a global linear model that mixes history, calendar, and weather, the autoregressive variables already "absorb" part of the meteorological information indirectly. During a multi-day cold spell, the high consumption from the previous day already reflects current thermal conditions. Consequently, the weather for the target day provides little additional information to the model, and its signal gets lost in the existing correlations.

We must therefore shift our perspective, which is the focus of the next section.