Getting started¶

The xdatasets library enables users to effortlessly access a vast collection of earth observation datasets that are compatible with xarray formats.

The library adopts an opinionated approach to data querying and caters to the specific needs of certain user groups, such as hydrologists, climate scientists, and engineers. One of the functionalities of xdatasets is the ability to extract data at a specific location or within a designated region, such as a watershed or municipality, while also enabling spatial and temporal operations.

To use xdatasets, users must employ a query. For instance, a straightforward query to extract the variables t2m (2m temperature) and tp (Total precipitation) from the era5_reanalysis_single_levels dataset at two geographical positions (Montreal and Toronto) could be as follows:

query = {
    "datasets": {"era5_reanalysis_single_levels": {'variables': ["t2m", "tp"]}},
    "space": {
        "clip": "point", # bbox, point or polygon
        "geometry": {'Montreal' : (45.508888, -73.561668),
                     'Toronto' : (43.651070, -79.347015)
                    }
    }
}

An example of a more complex query would look like the one below.

Note Don’t worry! Below, you’ll find additional examples that will assist in understanding each parameter in the query, as well as the possible combinations.

This query calls the same variables as above. However, instead of specifying geographical positions, a GeoPandas.DataFrame is used to provide features (such as shapefiles or geojson) for extracting data within each of them. Each polygon is identified using the unique identifier Station, and a spatial average is computed within each one (aggregation: True). The dataset, initially at an hourly time step, is converted into a daily time step while applying one or more temporal aggregations for each variable as prescribed in the query. xdatasets ultimately returns the dataset for the specified date range and time zone.

query = {
    "datasets": {"era5_reanalysis_single_levels": {'variables': ["t2m", "tp"]}},
    "space": {
        "clip": "polygon", # bbox, point or polygon
        "averaging": True, # spatial average of the variables within each polygon
        "geometry": gdf,
        "unique_id": "Station" # unique column name in geodataframe
    },
    "time": {
        "timestep": "D",
        "aggregation": {"tp": np.nansum,
                        "t2m": [np.nanmax, np.nanmin]},

        "start": '2000-01-01',
        "end": '2020-05-31',
        "timezone": 'America/Montreal',
    },
}

Query climate datasets¶

In order to use xdatasets, you must import at least xdatasets, pandas, geopandas, and numpy. Additionally, we import pathlib to interact with files.

[1]:

import os
import warnings
from pathlib import Path

warnings.simplefilter("ignore")

os.environ["USE_PYGEOS"] = "0"
import geopandas as gpd

# Visualization
import hvplot.pandas  # noqa
import hvplot.xarray  # noqa-
import numpy as np
import pandas as pd
import panel as pn  # noqa

import xdatasets as xd

ERROR 1: PROJ: proj_create_from_database: Open of /home/docs/checkouts/readthedocs.org/user_builds/xdatasets/conda/latest/share/proj failed

Clip on polygons with no averaging in space¶

First, let’s explore specific polygon features. With xdatasets, you can access geographical datasets, such as watershed boundaries linked to streamflow stations. These datasets follow a nomenclature where they are named after the hydrological dataset, with "_polygons" appended. For example, if the hydrological dataset is named deh, its corresponding watershed boundaries dataset will be labeled deh_polygons. The query below retrieves all polygons for the deh_polygons dataset.

gdf = xd.Query(
    **{
        "datasets": "deh_polygons"
}).data

gdf

As the data is loaded into memory, the process of loading all polygons may take some time. To expedite this, we recommend employing filters, as illustrated below. It’s important to note that the filters are consistent for both hydrological and corresponding geographical datasets. Consequently, only watershed boundaries associated with existing hydrological data will be returned.

[7]:

import xdatasets as xd

gdf = xd.Query(
    **{
        "datasets": {
            "deh_polygons": {
                "id": ["0421*"],
            }
        }
    }
).data.reset_index()

gdf

[7]:

	Station	Superficie	geometry
0	042102	623.479187	POLYGON ((-78.57120 46.70742, -78.57112 46.707...
1	042103	579.479614	POLYGON ((-78.49014 46.64514, -78.49010 46.645...

Let’s examine the geographic locations of the polygon features.

[8]:

gdf.hvplot(
    geo=True,
    tiles="ESRI",
    color="Station",
    alpha=0.8,
    width=750,
    height=450,
    legend="top",
    hover_cols=["Station", "Superficie"],
)

[8]:

The following query seeks the variables t2m and tp from the era5_reanalysis_single_levels dataset, covering the period between January 1, 1959, and September 30, 1961, for the three polygons mentioned earlier. It is important to note that as aggregation is set to False, no spatial averaging will be conducted, and a mask (raster) will be returned for each polygon.

[9]:

query = {
    "datasets": {"era5_reanalysis_single_levels": {"variables": ["t2m", "tp"]}},
    "space": {
        "clip": "polygon",  # bbox, point or polygon
        "averaging": False,  # spatial average of the variables within each polygon
        "geometry": gdf,
        "unique_id": "Station",  # unique column name in geodataframe
    },
    "time": {
        "start": "1959-01-01",
        "end": "1961-08-31",
    },
}

xds = xd.Query(**query)

Spatial operations: processing polygon 042103 with era5_reanalysis_single_levels: : 2it [00:00,  4.09it/s]

By accessing the data attribute, you can view the data obtained from the query. For each variable, the dimensions of time, latitude, longitude, and Station (the unique ID) are included. In addition, there is another variable called weights that is returned. This variable specifies the weight that should be assigned to each pixel if spatial averaging is conducted over a mask (polygon).

[10]:

xds.data

Weights are much easier to comprehend visually, so let’s examine the weights returned for the station 042102. Notice that when selecting a single feature (Station 042102 in this case), the shape of our spatial dimensions is reduced to a 3x2 pixel area (longitude x latitude) that encompasses the entire feature.

[11]:

station = "042102"

ds_station = xds.data.sel(Station=station)
ds_clipped = xds.bbox_clip(ds_station).squeeze()
ds_clipped

[12]:

(
    (
        ds_clipped.t2m.isel(time=0).hvplot(
            title="The 2m temperature for pixels that intersect with the polygon on January 1, 1959",
            tiles="ESRI",
            geo=True,
            alpha=0.6,
            colormap="isolum",
            width=750,
            height=450,
        )
        * gdf[gdf.Station == station].hvplot(
            geo=True,
            width=750,
            height=450,
            legend="top",
            hover_cols=["Station", "Superficie"],
        )
    )
    + ds_clipped.weights.hvplot(
        title="The weights that should be assigned to each pixel when performing spatial averaging",
        tiles="ESRI",
        alpha=0.6,
        colormap="isolum",
        geo=True,
        width=750,
        height=450,
    )
    * gdf[gdf.Station == station].hvplot(
        geo=True,
        width=750,
        height=450,
        legend="top",
        hover_cols=["Station", "Superficie"],
    )
).cols(1)

[12]:

The two plots depicted above show the 2m temperature for each pixel that intersects with the polygon from Station 042102 and the corresponding weights to be applied to each pixel. In the lower plot, it is apparent that the majority of the polygon is situated in the central pixels, which results in those pixels having a weight of approximately 80%. It is evident that the two lower and the upper pixels have much less intersection with the polygon, which results in their respective weights being smaller (hover on the plot to verify the weights).

In various libraries, either all pixels that intersect with the geometries are kept, or only pixels with centers within the polygon are retained. However, as shown in the previous example, utilizing such methods can introduce significant biases in the final calculations.

Query hydrological datasets¶

Hydrological queries are still being tested and output format is likely to change. Stay tuned!

[21]:

query = {"datasets": "deh"}
xds = xd.Query(**query)
xds.data

[21]:

<xarray.Dataset> Size: 1GB
Dimensions:        (id: 745, variable: 2, spatial_agg: 2, timestep: 1,
                    time_agg: 1, source: 1, time: 60631)
Coordinates: (12/15)
    drainage_area  (id) float32 3kB dask.array<chunksize=(745,), meta=np.ndarray>
    end_date       (variable, id, spatial_agg, timestep, time_agg, source) datetime64[ns] 24kB dask.array<chunksize=(2, 745, 2, 1, 1, 1), meta=np.ndarray>
  * id             (id) object 6kB '010101' '010801' ... '104804' '120201'
    latitude       (id) float32 3kB dask.array<chunksize=(745,), meta=np.ndarray>
    longitude      (id) float32 3kB dask.array<chunksize=(745,), meta=np.ndarray>
    name           (id) object 6kB dask.array<chunksize=(745,), meta=np.ndarray>
    ...             ...
  * spatial_agg    (spatial_agg) object 16B 'point' 'watershed'
    start_date     (variable, id, spatial_agg, timestep, time_agg, source) datetime64[ns] 24kB dask.array<chunksize=(2, 745, 2, 1, 1, 1), meta=np.ndarray>
  * time           (time) datetime64[ns] 485kB 1860-01-01 ... 2025-12-31
  * time_agg       (time_agg) object 8B 'mean'
  * timestep       (timestep) object 8B 'D'
  * variable       (variable) object 16B 'level' 'streamflow'
Data variables:
    level          (id, time, variable, spatial_agg, timestep, time_agg, source) float32 723MB dask.array<chunksize=(1, 60631, 1, 1, 1, 1, 1), meta=np.ndarray>
    streamflow     (id, time, variable, spatial_agg, timestep, time_agg, source) float32 723MB dask.array<chunksize=(1, 60631, 1, 1, 1, 1, 1), meta=np.ndarray>

[22]:

ds = (
    xd.Query(
        **{
            "datasets": {
                "deh": {
                    "id": ["020*"],
                    "regulated": ["Natural"],
                    "variables": ["streamflow"],
                }
            },
            "time": {"start": "1970-01-01", "minimum_duration": (10 * 365, "d")},
        }
    )
    .data.squeeze()
    .load()
)

ds

[23]:

query = {"datasets": "hydat"}
xds = xd.Query(**query)
xds.data

[23]:

<xarray.Dataset> Size: 841GB
Dimensions:        (data_type: 2, id: 7881, spatial_agg: 2, timestep: 1,
                    time_agg: 1, latitude: 2800, longitude: 4680, time: 59413)
Coordinates: (12/15)
  * data_type      (data_type) <U5 40B 'flow' 'level'
    drainage_area  (id) float64 63kB dask.array<chunksize=(10,), meta=np.ndarray>
    end_date       (id, data_type, spatial_agg, timestep, time_agg) object 252kB dask.array<chunksize=(7881, 2, 2, 1, 1), meta=np.ndarray>
  * id             (id) <U7 221kB '01AA002' '01AD001' ... '11AF004' '11AF005'
  * latitude       (latitude) float64 22kB 85.0 84.97 84.95 ... 15.05 15.02
  * longitude      (longitude) float64 37kB -167.0 -167.0 ... -50.05 -50.02
    ...             ...
    source         (id) object 63kB dask.array<chunksize=(7881,), meta=np.ndarray>
  * spatial_agg    (spatial_agg) object 16B 'point' 'watershed'
    start_date     (id, data_type, spatial_agg, timestep, time_agg) object 252kB dask.array<chunksize=(7881, 2, 2, 1, 1), meta=np.ndarray>
  * time           (time) datetime64[ns] 475kB 1860-01-01 ... 2022-08-31
  * time_agg       (time_agg) <U4 16B 'mean'
  * timestep       (timestep) <U3 12B 'day'
Data variables:
    mask           (id, latitude, longitude) float64 826GB dask.array<chunksize=(1, 500, 500), meta=np.ndarray>
    value          (id, time, data_type, spatial_agg, timestep, time_agg) float64 15GB dask.array<chunksize=(10, 59413, 1, 1, 1, 1), meta=np.ndarray>

Getting started¶

Query climate datasets¶

Clip by points (sites)¶

Clip on polygons with no averaging in space¶

Clip on polygons with averaging in space¶

Bounding box (bbox) around polygons¶

Query hydrological datasets¶