# Getting started

The `xdatasets` library enables users to effortlessly access a vast collection of earth observation datasets that are compatible with `xarray` formats.

The library adopts an opinionated approach to data querying and caters to the specific needs of certain user groups, such as hydrologists, climate scientists, and engineers. One of the functionalities of `xdatasets` is the ability to extract data at a specific location or within a designated region, such as a watershed or municipality, while also enabling spatial and temporal operations.

To use `xdatasets`, users must employ a query. For instance, a straightforward query to extract the variables `t2m` (*2m temperature*) and `tp` (*Total precipitation*) from the `era5_reanalysis_single_levels` dataset at two geographical positions (Montreal and Toronto) could be as follows:

```python
query = {
 "datasets": {"era5_reanalysis_single_levels": {'variables': ["t2m", "tp"]}},
 "space": {
 "clip": "point", # bbox, point or polygon
 "geometry": {'Montreal' : (45.508888, -73.561668),
 'Toronto' : (43.651070, -79.347015)
 }
 }
}
```

An example of a more complex query would look like the one below. 

> **Note**
> Don't worry! Below, you'll find additional examples that will assist in understanding each parameter in the query, as well as the possible combinations.

This query calls the same variables as above. However, instead of specifying geographical positions, a GeoPandas.DataFrame is used to provide features (such as shapefiles or geojson) for extracting data within each of them. Each polygon is identified using the unique identifier `Station`, and a spatial average is computed within each one `(aggregation: True)`. The dataset, initially at an hourly time step, is converted into a daily time step while applying one or more temporal aggregations for each variable as prescribed in the query. `xdatasets` ultimately returns the dataset for the specified date range and time zone.

```python
query = {
 "datasets": {"era5_reanalysis_single_levels": {'variables': ["t2m", "tp"]}},
 "space": {
 "clip": "polygon", # bbox, point or polygon
 "averaging": True, # spatial average of the variables within each polygon
 "geometry": gdf,
 "unique_id": "Station" # unique column name in geodataframe
 },
 "time": {
 "timestep": "D",
 "aggregation": {"tp": np.nansum, 
 "t2m": [np.nanmax, np.nanmin]},
 
 "start": '2000-01-01',
 "end": '2020-05-31',
 "timezone": 'America/Montreal',
 },
}
```




## Query climate datasets

In order to use `xdatasets`, you must import at least `xdatasets`, `pandas`, `geopandas`, and `numpy`. Additionally, we import `pathlib` to interact with files.

In [None]:
import os
import warnings
from pathlib import Path

warnings.simplefilter("ignore")

os.environ["USE_PYGEOS"] = "0"
import geopandas as gpd

# Visualization
import hvplot.pandas # noqa
import hvplot.xarray # noqa-
import numpy as np
import pandas as pd
import panel as pn # noqa

import xdatasets as xd

### Clip by points (sites)


To begin with, we need to create a dictionary of sites and their corresponding geographical coordinates.

In [None]:
sites = {
 "Montreal": (45.508888, -73.561668),
 "New York": (40.730610, -73.935242),
 "Miami": (25.761681, -80.191788),
}

We will then extract the `tp` (*total precipitation*) and `t2m` (*2m temperature*) from the `era5_reanalysis_single_levels` dataset for the designated sites. Afterward, we will convert the time step to daily and adjust the timezone to Eastern Time. Finally, we will limit the temporal interval.

Before proceeding with this first query, let's quickly outline the role of each parameter:

- **datasets**: A dictionary where datasets serve as keys and desired variables as values.
- **space**: A dictionary that defines the necessary spatial operations to apply on user-supplied geographic features.
- **time**: A dictionary that defines the necessary temporal operations to apply on the datasets

For more information on each parameter, consult the API documentation.

This is what the requested query looks like :

In [None]:
query = {
 "datasets": "era5_reanalysis_single_levels",
 "space": {"clip": "point", "geometry": sites}, # bbox, point or polygon
 "time": {
 "timestep": "D", # http://pandas.pydata.org/pandas-docs/stable/user_guide/timeseries.html#offset-aliases
 "aggregation": {"tp": np.nansum, "t2m": np.nanmean},
 "start": "1995-01-01",
 "end": "2000-12-31",
 "timezone": "America/Montreal",
 },
}
xds = xd.Query(**query)

By accessing the `data` attribute, you can view the data obtained from the query. It's worth noting that the variable name `tp` has been updated to `tp_nansum` to reflect the reduction operation (`np.nansum`) that was utilized to convert the time step from hourly to daily. Likewise, `t2m` was updated to `t2m_nanmean`. 

In [None]:
xds.data

In [None]:
title = f"Comparison of total precipitation across three cities in North America from \
{xds.data.time.dt.year.min().values} to {xds.data.time.dt.year.max().values}"

xds.data.sel(
 timestep="D",
 source="era5_reanalysis_single_levels",
).hvplot(
 title=title,
 x="time",
 y="tp_nansum",
 grid=True,
 width=750,
 height=450,
 by="site",
 legend="top",
 widget_location="bottom",
)

In [None]:
title = f"Comparison of 2m temperature across three cities in North America from \
{xds.data.time.dt.year.min().values} to {xds.data.time.dt.year.max().values}"

xds.data.sel(
 timestep="D",
 source="era5_reanalysis_single_levels",
).hvplot(
 title=title,
 x="time",
 y="t2m_nanmean",
 grid=True,
 width=750,
 height=450,
 by="site",
 legend="top",
 widget_location="bottom",
)

### Clip on polygons with no averaging in space

First, let's explore specific polygon features. With `xdatasets`, you can access geographical datasets, such as watershed boundaries linked to streamflow stations. These datasets follow a nomenclature where they are named after the hydrological dataset, with `"_polygons"` appended. For example, if the hydrological dataset is named `deh`, its corresponding watershed boundaries dataset will be labeled `deh_polygons`. The query below retrieves all polygons for the `deh_polygons` dataset.

```python
gdf = xd.Query(
 **{
 "datasets": "deh_polygons"
}).data

gdf
```

As the data is loaded into memory, the process of loading all polygons may take some time. To expedite this, we recommend employing filters, as illustrated below. It's important to note that the filters are consistent for both hydrological and corresponding geographical datasets. Consequently, only watershed boundaries associated with existing hydrological data will be returned.

In [None]:
import xdatasets as xd

gdf = xd.Query(
 **{
 "datasets": {
 "deh_polygons": {
 "id": ["0421*"],
 }
 }
 }
).data.reset_index()

gdf

Let's examine the geographic locations of the polygon features.

In [None]:
gdf.hvplot(
 geo=True,
 tiles="ESRI",
 color="Station",
 alpha=0.8,
 width=750,
 height=450,
 legend="top",
 hover_cols=["Station", "Superficie"],
)

The following query seeks the variables `t2m` and `tp` from the `era5_reanalysis_single_levels` dataset, covering the period between January 1, 1959, and September 30, 1961, for the three polygons mentioned earlier. It is important to note that as `aggregation` is set to `False`, no spatial averaging will be conducted, and a mask (raster) will be returned for each polygon.

In [None]:
query = {
 "datasets": {"era5_reanalysis_single_levels": {"variables": ["t2m", "tp"]}},
 "space": {
 "clip": "polygon", # bbox, point or polygon
 "averaging": False, # spatial average of the variables within each polygon
 "geometry": gdf,
 "unique_id": "Station", # unique column name in geodataframe
 },
 "time": {
 "start": "1959-01-01",
 "end": "1961-08-31",
 },
}

xds = xd.Query(**query)

By accessing the `data` attribute, you can view the data obtained from the query. For each variable, the dimensions of `time`, `latitude`, `longitude`, and `Station` (the unique ID) are included. In addition, there is another variable called `weights` that is returned. This variable specifies the weight that should be assigned to each pixel if spatial averaging is conducted over a mask (polygon).

In [None]:
xds.data

Weights are much easier to comprehend visually, so let's examine the weights returned for the station *042102*. Notice that when selecting a single feature (Station *042102* in this case), the shape of our spatial dimensions is reduced to a 3x2 pixel area (longitude x latitude) that encompasses the entire feature.

In [None]:
station = "042102"

ds_station = xds.data.sel(Station=station)
ds_clipped = xds.bbox_clip(ds_station).squeeze()
ds_clipped

In [None]:
(
 (
 ds_clipped.t2m.isel(time=0).hvplot(
 title="The 2m temperature for pixels that intersect with the polygon on January 1, 1959",
 tiles="ESRI",
 geo=True,
 alpha=0.6,
 colormap="isolum",
 width=750,
 height=450,
 )
 * gdf[gdf.Station == station].hvplot(
 geo=True,
 width=750,
 height=450,
 legend="top",
 hover_cols=["Station", "Superficie"],
 )
 )
 + ds_clipped.weights.hvplot(
 title="The weights that should be assigned to each pixel when performing spatial averaging",
 tiles="ESRI",
 alpha=0.6,
 colormap="isolum",
 geo=True,
 width=750,
 height=450,
 )
 * gdf[gdf.Station == station].hvplot(
 geo=True,
 width=750,
 height=450,
 legend="top",
 hover_cols=["Station", "Superficie"],
 )
).cols(1)

The two plots depicted above show the 2m temperature for each pixel that intersects with the polygon from Station `042102` and the corresponding weights to be applied to each pixel. In the lower plot, it is apparent that the majority of the polygon is situated in the central pixels, which results in those pixels having a weight of approximately 80%. It is evident that the two lower and the upper pixels have much less intersection with the polygon, which results in their respective weights being smaller (hover on the plot to verify the weights).

In various libraries, either all pixels that intersect with the geometries are kept, or only pixels with centers within the polygon are retained. However, as shown in the previous example, utilizing such methods can introduce significant biases in the final calculations.

### Clip on polygons with averaging in space

The following query seeks the variables `t2m` and `tp` from the `era5_reanalysis_single_levels` and `era5_land_reanalysis` datasets, covering the period between January 1, 2014, to December 31, 2023, for the three polygons mentioned earlier. Note that when the `aggregation` parameter is set to `True`, spatial averaging takes place. In addition, the weighted mask (raster) described earlier will be applied to generate a time series for each polygon.

Additional steps are carried out in the process, including converting the original hourly time step to a daily time step. During this conversion, various temporal aggregations will be applied to each variable and a conversion to the local time zone will take place.

> **Note**
> If users prefer to pass multiple dictionaries instead of a single large one, the following format is also considered acceptable.

In [None]:
datasets = {
 "era5_reanalysis_single_levels": {"variables": ["t2m", "tp"]},
 "era5_land_reanalysis": {"variables": ["t2m", "tp"]},
}
space = {
 "clip": "polygon", # bbox, point or polygon
 "averaging": True,
 "geometry": gdf, # 3 polygons
 "unique_id": "Station",
}
time = {
 "timestep": "D",
 "aggregation": {"tp": [np.nansum], "t2m": [np.nanmax, np.nanmin]},
 "start": "2014-01-01",
 "end": "2023-12-31",
 "timezone": "America/Montreal",
}

xds = xd.Query(datasets=datasets, space=space, time=time)

In [None]:
xds.data

In [None]:
(
 xds.data[["t2m_nanmax", "t2m_nanmin"]]
 .squeeze()
 .hvplot(
 x="time",
 groupby=["Station", "source"],
 width=750,
 height=400,
 grid=True,
 widget_location="bottom",
 )
)

The resulting dataset can be explored for the `total_precipitation` (tp) data attribute :

In [None]:
(
 xds.data[["tp_nansum"]]
 .squeeze()
 .hvplot(
 x="time",
 groupby=["Station", "source"],
 width=750,
 height=400,
 grid=True,
 widget_location="bottom",
 color="blue",
 )
)

### Bounding box (bbox) around polygons

The following query seeks the variable `tp` from the `era5_land_reanalysis_dev` dataset, covering the period between January 1, 1959, and December 31, 1970, for the bounding box that delimits the three polygons mentioned earlier.

Additional steps are carried out in the process, including converting to the local time zone.

In [None]:
query = {
 "datasets": {"era5_land_reanalysis": {"variables": ["tp"]}},
 "space": {
 "clip": "bbox", # bbox, point or polygon
 "geometry": gdf,
 },
 "time": {
 "start": "1969-01-01",
 "end": "1980-12-31",
 "timezone": "America/Montreal",
 },
}


xds = xd.Query(**query)

In [None]:
xds.data

Let's find out which day (24-hour period) was the rainiest in the entire region for the data retrieved in previous cell.

In [None]:
indexer = (
 xds.data.sel(source="era5_land_reanalysis")
 .tp.sum(["latitude", "longitude"])
 .rolling(time=24)
 .sum()
 .argmax("time")
 .values
)

xds.data.isel(time=indexer).time.dt.date.values.tolist()

Let's visualise the evolution of the hourly precipitation during that day. Note that each image (raster) delimits exactly the bounding box required to cover all polygons in the query. Please note that for full interactivity, running the code in a Jupyter Notebook is necessary.



In [None]:
da = xds.data.tp.isel(time=slice(indexer - 24, indexer))
da = da.where(da > 0.0001, drop=True)

(da * 1000).squeeze().hvplot.quadmesh(
 width=750,
 height=450,
 geo=True,
 tiles="ESRI",
 groupby=["time"],
 legend="top",
 cmap="gist_ncar",
 widget_location="bottom",
 widget_type="scrubber",
 dynamic=False,
 clim=(0.01, 10),
)

## Query hydrological datasets
Hydrological queries are still being tested and output format is likely to change. Stay tuned!

In [None]:
query = {"datasets": "deh"}
xds = xd.Query(**query)
xds.data

In [None]:
ds = (
 xd.Query(
 **{
 "datasets": {
 "deh": {
 "id": ["020*"],
 "regulated": ["Natural"],
 "variables": ["streamflow"],
 }
 },
 "time": {"start": "1970-01-01", "minimum_duration": (10 * 365, "d")},
 }
 )
 .data.squeeze()
 .load()
)

ds

In [None]:
query = {"datasets": "hydat"}
xds = xd.Query(**query)
xds.data