11: Using Xarray to look at Daymet precipitation around Mt. Rainier¶

In this example, we’ll use xarray to load and process a NetCDF file of gridded precipitation data from the Daymet model. Xarray aims to make working with multi-dimensional data easier, by implementing labeling of indices (e.g., to allow arrays to be indexed or sliced based on x and y coordinates or time, similar to what pandas does for tabular data).

Drawing

Next, we’ll explore some of the extended geospatial functionality of the rioxarray extension.

Note that there is also a UXarray package in early development that provides xarray-like functionality for unstructured grids.

Datasets:

Gridded precipitation output from the Daymet model, for 1980-2018, for the area around Mt. Rainier.
An elevation raster for the area around Mt. Rainier, created in the gis_raster_mt_rainier_glaciers.ipynb exercise.

Xarray operations:

make a DataArray from scratch (using the DataArray() constructor)
load a NetCDF dataset into an xarray.DataSet instance
plotting the NetCDF data in projected or geographic coordinates
getting values at point locations
coordinate transformations
slicing in time and space
boolean slicing
computing monthly averages using groupby
outputting extracted timeseries to pandas DataFrames
subsetting the xarray dataset and saving to a NetCDF file

rioxarray operations:

add coordinate reference information to a dataset
reproject a dataset
write one or more (2D) timeslices of a dataset to a raster
clip a dataset to a polygon feature

References:

[1]:

from pathlib import Path
import numpy as np
import pandas as pd
import rasterio
from rasterio.plot import show
import xarray as xr
from pyproj import Transformer
import matplotlib.pyplot as plt
from matplotlib.colors import LightSource

# for elevation hillshade if we want to use it
ls = LightSource(azdeg=315, altdeg=45)

Datasets¶

A Dataset holds multiple DataArrays, analogous to a DataFrame in pandas housing multiple Series. Like in Pandas, the different DataArrays can share coordinates, and they can be assigned variable names (analogous to column names). The graphic at the beginning of this lesson shows a Dataset with precipitation, temperature, latitude and longitude DataArrays.

A Dataset can be created from a DataArray with the .to_dataset() method:

[8]:

ds = da.to_dataset()

Additional DataArrays can then be added:

[9]:

ds['precip'] = xr.DataArray(3 * np.array([.5, 0., 0., .1, 0.]), dims=['x'])

A Dataset can also be generated from scratch:

(this example taken from the Xarray documentation)

[10]:

temp = 15 + 8 * np.random.randn(2, 2, 3)

precip = 10 * np.random.rand(2, 2, 3)

lon = [[-99.83, -99.32], [-99.79, -99.23]]

lat = [[42.25, 42.21], [42.63, 42.59]]

ds = xr.Dataset(
    {
        "temperature": (["x", "y", "time"], temp),
        "precipitation": (["x", "y", "time"], precip),
    },
    coords={
        "lon": (["x", "y"], lon),
        "lat": (["x", "y"], lat),
        "time": pd.date_range("2014-09-06", periods=3),
        "reference_time": pd.Timestamp("2014-09-05"),
    },
)

inputs¶

Daymet gridded precipitation estimates for area around Mt. Rainier, 1980-2018. Obtained via the Daymet Tile Selection Tool
a proj string for Daymet could be created from the dataset description; however, it was easier to just copy it from here
a 30 m DEM of elevations around Mt. Rainier, in geographic coordinates, created in the gis_raster_mt_rainier_glaciers.ipynb exercise

[11]:

daymet_prcp_data = 'data/xarray/daymet_prcp_rainier_1980-2018.nc'
daymet_proj_string = ('+proj=lcc +lon_0=-100 +lat_0=42.5 +x_0=0 +y_0=0 '
                      '+lat_1=25 +lat_2=60 +ellps=WGS84')
mt_rainier_elevations = 'data/xarray/aligned-19700901_ned1_2003_adj_4269.tif'

what if we want to plot the data in geographic coordinates?¶

we can use the attached DataArray.plot.pcolormesh method, which wraps pyplot.pcolormesh. pcolormesh takes grids of x and y values corresponding to each position in the array to be plotted. The attached xarray method just requires us to specify the label indices with the x and y values.

[15]:

fig, ax = plt.subplots(figsize=(11, 8.5))
qm = ds.prcp.mean(dim='time').plot.pcolormesh('lon', 'lat', rasterized=True)

../../_images/notebooks_part0_python_intro_11_xarray_mt_rainier_precip_28_0.png

Let’s add some topography for reference¶

Load the elevation data with rasterio, get the bounding extent, and rasterio.show to quickly plot the data with its coordinates (instead of row, column locations)

[16]:

with rasterio.open(mt_rainier_elevations) as src:
    elev_bounds = src.bounds
    # reorder the rasterio bounds for pyplot
    elev_extent = (elev_bounds[0], elev_bounds[2],
                   elev_bounds[1], elev_bounds[3])
    elevations = src.read(1)
    show(src)

../../_images/notebooks_part0_python_intro_11_xarray_mt_rainier_precip_30_0.png

Overlay the mean precip. values on the topography¶

use the matplotlib LightSource.hillshade method to convert the elevations to a shaded relief raster
- assign zorder=-1 to ensure that the data plot on the bottom
plot mean precip, but specify alpha < 1 so that it has some transparency; the colorbar can be controlled with cbar_kwargs (arguments that are passed to the matplotlib colorbar constructor)
set the plot limits to the extent of the elevation raster (* unpacks list elevations into individual arguments)

[17]:

fig, ax = plt.subplots(figsize=(11, 8.5))
ax.imshow(ls.hillshade(elevations, vert_exag=0.1), cmap='gray', extent=elev_extent,
          zorder=-1)
cbar_kwargs={'shrink': 0.7,
             'label': f"Mean precipitation, 1980-2018, {ds['prcp'].units}"}
qm = ds.prcp.mean(dim='time').plot.pcolormesh('lon', 'lat',
                                              rasterized=True,
                                              #linewidth=0,
                                              alpha=0.4,
                                              cbar_kwargs=cbar_kwargs)
ax.set_ylim(*elev_extent[2:])
ax.set_xlim(*elev_extent[:2])

[17]:

(-121.90692427970113, -121.57658593832609)

../../_images/notebooks_part0_python_intro_11_xarray_mt_rainier_precip_32_1.png

Getting data at point locations¶

Say we want to get data for the mountain summit, and at the Paradise Visitor Center in Mt. Rainier National Park. We can easily get the lat, lon coordinates for these locations, but to get data from this dataset, we need to work in the custom coordinate system for Daymet. Luckily, we found a PROJ string to define that, which we assigned to the variable daymet_proj_string. With the PROJ string, we can use pyproj to transform the coordinates to the Daymet CRS.

[18]:

coords = {'summit': (46.852886, -121.760374),
          'paradise': (46.7868, -121.7338)}

transformer = Transformer.from_crs(4269, daymet_proj_string)
coords_lcc = {k: transformer.transform(*v) for k, v in coords.items()}
coords_lcc

[18]:

{'summit': (-1567320.5383531058, 667030.7252830392),
 'paradise': (-1567257.837050381, 659751.1451685613)}

plot the two locations¶

[19]:

fig, ax = plt.subplots(figsize=(11, 8.5))
ax.imshow(ls.hillshade(elevations, vert_exag=0.1), cmap='gray', extent=elev_extent,
          zorder=-1)
qm = ds.prcp.mean(dim='time').plot.pcolormesh('lon', 'lat',
                                              rasterized=True,
                                              #linewidth=0,
                                              alpha=0.4)
ax.set_ylim(*elev_extent[2:])
ax.set_xlim(*elev_extent[:2])
for label, (y, x) in coords.items():
    ax.scatter(x, y, c='k')
    ax.text(x, y, label, transform=ax.transData)

../../_images/notebooks_part0_python_intro_11_xarray_mt_rainier_precip_36_0.png

Get a time series of precip values at the nearest pixel to the summit¶

[20]:

x, y = coords_lcc['summit']
ds.prcp.sel(x=[x], y=[y], method='nearest').plot()

[20]:

[<matplotlib.lines.Line2D at 0x7f26a191b210>]

../../_images/notebooks_part0_python_intro_11_xarray_mt_rainier_precip_38_1.png

we can also slice along the time axis¶

But to do that, we need to chain the operation (instead of including the time= argument in the first call to DataArray.sel())

[21]:

ds.prcp.sel(x=[x], y=[y], method='nearest').sel(time=slice('2015', '2018')).plot()

[21]:

[<matplotlib.lines.Line2D at 0x7f269e806490>]

../../_images/notebooks_part0_python_intro_11_xarray_mt_rainier_precip_40_1.png

slicing the whole dataset in time¶

Taking getting the mean value for 2012 (note, we could also simply write time=slice('2012')

[22]:

ds.prcp.sel(time=slice('2012-01-01', '2012-12-31')).mean(axis=0).plot()

[22]:

<matplotlib.collections.QuadMesh at 0x7f26a19cab50>

../../_images/notebooks_part0_python_intro_11_xarray_mt_rainier_precip_42_1.png

making a 2D slice spatially¶

[23]:

ds.prcp.sel(x=slice(-1570000, -1565000), y=slice(670000, 665000)).mean(axis=0).plot()

[23]:

<matplotlib.collections.QuadMesh at 0x7f269e6279d0>

../../_images/notebooks_part0_python_intro_11_xarray_mt_rainier_precip_44_1.png

what if we want to slice by latitude and longitude¶

The projected coordinate system that this dataset is aligned with is not aligned with latitude and longitude, so the lat and lon coordinates associated with each pixel are 2D.

[24]:

ds.lat.shape

[24]:

(31, 30)

boolean slicing¶

In other words, a box aligned with latitude and longitude will be rotated in the native CRS. Therefore we have to either specify explicitly what lat, lon values we want, or we can use a boolean slice. Let’s say we want all of the pixels between 46.8 and 46.9 N, and -121.8 and -121.7 longitude.

[25]:

valid_lat = (ds['lat'].values > 46.8) & (ds['lat'].values < 46.9)
valid_lon = (ds['lon'].values > -121.8) & (ds['lon'].values < -121.7)

[26]:

plt.imshow(valid_lat&valid_lon)

[26]:

<matplotlib.image.AxesImage at 0x7f269e6908d0>

../../_images/notebooks_part0_python_intro_11_xarray_mt_rainier_precip_49_1.png

[27]:

ds.prcp.loc[:].where(valid_lat&valid_lon).mean(axis=0).plot()

[27]:

<matplotlib.collections.QuadMesh at 0x7f269e6e3090>

../../_images/notebooks_part0_python_intro_11_xarray_mt_rainier_precip_50_1.png

Writing out subsetted data to new netcdf file¶

The above boolean indexing only masks the pixels that evaluated to False. Oftentimes, we want to subset a large dataset to only include the area we are working in.

drop the rows and columns that have all nans¶

[28]:

prcp_subset = ds.prcp.loc[:].where(valid_lat&valid_lon)
prcp_subset = prcp_subset.dropna(dim='x', how='all').dropna(dim='y', how='all')
prcp_subset.mean(axis=0).plot()

[28]:

<matplotlib.collections.QuadMesh at 0x7f269e6e0150>

../../_images/notebooks_part0_python_intro_11_xarray_mt_rainier_precip_52_1.png

write it out to a netcdf file¶

use encoding to reduce the resulting file size

further reading on compression: https://unidata.github.io/netcdf4-python/#efficient-compression-of-netcdf-variables

[29]:

# make an output folder first
output_folder = Path('11-output')
output_folder.mkdir(exist_ok=True)

encoding={'prcp': {'zlib': True, # compression algorithm
                     'complevel': 4, # compression level
                     'dtype': 'float32',
                     '_FillValue': -9999,
                     }}

prcp_subset.to_netcdf(output_folder / 'rainier_prcp_subset.nc',
                     encoding=encoding)

Groupby¶

more in the xarray manual
also note that groupby() operations can be slow. The flox project is aiming to speed groupby operations.

getting monthly values¶

[30]:

ds.prcp.groupby('time.month').mean(dim='time').shape

[30]:

(12, 31, 30)

monthy mean precipitation for the whole dataset, in mm/day¶

[31]:

ds.prcp.groupby('time.month').mean(dim='time').mean(axis=(1, 2)).plot()

[31]:

[<matplotlib.lines.Line2D at 0x7f269e4b6490>]

../../_images/notebooks_part0_python_intro_11_xarray_mt_rainier_precip_58_1.png

annual precip for whole dataset, in feet¶

[32]:

annual_precip = ds.prcp.groupby('time.year').sum(dim='time')/304.8
annual_precip.mean(axis=(1, 2)).plot()

[32]:

[<matplotlib.lines.Line2D at 0x7f269e360c90>]

../../_images/notebooks_part0_python_intro_11_xarray_mt_rainier_precip_60_1.png

comparing precip at the summit vs. Paradise¶

[33]:

coords_lcc

[33]:

{'summit': (-1567320.5383531058, 667030.7252830392),
 'paradise': (-1567257.837050381, 659751.1451685613)}

convert to feet per month¶

by multiplying by 30.4, the average number of days in a month. This is just for the sake of simplicity in this exercise. A better way to do this would be to multiply by an array with the number of days in each month.

[34]:

prcp_monthly_mean_ft = ds.prcp.groupby('time.month').mean(dim='time') * (30.4/304.8)

Note that in the winter, these values represent snow-water equivalent, not snowfall¶

During the winter of 1971-1972, Paradise set the world record for snowfall, with 1,122 inches (93.5 ft, 28.5 m) It was later broken by Mt. Baker, where 1,140 inches (95 feet) was recorded at the Mt. Baker Ski Area during the 1998-1999 season. Annual average snowfall at Paraside is 54.6 feet for the period of 1916-2016

[35]:

fig, ax = plt.subplots()
for site in ['summit', 'paradise']:
    prcp_site = prcp_monthly_mean_ft.sel(x=coords_lcc[site][0],
                                         y=coords_lcc[site][1],
                                         method='nearest')
    prcp_site.plot(label=site)
    ax.legend()

../../_images/notebooks_part0_python_intro_11_xarray_mt_rainier_precip_66_0.png

Compare annual precip¶

[36]:

fig, ax = plt.subplots()
for site in ['summit', 'paradise']:
    prcp_site = annual_precip.sel(x=coords_lcc[site][0],
                                         y=coords_lcc[site][1],
                                         method='nearest')
    prcp_site.plot(label=site)
    ax.legend()

../../_images/notebooks_part0_python_intro_11_xarray_mt_rainier_precip_68_0.png

plot maps of average monthly precip with topopgraphy¶

[37]:

prcp_monthly_mean_ft.loc[1].plot()

[37]:

<matplotlib.collections.QuadMesh at 0x7f269e20cc10>

../../_images/notebooks_part0_python_intro_11_xarray_mt_rainier_precip_70_1.png

The prcp_monthly_mean_ft DataArray we created above is grouped by month. Months are on the 0 axis, which ranges from 1 to 12.

[38]:

prcp_monthly_mean_ft['month']

[38]:

<xarray.DataArray 'month' (month: 12)> Size: 96B
array([ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12])
Coordinates:
  * month    (month) int64 96B 1 2 3 4 5 6 7 8 9 10 11 12

compare January and August¶

[39]:

fig, axes = plt.subplots(1, 2, figsize=(15, 8.5))
months = [1, 8]

for i, ax in enumerate(axes.flat):
    month = months[i]
    ax.imshow(ls.hillshade(elevations, vert_exag=0.1), cmap='gray', extent=elev_extent,
              zorder=-1)
    qm = prcp_monthly_mean_ft.loc[month].plot.pcolormesh(
        'lon', 'lat', rasterized=True, #linewidth=0,
        alpha=0.4, ax=ax,
        cbar_kwargs={"shrink": 0.5,
                     "label": "Total precip., in feet"}
        )
    ax.set_ylim(*elev_extent[2:])
    ax.set_xlim(*elev_extent[:2])
    for label, (y, x) in coords.items():
        ax.scatter(x, y, c='k')
        ax.text(x, y, label, transform=ax.transData)

../../_images/notebooks_part0_python_intro_11_xarray_mt_rainier_precip_74_0.png

what if we just want to get a timeseries of values at each point of interest, in a pandas DataFrame?¶

first make lists of the x and y values, and a corresponding list of site names¶

[40]:

sites = ['summit', 'paradise']
x = []
y = []
for key in sites:
    x.append(coords_lcc[key][0])
    y.append(coords_lcc[key][1])
x, y

[40]:

([-1567320.5383531058, -1567257.837050381],
 [667030.7252830392, 659751.1451685613])

For each site, make the selection. Drop the indices we don’t need, and convert the resulting DataArray to a pandas DataFrame with the to_dataframe() method. Append each dataframe to a list and then concatenate them into a single DataFrame (axis=1 specifies that the DataFrames should be concatenated along the column axis)

[41]:

dfs = []
for i, site in enumerate(sites):
    df = ds.prcp.sel(x=x[0],
                     y=y[0],
                     method='nearest').drop(['lat', 'lon', 'x', 'y']).to_dataframe()
    df.columns = [site]
    dfs.append(df)

df = pd.concat(dfs, axis=1)

/tmp/ipykernel_7096/2018086405.py:5: DeprecationWarning: dropping variables using `drop` is deprecated; use drop_vars.
  method='nearest').drop(['lat', 'lon', 'x', 'y']).to_dataframe()
/tmp/ipykernel_7096/2018086405.py:5: DeprecationWarning: dropping variables using `drop` is deprecated; use drop_vars.
  method='nearest').drop(['lat', 'lon', 'x', 'y']).to_dataframe()

[42]:

df.head()

[42]:

	summit	paradise
time
1980-01-01 12:00:00	48.0	48.0
1980-01-02 12:00:00	24.0	24.0
1980-01-03 12:00:00	31.0	31.0
1980-01-04 12:00:00	23.0	23.0
1980-01-05 12:00:00	48.0	48.0

rioxarray¶

Xarray focuses on operations within a dataset, or between datasets in the same coordinate reference system (CRS). It does not keep track of coordinate reference information or do any transformations.

rioxarray is an extension for xarray that provides additional geospatial functionality, including coordinate transformations, intersections with other geospatial features including polygons, and writing arrays to rasters. It uses rasterio and pyproj to do this.

[43]:

import rioxarray

Reprojecting the dataset¶

Here we reproject the dataset into geographic coordinates (Lat/Lon). We can use the spatial reference ID 4269 (an EPSG code) to specify the destination coordinate reference system (geographic coordinates with the NAD83 datum).

Note that this dataset already includes lat/lon coordinates. Because the x and y axes are in a projected coordinate reference system (Lambert Conformal Conic), the lat/lon coordinates are two dimensional (each point in the dataset has its own lat/lon coordinate). ``rioxarray`` doesn’t like this for some reason– as of version 0.15, it produces a ValueError: IndexVariable objects must be 1-dimensional when we try to reproject. To get around this, we can drop the lat/lon coordinates before calling .rio.project().

[48]:

ds_4269 = ds['prcp'].drop_vars(['lat', 'lon']).rio.reproject(4269)

[49]:

ax = ds_4269.sum(axis=0).plot()

../../_images/notebooks_part0_python_intro_11_xarray_mt_rainier_precip_90_0.png

Write an xarray dataset to a raster¶

Write part of the dataset out to a GeoTIFF.

Earlier versions of rioxarray require us to remove the 'grid_mapping' attribute first (at least in version 0.15, you will get an error otherwise). grid_mapping is a variable from the original NetCDF dataset that specifies that coordinate reference information is contained in a variable called 'lambert_conformal_conic' (we used it above to specify the coordinate reference information for rioxarray).

Write a single 2D timeslice to a raster¶

[50]:

precip_slice = ds.sel(time='2018-12-30')['prcp']

# remove grid_mapping if needed
#del precip_slice.attrs['grid_mapping']

[51]:

precip_slice.rio.to_raster(output_folder / 'pcrp_20181230.tif')

writing a 3D array (of multiple times) to a multi-band raster¶

In this case, each day from 12/25 to 12/30 is written it a separate band.

[52]:

precip_slice = ds.sel(time=slice('2018-12-25', '2018-12-30'))['prcp']

precip_slice.rio.to_raster(output_folder / 'pcrp_20181225-30.tif')

Clip to a shapefile area¶

Sometimes we may want to clip a dataset to a geographic area defined by a polygon shapefile. We can read a shapefile into a GeoDataFrame using geopandas. In this case, we’ll just use the glacier polygons from the Rasterio: Mt. Rainier glaciers example.

[53]:

import geopandas as gpd

gdf = gpd.read_file("data/rasterio/rgi60_glacierpoly_rainier.shp")
gdf.head()

[53]:

	RGIId	GLIMSId	BgnDate	EndDate	CenLon	CenLat	O1Region	O2Region	Area	Zmin	...	Aspect	Lmax	Linkages	Name	geometry
0	RGI60-02.14393	G238292E46809N	19709999	19949999	-121.70769	46.80930	2	4	0.126	2001	...	148	397	9	Williwakas Glacier WA	POLYGON ((-121.70642 46.80792, -121.7065 46.80...
1	RGI60-02.14394	G238275E46811N	19709999	19949999	-121.72504	46.81055	2	4	0.010	2207	...	222	-9	9	WA	POLYGON ((-121.72552 46.80985, -121.72561 46.8...
2	RGI60-02.14395	G238235E46811N	19709999	19949999	-121.76535	46.81088	2	4	0.010	2103	...	165	-9	9	WA	POLYGON ((-121.76498 46.81006, -121.76511 46.8...
3	RGI60-02.14396	G238243E46810N	19709999	19949999	-121.75702	46.80963	2	4	0.015	2186	...	153	182	9	WA	POLYGON ((-121.75659 46.80885, -121.75663 46.8...
4	RGI60-02.14397	G238302E46809N	19709999	19949999	-121.69756	46.80850	2	4	0.012	1990	...	125	160	9	WA	POLYGON ((-121.69777 46.80903, -121.6976 46.80...

5 rows × 23 columns

Clipping can be time-intensive, so in this case, we’ll just clip all of the arrays from 2018 (n=365)

[54]:

data = ds_4269.sel(time='2018')#.mean(axis=0)

The invert argument specifies whether we want to only include data pixels within the clip feature(s), or outside of the clip feature(s).

[55]:

clipped = data.rio.clip(gdf['geometry'].values, gdf.crs, drop=False, invert=False)
clipped

Plot the clipped data on top of the original clip features¶

[56]:

fig, ax = plt.subplots()
gdf.plot(ax=ax)
clipped.mean(axis=0).plot(ax=ax)

[56]:

<matplotlib.collections.QuadMesh at 0x7f269e694dd0>

../../_images/notebooks_part0_python_intro_11_xarray_mt_rainier_precip_103_1.png