National trends in peak annual streamflow
Introduction
This notebook demonstrates a slightly more advanced application of the dataretrieval.waterdata module: assembling a dataset of historical annual peak streamflow and regressing peak discharge against time to look for trends — not at a single station, but across many.
Setup
Before we begin any analysis, we’ll need to set up our environment by importing a few modules.
[1]:
import numpy as np
import pandas as pd
from scipy import stats
from dataretrieval import waterdata
---------------------------------------------------------------------------
ModuleNotFoundError Traceback (most recent call last)
Cell In[1], line 3
1 import numpy as np
2 import pandas as pd
----> 3 from scipy import stats
4
5 from dataretrieval import waterdata
ModuleNotFoundError: No module named 'scipy'
Basic usage
The waterdata module is the recommended interface to USGS water data and replaces the deprecated nwis module. Annual peak streamflow is retrieved with waterdata.get_peaks(), which returns a pandas data frame and a metadata object. To get started we need a monitoring location ID, a parameter code, and (optionally) a time window.
[2]:
# download annual peaks (discharge, parameter 00060) from a single site
df, md = waterdata.get_peaks(
monitoring_location_id="USGS-03339000",
parameter_code="00060",
time="1970-01-01/..",
)
df.head()
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[2], line 2
1 # download annual peaks (discharge, parameter 00060) from a single site
----> 2 df, md = waterdata.get_peaks(
3 monitoring_location_id="USGS-03339000",
4 parameter_code="00060",
5 time="1970-01-01/..",
NameError: name 'waterdata' is not defined
All we require for the trend analysis are the peak date (time), the monitoring location ID (monitoring_location_id), and the peak streamflow (value). The Water Data API returns a flat (single-index) data frame with one row per annual peak.
Preparing the regression
Next we’ll define a function that applies ordinary least squares to peak discharge versus time. After grouping the dataset by monitoring_location_id, we apply the regression per monitoring location. Each location’s result is returned as a row with the slope, y-intercept, p value, and standard error of the regression.
[3]:
def peak_trend_regression(df):
# convert peak dates to days since the first peak for the regression
peak_date = pd.to_datetime(df["time"])
peak_d = (peak_date - peak_date.min()) / np.timedelta64(1, "D")
# normalize the peak discharge values
value = (df["value"] - df["value"].mean()) / df["value"].std()
slope, intercept, _r_value, p_value, std_error = stats.linregress(peak_d, value)
return pd.Series(
{
"slope": slope,
"intercept": intercept,
"p_value": p_value,
"std_error": std_error,
}
)
Preparing the analysis
[4]:
def peak_trend_analysis(state_names, start_date):
"""
state_names : list
state names to include in the analysis, e.g. ["Illinois", "Indiana"].
start_date : string
the date to use as the beginning of the analysis (YYYY-MM-DD).
"""
final_df = pd.DataFrame()
for state in state_names:
# find stream gages in the state
sites, _ = waterdata.get_monitoring_locations(
state_name=state, site_type_code="ST", skip_geometry=True
)
# download annual peak discharge for those sites
df, _ = waterdata.get_peaks(
monitoring_location_id=sites["monitoring_location_id"].tolist(),
parameter_code="00060",
time=f"{start_date}/..",
)
# group the data by site and apply our regression
temp = (
df.groupby("monitoring_location_id")
.apply(peak_trend_regression)
.dropna()
)
# drop any insignificant results
temp = temp[temp["p_value"] < 0.05]
# join site metadata (for mapping) with the trend results
merged = pd.merge(
sites, temp, right_index=True, left_on="monitoring_location_id"
)
final_df = pd.concat([final_df, merged], ignore_index=True)
return final_df
To run the analysis for all states since 1970, one would only need to uncomment and run the following lines. However, pulling all that data from the Water Data API takes time and could put a burden on resources.
[5]:
# Warning: these lines download a large dataset from the web and
# will take a few minutes to run.
# start = "1970-01-01"
# states = ["Illinois", "Indiana", "Ohio"]
# final_df = peak_trend_analysis(state_names=states, start_date=start)
# final_df.to_csv("datasets/peak_discharge_trends.csv")
Instead, let’s quickly load some pre-generated results bundled with this notebook. (This example dataset was produced by an earlier run of the analysis and retains the column names from that run.)
[6]:
final_df = pd.read_csv("datasets/peak_discharge_trends.csv")
final_df.head()
[6]:
| Unnamed: 0 | agency_cd | site_no | station_nm | site_tp_cd | dec_lat_va | dec_long_va | coord_acy_cd | dec_coord_datum_cd | alt_va | alt_acy_va | alt_datum_cd | huc_cd | intercept | p_value | slope | std_error | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | USGS | 2343275 | ABBIE CREEK NEAR ABBEVILLE AL | ST | 31.561836 | -85.204932 | U | NAD83 | NaN | NaN | NaN | 3130004.0 | -0.641911 | 0.015899 | 0.000231 | 0.000065 |
| 1 | 1 | USGS | 2360275 | JUDY CREEK NEAR OZARK AL | ST | 31.463224 | -85.572159 | U | NAD83 | NaN | NaN | NaN | 3140201.0 | -0.702569 | 0.004652 | 0.000269 | 0.000069 |
| 2 | 2 | USGS | 2360500 | EAST FORK CHOCTAWHATCHEE R NEAR MIDLAND CITY AL | ST | 31.373227 | -85.477158 | U | NAD83 | 179.1 | 0.01 | NGVD29 | 3140201.0 | -1.138552 | 0.007698 | 0.000211 | 0.000003 |
| 3 | 3 | USGS | 2367400 | YELLOW RIVER NR SANFORD, ALA | ST | 31.317391 | -86.355788 | U | NAD83 | NaN | NaN | NaN | 3140103.0 | -0.611310 | 0.015836 | 0.000511 | 0.000013 |
| 4 | 4 | USGS | 2367500 | LIGHTWOOD KNOT CREEK AT BABBIE AL | ST | 31.270725 | -86.313564 | U | NAD83 | 185.0 | 0.01 | NGVD29 | 3140103.0 | -0.705676 | 0.015231 | 0.000210 | 0.000052 |
Notice how the data has been transformed. In addition to statistics about the peak streamflow trends, the analysis joined monitoring-location metadata to add latitude and longitude for each station.
Plotting the results
Finally we’ll use basemap and matplotlib, along with the location information from the Water Data API, to plot the results on a map (shown below). Monitoring locations with increasing peak annual discharge are shown in red, and those with decreasing peaks in blue.
[7]:
# Currently commented out as there isn't an easy way to install mpl_toolkits
# on a remote machine without spinning up a full geospatial stack.
# from mpl_toolkits.basemap import Basemap, cm
# import matplotlib.pyplot as plt
# fig = plt.figure(num=None, figsize=(10, 6) )
# # setup a basemap covering the contiguous United States
# m = Basemap(width=5500000, height=4000000, resolution='l',
# projection='aea', lat_1=36., lat_2=44, lon_0=-100, lat_0=40)
# # add coastlines
# m.drawcoastlines(linewidth=0.5)
# # add parallels and meridians.
# m.drawparallels(np.arange(-90.,91.,15.),labels=[True,True,False,False],dashes=[2,2])
# m.drawmeridians(np.arange(-180.,181.,15.),labels=[False,False,False,True],dashes=[2,2])
# # add boundaries and rivers
# m.drawcountries(linewidth=1, linestyle='solid', color='k' )
# m.drawstates(linewidth=0.5, linestyle='solid', color='k')
# m.drawrivers(linewidth=0.5, linestyle='solid', color='cornflowerblue')
# increasing = final_df[final_df['slope'] > 0]
# decreasing = final_df[final_df['slope'] < 0]
# #x,y = m(lons, lats)
# # categorical plots get a little ugly in basemap
# m.scatter(increasing['dec_long_va'].tolist(),
# increasing['dec_lat_va'].tolist(),
# label='increasing', s=2, color='red',
# latlon=True)
# m.scatter(decreasing['dec_long_va'].tolist(),
# decreasing['dec_lat_va'].tolist(),
# label='increasing', s=2, color='blue',
# latlon=True)