National trends in peak annual streamflow
Introduction
This notebook demonstrates a slightly more advanced application of data_retrieval.nwis to collect using a national dataset of historical peak annual streamflow measurements. The objective is to use a regression of peak annual streamflow and time to identify any trends. But, not for a singile station,
Setup
Before we begin any analysis, we’ll need to setup our environment by importing any modules.
[1]:
from scipy import stats
import pandas as pd
import numpy as np
from dataretrieval import nwis, utils, codes
---------------------------------------------------------------------------
ModuleNotFoundError Traceback (most recent call last)
Cell In[1], line 1
----> 1 from scipy import stats
2 import pandas as pd
3 import numpy as np
ModuleNotFoundError: No module named 'scipy'
Basic usage
Recall that the basic way to download data from NWIS is through through the nwis.get_record()
function, which returns a user-specified record as a pandas
dataframe. The nwis.get_record()
function is really a facade of sorts, that allows the user to download data from various NWIS services through a consistant interface. To get started, we require a few simple parameters: a list of site numbers or states codes, a service, and a start date.
[2]:
# download annual peaks from a single site
df = nwis.get_record(sites='03339000', service='peaks', start='1970-01-01')
df.head()
# alternatively information for the entire state of illiois can be downloaded using
#df = nwis.get_record(state_cd='il', service='peaks', start='1970-01-01')
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[2], line 2
1 # download annual peaks from a single site
----> 2 df = nwis.get_record(sites='03339000', service='peaks', start='1970-01-01')
3 df.head()
5 # alternatively information for the entire state of illiois can be downloaded using
6 #df = nwis.get_record(state_cd='il', service='peaks', start='1970-01-01')
NameError: name 'nwis' is not defined
Most of the fields are empty, but no matter. All we require are date (datetime
), site number (site_no
), and peak streamflow (peak_va
).
Note that when multiple sites are specified, nwis.get_record()
will combine datetime
and site_no
fields to create a multi-index dataframe.
Preparing the regression
Next we’ll define a function that applies ordinary least squares on peak discharge and time. After grouping the dataset by site_no
, we will apply the regression on a per-site basis. The results from each site, will be returned as a row that includes the slope, y-intercept, r\(^2\), p value, and standard error of the regression.
[3]:
def peak_trend_regression(df):
"""
"""
#convert datetimes to days for regression
peak_date = df.index
peak_date = pd.to_datetime(df.index.get_level_values(1))
df['peak_d'] = (peak_date - peak_date.min()) / np.timedelta64(1,'D')
#df['peak_d'] = (df['peak_dt'] - df['peak_dt'].min()) / np.timedelta64(1,'D')
#normalize the peak discharge values
df['peak_va'] = (df['peak_va'] - df['peak_va'].mean())/df['peak_va'].std()
slope, intercept, r_value, p_value, std_error = stats.linregress(df['peak_d'], df['peak_va'])
#df_out = pd.DataFrame({'slope':slope,'intercept':intercept,'p_value':p_value},index=df['site_no'].iloc[0])
#return df_out
return pd.Series({'slope':slope,'intercept':intercept,'p_value': p_value,'std_error':std_error})
Preparing the analysis
[4]:
def peak_trend_analysis(states, start_date):
"""
states : list
a list containing the two-letter codes for each state to include in the
analysis.
start_date : string
the date to use a the beginning of the analysis.
"""
final_df = pd.DataFrame()
for state in states:
# download annual peak discharge records
df = nwis.get_record(state_cd=state, start=start_date, service='peaks')
# group the data by site and apply our regression
temp = df.groupby('site_no').apply(peak_trend_regression).dropna()
# drop any insignificant results
temp = temp[temp['p_value']<0.05]
# now download metadata for each site, which we'll use later to plot the sites
# on a map
site_df = nwis.get_record(sites=temp.index, service='site')
if final_df.empty:
final_df = pd.merge(site_df, temp, right_index=True, left_on='site_no')
else:
final_df = final_df.append( pd.merge(site_df, temp, right_index=True, left_on='site_no') )
return final_df
To run the analysis for all states since 1970, one would only need to uncomment and run the following lines. However, pulling all that data from NWIS takes time and puts and could put a burden on resoures.
[5]:
# Warning these lines will download a large dataset from the web and
# will take few minutes to run.
#start = '1970-01-01'
#states = codes.state_codes
#final_df = peak_trend_analysis(states=states, start_date=start)
#final_df.to_csv('datasets/peak_discharge_trends.csv')
Instead, lets quickly load some predownloaded data, which I generated using the code above.
[6]:
final_df = pd.read_csv('datasets/peak_discharge_trends.csv')
final_df.head()
---------------------------------------------------------------------------
NameError Traceback (most recent call last)
Cell In[6], line 1
----> 1 final_df = pd.read_csv('datasets/peak_discharge_trends.csv')
2 final_df.head()
NameError: name 'pd' is not defined
Notice how the data has been transformed. In addition to statistics about the peak streamflow trends, we’ve also used the NWIS site service to add latitude and longtitude information for each station.
Plotting the results
Finally we’ll use basemap
and matplotlib
, along with the location information from NWIS, to plot the results on a map (shown below). Stations with increasing peak annual discharge are shown in red; whereas, stations with decreasing peaks are blue.
[7]:
# Currently commented out as there isn't an easy way to install mpl_toolkits
# on a remote machine without spinning up a full geospatial stack.
# from mpl_toolkits.basemap import Basemap, cm
# import matplotlib.pyplot as plt
# fig = plt.figure(num=None, figsize=(10, 6) )
# # setup a basemap covering the contiguous United States
# m = Basemap(width=5500000, height=4000000, resolution='l',
# projection='aea', lat_1=36., lat_2=44, lon_0=-100, lat_0=40)
# # add coastlines
# m.drawcoastlines(linewidth=0.5)
# # add parallels and meridians.
# m.drawparallels(np.arange(-90.,91.,15.),labels=[True,True,False,False],dashes=[2,2])
# m.drawmeridians(np.arange(-180.,181.,15.),labels=[False,False,False,True],dashes=[2,2])
# # add boundaries and rivers
# m.drawcountries(linewidth=1, linestyle='solid', color='k' )
# m.drawstates(linewidth=0.5, linestyle='solid', color='k')
# m.drawrivers(linewidth=0.5, linestyle='solid', color='cornflowerblue')
# increasing = final_df[final_df['slope'] > 0]
# decreasing = final_df[final_df['slope'] < 0]
# #x,y = m(lons, lats)
# # categorical plots get a little ugly in basemap
# m.scatter(increasing['dec_long_va'].tolist(),
# increasing['dec_lat_va'].tolist(),
# label='increasing', s=2, color='red',
# latlon=True)
# m.scatter(decreasing['dec_long_va'].tolist(),
# decreasing['dec_lat_va'].tolist(),
# label='increasing', s=2, color='blue',
# latlon=True)