dataretrieval
  • Installation Guide
  • User Guide
  • Examples
    • Introduction to the waterdata module of dataretrieval
    • Simple uses of the dataretrieval package
    • Example Notebooks from Hydroshare
    • Using dataretrieval to obtain nation trends in peak annual streamflow
      • National trends in peak annual streamflow
        • Introduction
        • Setup
        • Basic usage
        • Preparing the regression
        • Preparing the analysis
        • Plotting the results
    • Duplicating the R dataRetrieval vignettes functionality
  • Contributing
  • License and Disclaimer
  • API reference
dataretrieval
  • Examples
  • National trends in peak annual streamflow
  • View page source

National trends in peak annual streamflow

Introduction

This notebook demonstrates a slightly more advanced application of the dataretrieval.waterdata module: assembling a dataset of historical annual peak streamflow and regressing peak discharge against time to look for trends — not at a single station, but across many.

Setup

Before we begin any analysis, we’ll need to set up our environment by importing a few modules.

[1]:
import numpy as np
import pandas as pd
from scipy import stats

from dataretrieval import waterdata
---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
Cell In[1], line 3
      1 import numpy as np
      2 import pandas as pd
----> 3 from scipy import stats
      4
      5 from dataretrieval import waterdata

ModuleNotFoundError: No module named 'scipy'

Basic usage

The waterdata module is the recommended interface to USGS water data and replaces the deprecated nwis module. Annual peak streamflow is retrieved with waterdata.get_peaks(), which returns a pandas data frame and a metadata object. To get started we need a monitoring location ID, a parameter code, and (optionally) a time window.

[2]:
# download annual peaks (discharge, parameter 00060) from a single site
df, md = waterdata.get_peaks(
    monitoring_location_id="USGS-03339000",
    parameter_code="00060",
    time="1970-01-01/..",
)
df.head()
---------------------------------------------------------------------------
NameError                                 Traceback (most recent call last)
Cell In[2], line 2
      1 # download annual peaks (discharge, parameter 00060) from a single site
----> 2 df, md = waterdata.get_peaks(
      3     monitoring_location_id="USGS-03339000",
      4     parameter_code="00060",
      5     time="1970-01-01/..",

NameError: name 'waterdata' is not defined

All we require for the trend analysis are the peak date (time), the monitoring location ID (monitoring_location_id), and the peak streamflow (value). The Water Data API returns a flat (single-index) data frame with one row per annual peak.

Preparing the regression

Next we’ll define a function that applies ordinary least squares to peak discharge versus time. After grouping the dataset by monitoring_location_id, we apply the regression per monitoring location. Each location’s result is returned as a row with the slope, y-intercept, p value, and standard error of the regression.

[3]:
def peak_trend_regression(df):
    # convert peak dates to days since the first peak for the regression
    peak_date = pd.to_datetime(df["time"])
    peak_d = (peak_date - peak_date.min()) / np.timedelta64(1, "D")

    # normalize the peak discharge values
    value = (df["value"] - df["value"].mean()) / df["value"].std()

    slope, intercept, _r_value, p_value, std_error = stats.linregress(peak_d, value)

    return pd.Series(
        {
            "slope": slope,
            "intercept": intercept,
            "p_value": p_value,
            "std_error": std_error,
        }
    )

Preparing the analysis

[4]:
def peak_trend_analysis(state_names, start_date):
    """
    state_names : list
        state names to include in the analysis, e.g. ["Illinois", "Indiana"].
    start_date : string
        the date to use as the beginning of the analysis (YYYY-MM-DD).
    """
    final_df = pd.DataFrame()

    for state in state_names:
        # find stream gages in the state
        sites, _ = waterdata.get_monitoring_locations(
            state_name=state, site_type_code="ST", skip_geometry=True
        )
        # download annual peak discharge for those sites
        df, _ = waterdata.get_peaks(
            monitoring_location_id=sites["monitoring_location_id"].tolist(),
            parameter_code="00060",
            time=f"{start_date}/..",
        )
        # group the data by site and apply our regression
        temp = (
            df.groupby("monitoring_location_id")
            .apply(peak_trend_regression)
            .dropna()
        )
        # drop any insignificant results
        temp = temp[temp["p_value"] < 0.05]

        # join site metadata (for mapping) with the trend results
        merged = pd.merge(
            sites, temp, right_index=True, left_on="monitoring_location_id"
        )
        final_df = pd.concat([final_df, merged], ignore_index=True)

    return final_df

To run the analysis for all states since 1970, one would only need to uncomment and run the following lines. However, pulling all that data from the Water Data API takes time and could put a burden on resources.

[5]:
# Warning: these lines download a large dataset from the web and
# will take a few minutes to run.

# start = "1970-01-01"
# states = ["Illinois", "Indiana", "Ohio"]
# final_df = peak_trend_analysis(state_names=states, start_date=start)
# final_df.to_csv("datasets/peak_discharge_trends.csv")

Instead, let’s quickly load some pre-generated results bundled with this notebook. (This example dataset was produced by an earlier run of the analysis and retains the column names from that run.)

[6]:
final_df = pd.read_csv("datasets/peak_discharge_trends.csv")
final_df.head()
[6]:
Unnamed: 0 agency_cd site_no station_nm site_tp_cd dec_lat_va dec_long_va coord_acy_cd dec_coord_datum_cd alt_va alt_acy_va alt_datum_cd huc_cd intercept p_value slope std_error
0 0 USGS 2343275 ABBIE CREEK NEAR ABBEVILLE AL ST 31.561836 -85.204932 U NAD83 NaN NaN NaN 3130004.0 -0.641911 0.015899 0.000231 0.000065
1 1 USGS 2360275 JUDY CREEK NEAR OZARK AL ST 31.463224 -85.572159 U NAD83 NaN NaN NaN 3140201.0 -0.702569 0.004652 0.000269 0.000069
2 2 USGS 2360500 EAST FORK CHOCTAWHATCHEE R NEAR MIDLAND CITY AL ST 31.373227 -85.477158 U NAD83 179.1 0.01 NGVD29 3140201.0 -1.138552 0.007698 0.000211 0.000003
3 3 USGS 2367400 YELLOW RIVER NR SANFORD, ALA ST 31.317391 -86.355788 U NAD83 NaN NaN NaN 3140103.0 -0.611310 0.015836 0.000511 0.000013
4 4 USGS 2367500 LIGHTWOOD KNOT CREEK AT BABBIE AL ST 31.270725 -86.313564 U NAD83 185.0 0.01 NGVD29 3140103.0 -0.705676 0.015231 0.000210 0.000052

Notice how the data has been transformed. In addition to statistics about the peak streamflow trends, the analysis joined monitoring-location metadata to add latitude and longitude for each station.

Plotting the results

Finally we’ll use basemap and matplotlib, along with the location information from the Water Data API, to plot the results on a map (shown below). Monitoring locations with increasing peak annual discharge are shown in red, and those with decreasing peaks in blue.

[7]:
# Currently commented out as there isn't an easy way to install mpl_toolkits
# on a remote machine without spinning up a full geospatial stack.

# from mpl_toolkits.basemap import Basemap, cm
# import matplotlib.pyplot as plt

# fig = plt.figure(num=None, figsize=(10, 6) )

# # setup a basemap covering the contiguous United States
# m = Basemap(width=5500000, height=4000000, resolution='l',
#             projection='aea', lat_1=36., lat_2=44, lon_0=-100, lat_0=40)


# # add coastlines
# m.drawcoastlines(linewidth=0.5)

# # add parallels and meridians.
# m.drawparallels(np.arange(-90.,91.,15.),labels=[True,True,False,False],dashes=[2,2])
# m.drawmeridians(np.arange(-180.,181.,15.),labels=[False,False,False,True],dashes=[2,2])

# # add boundaries and rivers
# m.drawcountries(linewidth=1, linestyle='solid', color='k' )
# m.drawstates(linewidth=0.5, linestyle='solid', color='k')
# m.drawrivers(linewidth=0.5, linestyle='solid', color='cornflowerblue')


# increasing = final_df[final_df['slope'] > 0]
# decreasing = final_df[final_df['slope'] < 0]

# #x,y = m(lons, lats)

# # categorical plots get a little  ugly in basemap
# m.scatter(increasing['dec_long_va'].tolist(),
#           increasing['dec_lat_va'].tolist(),
#           label='increasing', s=2, color='red',
#           latlon=True)

# m.scatter(decreasing['dec_long_va'].tolist(),
#           decreasing['dec_lat_va'].tolist(),
#           label='increasing', s=2, color='blue',
#           latlon=True)
Previous Next

© Copyright .

Built with Sphinx using a theme provided by Read the Docs.