The Preprocessing Module

sfrmaker.preprocessing.clip_flowlines_to_polygon(flowlines, polygon, crs=None, simplify_tol=100, logger=None)[source]

Clip line features in a flowlines DataFrame to polygon features in polygon.

Parameters:
flowlinesDataFrame

Output from preprocess_nhdplus()

crsobj

Coordinate reference system of flowlines. Only needed if the data do not have a valid ESRI projection (.prj) file. A Python int, dict, str, or pyproj.crs.CRS instance passed to pyproj.crs.CRS.from_user_input()

Can be any of:
  • PROJ string

  • Dictionary of PROJ parameters

  • PROJ keyword arguments for parameters

  • JSON string with PROJ parameters

  • CRS WKT string

  • An authority string [i.e. ‘epsg:4326’]

  • An EPSG integer code [i.e. 4326]

  • A tuple of (“auth_name”: “auth_code”) [i.e (‘epsg’, ‘4326’)]

  • An object with a to_wkt method.

  • A pyproj.crs.CRS class

By default, None

polygonstr

Polygon shapefile, shapely polygon or list of shapely polygons of model active area. Shapely polygons must be in the same CRS as flowlines; shapefiles will be automatically reprojected.

simplify_tolfloat

Simplification tolerance for polygon to speed clipping. See The Shapely User Manual for more details.

loggerLogger instance
Returns:
flcclipped flowlines dataframe
sfrmaker.preprocessing.cull_flowlines(NHDPlus_paths, active_area=None, asum_thresh=None, intermittent_streams_asum_thresh=None, cull_invalid=True, cull_isolated=True, keep_comids=None, outfolder='clipped_flowlines', logger=None)[source]

Cull NHDPlus version2 data to an area defined by an active_area polygon and to flowlines with Arbolate sums greater than specified thresholds. Also remove lines that are isolated from the stream network or are missing attribute information.

Parameters:
NHDPlus_pathslist of strings

Paths to the root folder level NHDPlus Drainage Basins, as they were downloaded from the NHDPlus website (e.g. NHDPlus04, NHDPlus07, etc.).

active_areastr, shapely polygon or tuple, optional

A polygon shapefile, or shapely polygon or bounding box tuple (left, bottom, top, right) in the NAD83 GCS (EPSG:4269). The active area is converted to a bounding box, which is then used to filter the flowlines that are read in. If none, no filtering is performed, and the whole area encompased by the input NHDPlus data will be retained. By default None.

asum_threshnumeric, optional

Minimum arbolate sum value (total length of upstream drainage) to retain. Any flowlines with arbolate sums less than this value will be dropped. By default None.

intermittent_streams_asum_threshnumeric, optional

Minimum arbolate sum value (total length of upstream drainage) to retain for flowlines coded as intermittent (FCODE == 46003). Any intermittent flowlines with arbolate sums less than this value will be dropped. By default None.

cull_invalidbool, optional

Option to cull flowlines that have incomplete attribute information (lacking entries in the PlusFlowVAA, PlusFlow or Elevslope tables), by default False.

cull_isolatedbool, optional

Culling intermittent streams with intermittent_streams_asum_thresh may result in some isolated flowlines that no longer have downstream connections, or such isolated flowlines may be present in the raw NHDPlus data. SFRmaker identifies isolated flowslines by looking at up to 10 downstream connections to lines that are not stream network. Option to drop these lines. By default, False.

keep_comidssequence

List-like of COMIDs to retain, regardless of culling criteria.

outfolderstr, optional

Location for writing output, by default ‘clipped_flowlines’

loggersfrmaker.logger instance, optional

Pass an existing sfrmaker.logger instance to logger the preprocessing operations, by default None

sfrmaker.preprocessing.edit_flowlines(flowlines, config_file, id_column='COMID', toid_column='tocomid', logger=None)[source]

Make edits to the flowlines in flowlines_file, as described in config_file.

Parameters:
flowlinesshapefile or DataFrame

Flowlines to edit. If a shapefile is specified, a backup with “.original” before the extension is made, and the input shapefile is overwritten by the results.

config_fileyaml file

e.g. ‘flowline_edits.yml’

id_columnstr

Column in flowlines with unique identifiers (e.g. COMIDs)

toid_columnstr

Column in flowslines with downstream routing connections (identifiers)

loggersfrmaker logger instance, optional

Pass a logger file instance to continue writing to an open logger file (e.g. after or before other operations)

Returns:
flowlinesDataFrame

Edited flowlines.

sfrmaker.preprocessing.fix_invalid_asums(asums, fl_lengths, graph, graph_r)[source]

Recompute arbolate sum at any places in the network where it decreases going downstream, and then for all lines downstream of those locations. Decreases may be caused by routing connections that weren’t in the original NHDPlus data.

Parameters:
asumsdict

Dictionary of {indentifier (e.g. comid): asum values}, in same units as fl_lengths. Includes all lines in the stream network.

fl_lengthsdict

Dictionary of flowline lengths {indentifier (e.g. comid): length values}, in same units as asums.

graphdict

Dictionary of downstream routing connections {fromcomid: tocomid}

graph_rdict

Dictionary of upstream routing connections {tocomid: {fromcomid1, fromcomid2,…}}

Returns:
new_asumsdict

Dictionary of recomputed arbolate sums {comid: asum value}

Notes

This function is only designed to fix instances of breaks in arbolate sum caused by new routing connections. It can’t fix all instances of invalid asums, because it is not known whether the arbolate sums from upstream tributaries reflect unique upstream drainages (if the tribs are distributaries coming from the same divergence, there asums will reflect the same upstream drainage, and therefore would be duplicative if summed).

sfrmaker.preprocessing.get_flowline_routing(NHDPlus_paths=None, PlusFlow=None, mask=None, mask_crs=None, nhdplus_crs=4269)[source]

Read a collection of NHDPlus version 2 PlusFlow (routing) tables from one or more drainage basins and consolidate into a single pandas DataFrame, returning the FROMCOMID and TOCOMID columns.

Parameters:
NHDPlus_pathssequence

Sequence of paths to the top level folder for each drainage basin. For example:

['NHDPlus/NHDPlusGL/NHDPlus04', 
 'NHDPlus/NHDPlusMS/NHDPlus07']

by default None

PlusFlowstring or sequence

Single path to a PlusFlow table or sequence of PlusFlow table filepaths, by default None

Returns:
flowline_routingDataFrame

[description]

Raises:
ValueError

[description]

sfrmaker.preprocessing.get_netcdf_crs_from_grid_mapping(grid_mapping)[source]
sfrmaker.preprocessing.preprocess_nhdplus(flowlines_file, pfvaa_file, pf_file, elevslope_file, demfile=None, run_zonal_statistics=True, dem_length_units='meters', flowline_elevations_file=None, active_area=None, narwidth_shapefile=None, waterbody_shapefiles=None, buffersize_meters=50, asum_thresh=0.0, known_connections=None, update_up_elevations=None, update_dn_elevations=None, width_from_asum_a_param=0.1193, width_from_asum_b_param=0.5032, minimum_width=1.0, output_length_units='meters', logger=None, outfolder='output/', flowline_crs=None, dest_crs=None)[source]

Preprocess NHDPlus data to a single DataFrame of flowlines that each route to no more than one flowline, with width, elevation and recomputed arbolate sum attributes. A key part of the preprocessing is handling divergences in the stream network, as described in more detail in the Notes section. In picking routing at divergences, values are sampled from the demfile and included in the output DataFrame. Optionally (via the narwidth_shapefile arguement), remote sensing-based width estimates from the NARWidth Database (Allen and Pavelsky, 2015) can be included.

Parameters:
flowlines_filestr

Path to NHDPlus NHDFlowline shapefile. May or maybe not have been preprocessed by cull_flowlines(). The flowlines must be in a valid projected coorinate reference system (CRS; i.e., with units of meters), or a valid projected CRS must be specified with flowline_crs.

pfvaa_filestr

Path to NHDPlus PlusFlowlineVAA database (.dbf file). May or maybe not have been preprocessed by cull_flowlines(). ArbolateSu values within this file are assumed to be in km.

pf_filestr

Path to NHDPlus PlusFlow database (.dbf file). May or maybe not have been preprocessed by cull_flowlines()

elevslope_filestr

Path to NHDPlus elevslope database (.dbf file). May or maybe not have been preprocessed by cull_flowlines()

demfilestr

Path to DEM raster for project area.

dem_length_unitsstr, any length unit; e.g. {‘m’, ‘meters’, ‘ft’, etc.}

Length units of values in demfile. By default, ‘meters’.

active_areastr, optional

Polygon shapefile of active area that preprocessed lines will be clipped to. By default, None, in which case the flowlines won’t be clipped.

narwidth_shapefilestr, optional

Path to shapefile from the NARWidth database (Allen and Pavelsky, 2015).

waterbody_shapefilesstr or list of strings, optional

Path(s) to NHDPlus NHDWaterbody shapefile(s). Only required if a narwidth_shapefile is specified.

buffersize_metersfloat

Buffer distance in meters around flowlines to include when sampling DEM. By default, 50.

asum_threshfloat

Arbolate sum threshold for culling minor distributaries (that are not the main channel) below divergences. In NHDPlus, minor distributaries have the same arbolate sum as the main channel. After selecting the main channel, SFRmaker recomputes arbolate sum values for the minor distributaries, starting with 0 at the divergence. Lines with ending asums less than asum_thresh will then be culled.

known_connectionsdict, optional

Dictionary of specified flowline connections {COMID: tocomid}, which will override the routing selection at distributaries. By default None.

update_up_elevationsdict, optional

Dictionary of specified stage or streambed top elevations {COMID: elevation} at the upstream end of flowlines, for example, based on field measurements of streambed elevation or stage. The distinction between streambed elevation and stage depends on the context of the other elevations. For example, DEM elevations for larger streams typically reflect stage at the time data for the DEM (e.g. lidar) were collected. If all other COMID elevations represent stage, then the elevations supplied to update_up_elevations and update_dn_elevations should be stages. Streambed top is often poorly characterized and difficult to measure in the field. But if the SFR package is simulating stage (specified streambed bottom plus a simulated depth based on flow), appropriate streambed bottom input is needed to produce reasonable stages. The user may start with DEM elevations and then later subtract off simulated stream depths(from the SFR package output) to arrive at simulated stages that are close to those observed in the field.

update_up_elevations and update_dn_elevations input will override any other values (from NHDPlus or the DEM), and will be incorporated into the elevation smoothing. By default None.

update_dn_elevationsdict, optional

Dictionary of specified elevations {COMID: elevation} at the downstream end of flowlines. See the update_up_elevations description for more details.

update_up_elevations and update_dn_elevations input will override any other values (from NHDPlus or the DEM), and will be incorporated into the elevation smoothing. By default None.

width_from_asum_a_paramfloat, optional

\(a\) parameter used for estimating channel width from arbolate sum. Only needed if input flowlines are lacking width information. See width_from_arbolate(). By default, 0.1193.

width_from_asum_b_paramfloat, optional

\(b\) parameter used for estimating channel width from arbolate sum. Only needed if input flowlines are lacking width information. See width_from_arbolate(). By default, 0.5032.

minimum_widthfloat, optional

Minimum reach width to specify (in model units), if computing widths from arbolate sum values. (default = 1)

output_length_unitsstr, any length unit; e.g. {‘m’, ‘meters’, ‘ft’, etc.}

Units for width and elevation attribute values included with the output flowlines. Output arbolate sum values are specified in kilometers.

outfolderstr, optional

Location for writing output, by default ‘clipped_flowlines’

loggersfrmaker.logger instance, optional

Pass an existing sfrmaker.logger instance to logger the preprocessing operations, by default None

flowline_crsobj

Coordinate reference system of the NHDPlus data. Only needed if the data do not have a valid ESRI projection (.prj) file. A Python int, dict, str, or pyproj.crs.CRS instance passed to pyproj.crs.CRS.from_user_input()

Can be any of:
  • PROJ string

  • Dictionary of PROJ parameters

  • PROJ keyword arguments for parameters

  • JSON string with PROJ parameters

  • CRS WKT string

  • An authority string [i.e. ‘epsg:4326’]

  • An EPSG integer code [i.e. 4326]

  • A tuple of (“auth_name”: “auth_code”) [i.e (‘epsg’, ‘4326’)]

  • An object with a to_wkt method.

  • A pyproj.crs.CRS class

dest_crsobj

Output Coordinate reference system. Same input types as flowline_crs.

Returns:
flowlinesDataFrame

DataFrame with preprocessed flowlines. Width and elevation values are specified in the output_length_units. Output arbolate sum values are specified in kilometers. See NHDPlus documentation for description of fields not included here. Columns:

COMID

int64

NHDPlus Common Identifiers

tocomid

int64

Downstream routing connections (COMIDs)

nhd_asum

float

Arbolate sum from NHDPlus, in km

min

float

minimum elevation sampled within each buffer

mean

float

mean elevation sampled within each buffer

pct10,20,80

float

elevation percentiles sampled within each buffer

Divergence

int

NHDPlus Divergence classification

main_chan

bool

Flag indicating whether SFRmaker identified line as main channel

elevup

float

Smoothed elevation at line start (sampled from DEM)

elevdn

float

Smoothed elevation at line end (sampled from DEM)

elevupsmo

float

Smoothed elevation at line start (sampled from DEM)

elevdnsmo

float

Smoothed elevation at line end (sampled from DEM)

asum_calc

float

Recomputed arbolate sum for line (after removing distributaries)

asum_diff

float

Difference between recomputed and NHDPlus asums

width1asum

float

asum-based estimate for width at line start

width2asum

float

asum-based estimate for width at line end

narwd_n

int

number of values for line sampled from NARWidth

narwd_mean

float

mean elevation sampled from NARWidth

narwd_std

float

standard deviation in values sampled from NARWidth

narwd_min

float

minimum elevation sampled from NARWidth

narwd_max

float

maximum elevation sampled from NARWidth

is_wb

bool

Flag for whether the line coincides with a Waterbody

width1

float

estimated width at line start (from NARWidth or asum)

width2

float

estimated width at line end (from NARWidth or asum)

geometry

obj

Shapely LineString for each line

buffpoly

obj

Shapely Polygon (buffer) around each LineString

Raises:
ValueError

[description]

IOError

[description]

Notes

A key part of the preprocessing is handling divergences in the stream network, where flow is routed to two or more distributaries. Distributaries are common in the Mississippi Alluvial Plain (MAP) region, for example, but in reality, most of these features are either non-existent or only carry flow intermittently during high-water events. While distributaries in NHDPlus are classified as “main” and “minor” paths at divergences, inspection against the satellite imagery and the most recent, lidar-based DEMs for the MAP area suggested that the NHDPlus classifications are often inaccurate.

The following steps are taken to identify the main channel at each divergence:

  • A 50-meter buffer polygon is drawn around each flowline feature. A flat end-cap is used, so that only areas perpendicular to the flowlines are included in each buffer.

  • Zonal statistics for the lidar-based DEM values within each buffer polygon are computed using the rasterstats python package. The tenth percentile elevation is selected as a metric for discriminating between the main channel and minor distributaries. Lower elevation percentiles would be more likely to represent areas of overlap between the buffers for the main channel and minor distributaries (resulting in minor distributary values that are similar to the main channel), while higher elevation percentiles might miss the lowest parts of the main channel or even represent parts of the channel banks instead.

  • At each divergence, the distributary with the lowest tenth percentile elevation is assumed to be the main channel.

In the MAP region, comparison of the sampled DEM values with the NHDPlus elevation attribute data revealed a high bias in many of the attribute values, especially in the vicinity of diversions. This may be a result of the upstream smoothing process described by McKay and others (2012, p 123) when it encounters distributaries of unequal values such as the example shown in Figure 5. To remedy this issue, the 10th percentile values obtained from the buffer zonal statistics were assigned to each flowline, and then smoothed in the downstream direction to ensure that no flowlines had negative (uphill) slopes.

Finally, routing connections to minor distributaries are removed, and arbolate sums recomputed for the entire stream network, with arbolate sums at minor distributaries starting at zero. In this way, the minor distributaries are treated like headwater streams in that they will only receive flow if the water table is greater than their assigned elevation, otherwise they are simulated as dry and are not part of the groundwater model solution. Similar to cull_flowlines(), the first asum_thresh km of minor distributaries are trimmed from the stream network.

If a shapefile is specified for the narwidth_shapefile argument, the sample_narwidth() function is called.

sfrmaker.preprocessing.recompute_asums_for_minor_distribs(minor_distrib_comids, fl_lengths, graph, graph_r)[source]

Reset arbolate sums for minor distributaries and downstream segments in their path, to the next confluence.

Parameters:
minor_distrib_comidssequence

Sequence of line identifiers for minor distributaries (that were part of a divergence in NHDPlus).

fl_lengthsdict

Dictionary of flowline lengths {indentifier (e.g. comid): length values}, in same units as asums.

graphdict

Dictionary of downstream routing connections {fromcomid: tocomid}

graph_rdict

Dictionary of upstream routing connections {tocomid: {fromcomid1, fromcomid2,…}}

Returns:
new_asumsdict

Dictionary of recomputed arbolate sums {comid: asum value}

sfrmaker.preprocessing.sample_NARWidth(flowlines, narwidth_shapefile, waterbodies, bbox_filter=None, crs=None, output_width_units='meters', outpath='shps/')[source]

Sample the North American River Width Database by doing a spatial join (transfer width information from NARWidth shapefile to flowlines shapefile based on proximity).

Parameters:
flowlinesDataFrame

flowlines dataframe from preprocess_nhdplus(). Flowlines must be in a projected Coordinate reference system (CRS).

narwidth_shapefilestr

Path to shapefile from the NARWidth database (Allen and Pavelsky, 2015).

waterbody_shapefilesstr or list of strings, optional

Path(s) to NHDPlus NHDWaterbody shapefile(s). Only required if a narwidth_shapefile is specified.

crsobj

Coordinate reference system of flowlines. A Python int, dict, str, or pyproj.crs.CRS instance passed to pyproj.crs.CRS.from_user_input()

Can be any of:
  • PROJ string

  • Dictionary of PROJ parameters

  • PROJ keyword arguments for parameters

  • JSON string with PROJ parameters

  • CRS WKT string

  • An authority string [i.e. ‘epsg:4326’]

  • An EPSG integer code [i.e. 4326]

  • A tuple of (“auth_name”: “auth_code”) [i.e (‘epsg’, ‘4326’)]

  • An object with a to_wkt method.

  • A pyproj.crs.CRS class

By default, None

filtertuple

Bounds (most likely in lat/lon) for filtering NARWidth lines that are read in (left, bottom, right, top)

output_width_unitsstr, any length unit; e.g. {‘m’, ‘meters’, ‘ft’, etc.}

Units for width and elevation attribute values included with the output flowlines. NARWidth widths are assumed to be in meters.

Returns:
This function operates on the fl DataFrame in place.

Notes

To avoid erroneous overlap between main-stem NARWidth estimates and minor tributaries, flowlines with arbolate sums less than 500 km only receive widths from NARWidth lines that have at least 50% of their length inside of the 1-km buffer. NARWidth values are generally higher than arbolate sum-based estimates, because the NARWidth estimates represent mean flows and include all reaches of the stream, whereas the arbolate sum estimates are based on field measurements taken at narrower than average, well-behaved channel sections near stream gages, under base flow conditions. Therefore, measured channel widths may be biased low compared to actual widths throughout the stream network (Allen and Pavelsky, 2015; Park, 1977).

sfrmaker.preprocessing.swb_runoff_to_csv(swb_runoff_netcdf_output, nhdplus_catchments_file, runoff_output_variable='runoff', swb_rejected_net_inf_output=None, rejected_net_inf_variable='rejected_net_infiltration', catchment_id_col='FEATUREID', limit_runoff_to_area=None, start_datetime=None, end_datetime=None, output_length_units='meters', include_xy_in_output=False, xy_crs=None, outfile='swb_runoff_by_nhdplus_comid.csv')[source]

Convert gridded runoff estimates from the USGS Soil Water Balance Code (SWB) to a CSV format, associating each runoff value with a catchment ID. For the runoff to be successfully mapped to the SFR network, the catchment IDs must correspond to line IDs in the stream network, and each catchment must either be associated with a line, or be referenced in the flowline_routing dataset. See the documentation for sfrmaker.flows.add_to_perioddata() for more details.

Parameters:
swb_runoff_netcdf_outputstr or path-like

NetCDF file with gridded SWB runoff estimates. Runoff values are assumed to be in units of length/time (normalized to area). The CRS for the runoff dataset is read from the ‘proj4_string’ attribute, and is assumed to have length units of meters.

nhdplus_catchments_filestr or path-like

Shapefile of catchments with IDs (catchment_id_col). The catchments are rasterized to the grid in swb_netcdf_output, so that each runoff value can be associated with a catchment ID.

catchment_id_colstr

Field in nhdplus_catchments_file with ID, by default ‘FEATUREID’.

runoff_output_variablestr, optional

Variable in swb_runoff_netcdf_output with runoff data, by default ‘runoff’

swb_rejected_net_inf_outputstr or path-like

NetCDF file with gridded SWB estimates of “rejected net infiltration”. This is infiltration in the SWB simulation in excess of the specified max infiltration rate. In reality this may represent a component of runoff or ‘quick’ overland flow, depending on the timescale of the simulation. Values are assumed to be in units of length/time (normalized to area). The CRS for the runoff dataset is read from the ‘proj4_string’ attribute, and is assumed to have length units of meters.

rejected_net_inf_variablestr, optional

Variable in swb_runoff_netcdf_output with runoff data, by default ‘rejected_net_infiltration’

limit_runoff_to_areastr or path-like

Optionally limit runoff to an area defined by a polygon shapefile. By default None, in which case runoff is aggregated for each NHDPlus Catchment that intersections the grid in swb_runoff_netcdf_output.

start_datetimestr

Include runoff on or after this date in the output. Can be in any string format used for time slicing in pandas or xarray. By default, None (include all times in swb_runoff_netcdf_output).

end_datetimestr

Include runoff on or before this date in the output. Can be in any string format used for time slicing in pandas or xarray. By default, None (include all times in swb_runoff_netcdf_output).

output_length_unitsstr, optional

Input units are read from the ‘units’ attribute of the netcdf_output_variable in swb_netcdf_output; specify output length units so that the output CSV file is in the same length units as the model. No time unit conversion is perfo by default ‘meters’

include_xy_in_outputbool

Option to include approximate x, y coordinates of catchments in the output. The x, y coordinates represent the average x and y values of the SWB grid cells intersected by each catchment.

xy_crs

Output coordinate reference system (CRS) for coordinates in outfile. Only used if include_xy_in_output=True. By default False, in which case the CRS for swb_runoff_netcdf_output is used.

outfilestr, optional

Output CSV file with transient runoff volumes for each catchment intersecting the SWB grid, by default ‘swb_runoff_by_nhdplus_comid.csv’

Returns:
aggregatedDataFrame (also written to outfile)

DataFrame with columns:

time

time associated with each runoff volume

comid

catchment and flowline identifier

x

(optional) approximate x-coordinate of catchment, in xy_crs

y

(optional) approximate y-coordinate of catchment, in xy_crs

runoff_{L}3d

runoff volume, in model length units cubed per day

area_{L}2

total area of SWB cells intersected by catchment, in model length units cubed

Notes

Some notes on SWB output variables, as of 7/5/2023:

  • 'runoff_outside' is the sum of 'runoff' plus 'rejected_net_infiltration'.

  • When routing is inactive in the SWB simulation, 'runoff_outside' is computed for every SWB grid cell. In this case, it can be supplied here to swb_runoff_netcdf_output (and runoff_output_variable), in lieu of supplying both swb_runoff_netcdf_output and swb_rejected_net_inf_output.

  • When routing is active, 'runoff_outside' represents only water that cannot be moved downslope; in most cases it should be zero in the model interior. In this case, one would most likely supply both swb_runoff_netcdf_output and swb_rejected_net_inf_output.