This article will describe the R-package “dataRetrieval” which simplifies the process of finding and retrieving water data from the U.S. Geological Survey and other agencies.
Package Overview
dataRetrieval
is available on Comprehensive R Archive
Network (CRAN).
install.packages("dataRetrieval")
Once the dataRetrieval
package has been installed, it
needs to be loaded in order to use any of the functions:
There are several vignettes included within the
dataRetrieval
package. The following command will open the
main package introduction:
vignette("dataRetrieval", package = "dataRetrieval")
Additionally, each function has a help file. These can be accessed by typing a question mark, followed by the function name in the R console:
?readNWISuv
Each function’s help file has working examples to demonstrate the usage. The examples may have comments “## Not run”. These examples CAN be run, they just are not run by the CRAN maintainers due to the external service calls.
Finally, if there are still questions that the vignette and help
files don’t answer, please post an issue on the
dataRetrieval
GitHub page:
Orientation
dataRetrieval
provides US water data via 2 sources:
National Water Information System (NWIS) and Water Quality Portal (WQP).
WQP is the service for all discrete water quality data (USGS and other).
NWIS is the service for all other hydrologic data in
dataRetrieval
.
Functions in dataRetrieval
look like
readNWISdv
, readNWISuv
,
readWQPqw
, whatNWISdata
, etc. What does that
mean? The functions are generally structured with a prefix, middle, and
suffix:
- Prefix: “read” or “what”
- Functions that start with “read” will get full data sets
- Functions that start with “what” will get data availability
- Middle: “NWIS” or “WQP”:
- NWIS functions get data from NWIS web services.
- WQP functions are for all discrete water-quality data
- Suffix: “data” or other:
- Functions that end in “data”: These are flexible, powerful functions that allow complex user queries.
- Functions that don’t end with “data” are user-friendly functions that assume site, code, and start/end dates are known.
USGS water data comes from the National Water Information System (NWIS). As of March 2024, all discrete water quality USGS data should be obtained from the Water Quality Portal (WQP). WQP retrievals will be covered below. Lots of changes are happening with USGS water quality data and Water Quality Portal data formatting. For current information, see the status page:
National Water Information System (NWIS)
There are many types of data served from NWIS. To understand how the services are separated, it’s helpful to understand the terms here:
Type | Description | service |
---|---|---|
Unit | Regular frequency data reported from a sensor (e.g. 15 minute interval). This data can include ‘real-time’ data | uv (or iv) |
Daily | Data aggregated to a daily statistic such as mean, min, or max. | dv |
Discrete | Data collected at non-regular times. | groundwater (gwlevel), rating curves (rating), peak flow (peak), surfacewater (meas) |
USGS Basic Retrievals
The USGS uses various codes for basic retrievals. These codes can have leading zeros, therefore they need to be a character (“01234567”).
- Site ID (often 8 or 15-digits)
- Parameter Code (5 digits)
- Full list:
readNWISpCode("all")
- Full list:
- Statistic Code (for daily values)
Here are some examples of a few common parameter codes:
Parameter Codes | Short Name |
---|---|
00060 | Discharge |
00065 | Gage Height |
00010 | Temperature |
00400 | pH |
Statistic Codes | Short Name |
---|---|
00001 | Maximum |
00002 | Minimum |
00003 | Mean |
00008 | Median |
Use the readNWISpCode
function to get information on
USGS parameter codes. You can use “all” to get a full list. Then use
your favorite data analysis methods to pull out what you need. Here is
one example to find all the phosphorous parameter codes:
pcode <- readNWISpCode("all")
phosCds <- pcode[grep("phosphorus",
pcode$parameter_nm,
ignore.case = TRUE
), ]
Explore the wide variety of parameters that contain “phosphorus” in the parameter_nm:
User-friendly retrievals: NWIS
Sometimes, you know exactly what you want. If you know:
- The type of data (groundwater, unit values, daily values, etc..)
- USGS site number(s)
- USGS parameter code(s)
- Time frame (start and end date)
You can use the “user-friendly” functions. These functions take the same 4 inputs (sites, parameter codes, start date, end date), and deliver data from the different NWIS services:
Function Name | Data |
---|---|
readNWISuv | Unit |
readNWISdv | Daily |
readNWISgwl | Groundwater Level |
readNWISmeas | Surface-water |
readNWISpeak | Peak Flow |
readNWISrating | Rating Curves |
readNWISuse | Water Use |
readNWISstat | Statistics |
Let’s start by asking for discharge (parameter code = 00060) at a site right next to the old USGS office in Wisconsin (Pheasant Branch Creek).
siteNo <- "05427948"
pCode <- "00060"
start.date <- "2017-10-01"
end.date <- "2018-09-30"
pheasant <- readNWISuv(
siteNumbers = siteNo,
parameterCd = pCode,
startDate = start.date,
endDate = end.date
)
From the Pheasant Creek example, let’s look at the data. The column names are:
names(pheasant)
## [1] "agency_cd" "site_no" "dateTime" "X_00060_00000"
## [5] "X_00060_00000_cd" "tz_cd"
The names of the columns are based on the parameter and statistic
codes. In many cases, you can clean up the names with the convenience
function renameNWISColumns
:
pheasant <- renameNWISColumns(pheasant)
names(pheasant)
## [1] "agency_cd" "site_no" "dateTime" "Flow_Inst" "Flow_Inst_cd"
## [6] "tz_cd"
The returned data also has several attributes attached to the data frame. To see what the attributes are:
names(attributes(pheasant))
## [1] "names" "row.names" "class" "url"
## [5] "siteInfo" "variableInfo" "disclaimer" "statisticInfo"
## [9] "queryTime"
Each dataRetrieval
return should have the attributes:
url, siteInfo, and variableInfo. Additional attributes are available
depending on the data service.
To access the attributes:
url <- attr(pheasant, "url")
url
## [1] "https://nwis.waterservices.usgs.gov/nwis/iv/?site=05427948&format=waterml,1.1&ParameterCd=00060&startDT=2017-10-01&endDT=2018-09-30"
Make a simple plot to see the data:
Then use the attributes attached to the data frame to create better labels:
Known USGS site, unknown service/pcode
The most common question the dataRetrieval team gets is:
“I KNOW this site has data but it’s not coming out of dataRetrieval! Where’s my data?”
The best way to verify you are calling your data correctly, use the
whatNWISdata
function to find out the data_type_cd (which
will tell you the service you need to call), the parameter/stat codes
available at that site, and the period of record. All rows that have
“qw” in the column data_type_cd will come from the Water Quality
Portal.
library(dplyr)
site <- "05407000"
data_available <- whatNWISdata(siteNumber = site)
data_available_NWIS <- data_available |>
select(data_type_cd, parm_cd, stat_cd,
begin_date, end_date, count_nu) |>
filter(!data_type_cd %in% c("qw", "ad")) |>
arrange(data_type_cd)
This is the only available data from NWIS for site 05407000.
The data_type_cd can be used to figure out where to request data:
data_type_cd | readNWIS | readNWISdata |
---|---|---|
dv | readNWISdv | readNWISdata(…, service = “dv”) |
uv | readNWISuv | readNWISdata(…, service = “iv”) |
pk | readNWISpeak | readNWISdata(…, service = “peak”) |
sv | readNWISmeas | Not available |
gwl | readNWISgwl | readNWISdata(…, service = “gwlevels”) |
So to get all the NWIS data from the above site:
dv_pcodes <- data_available_NWIS$parm_cd[data_available_NWIS$data_type_cd == "dv"]
stat_cds <- data_available_NWIS$stat_cd[data_available_NWIS$data_type_cd == "dv"]
dv_data <- readNWISdv(siteNumbers = site,
parameterCd = unique(dv_pcodes),
statCd = unique(stat_cds))
uv_pcodes <- data_available_NWIS$parm_cd[data_available_NWIS$data_type_cd == "uv"]
uv_data <- readNWISuv(siteNumbers = site,
parameterCd = unique(uv_pcodes))
peak_data <- readNWISpeak(site)
Water Quality Portal (WQP)
dataRetrieval
also allows users to access data from the
Water Quality Portal. The
WQP houses data from multiple agencies; while USGS data comes from the
NWIS database, EPA data comes from the STORET database (this includes
many state, tribal, NGO, and academic groups). The WQP brings data from
all these organizations together and provides it in a single format that
has a more verbose output than NWIS.
The single user-friendly function is readWQPqw
. This
function will take a site or vector of sites in the first argument
“siteNumbers”. USGS sites need to add “USGS-” before the site
number.
The 2nd argument “parameterCd”. Although it is called “parameterCd”, it can take EITHER a USGS 5-digit parameter code OR a characterisitc name (this is what non-USGS databases use). Leaving “parameterCd” as empty quotes will return all data for a site.
So we could get all the water quality data for site 05407000 like this:
or 1 parameter code:
or 1 characteristic name:
Discover Data
This is all great when you know your site numbers. What do you do when you don’t?
There are 2 dataRetrieval
functions that help with
discover in NWIS:
-
whatNWISsites
finds sites within a specified filter (quicker) -
whatNWISdata
summarizes the data within the specified filter (more information)
And 2 functions that help with discover in WQP:
-
readWQPsummary
summarizes the data available within the WQP by year. -
whatWQPdata
summarizes the data available within the WQP.
There are several ways to specify the requests. The best way to discover how flexible the USGS web services are is to click on the links and see all of the filtering options: http://waterservices.usgs.gov/
Available geographic filters are individual site(s), a single state,
a bounding box, or a HUC (hydrologic unit code). See examples for those
services by looking at the help page for the readNWISdata
and readWQPdata
functions:
Here are a few examples:
Arizona Example
For example, let’s see which sites ever measured phosphorus at least 100 times over at least 20 years in Arizona. Water quality data is exclusively found in WQP functions.
AZ_sites <- readWQPsummary(
statecode = "AZ",
siteType = "Stream"
)
az_phos_summary <- AZ_sites |>
mutate(ResultCount = as.numeric(ResultCount),
Lat = as.numeric(MonitoringLocationLatitude),
Lon = as.numeric(MonitoringLocationLongitude)) |>
rename(Site = MonitoringLocationIdentifier) |>
group_by(Site, Lat, Lon) |>
summarise(min_year = min(YearSummarized),
max_year = max(YearSummarized),
count = sum(ResultCount)) |>
mutate(POR = max_year - min_year) |>
filter(count > 100,
POR >= 20) |>
arrange(desc(count)) |>
ungroup()
library(leaflet)
leaflet(data = az_phos_summary) %>%
addProviderTiles("CartoDB.Positron") %>%
addCircleMarkers(~Lon, ~Lat,
color = "red", radius = 3, stroke = FALSE,
fillOpacity = 0.8, opacity = 0.8,
popup = ~Site
)
Time/Time zone discussion
The arguments for all
dataRetrieval
functions concerning dates (startDate, endDate) can be R Date objects, or character strings, as long as the string is in the form “YYYY-MM-DD”.-
For functions that include a date and time,
dataRetrieval
will take that information and create a column that is a POSIXct type. By default, this date/time POSIXct column is converted to “UTC”. In R, one vector (or column in a data frame) can only have ONE timezone attribute.- Sometimes in a single state, some sites will acknowledge daylight savings and some don’t
-
dataRetrieval
queries could easily span multiple timezones (or switching between daylight savings and regular time)
The user can specify a single timezone to override UTC. The allowable tz arguments are
OlsonNames
(see also the help file forreadNWISuv
).
Large Data Requests
It is increasingly common for R users to be interested in large-scale
dataRetrieval
analysis. You can use a loop of either state
codes (stateCd$STATE
) or HUCs to make large requests. BUT
without careful planning, those requests could be too large to complete.
Here are a few tips to make those queries manageable:
Please do NOT use multi-thread processes and simultaneously request hundreds or thousands of queries.
Take advantage of the
whatWQPdata
andwhatNWISdata
functions to filter out sites you don’t need before requesting the data. Use what you can from these faster requests to filter the full data request as much as possible.Think about using
tryCatch
, saving the data after each iteration of the loop, and/or using a make-like data pipeline (for example, see thedrake
package). This way if a single query fails, you do not need to start over.The WQP doesn’t always perform that well when there are a lot of filtering arguments in the request. Even though those filters would reduce the amount of data needed to transfer, that sometimes causes the pre-processing of the request to take so long that it times-out before returning any data. It’s a bit counterintuitive, but if you are having trouble getting your large requests to complete, remove arguments such as Sample Media, Site Type, these are things that can be filtered in a post-processing script. Another example: sometimes it is slower and error-prone requesting data year-by-year instead of requesting the entire period of record.
Pick a single state/HUC/bbox to practice your data retrievals before looping through larger sets, and optimize ahead of time as much as possible.
There are two examples scripting and pipeline that go into more detail.
But wait, there’s more!
There are two services that also have functions in
dataRetrieval
, the National Groundwater Monitoring Network
(NGWMN) and Network Linked Data Index (NLDI). These functions are not as
mature as the WQP and NWIS functions. A future blog post will bring
together these functions.
National Groundwater Monitoring Network (NGWMN)
Similar to WQP, the NGWMN brings groundwater data from multiple
sources into a single location. There are currently a few
dataRetrieval
functions included:
Network Linked Data Index (NLDI)
The NLDI provides a information backbone to navigate the NHDPlusV2 network and discover features indexed to the network. For an overview of the NLDI, see: https://rconnect.usgs.gov/dataRetrieval/articles/nldi.html