In this ~90 minute introduction, the goal is:
Introduce the modern dataRetrieval
workflows.
The intended audience is someone:
New to dataRetrieval
Has some R experience
By default will look like:
Go to Tools -> Global Options -> Appearances to change style.
Create scripts.
See code run.
See what variables are loaded
USGS Water Data APIs *
Surface water levels
Groundwater levels
Site metadata
Peak flows
Rating curves
Discrete water-quality data
Water Quality Portal (WQP) Data
Discrete water-quality data
USGS and non-USGS data
dataRetrieval
is available on the Comprehensive R Archive Network (CRAN) repository. To install dataRetrieval
on your computer, open RStudio and run this line of code in the Console:
Then each time you open R, you’ll need to load the library:
Within R, you can call help files for any dataRetrieval
function:
Click here to open a new window:
Open RStudio
Install dataRetrieval
, dplyr
, ggplot2
, and data.table
(if they are not already installed).
Load dataRetrieval
Open the help file for the function read_waterdata_daily
Navigate to https://doi-usgs.github.io/dataRetrieval/ and find the list of function help files and explore some articles in “Additional Articles”
Are you a seasoned dataRetrieval
user?
Here are resources for recent major changes:
There’s been a lot of changes to dataRetrieval
over the past year. If you’d like to see an overview of those changes, visit: Changes to dataRetrieval
Biggest changes:
NWIS servers will be shut down, so all readNWIS
functions will eventually stop working
read_waterdata
functions are modern and should be used when possible
The “USGS Water Data APIs” are the new home for USGS data
The Water Data APIs limit how many queries a single IP address can make per hour
You can run new dataRetrieval
functions without a token
You might run into errors quickly. If you (or your IP!) have exceeded the quota, you will see:
! HTTP 429 Too Many Requests.
• You have exceeded your rate limit. Make sure you provided your API key from https://api.waterdata.usgs.gov/signup/, then either try again later or contact us at https://waterdata.usgs.gov/questions-comments/?referrerUrl=https://api.waterdata.usgs.gov for assistance.
Request a USGS Water Data API Token: https://api.waterdata.usgs.gov/signup/
Save it in a safe place (KeePass or other password management tool)
Add it to your .Renviorn file as API_USGS_PAT.
Restart R
Check that it worked by running (you should see your token printed in the Console):
See next slide for a demonstration.
My favorite method to do add your token to .Renviron is to use the usethis
package. Let’s pretend the token sent you was “abc123”:
Save that file using the save button
Restart R/RStudio.
Run after restarting R:
After save and restart, check that it worked by running:
The USGS uses various codes for basic retrievals. These codes can have leading zeros, therefore they need to be a character surrounded in quotes (“00060”).
read_waterdata_parameter_codes()
read_metadata("statistic-codes")
Here are some examples of a few common codes:
|
|
We’re going walk through 3 retrievals:
Workflow 1: Daily Data
Uses the new USGS Water Data API
Modern data access point going forward
Workflow 2: Discrete Data
Uses new USGS Samples Data
Modern data access point going forward
Workflow 3: Join Daily and Discrete
Workflow 4: Continuous Data
Uses the NWIS web services
Will be deprecated, this fall we’ll have read_waterdata_continuous
Workflow 5: Join Continuous and Discrete
Let’s pull daily mean discharge data for site “USGS-0940550”, getting all the data from October 10, 2024 onward.
library(dataRetrieval)
site <- "USGS-09405500"
pcode <- "00060" # Discharge
stat_cd <- "00003" # Mean
range <- c("2024-10-01", NA)
df <- read_waterdata_daily(monitoring_location_id = site,
parameter_code = pcode,
statistic_id = stat_cd,
time = range)
Requesting:
https://api.waterdata.usgs.gov/ogcapi/v0/collections/daily/items?f=json&lang=en-US&limit=10000&monitoring_location_id=USGS-09405500¶meter_code=00060&statistic_id=00003&time=2024-10-01%2F..
Remaining requests this hour:137
In RStudio, click on the data frame in the upper right Environment tab to open a Viewer.
Let’s use ggplot2
to visualize the data.
Use your “tab” key!
When you look at the help file for the new functions, you’ll notice there are lots of possible inputs (arguments).
You DO NOT need to (and should not!) specify all of these parameters.
However, also consider what happens if you leave too many things blank. What do you suppose will be returned here?
Since no list of sites or bounding box was defined, ALL the daily data in ALL the country with parameter code “00060” and statistic code “00003” will be returned.
The “time” argument has a few options:
A single date (or date-time): “2024-10-01” or “2024-10-01T23:20:50Z”
A bounded interval: c(“2024-10-01”, “2025-07-02”)
Half-bounded intervals: c(“2024-10-01”, NA)
Duration objects: “P1M” for data from the past month or “PT36H” for the last 36 hours
Here are a bunch of valid inputs:
# Ask for exact times:
time = "2025-01-01"
time = as.Date("2025-01-01")
time = "2025-01-01T23:20:50Z"
time = as.POSIXct("2025-01-01T23:20:50Z",
format = "%Y-%m-%dT%H:%M:%S",
tz = "UTC")
# Ask for specific range
time = c("2024-01-01", "2025-01-01") # or Dates or POSIXs
# Asking beginning of record to specific end:
time = c(NA, "2024-01-01") # or Date or POSIX
# Asking specific beginning to end of record:
time = c("2024-01-01", NA) # or Date or POSIX
# Ask for period
time = "P1M" # past month
time = "P7D" # past 7 days
time = "PT12H" # past hours
Use your “tab” key!
Let’s get orthophosphate (“00660”) data from the Shenandoah River at Front Royal, VA (“USGS-01631000”).
site <- "USGS-01631000"
pcode <- "00660"
qw_data <- read_waterdata_samples(monitoringLocationIdentifier = site,
usgsPCode = pcode,
dataType = "results",
dataProfile = "basicphyschem")
GET: https://api.waterdata.usgs.gov/samples-data/results/basicphyschem?mimeType=text%2Fcsv&monitoringLocationIdentifier=USGS-01631000&usgsPCode=00660
[1] 100
That’s a LOT of columns that come back. We won’t look at them here, but you can use View
in RStudio to explore on your own.
Let’s pull a few columns out and look at those:
library(dplyr)
qw_data_slim <- qw_data |>
select(Date = Activity_StartDate,
Result_Measure,
DL_cond = Result_ResultDetectionCondition,
DL_val = DetectionLimit_MeasureA,
DL_type = DetectionLimit_TypeA) |>
mutate(Result = if_else(!is.na(DL_cond), DL_val, Result_Measure),
Detected = if_else(!is.na(DL_cond), "Not Detected", "Detected")) |>
arrange(Detected)
|>
? It’s a pipe! It says take ‘this thing’ and put it in ‘that thing’. You’ll also see %>%
in code, it is also a pipe - they are basically the same.One common workflow is to join discrete data with daily data.
In this example, we will look at a site that measures both water quality parameters and has daily mean discharge.
We will use the dplyr::left_join
to join the 2 data frames by a date.
site <- "USGS-04183500"
p_code_dv <- "00060"
stat_cd <- "00003"
p_code_qw <- "00665"
start_date <- "2015-07-03"
end_date <- "2025-07-03"
qw_data <- read_waterdata_samples(monitoringLocationIdentifier = site,
usgsPCode = p_code_qw,
activityStartDateLower = start_date,
activityStartDateUpper = end_date,
dataProfile = "basicphyschem")
dv_data <- read_waterdata_daily(monitoring_location_id = site,
parameter_code = p_code_dv,
statistic_id = stat_cd,
time = c(start_date, end_date))
See dplyr
documentation for lots of joining options, but I find left_join
my “go-to” for straightforward joins.
Let’s take a quick peak:
dplyr
comes with some data sets. To look at them run:
Run that code and view the 2 data frames to see what they look like.
Join the instruments to the “band_members” by name.
Join the members to the “band_instruments” by name.
# A tibble: 3 × 3
name band plays
<chr> <chr> <chr>
1 Mick Stones <NA>
2 John Beatles guitar
3 Paul Beatles bass
# A tibble: 3 × 3
name plays band
<chr> <chr> <chr>
1 John guitar Beatles
2 Paul bass Beatles
3 Keith guitar <NA>
Continuous data is the high-frequency sensor data.
The function to get that data today is readNWISuv
As NWIS gets deprecated, we expect to have read_waterdata_continuous
soon
We’ll look at Suisun Bay a Van Sickle Island NR Pittsburg CA (“USGS-11455508”), with parameter code “99133” which is Nitrate plus Nitrite.
[1] "agency_cd"
[2] "site_no"
[3] "dateTime"
[4] "X_99133_00000"
[5] "X_99133_00000_cd"
[6] "tz_cd"
GET: https://nwis.waterservices.usgs.gov/nwis/iv/?site=11455508&format=waterml%2C1.1&ParameterCd=99133&startDT=2024-01-01&endDT=2024-06-01
That same site also measures discrete Nitrate plus Nitrite, which is parameter code “00631”. Let’s first grab that data:
discrete_data <- read_waterdata_samples(monitoringLocationIdentifier = "USGS-11455508",
usgsPCode = "00631",
activityStartDateLower = start_date,
activityStartDateUpper = end_date,
dataProfile = "basicphyschem")
GET: https://api.waterdata.usgs.gov/samples-data/results/basicphyschem?mimeType=text%2Fcsv&monitoringLocationIdentifier=USGS-11455508&usgsPCode=00631&activityStartDateLower=2024-01-01&activityStartDateUpper=2024-06-01
We now want to join the closest continuous sensor time with the discrete sample time.
This is trickier than joining by exact matches.
dplyr
has a way, but it’s complicated if you want the absolute closest in either direction
Another package data.table
has a slick way to get the closest matches
The process for discovering data is a bit in flux with NWIS retiring. I expect a new process will be introduced soon. For now here are some options.
read_waterdata_ts_meta
discovers daily and continuous time series
summarize_waterdata_samples
discovers discrete data at specific monitoring locations
The next slides will demo how to use those.
read_waterdata_sample
Any use of trade, firm, or product name is for descriptive purposes only and does not imply endorsement by the U.S. Government.