Large Data Pulls from Water Quality Portal - A Script-Based Approach
Aliesha Krall
Laura DeCicco
2024-03-07
Source:vignettes/wqp_large_pull_script.Rmd
wqp_large_pull_script.Rmd
dataRetrieval
is an R package that provides
user-friendly access to water data from either the Water Quality Portal
(WQP) or the National Water Information Service (NWIS).
dataRetrival
makes it easier for a user to download data
from the WQP or NWIS and convert it into usable formats.
This article will walk through an example that uses the WQP summary service to limit the amount downloaded to only relevant data using a scripting approach. For large datasets, that can save a lot of time and ultimately reduce the complexity of subsequent data processing.
This article highlights a “scripting” approach, meaning it will only use basic R techniques. There is less of a learning curve in this approach, but also if something goes wrong, it is more challenging to re-run. The alternative is the pipeline approach, which is highlighted in the article Large Data Pulls from Water Quality Portal - A Pipeline-Based Approach.
See https://waterdata.usgs.gov/blog/large_sample_pull/ for more detail, including analysis of the downloaded data.
This article was updated in March 2024 to use the
parse_WQP
function to convert columns after the full pull.
This makes binding the data together more simple.
Large data example
This blog sets up a scenario to look at all the total nitrogen data measured in streams within the contiguous United States where sites have at least 40 measurements between 1995 and 2020.
Data download
First, set up a loop to find out which sites have the required data, then get only the data that is relevant.
There are several ways to break up the geographic part of the query,
such as bounding box, state, county, or hydrologic unit code (HUC).
Depending on the density of the data, it may be ideal to adjust how to
loop through the geographic query. Sometimes running a single
dataRetrieval
call comes back with a “timeout” error. Other
times, requests for data spans more than traditional geographic filters
of HUC, state, or county. In these cases, it may be necessary to break
up the dataRetrieval
call into smaller subsets, and then
bind those subsets together separately. This blog will discuss one
strategy for breaking up a WQP request.
dataRetrieval
includes a data frame “stateCd” that has
all of the states and territories that could be queried by either NWIS
or WQP. In this example, only the lower 48 states along with Washington,
D.C.are considered.
Use the readWQPsummary
function, which is a very useful
function that returns information about all the data that is available
for a particular query. The initial query is asking what
nitrogen data for streams is available in a particular
state. The returned data shows how many nitrogen samples are available
at each site for each year. Then, using filtering and summaries will
figure out exactly which sites meet the set up scenairo’s needs.
The readWQPdata
function is used to download the actual
relevant data. This example saves the data using the
saveRDS
function for each individual state. This ensures
higher likelihood of successful completion of the query. For example, if
a failure occurs during the download and the loops don’t finish. In that
case, the states that successfully downloaded the data are skipped, and
only re-run the states that didn’t work. Saving as an “RDS” file also
has the benefits of retaining all the attributes of the data. Notice
another feature of this loop is using tryCatch
for each of
the dataRetrieval
calls. This allows the loop to continue
even if one of the states failed for some reason.
# Load packages
library(dataRetrieval)
library(dplyr)
# state code information for the 48 conterminous United States plus DC:
state_cd_cont <- stateCd[c(2,4:12,14:52),]
rownames(state_cd_cont) <- seq(length=nrow(state_cd_cont)) # reset row sequence
for(i in seq_len(nrow(state_cd_cont))){
state_cd <- state_cd_cont$STATE[i]
state_nm <- state_cd_cont$STUSAB[i]
message("Getting: ", state_nm)
df_summary <- tryCatch({
readWQPsummary(statecode = state_cd,
CharacteristicName = "Nitrogen",
siteType = "Stream")
},
error=function(cond) {
message(paste("No data in:", state_nm))
break()
})
sites <- df_summary |>
filter(YearSummarized >= 1995,
YearSummarized <= 2020) |>
group_by(MonitoringLocationIdentifier, MonitoringLocationName, Provider) |>
summarise(start_year = min(YearSummarized, na.rm = TRUE),
end_year = max(YearSummarized, na.rm = TRUE),
count_activity = sum(ActivityCount, na.rm = TRUE),
count_result = sum(ResultCount, na.rm = TRUE)) |>
ungroup() |>
filter(count_activity >= 40)
if(nrow(sites) > 0){
df_state <- tryCatch({
readWQPdata(siteid = sites$MonitoringLocationIdentifier,
CharacteristicName = "Nitrogen",
startDateLo = "1995-01-01",
startDateHi = "2023-12-31",
sampleMedia = "Water",
convertType = FALSE,
ignore_attributes = TRUE,
legacy = TRUE)
},
error=function(cond) {
message(paste("No data in:", state_nm))
})
if(nrow(df_state) > 0){
# I would write the data here, just in case:
saveRDS(df_state, file = paste(state_nm, "data.rds",
sep = "_"))
} else {
message("No data in:", state_nm)
}
}
}
The following can be included in the loop above, but saving it for later allows for more flexibility with the raw data (e.g., leaving data in or filtering data out).
Although creating an empty data frame and filling the data in later would be the most efficient way to go, binding the rows is flexible and easy to conceptualize. For this data download scenario, there wasn’t a huge bottleneck by using the “dyplr::bind_rows”, but that could be a place to reconsider if the next section seems to be taking too long, in which creating the empty data frame may be the considered solution.
The next loop shown below opens each file, pulls out the data we need
for analysis, and binds up each state into one large data frame. Notice
“ResultMeasureValue” changes into a character vector. By default,
dataRetrieval
will try to convert that column into numeric.
Sometimes however, that can’t be done because there are actual character
values in the results. Therefore, to retain all of that information, be
sure each state’s “ResultMeasureValue” column is a character (more
information on this is below).
all_nitrogen <- data.frame()
# for(state in stateCd$STUSAB){
for(state in state_cd_cont$STUSAB){
state_df <- tryCatch({
readRDS(paste0(state, "_data.rds"))
}, error = function(e) e)
if(!inherits(state_df, "error") && nrow(state_df) > 0){
df_slim <- state_df |>
filter(ActivityMediaSubdivisionName %in% c("Surface Water") |
is.na(ActivityMediaSubdivisionName),
ResultSampleFractionText %in% c("Total"))
all_nitrogen <- bind_rows(all_nitrogen, df_slim)
}
}
all_nitrogen <- parse_WQP(all_nitrogen)
all_nitrogen <- dataRetrieval:::create_WQP_attributes(all_nitrogen,
siteid = unique(all_nitrogen$MonitoringLocationIdentifier))
For an extended discussion, see: