Large Data Pulls from Water Quality Portal - A Script-Based Approach • dataRetrieval

dataRetrieval is an R package that provides user-friendly access to water data from either the Water Quality Portal (WQP) or the National Water Information Service (NWIS). dataRetrival makes it easier for a user to download data from the WQP or NWIS and convert it into usable formats.

This article will walk through an example that uses the WQP summary service to limit the amount downloaded to only relevant data using a scripting approach. For large datasets, that can save a lot of time and ultimately reduce the complexity of subsequent data processing.

This article highlights a “scripting” approach, meaning it will only use basic R techniques. There is less of a learning curve in this approach, but also if something goes wrong, it is more challenging to re-run. The alternative is the pipeline approach, which is highlighted in the article Large Data Pulls from Water Quality Portal - A Pipeline-Based Approach.

See https://waterdata.usgs.gov/blog/large_sample_pull/ for more detail, including analysis of the downloaded data.

This article was updated in March 2024 to use the parse_WQP function to convert columns after the full pull. This makes binding the data together more simple.

Large data example

This blog sets up a scenario to look at all the total nitrogen data measured in streams within the contiguous United States where sites have at least 40 measurements between 1995 and 2020.

Data download

First, set up a loop to find out which sites have the required data, then get only the data that is relevant.

There are several ways to break up the geographic part of the query, such as bounding box, state, county, or hydrologic unit code (HUC). Depending on the density of the data, it may be ideal to adjust how to loop through the geographic query. Sometimes running a single dataRetrieval call comes back with a “timeout” error. Other times, requests for data spans more than traditional geographic filters of HUC, state, or county. In these cases, it may be necessary to break up the dataRetrieval call into smaller subsets, and then bind those subsets together separately. This blog will discuss one strategy for breaking up a WQP request.

dataRetrieval includes a data frame “stateCd” that has all of the states and territories that could be queried by either NWIS or WQP. In this example, only the lower 48 states along with Washington, D.C.are considered.

Use the readWQPsummary function, which is a very useful function that returns information about all the data that is available for a particular query. The initial query is asking what nitrogen data for streams is available in a particular state. The returned data shows how many nitrogen samples are available at each site for each year. Then, using filtering and summaries will figure out exactly which sites meet the set up scenairo’s needs.

The readWQPdata function is used to download the actual relevant data. This example saves the data using the saveRDS function for each individual state. This ensures higher likelihood of successful completion of the query. For example, if a failure occurs during the download and the loops don’t finish. In that case, the states that successfully downloaded the data are skipped, and only re-run the states that didn’t work. Saving as an “RDS” file also has the benefits of retaining all the attributes of the data. Notice another feature of this loop is using tryCatch for each of the dataRetrieval calls. This allows the loop to continue even if one of the states failed for some reason.

# Load packages
library(dataRetrieval)
library(dplyr)

# state code information for the 48 conterminous United States plus DC:
state_cd_cont <- stateCd[c(2,4:12,14:52),] 
rownames(state_cd_cont) <- seq(length=nrow(state_cd_cont)) # reset row sequence

for(i in seq_len(nrow(state_cd_cont))){
  
  state_cd <- state_cd_cont$STATE[i]
  state_nm <- state_cd_cont$STUSAB[i]
  message("Getting: ", state_nm)
  
  df_summary <- tryCatch({
    readWQPsummary(statecode = state_cd,
                   CharacteristicName = "Nitrogen",
                   siteType = "Stream")
  }, 
  error=function(cond) {
    message(paste("No data in:", state_nm))
    break()
  })
  
  sites <- df_summary |> 
    filter(YearSummarized >= 1995,
           YearSummarized <= 2020) |> 
    group_by(MonitoringLocationIdentifier, MonitoringLocationName, Provider) |> 
    summarise(start_year = min(YearSummarized, na.rm = TRUE),
              end_year = max(YearSummarized, na.rm = TRUE),
              count_activity = sum(ActivityCount, na.rm = TRUE),
              count_result = sum(ResultCount, na.rm = TRUE)) |> 
    ungroup() |> 
    filter(count_activity >= 40)
  
  if(nrow(sites) > 0){
    df_state <- tryCatch({
      readWQPdata(siteid = sites$MonitoringLocationIdentifier,
                  CharacteristicName = "Nitrogen",
                  startDateLo = "1995-01-01",
                  startDateHi = "2023-12-31",
                  sampleMedia = "Water", 
                  convertType = FALSE,
                  ignore_attributes = TRUE,
                  legacy = TRUE)
    }, 
    error=function(cond) {
      message(paste("No data in:", state_nm))
    })
    
    if(nrow(df_state) > 0){
      # I would write the data here, just in case:
      saveRDS(df_state, file = paste(state_nm, "data.rds", 
                                     sep = "_"))
      
    } else {
      message("No data in:", state_nm)
    }
  }
}

The following can be included in the loop above, but saving it for later allows for more flexibility with the raw data (e.g., leaving data in or filtering data out).

Although creating an empty data frame and filling the data in later would be the most efficient way to go, binding the rows is flexible and easy to conceptualize. For this data download scenario, there wasn’t a huge bottleneck by using the “dyplr::bind_rows”, but that could be a place to reconsider if the next section seems to be taking too long, in which creating the empty data frame may be the considered solution.

The next loop shown below opens each file, pulls out the data we need for analysis, and binds up each state into one large data frame. Notice “ResultMeasureValue” changes into a character vector. By default, dataRetrieval will try to convert that column into numeric. Sometimes however, that can’t be done because there are actual character values in the results. Therefore, to retain all of that information, be sure each state’s “ResultMeasureValue” column is a character (more information on this is below).

all_nitrogen <- data.frame()
# for(state in stateCd$STUSAB){
for(state in state_cd_cont$STUSAB){
  
  state_df <- tryCatch({
    readRDS(paste0(state, "_data.rds"))
  }, error = function(e) e)
    
    
  if(!inherits(state_df, "error") && nrow(state_df) > 0){
    df_slim <- state_df |>
      filter(ActivityMediaSubdivisionName %in% c("Surface Water") |
               is.na(ActivityMediaSubdivisionName),
             ResultSampleFractionText %in% c("Total")) 
    
    all_nitrogen <- bind_rows(all_nitrogen, df_slim)
  }

}

all_nitrogen <- parse_WQP(all_nitrogen)
all_nitrogen <- dataRetrieval:::create_WQP_attributes(all_nitrogen,
                        siteid = unique(all_nitrogen$MonitoringLocationIdentifier))

For an extended discussion, see:

https://waterdata.usgs.gov/blog/large_sample_pull/

Disclaimer

Any use of trade, firm, or product names is for descriptive purposes only and does not imply endorsement by the U.S. Government.

Large Data Pulls from Water Quality Portal - A Script-Based Approach

Aliesha Krall

Laura DeCicco

2024-03-07

Large data example

Data download

Disclaimer