Changes to dataRetrieval for delivering discrete sample data
Lee Stanish
25 October, 2024
Source:vignettes/wqx3_development_plan.Rmd
wqx3_development_plan.Rmd
Summary
This page describes the changes happening right now to R dataRetrieval for accessing water quality and discrete sample data. The content was originally presented during the WaterSciCon24 meeting in St. Paul, Minnesota.
For questions or comments, please email CompTools@usgs.gov.
Background on dataRetrieval
If you are viewing this page then you are already familiar with
dataRetrieval
, but just to summarize, the package started
around 2013 with a version in R. It is a software package created and
maintained by the USGS and enables searching, filtering and downloading
water data from the USGS and the multi-agency water quality portal. It
has become very popular, with more than 200,000 downloads since its
release.
There is a stable release available on CRAN as well as a developer version available on GitHub. At its core the software relies on USGS and WQP web services. The R version has become popular for users who want easy access to water data in the U.S.
Development process
dataRetrieval
is a community resource. It is open source
and we actively respond to user input. We generally follow Agile
principles for software development by making incremental updates and
releasing updates frequently to ensure that the software remains
relevant and is responsive to user feedback. The package went through
the USGS software review process and all code updates are reviewed by a
qualified developer prior to being accepted.
And just a quick reminder that we now offer a suite of data retrieval software that includes a Python and (experimental) Julia version!
Why are updates happening? Part I
The changes result from a couple of relatively major updates to how
both the USGS and the WQP deliver discrete water quality and sample
data. Discrete data from USGS are now being delivered in the Water
Quality Exchange (or WQX) format. Previously, these data were available
using a set of NWIS functions, e.g. readNWISqw()
(now
retired). The NWIS data format was specific to USGS data and relied on
codes for describing the metadata. Now, USGS data will be formatted in
the same way as the data published by other providers on the Water
Quality Portal. This change will make it easier for users to combine
USGS data with data from these other sources.
How will this impact data delivery? For dataRetrieval
,
this means that there is a new USGS web service that is available for us
to build new functionality. However, there are changes that will require
some adjustment. Some of the existing USGS NWIS services that we’ve
relied on for years are reaching their end of life and will soon become
deprecated. The services for pulling discrete WQ data still work,
however they are delivering STALE data. As of March 11, 2024, all
queries to the legacy NWIS qw data sources will pull the same data
regardless of whether there are new data, and no changes to the data
delivered from these functions is going to happen moving forward.
Why are updates happening? Part II
The second significant update is happening on the Water Quality
Portal. We recently released a test (“beta”) version of a new WQP web
service that delivers data in the new WQX v3.0 format. The services are
still under development and when completed all metadata will be
available in WQX v3.0 format. The current WQP functions in
dataRetrieval
pull data in v2.2 of the WQX format. Using
the new WQX v3.0 services, users will have access to more metadata
fields than before. The new data profiles, which are simply subsets of
the hundreds of metadata fields available in the WQX v3.0 schema, have
been redesigned to make it easier for users to get the data they want in
a single download so that users don’t need to join data tables and to
figure out how to correctly merge datasets.
The plus side of this change is that we now have access to a new WQP web service for us to build new functionality. The downside is that the current WQP services will experience the same problem as the USGS legacy web services: they will pull STALE data from the USGS. For users who are not concerned with USGS data, the data from other providers will be current and up to date. Until the new WQX v3.0 services are complete and stabilized, data will continue to be delivered in the current WQX v2.2 format.
In summary, these two separate updates are happening at the same
time, which means that there are lots of changes happening all at once
that impact dataRetrieval
functionality in different ways.
We are doing our best to make thoughtful decisions about how to update
the package to deliver as much data as possible as soon as possible.
How will updates be implemented?
Our implementation approach is to develop incrementally along with
the developing web services so that users have access to the most recent
and most complete discrete data available. When the WQX v3.0 services
are updated, we will update the R dataRetrieval
functions
and push out those changes to the developer version on GitHub. We’ll
also update Python on GitHub shortly afterwards. Once the WQX v3.0
services stabilize, we will put out new software releases for R and
Python on CRAN and PyPi, respectively.
What changes can users expect?
Here’s our current plan, starting with the R
dataRetrieval
functions, although these are analogous to
changes in Python as well.
All of the functions that pull USGS real-time, continuous data, such as flow or water quality, are not changing. These functions typically include NWIS in their names and use services such as ‘iv’ or ‘dv’. Example:
readNWISdv()
If you are pulling water quality data using
readWQPdata()
orreadWQPqw()
, expect to see some minor changes in function behavior. For instance,readWQPdata()
has new ‘service’ options, and by default it will pull result-level data from the new WQX v3.0 web services.readWQPqw()
now has an added argument, called ‘legacy’, which allows the user to select whether or not to use the current v2.2 services or the new v3.0 services.
Big changes!
There is at least one breaking change that will prevent full
backwards compatibility for R dataRetrieval.
The
readNWISqw()
function still works, however it is using the
legacy NWIS qw services and therefore the data are stale as of
March 11, 2024. Eventually this function will be deprecated. Users may
have noticed that there is a warning on that particular function. We
encourage users to switch to the WQP functions for retrieving USGS
discrete water quality data. Or, users could wait until we build new
functions using the new samples-data web services for accessing discrete
water quality and sample data.
Do I need to change my workflow?
In addition to the function-level changes, users will need to update their discrete water quality data workflows to the new WQX v3.0 format.
Nearly all of the field names have changed between the WQX v2.2 and v3.0 formats. This is a pretty big deal, and we are working to minimize the impact on users. For example,
dataRetrieval
is being updated to work with the new field names. Users can also access schema files that map the v2.2 names to v3.0.The USGS updates to the new WQX format also changed the way that censored data are delivered. Censored data in the new format will be more consistent with how other data providers are doing it. Consistency is a good thing, but this is a change compared to how those data are delivered in the current WQP functions. The
dataRetrieval
documentation provides guidance on how users can update their workflows.
Where can I go to get help?
We will provide updated documentation to describe how to use the new
and updated functionality, and will also retain documentation for the
current stable CRAN release. Our plan is to keep the function help pages on GitHub
updated with the most recent functionality. Users of the stable CRAN
release should use the documentation available on CRAN,
or pull up the function help pages in their R console
(e.g. ?readWQPdata
).
How can I use the newest changes?
If you want to follow along with us and have access to the most recent USGS discrete water quality data or WQP functions, then follow the installation instructions available on our DOI-USGS GitHub R and Python accounts. For those who prefer to wait for a stable software release or who don’t need the most updated USGS data or WQP functions, then simply wait until we roll out these changes to CRAN and PyPi some time later this year.
Where can I go to get the most updated news and information?
There are a lot of changes coming and the best way to stay up to
speed is by visiting our dataRetrieval
Status
page. We’ll continue to update this page as development continues.
We appreciate your patience!