Increasingly there are high frequency sensor data available for water quality data. There’s a need to join the sensor and discrete data by the closest time.
This article will discuss how to do that using the
data.table
package.
Let’s look at site “01646500”, and a nearby site with a real-time nitrate-plus-nitrite sensor. Our goal is to get the discharge and nitrate-plus-nitrite sensor data aligned with the discrete water quality samples.
library(dataRetrieval)
site_uv <- "01646500"
site_qw <- "USGS-01646580"
pcode_uv <- "99133"
pcode_qw <- "00631"
start_date <- as.Date("2018-01-01")
end_date <- as.Date("2020-01-01")
qw_data <- readWQPqw(site_qw, pcode_qw,
startDate = start_date,
endDate = end_date)
uv_data <- readNWISuv(siteNumbers = site_uv,
parameterCd = c(pcode_uv, "00060"),
startDate = start_date,
endDate = end_date)
The sensor data (“uv” data) at this particular site has 2 columns of data that are important. The first task is to combine those columns. This is rather unique to this particular site and probably won’t need to be done generally.
library(dplyr)
uv_trim <- uv_data |>
select(uv_date = dateTime,
val1 = X_SUNA...Discontinued._99133_00000,
val2 = X_SUNA_99133_00000,
flow = X_00060_00000) |>
mutate(val_uv = if_else(is.na(val1), val2, val1)) |>
select(-val1, -val2)
uv_date | flow | val_uv |
---|---|---|
2018-01-01 05:00:00 | 1940 | 1.43 |
2018-01-01 05:15:00 | 1940 | 1.43 |
2018-01-01 05:30:00 | 1850 | 1.43 |
2018-01-01 05:45:00 | 1850 | 1.43 |
2018-01-01 06:00:00 | 1850 | 1.43 |
2018-01-01 06:15:00 | 1810 | 1.43 |
Next we’ll clean up the discrete water quality “qw” data to make it easy to follow in this tutorial.
qw_trim <- qw_data |>
filter(ActivityTypeCode == "Sample-Routine",
!is.na(ActivityStartDateTime)) |>
select(qw_date = ActivityStartDateTime,
val_qw = ResultMeasureValue,
det_txt = ResultDetectionConditionText)
qw_date | val_qw | det_txt |
---|---|---|
2018-01-04 15:30:00 | 1.39 | NA |
2018-01-15 15:15:00 | 1.73 | NA |
2018-02-01 15:30:00 | 1.66 | NA |
2018-02-11 19:00:00 | 1.33 | NA |
2018-02-13 17:30:00 | 1.53 | NA |
2018-02-15 16:00:00 | 1.65 | NA |
Finally, we’ll use the data.table
package to do a join
to the nearest date. The code to do that is here:
library(data.table)
setDT(qw_trim)[, join_date := qw_date]
setDT(uv_trim)[, join_date := uv_date]
closest_dt <- uv_trim[qw_trim, on = .(join_date), roll = "nearest"]
closest_dt
is a data.table
object. It
similar to a data.frame, but not identical. We can convert it to a
data.frame and then use dplyr
commands. Note: the whole
analysis can be done via data.table
, but most examples in
dataRetrieval
have used dplyr
, which is why we
bring it back to data.frame. dplyr
also has a
join_by(closest())
option, but it is more complicated
because you can only specify the closeness in either the forward or
backwards direction (and we want either direction).
We can calculate “delta_time” - the difference in time between the uv
and qw data. We’ll probably want to add a threshold that we don’t join
values if they are too far apart in time. In this example, if the
difference is greater than 24 hours, we’ll substitute
NA
.
qw_closest <- data.frame(closest_dt) |>
mutate(delta_time = difftime(qw_date, uv_date,
units = "hours"),
val_uv = if_else(abs(as.numeric(delta_time)) >= 24, NA, val_uv)) |>
select(-join_date)
uv_date | flow | val_uv | qw_date | val_qw | det_txt | delta_time |
---|---|---|---|---|---|---|
2018-01-04 15:30:00 | 1850 | 1.43 | 2018-01-04 15:30:00 | 1.39 | NA | 0 hours |
2018-01-15 15:15:00 | 8960 | 1.84 | 2018-01-15 15:15:00 | 1.73 | NA | 0 hours |
2018-02-01 15:30:00 | 6000 | 1.67 | 2018-02-01 15:30:00 | 1.66 | NA | 0 hours |
2018-02-11 19:00:00 | 20800 | 1.29 | 2018-02-11 19:00:00 | 1.33 | NA | 0 hours |
2018-02-13 17:30:00 | 30700 | 1.63 | 2018-02-13 17:30:00 | 1.53 | NA | 0 hours |
2018-02-15 16:00:00 | 25300 | 1.68 | 2018-02-15 16:00:00 | 1.65 | NA | 0 hours |
Here are a few plots to show the applications of these joins:
library(ggplot2)
ggplot(data = qw_closest) +
geom_point(aes(x = val_uv, y = val_qw)) +
theme_bw() +
xlab("Sensor") +
ylab("Discrete")
ggplot(data = qw_closest) +
geom_point(aes(x = flow, y = val_qw)) +
theme_bw() +
xlab("Discharge") +
ylab("Concentration") +
scale_x_log10() +
scale_y_log10()