08: Working with tabular data in Pandas

08ec9b7a5bf14f66855749b6a511f80b

Not that type of Panda – Python’s Pandas package

Pandas is a powerful, flexible and easy to use open source data analysis and manipulation tool. Pandas is commonly used for operations that would normally be done in a spreadsheet environment and includes powerful data analysis and manipulation tools.

Let’s begin by importing the libraries and setting our data path

[1]:
from pathlib import Path
import numpy as np
import matplotlib as mpl
import matplotlib.pyplot as plt
import pandas as pd
import statsmodels.api as sm
from dataretrieval import nwis
from matplotlib.backends.backend_pdf import PdfPages
from scipy.signal import detrend

data_path = Path("..", "data", "pandas")

Get site information data for part of the Russian River near Guerneville, CA from NWIS

[2]:
bbox = (-123.10, 38.45, -122.90, 38.55)
info, metadata = nwis.get_info(bBox=[str(i) for i in bbox])
info
info.to_csv(data_path / "site_info.csv", index=False)
info.head()
[2]:
agency_cd site_no station_nm site_tp_cd lat_va long_va dec_lat_va dec_long_va coord_meth_cd coord_acy_cd ... local_time_fg reliability_cd gw_file_cd nat_aqfr_cd aqfr_cd aqfr_type_cd well_depth_va hole_depth_va depth_src_cd project_no
0 USGS 11467000 RUSSIAN R A HACIENDA BRIDGE NR GUERNEVILLE CA ST 383031.0 1225536.0 38.508523 -122.927774 M F ... Y NaN NNNNNNNN NaN NaN NaN NaN NaN NaN NaN
1 USGS 11467002 RUSSIAN R A JOHNSONS BEACH A GUERNEVILLE CA ST 382958.0 1225950.0 38.499358 -122.998333 M F ... Y NaN NNNNNNNN NaN NaN NaN NaN NaN NaN 967760420
2 USGS 11467006 RUSSIAN R A VACATION BEACH CA ST 382903.0 1230041.0 38.484080 -123.012500 M F ... Y NaN NNNNNNNN NaN NaN NaN NaN NaN NaN 967760420
3 USGS 11467050 BIG AUSTIN C A CAZADERO CA ST 383215.0 1230518.0 38.537413 -123.089447 M F ... Y NaN NNNNNNNN NaN NaN NaN NaN NaN NaN NaN
4 USGS 11467200 AUSTIN C NR CAZADERO CA ST 383024.0 1230407.0 38.506580 -123.069724 M S ... Y NaN NNNNNNNN NaN NaN NaN NaN NaN NaN 967700102

5 rows × 42 columns

Wow that’s a lot of gages in this location, let’s info on all of them and then use this data to learn about pandas

The data returned to us from our NWIS query info was returned to us as pandas DataFrame. We’ll be working with this to start learning about the basics of pandas.

Viewing data in pandas

Pandas has built in methods to inspect DataFrame objects. We’ll look at a few handy methods:

  • .head(): inspect the first few rows of data

  • .tail(): inspect the last few rows of data

  • .index(): show the row indexes

  • .columns(): show the column names

  • .describe(): statistically describe the data

[3]:
info.head()
[3]:
agency_cd site_no station_nm site_tp_cd lat_va long_va dec_lat_va dec_long_va coord_meth_cd coord_acy_cd ... local_time_fg reliability_cd gw_file_cd nat_aqfr_cd aqfr_cd aqfr_type_cd well_depth_va hole_depth_va depth_src_cd project_no
0 USGS 11467000 RUSSIAN R A HACIENDA BRIDGE NR GUERNEVILLE CA ST 383031.0 1225536.0 38.508523 -122.927774 M F ... Y NaN NNNNNNNN NaN NaN NaN NaN NaN NaN NaN
1 USGS 11467002 RUSSIAN R A JOHNSONS BEACH A GUERNEVILLE CA ST 382958.0 1225950.0 38.499358 -122.998333 M F ... Y NaN NNNNNNNN NaN NaN NaN NaN NaN NaN 967760420
2 USGS 11467006 RUSSIAN R A VACATION BEACH CA ST 382903.0 1230041.0 38.484080 -123.012500 M F ... Y NaN NNNNNNNN NaN NaN NaN NaN NaN NaN 967760420
3 USGS 11467050 BIG AUSTIN C A CAZADERO CA ST 383215.0 1230518.0 38.537413 -123.089447 M F ... Y NaN NNNNNNNN NaN NaN NaN NaN NaN NaN NaN
4 USGS 11467200 AUSTIN C NR CAZADERO CA ST 383024.0 1230407.0 38.506580 -123.069724 M S ... Y NaN NNNNNNNN NaN NaN NaN NaN NaN NaN 967700102

5 rows × 42 columns

[4]:
info.tail()
[4]:
agency_cd site_no station_nm site_tp_cd lat_va long_va dec_lat_va dec_long_va coord_meth_cd coord_acy_cd ... local_time_fg reliability_cd gw_file_cd nat_aqfr_cd aqfr_cd aqfr_type_cd well_depth_va hole_depth_va depth_src_cd project_no
20 USGS 383009122543001 GREEN VALLEY C NR MIRABEL HEIGHTS CA ST 383009.0 1225430.0 38.502500 -122.908333 G U ... Y NaN NaN NaN NaN NaN NaN NaN NaN 967760420
21 USGS 383012122574501 RUSSIAN R A ODD FELLOWS PARK NR RIO NIDO CA ST 383012.0 1225745.0 38.503333 -122.962500 G F ... Y NaN NaN NaN NaN NaN NaN NaN NaN 967760420
22 USGS 383016123041101 008N011W27N001M GW 383016.3 1230411.0 38.504528 -123.069722 G 5 ... Y C YY Y NaN NaN NaN 30.0 30.0 D 00BHM8187
23 USGS 383028122554501 HOBSON C NR HACIENDA CA ST 383028.0 1225545.0 38.507778 -122.929167 G U ... Y NaN NaN NaN NaN NaN NaN NaN NaN 967760420
24 USGS 383034122590701 008N010W29H004M GW 383034.4 1225907.9 38.509556 -122.985528 G 5 ... Y C YY Y N100CACSTL NaN NaN 99.0 110.0 D 9677BHM32

5 rows × 42 columns

[5]:
# print the index names
print(info.index)
RangeIndex(start=0, stop=25, step=1)
[6]:
# print the column names
print(info.columns)

# print the column names as a list
print(list(info))
Index(['agency_cd', 'site_no', 'station_nm', 'site_tp_cd', 'lat_va', 'long_va',
       'dec_lat_va', 'dec_long_va', 'coord_meth_cd', 'coord_acy_cd',
       'coord_datum_cd', 'dec_coord_datum_cd', 'district_cd', 'state_cd',
       'county_cd', 'country_cd', 'land_net_ds', 'map_nm', 'map_scale_fc',
       'alt_va', 'alt_meth_cd', 'alt_acy_va', 'alt_datum_cd', 'huc_cd',
       'basin_cd', 'topo_cd', 'instruments_cd', 'construction_dt',
       'inventory_dt', 'drain_area_va', 'contrib_drain_area_va', 'tz_cd',
       'local_time_fg', 'reliability_cd', 'gw_file_cd', 'nat_aqfr_cd',
       'aqfr_cd', 'aqfr_type_cd', 'well_depth_va', 'hole_depth_va',
       'depth_src_cd', 'project_no'],
      dtype='object')
['agency_cd', 'site_no', 'station_nm', 'site_tp_cd', 'lat_va', 'long_va', 'dec_lat_va', 'dec_long_va', 'coord_meth_cd', 'coord_acy_cd', 'coord_datum_cd', 'dec_coord_datum_cd', 'district_cd', 'state_cd', 'county_cd', 'country_cd', 'land_net_ds', 'map_nm', 'map_scale_fc', 'alt_va', 'alt_meth_cd', 'alt_acy_va', 'alt_datum_cd', 'huc_cd', 'basin_cd', 'topo_cd', 'instruments_cd', 'construction_dt', 'inventory_dt', 'drain_area_va', 'contrib_drain_area_va', 'tz_cd', 'local_time_fg', 'reliability_cd', 'gw_file_cd', 'nat_aqfr_cd', 'aqfr_cd', 'aqfr_type_cd', 'well_depth_va', 'hole_depth_va', 'depth_src_cd', 'project_no']

The describe() method is only useful for numerical data. This example is does not have a lot of useful data for describe(), however we’ll use it agian later

[7]:
info.describe()
[7]:
lat_va long_va dec_lat_va dec_long_va district_cd state_cd county_cd map_scale_fc alt_va alt_acy_va huc_cd basin_cd construction_dt inventory_dt drain_area_va contrib_drain_area_va aqfr_cd aqfr_type_cd well_depth_va hole_depth_va
count 25.000000 2.500000e+01 25.000000 25.000000 25.0 25.0 25.0 22.0 23.000000 23.000000 25.0 0.0 5.000000e+00 8.000000e+00 6.000000 0.0 0.0 0.0 6.000000 5.000000
mean 382930.716000 1.227999e+06 38.490956 -122.989798 6.0 6.0 97.0 24000.0 44.040435 11.865652 18010110.0 NaN 1.988489e+07 2.006084e+07 935.400000 NaN NaN NaN 59.166667 70.400000
std 129.549182 2.349959e+03 0.021226 0.060306 0.0 0.0 0.0 0.0 41.449315 9.492186 0.0 NaN 5.202210e+04 3.689536e+04 691.350481 NaN NaN NaN 37.434832 42.435834
min 382701.000000 1.225404e+06 38.450278 -123.089447 6.0 6.0 97.0 24000.0 2.000000 0.010000 18010110.0 NaN 1.984061e+07 2.004091e+07 26.600000 NaN NaN NaN 22.000000 22.000000
25% 382815.000000 1.225545e+06 38.470833 -123.046111 6.0 6.0 97.0 24000.0 21.570000 1.600000 18010110.0 NaN 1.984062e+07 2.004091e+07 381.600000 NaN NaN NaN 26.250000 30.000000
50% 383001.900000 1.230006e+06 38.500528 -123.001667 6.0 6.0 97.0 24000.0 43.000000 20.000000 18010110.0 NaN 1.986093e+07 2.004092e+07 1345.500000 NaN NaN NaN 55.000000 80.000000
75% 383012.000000 1.230246e+06 38.503333 -122.929167 6.0 6.0 97.0 24000.0 48.000000 20.000000 18010110.0 NaN 1.994110e+07 2.006082e+07 1366.500000 NaN NaN NaN 94.250000 110.000000
max 383215.000000 1.230518e+06 38.537413 -122.901222 6.0 6.0 97.0 24000.0 211.000000 20.000000 18010110.0 NaN 1.994120e+07 2.012072e+07 1461.000000 NaN NaN NaN 99.000000 110.000000

Getting data from a pandas dataframe

There are multiple methods to get data out of a pandas dataframe as either a “series”, numpy array, or a list

Let’s start by getting data as a series using a few methods

[8]:
# get a series of site numbers by key
info["site_no"]
[8]:
0            11467000
1            11467002
2            11467006
3            11467050
4            11467200
5            11467210
6     382701123025801
7     382752123003401
8     382754123030501
9     382757123003801
10    382808122565401
11    382815123024601
12    382819123010001
13    382944123002901
14    382955122594101
15    383001122540701
16    383003122540401
17    383003122540402
18    383003122540403
19    383006123000601
20    383009122543001
21    383012122574501
22    383016123041101
23    383028122554501
24    383034122590701
Name: site_no, dtype: object
[9]:
# get station names by attribute
info.station_nm
[9]:
0     RUSSIAN R A HACIENDA BRIDGE NR GUERNEVILLE CA
1       RUSSIAN R A JOHNSONS BEACH A GUERNEVILLE CA
2                     RUSSIAN R A VACATION BEACH CA
3                        BIG AUSTIN C A CAZADERO CA
4                           AUSTIN C NR CAZADERO CA
5                      RUSSIAN R A DUNCANS MILLS CA
6                   FREEZEOUT C NR DUNCANS MILLS CA
7                       DUTCH BILL C A MONTE RIO CA
8      RUSSIAN R A CASINI RANCH NR DUNCANS MILLS CA
9                          RUSSIAN R A MONTE RIO CA
10                                  007N010W10H001M
11                     AUSTIN C NR DUNCANS MILLS CA
12                                  007N010W07D001M
13                      HULBERT C NR GUERNEVILLE CA
14                    POCKET CYN C A GUERNEVILLE CA
15                                  008N009W31C002M
16                                  008N009W31C003M
17                                  008N009W31C004M
18                                  008N009W31C005M
19                          FIFE C A GUERNEVILLE CA
20             GREEN VALLEY C NR MIRABEL HEIGHTS CA
21      RUSSIAN R A ODD FELLOWS PARK NR RIO NIDO CA
22                                  008N011W27N001M
23                          HOBSON C NR HACIENDA CA
24                                  008N010W29H004M
Name: station_nm, dtype: object

getting data from a dataframe as a numpy array can be accomplished by using .values

[10]:
info.station_nm.values
[10]:
array(['RUSSIAN R A HACIENDA BRIDGE NR GUERNEVILLE CA',
       'RUSSIAN R A JOHNSONS BEACH A GUERNEVILLE CA',
       'RUSSIAN R A VACATION BEACH CA', 'BIG AUSTIN C A CAZADERO CA',
       'AUSTIN C NR CAZADERO CA', 'RUSSIAN R A DUNCANS MILLS CA',
       'FREEZEOUT C NR DUNCANS MILLS CA', 'DUTCH BILL C A MONTE RIO CA',
       'RUSSIAN R A CASINI RANCH NR DUNCANS MILLS CA',
       'RUSSIAN R A MONTE RIO CA', '007N010W10H001M',
       'AUSTIN C NR DUNCANS MILLS CA', '007N010W07D001M',
       'HULBERT C NR GUERNEVILLE CA', 'POCKET CYN C A GUERNEVILLE CA',
       '008N009W31C002M', '008N009W31C003M', '008N009W31C004M',
       '008N009W31C005M', 'FIFE C A GUERNEVILLE CA',
       'GREEN VALLEY C NR MIRABEL HEIGHTS CA',
       'RUSSIAN R A ODD FELLOWS PARK NR RIO NIDO CA', '008N011W27N001M',
       'HOBSON C NR HACIENDA CA', '008N010W29H004M'], dtype=object)

Selection by position

We can get data by position in the dataframe using the .iloc attribute

[11]:
info.iloc[0:2, 1:4]
[11]:
site_no station_nm site_tp_cd
0 11467000 RUSSIAN R A HACIENDA BRIDGE NR GUERNEVILLE CA ST
1 11467002 RUSSIAN R A JOHNSONS BEACH A GUERNEVILLE CA ST

Selection by label

Pandas allows the user to get data from the dataframe by index and column labels

[12]:
info.loc[0:3, ["site_no", "station_nm", "site_tp_cd"]]
[12]:
site_no station_nm site_tp_cd
0 11467000 RUSSIAN R A HACIENDA BRIDGE NR GUERNEVILLE CA ST
1 11467002 RUSSIAN R A JOHNSONS BEACH A GUERNEVILLE CA ST
2 11467006 RUSSIAN R A VACATION BEACH CA ST
3 11467050 BIG AUSTIN C A CAZADERO CA ST

Boolean indexing

pandas dataframes supports boolean indexing that allows a user to create a new dataframe with only the data that meets a boolean condition defined by the user.

Let’s get a dataframe of only groundwater sites from the info dataframe

[13]:
dfgw = info[info["site_tp_cd"] == "GW"]
dfgw
[13]:
agency_cd site_no station_nm site_tp_cd lat_va long_va dec_lat_va dec_long_va coord_meth_cd coord_acy_cd ... local_time_fg reliability_cd gw_file_cd nat_aqfr_cd aqfr_cd aqfr_type_cd well_depth_va hole_depth_va depth_src_cd project_no
10 USGS 382808122565401 007N010W10H001M GW 382808.5 1225654.2 38.469028 -122.948389 G 5 ... Y C YY Y NaN NaN NaN 22.0 22.0 D 00BHM8187
12 USGS 382819123010001 007N010W07D001M GW 382819.9 1230100.8 38.472194 -123.016889 G 5 ... Y C YY N100CACSTL NaN NaN 99.0 110.0 D 9677BHM32
15 USGS 383001122540701 008N009W31C002M GW 383001.9 1225407.7 38.500528 -122.902139 G 5 ... Y C YY N100CACSTL NaN NaN 80.0 80.0 D 9677BHM32
16 USGS 383003122540401 008N009W31C003M GW 383003.3 1225404.4 38.500917 -122.901222 G 1 ... Y C YY Y N100CACSTL NaN NaN NaN NaN NaN 967760420
17 USGS 383003122540402 008N009W31C004M GW 383003.3 1225404.4 38.500917 -122.901222 G 1 ... Y C Y N100CACSTL NaN NaN NaN NaN NaN 967760420
18 USGS 383003122540403 008N009W31C005M GW 383003.3 1225404.4 38.500917 -122.901222 G 1 ... Y C YY Y N100CACSTL NaN NaN 25.0 NaN S 967760420
22 USGS 383016123041101 008N011W27N001M GW 383016.3 1230411.0 38.504528 -123.069722 G 5 ... Y C YY Y NaN NaN NaN 30.0 30.0 D 00BHM8187
24 USGS 383034122590701 008N010W29H004M GW 383034.4 1225907.9 38.509556 -122.985528 G 5 ... Y C YY Y N100CACSTL NaN NaN 99.0 110.0 D 9677BHM32

8 rows × 42 columns

Reading and writing data to .csv files

Pandas has support to both read and write many types of files. For this example we are focusing on .csv files. For information on other file types that are supported see the ten minutes to pandas tutorial documentation

For this part we’ll write a new .csv file of the groundwater sites that we found in NWIS using to_csv().

to_csv() has a bunch of handy options for writing to file. For this example, I’m going to drop the index column while writing by passing index=False

[14]:
csv_file = data_path / "RussianRiverGWsites.csv"
dfgw.to_csv(csv_file, index=False)

Now we can load the csv file back into a pandas dataframe with the read_csv() method.

[15]:
dfgw2 = pd.read_csv(csv_file)
dfgw2
[15]:
agency_cd site_no station_nm site_tp_cd lat_va long_va dec_lat_va dec_long_va coord_meth_cd coord_acy_cd ... local_time_fg reliability_cd gw_file_cd nat_aqfr_cd aqfr_cd aqfr_type_cd well_depth_va hole_depth_va depth_src_cd project_no
0 USGS 382808122565401 007N010W10H001M GW 382808.5 1225654.2 38.469028 -122.948389 G 5 ... Y C YY Y NaN NaN NaN 22.0 22.0 D 00BHM8187
1 USGS 382819123010001 007N010W07D001M GW 382819.9 1230100.8 38.472194 -123.016889 G 5 ... Y C YY N100CACSTL NaN NaN 99.0 110.0 D 9677BHM32
2 USGS 383001122540701 008N009W31C002M GW 383001.9 1225407.7 38.500528 -122.902139 G 5 ... Y C YY N100CACSTL NaN NaN 80.0 80.0 D 9677BHM32
3 USGS 383003122540401 008N009W31C003M GW 383003.3 1225404.4 38.500917 -122.901222 G 1 ... Y C YY Y N100CACSTL NaN NaN NaN NaN NaN 967760420
4 USGS 383003122540402 008N009W31C004M GW 383003.3 1225404.4 38.500917 -122.901222 G 1 ... Y C Y N100CACSTL NaN NaN NaN NaN NaN 967760420
5 USGS 383003122540403 008N009W31C005M GW 383003.3 1225404.4 38.500917 -122.901222 G 1 ... Y C YY Y N100CACSTL NaN NaN 25.0 NaN S 967760420
6 USGS 383016123041101 008N011W27N001M GW 383016.3 1230411.0 38.504528 -123.069722 G 5 ... Y C YY Y NaN NaN NaN 30.0 30.0 D 00BHM8187
7 USGS 383034122590701 008N010W29H004M GW 383034.4 1225907.9 38.509556 -122.985528 G 5 ... Y C YY Y N100CACSTL NaN NaN 99.0 110.0 D 9677BHM32

8 rows × 42 columns

Class exercise 1

Using the methods presented in this notebook create a DataFrame of surface water sites from the info dataframe, write it to a csv file named "RussianRiverSWsites.csv", and read it back in as a new DataFrame

[16]:
csv_name = data_path / "RussianRiverSWsites.csv"
dfsw = info[info.site_tp_cd == "ST"]
dfsw.to_csv(csv_name, index=False)

dfsw = pd.read_csv(csv_name)
dfsw.head()
[16]:
agency_cd site_no station_nm site_tp_cd lat_va long_va dec_lat_va dec_long_va coord_meth_cd coord_acy_cd ... local_time_fg reliability_cd gw_file_cd nat_aqfr_cd aqfr_cd aqfr_type_cd well_depth_va hole_depth_va depth_src_cd project_no
0 USGS 11467000 RUSSIAN R A HACIENDA BRIDGE NR GUERNEVILLE CA ST 383031.0 1225536.0 38.508523 -122.927774 M F ... Y NaN NNNNNNNN NaN NaN NaN NaN NaN NaN NaN
1 USGS 11467002 RUSSIAN R A JOHNSONS BEACH A GUERNEVILLE CA ST 382958.0 1225950.0 38.499358 -122.998333 M F ... Y NaN NNNNNNNN NaN NaN NaN NaN NaN NaN 967760420.0
2 USGS 11467006 RUSSIAN R A VACATION BEACH CA ST 382903.0 1230041.0 38.484080 -123.012500 M F ... Y NaN NNNNNNNN NaN NaN NaN NaN NaN NaN 967760420.0
3 USGS 11467050 BIG AUSTIN C A CAZADERO CA ST 383215.0 1230518.0 38.537413 -123.089447 M F ... Y NaN NNNNNNNN NaN NaN NaN NaN NaN NaN NaN
4 USGS 11467200 AUSTIN C NR CAZADERO CA ST 383024.0 1230407.0 38.506580 -123.069724 M S ... Y NaN NNNNNNNN NaN NaN NaN NaN NaN NaN 967700102.0

5 rows × 42 columns

Get timeseries data from NWIS for surface water sites

Now to start working with stream gage data from NWIS. We’re going to send all surface water sites to NWIS and get only sites with daily discharge measurements.

(11467002)

[17]:
# filter for only SW sites
dfsw = info[info.site_tp_cd == "ST"]

# create query
sites = dfsw.site_no.to_list()
pcode = "00060"
start_date = "1939-09-22"
end_date = "2023-12-31"
df, metadata = nwis.get_dv(
    sites=sites,
    start=start_date,
    end=end_date,
    parameterCd=pcode,
    multi_index=False
)
df.head()
[17]:
site_no 00060_Mean 00060_Mean_cd
datetime
1939-10-01 00:00:00+00:00 11467000 185.0 A, e
1939-10-02 00:00:00+00:00 11467000 185.0 A, e
1939-10-03 00:00:00+00:00 11467000 185.0 A, e
1939-10-04 00:00:00+00:00 11467000 185.0 A, e
1939-10-05 00:00:00+00:00 11467000 185.0 A, e

Awesome! There are two gages in this study area that have daily discharge data!

Inspect the data using .plot()

[18]:
ax = df.plot()
ax.set_ylabel(r"cubic meter per second")
ax.set_yscale("log")
../../../_images/notebooks_part0_python_intro_solutions_08_pandas_35_0.png

Updating column names

Let’s update column names for these gages based on their locations instead of their USGS gage id. First we’ll get name information from the info class and then we’ll use the station name information to remap the column names.

the rename() method accepts a dictionary that is formated {current_col_name: new_col_name}

[19]:
df
[19]:
site_no 00060_Mean 00060_Mean_cd
datetime
1939-10-01 00:00:00+00:00 11467000 185.0 A, e
1939-10-02 00:00:00+00:00 11467000 185.0 A, e
1939-10-03 00:00:00+00:00 11467000 185.0 A, e
1939-10-04 00:00:00+00:00 11467000 185.0 A, e
1939-10-05 00:00:00+00:00 11467000 185.0 A, e
... ... ... ...
2023-12-29 00:00:00+00:00 11467000 1970.0 P
2023-12-30 00:00:00+00:00 11467000 6170.0 P
2023-12-30 00:00:00+00:00 11467200 1430.0 P
2023-12-31 00:00:00+00:00 11467000 5970.0 P
2023-12-31 00:00:00+00:00 11467200 680.0 P

40848 rows × 3 columns

[20]:
df2 = df[['site_no', '00060_Mean']].pivot(columns=['site_no'])
df2.columns = df2.columns.get_level_values(1)

site_names = {site_no: '_'.join(name.split()[:2]).lower() for site_no, name
              in info.groupby('site_no')['station_nm'].first().loc[df2.columns].items()}
df2.rename(columns=site_names, inplace=True)
df2.plot()
[20]:
<Axes: xlabel='datetime'>
../../../_images/notebooks_part0_python_intro_solutions_08_pandas_38_1.png

Adding a new column of data to the dataframe

Adding new columns of data is relatively simple in pandas. The call signature is similar to a dictionary where df[“key”] = “some_array_of_values”

Let’s add a year column (water year) to the dataframe by getting the year from the index and adjusting it using np.where()

[21]:
df2['year'] = np.where(df2.index.month >=10, df2.index.year, df2.index.year - 1)

Manipulating data

Columns in a pandas dataframe can be manipulated similarly to working with a dictionary of numpy arrays. The discharge data units are \(\frac{ft^3}{s}\), let’s accumulate this to daily discharge in \(\frac {ft^3}{d}\).

The coversion for this is:

$ \frac{1 ft^3}{s} \times `:nbsphinx-math:frac{60s}{min}` \times `:nbsphinx-math:frac{60min}{hour}` \times `:nbsphinx-math:frac{24hours}{day}` \rightarrow `:nbsphinx-math:frac{ft^3}{day}` $

and then convert that to acre-feet (1 acre-ft = 43559.9)

[22]:
df2
[22]:
site_no russian_r austin_c year
datetime
1939-10-01 00:00:00+00:00 185.0 NaN 1939
1939-10-02 00:00:00+00:00 185.0 NaN 1939
1939-10-03 00:00:00+00:00 185.0 NaN 1939
1939-10-04 00:00:00+00:00 185.0 NaN 1939
1939-10-05 00:00:00+00:00 185.0 NaN 1939
... ... ... ...
2023-12-27 00:00:00+00:00 888.0 967.0 2023
2023-12-28 00:00:00+00:00 1640.0 913.0 2023
2023-12-29 00:00:00+00:00 1970.0 1490.0 2023
2023-12-30 00:00:00+00:00 6170.0 1430.0 2023
2023-12-31 00:00:00+00:00 5970.0 680.0 2023

30773 rows × 3 columns

[23]:
conv = 60 * 60 * 24
df2["russian_r_cfd"] = df2.russian_r * conv
df2["russian_r_af"] = df2.russian_r_cfd / 43559.9
df2

# now let's plot up discharge from 2018 - 2020
dfplt = df2[(df2["year"] >= 2015) & (df2["year"] <= 2020)]
ax = dfplt["russian_r_af"].plot()
ax.set_yscale("log")

../../../_images/notebooks_part0_python_intro_solutions_08_pandas_43_0.png

Class exercise 2:

Using the methods described in the notebook. Convert the discharge for "austin_c" from \(\frac{m^3}{s}\) acre-ft by adding two additional fields to df2 named "austin_c_cfs" and "austin_c_af".

After these two fields have been added to df2 use boolean indexing to create a new dataframe from 2015 through 2020. Finally plot the discharge data (in acre-ft) for “austin_cr”.

bonus exercise try to plot both "russian_r_af" and "austin_c_af" on the same plot

[24]:
conv = 60 * 60 * 24
df2["austin_c_cfd"] = df2.austin_c * conv
df2["austin_c_af"] = df2.austin_c_cfd / 43559.9
df2

# now let's plot up discharge from 2018 - 2020
dfplt = df2[(df2["year"] >= 2015) & (df2["year"] <= 2020)]
ax = dfplt["austin_c_af"].plot()
ax.set_yscale("log")

../../../_images/notebooks_part0_python_intro_solutions_08_pandas_45_0.png
[25]:
# now let's plot up discharge from 2018 - 2020
dfplt = df2[(df2["year"] >= 2015) & (df2["year"] <= 2020)]
ax = dfplt[["russian_r_af","austin_c_af"]].plot()
ax.set_yscale("log")
../../../_images/notebooks_part0_python_intro_solutions_08_pandas_46_0.png

groupby: grouping data and performing mathematical operations on it

groupby() is a powerful method that allows for performing statistical operations over a groups of “common” data within a dataframe.

For this example we’ll use it to get mean daily flows for the watershed. Pandas will group all common days of the year with each other and then calculate the mean value of these via the function .mean(). groupby() also supports other operations such as .median(), .sum(), max(), min(), std(), and other functions.

[26]:
df2["day_of_year"] = df2.index.day_of_year

df_mean_day = df2.groupby(by=["day_of_year"], as_index=False)[["russian_r_af", "austin_c_af"]].mean()
ax = df_mean_day[["russian_r_af", "austin_c_af"]].plot()
ax.set_yscale("log")
../../../_images/notebooks_part0_python_intro_solutions_08_pandas_48_0.png

We can see that around March flow starts decreasing heavily and doesn’t recover until sometime in october. What’s going on here? Let’s load some climate data and see what’s happening.

[27]:
cimis_file = data_path / "santa_rosa_CIMIS_83.csv"
df_cimis = pd.read_csv(cimis_file)
drop_list = ["qc"] + [f"qc.{i}" for i in range(1, 22)]
df_cimis.drop(columns=drop_list, inplace=True)
df_cimis
[27]:
Stn Id Stn Name CIMIS Region Date Jul ETo (in) Precip (in) Sol Rad (Ly/day) Avg Vap Pres (mBars) Max Air Temp (F) ... Rose NNE Rose ENE Rose ESE Rose SSE Rose SSW Rose WSW Rose WNW Rose NNW Wind Run (miles) Avg Soil Temp (F)
0 83 Santa Rosa North Coast Valleys 1/1/1990 1 0.00 0.30 42.0 7.7 48.7 ... 0.03 0.02 0.06 0.06 0.04 0.10 0.15 0.14 32.1 46.3
1 83 Santa Rosa North Coast Valleys 1/2/1990 2 0.06 0.00 236.0 6.1 58.4 ... 0.06 0.01 0.04 0.03 0.03 0.06 0.45 1.53 121.0 44.9
2 83 Santa Rosa North Coast Valleys 1/3/1990 3 0.04 0.00 225.0 5.8 63.0 ... 0.08 0.04 0.10 0.04 0.06 0.05 0.21 0.23 44.5 43.3
3 83 Santa Rosa North Coast Valleys 1/4/1990 4 0.03 0.01 190.0 6.3 56.7 ... 0.07 0.07 0.11 0.07 0.03 0.03 0.08 0.09 30.2 42.8
4 83 Santa Rosa North Coast Valleys 1/5/1990 5 0.04 0.00 192.0 7.2 59.1 ... 0.08 0.13 0.33 0.08 0.04 0.08 0.07 0.05 45.2 43.7
... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ... ...
11948 83 Santa Rosa North Coast Valleys 9/18/2022 261 0.03 0.77 113.0 16.4 66.8 ... 0.01 0.04 0.29 0.51 0.14 0.01 0.00 0.00 197.4 65.2
11949 83 Santa Rosa North Coast Valleys 9/19/2022 262 0.14 0.02 424.0 15.6 73.9 ... 0.09 0.07 0.08 0.17 0.09 0.28 0.15 0.07 108.9 65.9
11950 83 Santa Rosa North Coast Valleys 9/20/2022 263 0.10 0.00 330.0 13.9 72.6 ... 0.08 0.04 0.10 0.11 0.18 0.25 0.15 0.10 79.6 65.8
11951 83 Santa Rosa North Coast Valleys 9/21/2022 264 0.15 0.07 458.0 14.8 75.5 ... 0.09 0.05 0.13 0.16 0.18 0.22 0.08 0.09 103.3 66.7
11952 83 Santa Rosa North Coast Valleys 9/22/2022 265 0.17 0.00 494.0 13.1 82.9 ... 0.14 0.06 0.09 0.10 0.11 0.20 0.12 0.19 83.7 66.3

11953 rows × 27 columns

[28]:
df_mean_clim = df_cimis.groupby(by=["Jul"], as_index=False)[["ETo (in)", "Precip (in)"]].mean()
df_mean_clim = df_mean_clim.rename(columns={"Jul": "day_of_year"})
df_mean_clim.head()
[28]:
day_of_year ETo (in) Precip (in)
0 1 0.031212 0.335455
1 2 0.031212 0.155152
2 3 0.030303 0.195484
3 4 0.030909 0.190303
4 5 0.034242 0.150000

Joining two DataFrames with an inner join

Inner joins can be made in pandas with the merge() method. As inner join, joins two dataframes on a common key or values, when a key or value exists in one dataframe, but not the other that row of data will be excluded from the joined dataframe.

There are a number of other ways to join dataframes in pandas. Detailed examples and discussion of the different merging methods can be found here

[29]:
df_merged = pd.merge(df_mean_day, df_mean_clim, on=["day_of_year",])
df_merged.head()
[29]:
day_of_year russian_r_af austin_c_af ETo (in) Precip (in)
0 1 11169.447129 982.247985 0.031212 0.335455
1 2 10601.346651 648.518661 0.031212 0.155152
2 3 8662.758966 777.478369 0.030303 0.195484
3 4 9604.154280 1831.494379 0.030909 0.190303
4 5 11811.951556 1508.267191 0.034242 0.150000
[30]:
cols = [i for i in list(df_merged) if i != "day_of_year"]
ax = df_merged[cols].plot()
ax.set_yscale("log")
../../../_images/notebooks_part0_python_intro_solutions_08_pandas_54_0.png

This starts to make a little more sense! We can see that this area gets most of it’s precipitation in the winter and early spring. The evapotranspiration curve shows what we’d expect: that there is more evapotranspiration in the summer than the winter.

Is there anything else we can look at that might give us more insights into this basin? Let’s try looking at long term, yearly trends to see what’s there.

[31]:
year = []
for dstr in df_cimis.Date.values:
    mo, d, yr = [int(i) for i in dstr.split("/")]
    if mo < 10:
        year.append(yr - 1)
    else:
        year.append(yr)
df_cimis["year"] = year
df_cimis.head()
[31]:
Stn Id Stn Name CIMIS Region Date Jul ETo (in) Precip (in) Sol Rad (Ly/day) Avg Vap Pres (mBars) Max Air Temp (F) ... Rose ENE Rose ESE Rose SSE Rose SSW Rose WSW Rose WNW Rose NNW Wind Run (miles) Avg Soil Temp (F) year
0 83 Santa Rosa North Coast Valleys 1/1/1990 1 0.00 0.30 42.0 7.7 48.7 ... 0.02 0.06 0.06 0.04 0.10 0.15 0.14 32.1 46.3 1989
1 83 Santa Rosa North Coast Valleys 1/2/1990 2 0.06 0.00 236.0 6.1 58.4 ... 0.01 0.04 0.03 0.03 0.06 0.45 1.53 121.0 44.9 1989
2 83 Santa Rosa North Coast Valleys 1/3/1990 3 0.04 0.00 225.0 5.8 63.0 ... 0.04 0.10 0.04 0.06 0.05 0.21 0.23 44.5 43.3 1989
3 83 Santa Rosa North Coast Valleys 1/4/1990 4 0.03 0.01 190.0 6.3 56.7 ... 0.07 0.11 0.07 0.03 0.03 0.08 0.09 30.2 42.8 1989
4 83 Santa Rosa North Coast Valleys 1/5/1990 5 0.04 0.00 192.0 7.2 59.1 ... 0.13 0.33 0.08 0.04 0.08 0.07 0.05 45.2 43.7 1989

5 rows × 28 columns

[32]:
df2.head()
[32]:
site_no russian_r austin_c year russian_r_cfd russian_r_af austin_c_cfd austin_c_af day_of_year
datetime
1939-10-01 00:00:00+00:00 185.0 NaN 1939 15984000.0 366.942991 NaN NaN 274
1939-10-02 00:00:00+00:00 185.0 NaN 1939 15984000.0 366.942991 NaN NaN 275
1939-10-03 00:00:00+00:00 185.0 NaN 1939 15984000.0 366.942991 NaN NaN 276
1939-10-04 00:00:00+00:00 185.0 NaN 1939 15984000.0 366.942991 NaN NaN 277
1939-10-05 00:00:00+00:00 185.0 NaN 1939 15984000.0 366.942991 NaN NaN 278

.aggregate()

The pandas .aggregate() can be used with .groupy() to perform multiple statistical operations.

Example useage could be:

agdf = df2.groupby(by=["day_of_year"], as_index=False)["austin_c_af"].aggregate([np.min, np.max, np.mean, np.std])

And would return a dataframe grouped by the day of year and take the min, max, mean, and standard deviation of these data.

Class exercise 3: groupy, aggregate, and merge

For this exercise we need to produce yearly mean and standard deviation flows for the russian_r_af variable and merge the data with yearly mean precipitation and evapotranspiration

Hints:
- df2 and df_cimis are the dataframes these operations should be performed on
- use .groupby() to group by the year and do precip and ETo in seperate groupby routines - rename the columns in the ETo aggregated dataframe and the Precip dataframe to mean_et, std_et, mean_precip, std_precip - np.mean and np.std can be used to calculate mean and std flows

Name your final joined dataframe df_yearly

[33]:
dfg0 = df2.groupby(by=["year"], as_index=False)["russian_r_af"].aggregate(['mean', 'std'])
dfg1 = df_cimis.groupby(by=["year"], as_index=False)["ETo (in)"].aggregate(['mean', 'std'])
dfg2 = df_cimis.groupby(by=["year"], as_index=False)["Precip (in)"].aggregate(['mean', 'std'])

dfm1 = pd.merge(dfg0, dfg1, on=["year"], suffixes=(None, "_et"))
df_yearly = pd.merge(dfm1, dfg2, on=["year"], suffixes=(None, "_precip"))

df_yearly
[33]:
year mean std mean_et std_et mean_precip std_precip
0 1989 1388.134060 2703.398315 0.160513 0.080539 0.067473 0.284019
1 1990 2105.130611 7210.539138 0.113671 0.067419 0.081642 0.353434
2 1991 2180.262423 5581.173315 0.117268 0.067521 0.086940 0.233645
3 1992 5746.248446 11726.314669 0.122712 0.077676 0.128329 0.312310
4 1993 1528.091360 3052.871464 0.121781 0.073195 0.060168 0.225893
5 1994 9581.534959 22426.346294 0.113178 0.076706 0.151479 0.433991
6 1995 5575.848053 10440.114353 0.122814 0.078691 0.112596 0.416613
7 1996 5600.840663 16501.510310 0.129699 0.077105 0.102959 0.348520
8 1997 8587.944090 16139.648496 0.111589 0.075997 0.153068 0.383740
9 1998 4271.672888 7810.405143 0.123479 0.077184 0.086986 0.287404
10 1999 3510.811474 7712.677405 0.120546 0.074941 0.073425 0.254732
11 2000 1736.410643 4474.354276 0.127562 0.077590 0.058658 0.180318
12 2001 3625.296220 7794.142971 0.124712 0.077907 0.111044 0.355845
13 2002 5257.492880 10816.766181 0.123096 0.074760 0.103205 0.394435
14 2003 4723.322590 11718.629659 0.130546 0.078522 0.081995 0.327512
15 2004 3989.109667 6020.134120 0.117068 0.073038 0.095945 0.292390
16 2005 9004.490274 17931.497470 0.116658 0.073711 0.123123 0.384322
17 2006 1775.969842 4655.125526 0.123890 0.073490 0.056947 0.204511
18 2007 2696.274988 7169.983326 0.127377 0.076249 0.066066 0.270427
19 2008 1591.079482 4437.148385 0.119425 0.071221 0.052795 0.195192
20 2009 3869.030053 8304.931111 0.108219 0.070222 0.096411 0.308841
21 2010 5179.675425 9723.981660 0.114192 0.073108 0.101616 0.276342
22 2011 1877.259966 4587.983835 0.123962 0.071181 0.061233 0.278723
23 2012 2669.014799 7632.740020 0.122164 0.071272 0.066904 0.263863
24 2013 1062.237059 2976.193603 0.128849 0.068012 0.050740 0.297342
25 2014 2086.730478 6661.336955 0.128274 0.071913 0.061507 0.270112
26 2015 3395.358018 7587.288287 0.106011 0.059232 0.065495 0.233080
27 2016 8686.857030 16763.980172 0.119699 0.077931 0.108027 0.329238
28 2017 1443.417058 2746.964668 0.128055 0.075172 0.061616 0.272836
29 2018 6129.493976 14070.439880 0.130301 0.079187 0.115014 0.385471
30 2019 928.754876 1323.893808 0.137896 0.078000 0.039016 0.148017
31 2020 410.464250 515.126426 0.140959 0.073783 0.032027 0.126727
32 2021 1435.621183 3601.895852 0.134090 0.077279 0.078627 0.411521

Lets examine the long term flow record for the Russian River

[34]:
df_y_flow = df2.groupby(by=["year"], as_index=False)["russian_r_af"].aggregate(['mean', 'std'])

fig = plt.figure(figsize=(12,4))

lower_ci = df_y_flow["mean"] - 2 * df_y_flow['std']
lower_ci = np.where(lower_ci < 0, 0, lower_ci)
upper_ci = df_y_flow["mean"] + 2 * df_y_flow['std']
ax = df_y_flow["mean"].plot(style="b.-")
ax.fill_between(df_y_flow.index, lower_ci, upper_ci, color="b", alpha=0.5);
../../../_images/notebooks_part0_python_intro_solutions_08_pandas_63_0.png

From this plot it looks like the long term trend in yearly discharge in the Russian River has been decreasing. We will revisit and test this later.

First let’s see if there are any relationships between yearly discharge and climate

[35]:
dfg0 = df2.groupby(by=["year"], as_index=False)["russian_r_af"].aggregate(['mean', 'std', 'sum'])
dfg1 = df_cimis.groupby(by=["year"], as_index=False)["ETo (in)"].aggregate(['mean', 'std', 'sum'])
dfg2 = df_cimis.groupby(by=["year"], as_index=False)["Precip (in)"].aggregate(['mean', 'std', 'sum'])
dfm1 = pd.merge(dfg0, dfg1, on=["year"], suffixes=(None, "_et"))
df_yearly = pd.merge(dfm1, dfg2, on=["year"], suffixes=(None, "_precip"))

fig = plt.figure(figsize=(12,4))

lower_ci = df_yearly["mean"] - 2 * df_yearly['std']
lower_ci = np.where(lower_ci < 0, 0, lower_ci)
upper_ci = df_yearly["mean"] + 2 * df_yearly['std']
ax = df_yearly["mean"].plot(style="b.-", label="flow_af")
ax.fill_between(df_yearly.index, lower_ci, upper_ci, color="b", alpha=0.5)
ax2 = ax.twinx()
ax2.plot(df_yearly.index, df_yearly.sum_precip, "k--", lw=1.5, label="Precip")
ax2.plot(df_yearly.index, df_yearly.sum_et, "r.--", lw=1.5, label="ET")
plt.legend()
plt.show();
../../../_images/notebooks_part0_python_intro_solutions_08_pandas_65_0.png

As expected, precipitation amount is the main driver of the yearly discharge regime here

Baseflow separation

Baseflow separation is a method to separate the quick response hydrograph (storm runoff) from the long term flow. We’re going to go back to the complete daily dataset to perform this operation. The following cell contains a low pass filtration method that is used for digital baseflow separation. We’ll use these in our analysis

[36]:
def _baseflow_low_pass_filter(arr, beta, enforce):
    """
    Private method to apply digital baseflow separation filter
    (Lyne & Hollick, 1979; Nathan & Mcmahon, 1990;
    Boughton, 1993; Chapman & Maxwell, 1996).

    This method should not be called by the user!

    Method removes "spikes" which would be consistent with storm flow
    events from data.

    Parameters
    ----------
    arr : np.array
        streamflow or municipal pumping time series

    beta : float
        baseflow filtering parameter that ranges from 0 - 1
        values in 0.8 - 0.95 range used in literature for
        streamflow baseflow separation

    enforce : bool
        enforce physical constraint of baseflow less than measured flow

    Returns
    -------
        np.array of filtered data
    """
    # prepend 10 records to data for initial spin up
    # these records will be dropped before returning data to user
    qt = np.zeros((arr.size + 10,), dtype=float)
    qt[0:10] = arr[0:10]
    qt[10:] = arr[:]

    qdt = np.zeros((arr.size + 10,), dtype=float)
    qbf = np.zeros((arr.size + 10,), dtype=float)

    y = (1. + beta) / 2.

    for ix in range(qdt.size):
        if ix == 0:
            qbf[ix] = qt[ix]
            continue

        x = beta * qdt[ix - 1]
        z = qt[ix] - qt[ix - 1]
        qdt[ix] = x + (y * z)

        qb = qt[ix] - qdt[ix]
        if enforce:
            if qb > qt[ix]:
                qbf[ix] = qt[ix]
            else:
                qbf[ix] = qb

        else:
            qbf[ix] = qb

    return qbf[10:]


def baseflow_low_pass_filter(arr, beta=0.9, T=1, enforce=True):
    """
    User method to apply digital baseflow separation filter
    (Lyne & Hollick, 1979; Nathan & Mcmahon, 1990;
    Boughton, 1993; Chapman & Maxwell, 1996).

    Method removes "spikes" which would be consistent with storm flow
    events from data.

    Parameters
    ----------
    arr : np.array
        streamflow or municipal pumping time series

    beta : float
        baseflow filtering parameter that ranges from 0 - 1
        values in 0.8 - 0.95 range used in literature for
        streamflow baseflow separation

    T : int
        number of recursive filtering passes to apply to the data

    enforce : bool
        enforce physical constraint of baseflow less than measured flow

    Returns
    -------
        np.array of baseflow
    """
    for _ in range(T):
        arr = _baseflow_low_pass_filter(arr, beta, enforce)

    return arr

Class exercise 4: baseflow separation

Using the full dataframe discharge dataframe, df2, get the baseflows for the Russian River (in acre-ft) and add these to the dataframe as a new column named russian_bf_af.

The function baseflow_low_pass_filter() will be used to perform baseflow. This function takes a numpy array of stream discharge and runs a digital low pass filtration method to calculate baseflow.

After performing baseflow seperation plot both the baseflow and the total discharge for the years 2015 - 2017.

Hints: - make sure to use the .values property when you get discharge data from the pandas dataframe - feel free to play with the input parameters beta and T. - beta is commonly between 0.8 - 0.95 for hydrologic problems. - T is the number of filter passes over the data, therefore increasing T will create a smoother profile. I recommend starting with a T value of around 5 and adjusting it to see how it changes the baseflow profile.

[37]:
baseflow = baseflow_low_pass_filter(df2["russian_r_af"].values, beta=0.90, T=5)
df2["russian_bf_af"] = baseflow

fig = plt.figure(figsize=(15, 10))
tdf = df2[(df2.year >= 2015) & (df2.year <= 2017)]
ax = tdf[["russian_r_af", "russian_bf_af"]].plot()
ax.set_yscale("log");
<Figure size 1500x1000 with 0 Axes>
../../../_images/notebooks_part0_python_intro_solutions_08_pandas_70_1.png

Back to the basics:

Now that we’ve done some analysis on real data, let’s go back to the basic and briefly discuss how to adjust data values in pandas

Adjusting a whole column of data values

Because pandas series objects (columns) are built off of numpy, we can perform mathematical operations in place on a whole column similarly to numpy arrays

[41]:
df3 = df2.copy()
# Adding values
df3["russian_r"] += 10000

# Subtracting values
df3["russian_r"] -= 10000

# multiplication
df3["russian_r"] *= 5

# division
df3["russian_r"] /= 5

# more complex operations
df3["russian_r"] = np.log10(df3["russian_r"].values)
df3.head()
[41]:
site_no russian_r austin_c year russian_r_cfd russian_r_af austin_c_cfd austin_c_af day_of_year russian_bf_af baseflow_trend
datetime
1939-10-01 00:00:00+00:00 2.267172 NaN 1939 15984000.0 366.942991 NaN NaN 274 366.942991 NaN
1939-10-02 00:00:00+00:00 2.267172 NaN 1939 15984000.0 366.942991 NaN NaN 275 366.942991 NaN
1939-10-03 00:00:00+00:00 2.267172 NaN 1939 15984000.0 366.942991 NaN NaN 276 366.942991 NaN
1939-10-04 00:00:00+00:00 2.267172 NaN 1939 15984000.0 366.942991 NaN NaN 277 366.942991 NaN
1939-10-05 00:00:00+00:00 2.267172 NaN 1939 15984000.0 366.942991 NaN NaN 278 366.942991 NaN

updating values by position

We can update single or multiple values by their position in the dataframe using .iloc

[42]:
df3.iloc[0, 1] = 999
df3.head()
[42]:
site_no russian_r austin_c year russian_r_cfd russian_r_af austin_c_cfd austin_c_af day_of_year russian_bf_af baseflow_trend
datetime
1939-10-01 00:00:00+00:00 2.267172 999.0 1939 15984000.0 366.942991 NaN NaN 274 366.942991 NaN
1939-10-02 00:00:00+00:00 2.267172 NaN 1939 15984000.0 366.942991 NaN NaN 275 366.942991 NaN
1939-10-03 00:00:00+00:00 2.267172 NaN 1939 15984000.0 366.942991 NaN NaN 276 366.942991 NaN
1939-10-04 00:00:00+00:00 2.267172 NaN 1939 15984000.0 366.942991 NaN NaN 277 366.942991 NaN
1939-10-05 00:00:00+00:00 2.267172 NaN 1939 15984000.0 366.942991 NaN NaN 278 366.942991 NaN

updating values based on location

We can update values in the dataframe based on their index and column headers too using .loc

[43]:
df3.loc[df3.index[0], 'russian_r'] *= 100
df3.head()
[43]:
site_no russian_r austin_c year russian_r_cfd russian_r_af austin_c_cfd austin_c_af day_of_year russian_bf_af baseflow_trend
datetime
1939-10-01 00:00:00+00:00 226.717173 999.0 1939 15984000.0 366.942991 NaN NaN 274 366.942991 NaN
1939-10-02 00:00:00+00:00 2.267172 NaN 1939 15984000.0 366.942991 NaN NaN 275 366.942991 NaN
1939-10-03 00:00:00+00:00 2.267172 NaN 1939 15984000.0 366.942991 NaN NaN 276 366.942991 NaN
1939-10-04 00:00:00+00:00 2.267172 NaN 1939 15984000.0 366.942991 NaN NaN 277 366.942991 NaN
1939-10-05 00:00:00+00:00 2.267172 NaN 1939 15984000.0 366.942991 NaN NaN 278 366.942991 NaN

Class exercise 5: setting data

The russian river discharge in cubic feet per day highly variable and represented by large numbers. To make this data easier to read and plot, update it to a log10 representation.

Then use either a location based method to change the value in austin_c on 10-4-1939 to 0.

Finally display the first 5 records in the notebook.

[44]:
df3["russian_r_cfd"] = np.log10(df3["russian_r_cfd"])
df3.loc[df3.index[3], "austin_c"] = 0
df3.head()
[44]:
site_no russian_r austin_c year russian_r_cfd russian_r_af austin_c_cfd austin_c_af day_of_year russian_bf_af baseflow_trend
datetime
1939-10-01 00:00:00+00:00 226.717173 999.0 1939 7.203685 366.942991 NaN NaN 274 366.942991 NaN
1939-10-02 00:00:00+00:00 2.267172 NaN 1939 7.203685 366.942991 NaN NaN 275 366.942991 NaN
1939-10-03 00:00:00+00:00 2.267172 NaN 1939 7.203685 366.942991 NaN NaN 276 366.942991 NaN
1939-10-04 00:00:00+00:00 2.267172 0.0 1939 7.203685 366.942991 NaN NaN 277 366.942991 NaN
1939-10-05 00:00:00+00:00 2.267172 NaN 1939 7.203685 366.942991 NaN NaN 278 366.942991 NaN

Now for some powerful analysis using fourier analysis to extract the data’s signal.

Some background for your evening viewing: https://youtu.be/spUNpyF58BY

We have to detrend the data for fast fourier transforms to work properly. Here’s a discussion on why:

https://groups.google.com/forum/#!topic/comp.dsp/kYDZqetr_TU

Fortunately we can easily do this in python using scipy!

https://docs.scipy.org/doc/scipy/reference/generated/scipy.signal.detrend.html

Let’s create a new dataframe with only the data we’re interested in analysing.

[45]:
df_data = df2[["russian_r_af", "year"]]
df_data = df_data.rename(columns={"russian_r_af": "Q"})
df_data.describe()
[45]:
site_no Q year
count 30773.000000 30773.000000
mean 4255.042969 1980.626036
std 11328.390556 24.322069
min 1.487607 1939.000000
25% 345.124759 1960.000000
50% 672.398238 1981.000000
75% 2657.857341 2002.000000
max 193785.568837 2023.000000

But first let’s also set the water year by date on this dataframe and drop the single nan value from the dataframe

[46]:
df_data['water_year'] = df_data.index.shift(30+31+31,freq='d').year
df_data.dropna(inplace=True)
df_data["detrended"] = detrend(df_data.Q.values)
[47]:
fig = plt.figure(figsize=(12,6))
ax = plt.subplot(3,1,1)
plt.plot(df_data.detrended)
plt.title('detrended signal')

plt.subplot(3,1,2)
plt.plot(df_data.index, df_data.Q)
plt.title('Raw Signal')

plt.subplot(3,1,3)
plt.plot(df_data.index, df_data.Q - df_data.detrended)
plt.title('Difference')

plt.tight_layout()
../../../_images/notebooks_part0_python_intro_solutions_08_pandas_93_0.png

Evaluate and plot the Period Spectrum to see timing of recurrence

Here we’ve created a function that performs fast fourier transforms and then plot’s the spectrum for signals of various lengths.

[48]:
def fft_and_plot(df, plot_dominant_periods=4):
    N = len(df)
    yf = np.fft.fft(df.detrended)
    yf = np.abs(yf[:int(N / 2)])

    # get the right frequency
    # https://docs.scipy.org/doc/numpy/reference/generated/numpy.fft.fftfreq.html#numpy.fft.fftfreq
    d = 1. # day
    f = np.fft.fftfreq(N, d)[:int(N/2)]
    f[0] = .00001
    per = 1./f # days

    fig = plt.figure(figsize=(12,6))
    ax = plt.subplot(2,1,1)
    plt.plot(per, yf)
    plt.xscale('log')

    top=np.argsort(yf)[-plot_dominant_periods:]
    j=(10-plot_dominant_periods)/10
    for i in top:
        plt.plot([per[i], per[i]], [0,np.max(yf)], 'r:')
        plt.text(per[i], j*np.max(yf), f"{per[i] :.2f}")
        j+=0.1

    plt.title('Period Spectrum')
    plt.grid()
    ax.set_xlabel('Period (days)')
    plt.xlim([1, 1e4])

    plt.subplot(2,1,2)
    plt.plot(df.index,df.Q)
    plt.title('Raw Signal')
    plt.tight_layout()

Now let’s look at the whole signal

[49]:
fft_and_plot(df_data)
../../../_images/notebooks_part0_python_intro_solutions_08_pandas_97_0.png

We see a strong annual frequency that corresponds to spring snowmelt from the surrounding mountain ranges. The second strongest frequency is biannually, and is likely due to the onset of fall in winter rains

Okay, what about years before dams were installed on the Russian River.

Let’s get in the wayback machine and look at only the years prior to 1954!

[50]:
fft_and_plot(df_data[df_data.index.year < 1954])
../../../_images/notebooks_part0_python_intro_solutions_08_pandas_100_0.png

The binanual and annual signals have a much wider spread, which suggests that peak runoff was a little more variable in it’s timing compared to the entire data record.

What if we focus in on after 1954?

[51]:
fft_and_plot(df_data[(df_data.index.year > 1954)])
../../../_images/notebooks_part0_python_intro_solutions_08_pandas_103_0.png

It looks like flows are more controlled with a tighter periods.

So post dam construction, flows are much more controlled with regular release schedules.