03: Solutions to Useful standard library modules exercises

[1]:
import os
from pathlib import Path
import shutil
import subprocess
import sys
import zipfile

Exercise: Make a script with a command line argument using sys.argv

  1. Using a text editor such as VSCode, make a new *.py file with the following contents:

import sys

if len(sys.argv) > 1:
    for argument in sys.argv[1:]:
        print(argument)
else:
    print("usage is: python <script name>.py argument")
    quit()
  1. Try running the script at the command line

[2]:
write_text = (
    'import sys\n\n'
    'if len(sys.argv) > 1:\n'
    '    for argument in sys.argv[1:]:\n'
    '        print(argument)\n'
    'else:\n'
    '   print("usage is: python <script name>.py argument")\n'
    'quit()\n'
)

with open('myscript.py', 'w') as dest:
    dest.write(write_text)
[3]:
result = subprocess.run(['python', 'myscript.py'], check=True)
result.stdout
usage is: python <script name>.py argument
[4]:
result = subprocess.run(['python', 'myscript.py', 'arg1', 'arg2'], check=True)
result.stdout
arg1
arg2

Testing Your Skills with a truly awful example:

the problem:

Pretend that the file data/fileio/netcdf_data.zip contains some climate data (in the NetCDF format with the *.nc extension) that we downloaded. If you open data/fileio/netcdf_data.zip, you’ll see that within a subfolder zipped are a bunch of additional subfolders, each for a different year. Within each subfolder is another zipfile. Within each of these zipfiles is yet another subfolder, inside of which is the actual data file we want (prcp.nc).

[5]:
with zipfile.ZipFile('../data/netcdf_data.zip') as src:
    for f in src.namelist()[:10]:
        print(f)
netcdf_data/
netcdf_data/zipped/
netcdf_data/zipped/zipped_1991/
netcdf_data/zipped/zipped_1991/12270_1991.zip
netcdf_data/zipped/zipped_1996/
netcdf_data/zipped/zipped_1996/12270_1996.zip
netcdf_data/zipped/zipped_1998/
netcdf_data/zipped/zipped_1998/12270_1998.zip
netcdf_data/zipped/zipped_1999/
netcdf_data/zipped/zipped_1999/12270_1999.zip

the goal:

To extract all of these prcp.nc files into a single folder, after renaming them with their respective years (obtained from their enclosing folders or zip files). e.g.

prcp_1980.nc
prcp_1981.nc
...

This will allow us to open them together as a dataset in xarray (more on that later). Does this sound awful? I’m not making this up. This is the kind of structure you get when downloading tiles of climate data with the Daymet Tile Selection Tool

hint:

you might find these functions helpful:

ZipFile.extractall
ZipFile.extract
Path.glob
Path.mkdir
Path.stem
Path.parent
Path.name
shutil.move
Path.rmdir()


os.path.isdir
os.makedirs

os.path.split
os.path.splitext
os.path.join
os.rename
os.rmdir

hint: start by using ZipFile.extractall() to extract all of the individual zip files from the main zip archive

This extracts the entire contents of the zip file to a designated folder

[6]:
output_folder = Path('../03-output')
output_folder.mkdir(exist_ok=True)

with zipfile.ZipFile('../data/netcdf_data.zip') as src:
    src.extractall(output_folder)

Make a list of the zipfiles

[7]:
zipfiles = list(output_folder.glob('netcdf_data/zipped/*/*.zip'))
zipfiles[:5]
[7]:
[PosixPath('../03-output/netcdf_data/zipped/zipped_1991/12270_1991.zip'),
 PosixPath('../03-output/netcdf_data/zipped/zipped_1996/12270_1996.zip'),
 PosixPath('../03-output/netcdf_data/zipped/zipped_1998/12270_1998.zip'),
 PosixPath('../03-output/netcdf_data/zipped/zipped_1999/12270_1999.zip'),
 PosixPath('../03-output/netcdf_data/zipped/zipped_1997/12270_1997.zip')]

Part 1: extract with a single file

[8]:
f = zipfiles[0]
f
[8]:
PosixPath('../03-output/netcdf_data/zipped/zipped_1991/12270_1991.zip')

1a) Use ZipFile.namelist() (as above) list the contents

This will yield the name of the *.nc file that we need to extract

[9]:
with zipfile.ZipFile(f) as src:
    nc_file = src.namelist()[0]
print(nc_file)
12270_1991/prcp.nc

1b) Use ZipFile.extract() to extract the *.nc file to the destination folder

(you may need to create the destination folder first)

[10]:
with zipfile.ZipFile(f) as src:
    src.extract(nc_file, output_folder)

1c) Move the extracted file out of any enclosing subfolders, and rename to prcp_<year>.nc

(so that if we repeat this for subsequent files, the extracted *.nc files will end up in the same place)

[11]:
# make a path for the extracted file
extracted_path = output_folder / nc_file
extracted_path
[11]:
PosixPath('../03-output/12270_1991/prcp.nc')
[12]:
# make a path for the new file
nc_file = Path(nc_file)
variable = nc_file.stem
year = nc_file.parent.name.split('_')[1]
new_file = output_folder / f"{variable}_{year}.nc"
new_file
[12]:
PosixPath('../03-output/prcp_1991.nc')
[13]:
# do the move
shutil.move(extracted_path, new_file)
[13]:
PosixPath('../03-output/prcp_1991.nc')

1d) Remove the extra subfolders that were extracted

[14]:
extracted_path.parent.rmdir()

Part 2: put the above steps together into a loop to repeat the workflow for all of the NetCDF files

[15]:
for f in zipfiles:
    with zipfile.ZipFile(f) as src:

        # get the NetCDF file
        nc_file = src.namelist()[0]

        # extract it to the output folder
        src.extract(nc_file, output_folder)

        # make a path for the extracted file
        extracted_path = output_folder / nc_file

        # make a path for the new file
        nc_file = Path(nc_file)
        variable = nc_file.stem
        year = nc_file.parent.name.split('_')[1]
        new_file = output_folder / f"{variable}_{year}.nc"

        # move the extracted NetCDF file to the dest. location
        shutil.move(extracted_path, new_file)

        # remove the subfolders that were extracted
        extracted_path.parent.rmdir()

        print(f"{f}/{nc_file} --> {new_file}")
../03-output/netcdf_data/zipped/zipped_1991/12270_1991.zip/12270_1991/prcp.nc --> ../03-output/prcp_1991.nc
../03-output/netcdf_data/zipped/zipped_1996/12270_1996.zip/12270_1996/prcp.nc --> ../03-output/prcp_1996.nc
../03-output/netcdf_data/zipped/zipped_1998/12270_1998.zip/12270_1998/prcp.nc --> ../03-output/prcp_1998.nc
../03-output/netcdf_data/zipped/zipped_1999/12270_1999.zip/12270_1999/prcp.nc --> ../03-output/prcp_1999.nc
../03-output/netcdf_data/zipped/zipped_1997/12270_1997.zip/12270_1997/prcp.nc --> ../03-output/prcp_1997.nc
../03-output/netcdf_data/zipped/zipped_1990/12270_1990.zip/12270_1990/prcp.nc --> ../03-output/prcp_1990.nc
../03-output/netcdf_data/zipped/zipped_2003/12270_2003.zip/12270_2003/prcp.nc --> ../03-output/prcp_2003.nc
../03-output/netcdf_data/zipped/zipped_2004/12270_2004.zip/12270_2004/prcp.nc --> ../03-output/prcp_2004.nc
../03-output/netcdf_data/zipped/zipped_2005/12270_2005.zip/12270_2005/prcp.nc --> ../03-output/prcp_2005.nc
../03-output/netcdf_data/zipped/zipped_2002/12270_2002.zip/12270_2002/prcp.nc --> ../03-output/prcp_2002.nc
../03-output/netcdf_data/zipped/zipped_2011/12270_2011.zip/12270_2011/prcp.nc --> ../03-output/prcp_2011.nc
../03-output/netcdf_data/zipped/zipped_2016/12270_2016.zip/12270_2016/prcp.nc --> ../03-output/prcp_2016.nc
../03-output/netcdf_data/zipped/zipped_2017/12270_2017.zip/12270_2017/prcp.nc --> ../03-output/prcp_2017.nc
../03-output/netcdf_data/zipped/zipped_2010/12270_2010.zip/12270_2010/prcp.nc --> ../03-output/prcp_2010.nc
../03-output/netcdf_data/zipped/zipped_1983/12270_1983.zip/12270_1983/prcp.nc --> ../03-output/prcp_1983.nc
../03-output/netcdf_data/zipped/zipped_1984/12270_1984.zip/12270_1984/prcp.nc --> ../03-output/prcp_1984.nc
../03-output/netcdf_data/zipped/zipped_1985/12270_1985.zip/12270_1985/prcp.nc --> ../03-output/prcp_1985.nc
../03-output/netcdf_data/zipped/zipped_1982/12270_1982.zip/12270_1982/prcp.nc --> ../03-output/prcp_1982.nc
../03-output/netcdf_data/zipped/zipped_1995/12270_1995.zip/12270_1995/prcp.nc --> ../03-output/prcp_1995.nc
../03-output/netcdf_data/zipped/zipped_1992/12270_1992.zip/12270_1992/prcp.nc --> ../03-output/prcp_1992.nc
../03-output/netcdf_data/zipped/zipped_1993/12270_1993.zip/12270_1993/prcp.nc --> ../03-output/prcp_1993.nc
../03-output/netcdf_data/zipped/zipped_1994/12270_1994.zip/12270_1994/prcp.nc --> ../03-output/prcp_1994.nc
../03-output/netcdf_data/zipped/zipped_2009/12270_2009.zip/12270_2009/prcp.nc --> ../03-output/prcp_2009.nc
../03-output/netcdf_data/zipped/zipped_2007/12270_2007.zip/12270_2007/prcp.nc --> ../03-output/prcp_2007.nc
../03-output/netcdf_data/zipped/zipped_2000/12270_2000.zip/12270_2000/prcp.nc --> ../03-output/prcp_2000.nc
../03-output/netcdf_data/zipped/zipped_2001/12270_2001.zip/12270_2001/prcp.nc --> ../03-output/prcp_2001.nc
../03-output/netcdf_data/zipped/zipped_2006/12270_2006.zip/12270_2006/prcp.nc --> ../03-output/prcp_2006.nc
../03-output/netcdf_data/zipped/zipped_2008/12270_2008.zip/12270_2008/prcp.nc --> ../03-output/prcp_2008.nc
../03-output/netcdf_data/zipped/zipped_2015/12270_2015.zip/12270_2015/prcp.nc --> ../03-output/prcp_2015.nc
../03-output/netcdf_data/zipped/zipped_2012/12270_2012.zip/12270_2012/prcp.nc --> ../03-output/prcp_2012.nc
../03-output/netcdf_data/zipped/zipped_2013/12270_2013.zip/12270_2013/prcp.nc --> ../03-output/prcp_2013.nc
../03-output/netcdf_data/zipped/zipped_2014/12270_2014.zip/12270_2014/prcp.nc --> ../03-output/prcp_2014.nc
../03-output/netcdf_data/zipped/zipped_1989/12270_1989.zip/12270_1989/prcp.nc --> ../03-output/prcp_1989.nc
../03-output/netcdf_data/zipped/zipped_1987/12270_1987.zip/12270_1987/prcp.nc --> ../03-output/prcp_1987.nc
../03-output/netcdf_data/zipped/zipped_1980/12270_1980.zip/12270_1980/prcp.nc --> ../03-output/prcp_1980.nc
../03-output/netcdf_data/zipped/zipped_1981/12270_1981.zip/12270_1981/prcp.nc --> ../03-output/prcp_1981.nc
../03-output/netcdf_data/zipped/zipped_1986/12270_1986.zip/12270_1986/prcp.nc --> ../03-output/prcp_1986.nc
../03-output/netcdf_data/zipped/zipped_1988/12270_1988.zip/12270_1988/prcp.nc --> ../03-output/prcp_1988.nc

Another way to do this using os instead of pathlib

(from the 2018 Madison Python class)

[16]:
# declare a destination path
dest_path = 'extracted_data'
variable = 'prcp'

for f in zipfiles:
    with zipfile.ZipFile(f) as src:
        # get the path to the source file and the year
        _, fname = os.path.split(f)
        name = os.path.splitext(fname)[0].replace('.tar', '')
        srcfile = '{}/{}.nc'.format(name, variable)
        year = name.split('_')[1]

        # where we want the extracted .nc file to end up
        destfile = os.path.join(dest_path, '{}_{}.nc'.format(variable, year))

        # extract the srcfile path to the /daymet folder
        # unfortunately this extracts the whole path, not just the file
        src.extract(srcfile, dest_path)
        # move the file up from subfolders to /daymet
        shutil.move(os.path.join(dest_path, srcfile), dest_path)
        # rename to include year
        os.rename(os.path.join(dest_path, '{}.nc'.format(variable)),
                  destfile)
        # trash subfolders that were extracted
        os.rmdir(os.path.join(dest_path, name))
        print('{}/{} --> {}'.format(f, srcfile, destfile))
../03-output/netcdf_data/zipped/zipped_1991/12270_1991.zip/12270_1991/prcp.nc --> extracted_data/prcp_1991.nc
../03-output/netcdf_data/zipped/zipped_1996/12270_1996.zip/12270_1996/prcp.nc --> extracted_data/prcp_1996.nc
../03-output/netcdf_data/zipped/zipped_1998/12270_1998.zip/12270_1998/prcp.nc --> extracted_data/prcp_1998.nc
../03-output/netcdf_data/zipped/zipped_1999/12270_1999.zip/12270_1999/prcp.nc --> extracted_data/prcp_1999.nc
../03-output/netcdf_data/zipped/zipped_1997/12270_1997.zip/12270_1997/prcp.nc --> extracted_data/prcp_1997.nc
../03-output/netcdf_data/zipped/zipped_1990/12270_1990.zip/12270_1990/prcp.nc --> extracted_data/prcp_1990.nc
../03-output/netcdf_data/zipped/zipped_2003/12270_2003.zip/12270_2003/prcp.nc --> extracted_data/prcp_2003.nc
../03-output/netcdf_data/zipped/zipped_2004/12270_2004.zip/12270_2004/prcp.nc --> extracted_data/prcp_2004.nc
../03-output/netcdf_data/zipped/zipped_2005/12270_2005.zip/12270_2005/prcp.nc --> extracted_data/prcp_2005.nc
../03-output/netcdf_data/zipped/zipped_2002/12270_2002.zip/12270_2002/prcp.nc --> extracted_data/prcp_2002.nc
../03-output/netcdf_data/zipped/zipped_2011/12270_2011.zip/12270_2011/prcp.nc --> extracted_data/prcp_2011.nc
../03-output/netcdf_data/zipped/zipped_2016/12270_2016.zip/12270_2016/prcp.nc --> extracted_data/prcp_2016.nc
../03-output/netcdf_data/zipped/zipped_2017/12270_2017.zip/12270_2017/prcp.nc --> extracted_data/prcp_2017.nc
../03-output/netcdf_data/zipped/zipped_2010/12270_2010.zip/12270_2010/prcp.nc --> extracted_data/prcp_2010.nc
../03-output/netcdf_data/zipped/zipped_1983/12270_1983.zip/12270_1983/prcp.nc --> extracted_data/prcp_1983.nc
../03-output/netcdf_data/zipped/zipped_1984/12270_1984.zip/12270_1984/prcp.nc --> extracted_data/prcp_1984.nc
../03-output/netcdf_data/zipped/zipped_1985/12270_1985.zip/12270_1985/prcp.nc --> extracted_data/prcp_1985.nc
../03-output/netcdf_data/zipped/zipped_1982/12270_1982.zip/12270_1982/prcp.nc --> extracted_data/prcp_1982.nc
../03-output/netcdf_data/zipped/zipped_1995/12270_1995.zip/12270_1995/prcp.nc --> extracted_data/prcp_1995.nc
../03-output/netcdf_data/zipped/zipped_1992/12270_1992.zip/12270_1992/prcp.nc --> extracted_data/prcp_1992.nc
../03-output/netcdf_data/zipped/zipped_1993/12270_1993.zip/12270_1993/prcp.nc --> extracted_data/prcp_1993.nc
../03-output/netcdf_data/zipped/zipped_1994/12270_1994.zip/12270_1994/prcp.nc --> extracted_data/prcp_1994.nc
../03-output/netcdf_data/zipped/zipped_2009/12270_2009.zip/12270_2009/prcp.nc --> extracted_data/prcp_2009.nc
../03-output/netcdf_data/zipped/zipped_2007/12270_2007.zip/12270_2007/prcp.nc --> extracted_data/prcp_2007.nc
../03-output/netcdf_data/zipped/zipped_2000/12270_2000.zip/12270_2000/prcp.nc --> extracted_data/prcp_2000.nc
../03-output/netcdf_data/zipped/zipped_2001/12270_2001.zip/12270_2001/prcp.nc --> extracted_data/prcp_2001.nc
../03-output/netcdf_data/zipped/zipped_2006/12270_2006.zip/12270_2006/prcp.nc --> extracted_data/prcp_2006.nc
../03-output/netcdf_data/zipped/zipped_2008/12270_2008.zip/12270_2008/prcp.nc --> extracted_data/prcp_2008.nc
../03-output/netcdf_data/zipped/zipped_2015/12270_2015.zip/12270_2015/prcp.nc --> extracted_data/prcp_2015.nc
../03-output/netcdf_data/zipped/zipped_2012/12270_2012.zip/12270_2012/prcp.nc --> extracted_data/prcp_2012.nc
../03-output/netcdf_data/zipped/zipped_2013/12270_2013.zip/12270_2013/prcp.nc --> extracted_data/prcp_2013.nc
../03-output/netcdf_data/zipped/zipped_2014/12270_2014.zip/12270_2014/prcp.nc --> extracted_data/prcp_2014.nc
../03-output/netcdf_data/zipped/zipped_1989/12270_1989.zip/12270_1989/prcp.nc --> extracted_data/prcp_1989.nc
../03-output/netcdf_data/zipped/zipped_1987/12270_1987.zip/12270_1987/prcp.nc --> extracted_data/prcp_1987.nc
../03-output/netcdf_data/zipped/zipped_1980/12270_1980.zip/12270_1980/prcp.nc --> extracted_data/prcp_1980.nc
../03-output/netcdf_data/zipped/zipped_1981/12270_1981.zip/12270_1981/prcp.nc --> extracted_data/prcp_1981.nc
../03-output/netcdf_data/zipped/zipped_1986/12270_1986.zip/12270_1986/prcp.nc --> extracted_data/prcp_1986.nc
../03-output/netcdf_data/zipped/zipped_1988/12270_1988.zip/12270_1988/prcp.nc --> extracted_data/prcp_1988.nc