03: Solutions to Useful standard library modules exercises¶
[1]:
import os
from pathlib import Path
import shutil
import subprocess
import sys
import zipfile
Exercise: Make a script with a command line argument using sys.argv¶
Using a text editor such as VSCode, make a new
*.py
file with the following contents:
import sys
if len(sys.argv) > 1:
for argument in sys.argv[1:]:
print(argument)
else:
print("usage is: python <script name>.py argument")
quit()
Try running the script at the command line
[2]:
write_text = (
'import sys\n\n'
'if len(sys.argv) > 1:\n'
' for argument in sys.argv[1:]:\n'
' print(argument)\n'
'else:\n'
' print("usage is: python <script name>.py argument")\n'
'quit()\n'
)
with open('myscript.py', 'w') as dest:
dest.write(write_text)
[3]:
result = subprocess.run(['python', 'myscript.py'], check=True)
result.stdout
usage is: python <script name>.py argument
[4]:
result = subprocess.run(['python', 'myscript.py', 'arg1', 'arg2'], check=True)
result.stdout
arg1
arg2
Testing Your Skills with a truly awful example:¶
the problem:¶
Pretend that the file data/fileio/netcdf_data.zip
contains some climate data (in the NetCDF format with the *.nc
extension) that we downloaded. If you open data/fileio/netcdf_data.zip
, you’ll see that within a subfolder zipped
are a bunch of additional subfolders, each for a different year. Within each subfolder is another zipfile. Within each of these zipfiles is yet another subfolder, inside of which is the actual data file we want (prcp.nc
).
[5]:
with zipfile.ZipFile('../data/netcdf_data.zip') as src:
for f in src.namelist()[:10]:
print(f)
netcdf_data/
netcdf_data/zipped/
netcdf_data/zipped/zipped_1991/
netcdf_data/zipped/zipped_1991/12270_1991.zip
netcdf_data/zipped/zipped_1996/
netcdf_data/zipped/zipped_1996/12270_1996.zip
netcdf_data/zipped/zipped_1998/
netcdf_data/zipped/zipped_1998/12270_1998.zip
netcdf_data/zipped/zipped_1999/
netcdf_data/zipped/zipped_1999/12270_1999.zip
the goal:¶
To extract all of these prcp.nc
files into a single folder, after renaming them with their respective years (obtained from their enclosing folders or zip files). e.g.
prcp_1980.nc
prcp_1981.nc
...
This will allow us to open them together as a dataset in xarray
(more on that later). Does this sound awful? I’m not making this up. This is the kind of structure you get when downloading tiles of climate data with the Daymet Tile Selection Tool
hint:¶
you might find these functions helpful:
ZipFile.extractall
ZipFile.extract
Path.glob
Path.mkdir
Path.stem
Path.parent
Path.name
shutil.move
Path.rmdir()
os.path.isdir
os.makedirs
os.path.split
os.path.splitext
os.path.join
os.rename
os.rmdir
hint: start by using ZipFile.extractall()
to extract all of the individual zip files from the main zip archive¶
This extracts the entire contents of the zip file to a designated folder
[6]:
output_folder = Path('../03-output')
output_folder.mkdir(exist_ok=True)
with zipfile.ZipFile('../data/netcdf_data.zip') as src:
src.extractall(output_folder)
Make a list of the zipfiles
[7]:
zipfiles = list(output_folder.glob('netcdf_data/zipped/*/*.zip'))
zipfiles[:5]
[7]:
[PosixPath('../03-output/netcdf_data/zipped/zipped_1991/12270_1991.zip'),
PosixPath('../03-output/netcdf_data/zipped/zipped_1996/12270_1996.zip'),
PosixPath('../03-output/netcdf_data/zipped/zipped_1998/12270_1998.zip'),
PosixPath('../03-output/netcdf_data/zipped/zipped_1999/12270_1999.zip'),
PosixPath('../03-output/netcdf_data/zipped/zipped_1997/12270_1997.zip')]
Part 1: extract with a single file¶
[8]:
f = zipfiles[0]
f
[8]:
PosixPath('../03-output/netcdf_data/zipped/zipped_1991/12270_1991.zip')
1a) Use ZipFile.namelist()
(as above) list the contents¶
This will yield the name of the *.nc
file that we need to extract
[9]:
with zipfile.ZipFile(f) as src:
nc_file = src.namelist()[0]
print(nc_file)
12270_1991/prcp.nc
1b) Use ZipFile.extract()
to extract the *.nc
file to the destination folder¶
(you may need to create the destination folder first)
[10]:
with zipfile.ZipFile(f) as src:
src.extract(nc_file, output_folder)
1c) Move the extracted file out of any enclosing subfolders, and rename to prcp_<year>.nc
¶
(so that if we repeat this for subsequent files, the extracted *.nc
files will end up in the same place)
[11]:
# make a path for the extracted file
extracted_path = output_folder / nc_file
extracted_path
[11]:
PosixPath('../03-output/12270_1991/prcp.nc')
[12]:
# make a path for the new file
nc_file = Path(nc_file)
variable = nc_file.stem
year = nc_file.parent.name.split('_')[1]
new_file = output_folder / f"{variable}_{year}.nc"
new_file
[12]:
PosixPath('../03-output/prcp_1991.nc')
[13]:
# do the move
shutil.move(extracted_path, new_file)
[13]:
PosixPath('../03-output/prcp_1991.nc')
1d) Remove the extra subfolders that were extracted¶
[14]:
extracted_path.parent.rmdir()
Part 2: put the above steps together into a loop to repeat the workflow for all of the NetCDF files¶
[15]:
for f in zipfiles:
with zipfile.ZipFile(f) as src:
# get the NetCDF file
nc_file = src.namelist()[0]
# extract it to the output folder
src.extract(nc_file, output_folder)
# make a path for the extracted file
extracted_path = output_folder / nc_file
# make a path for the new file
nc_file = Path(nc_file)
variable = nc_file.stem
year = nc_file.parent.name.split('_')[1]
new_file = output_folder / f"{variable}_{year}.nc"
# move the extracted NetCDF file to the dest. location
shutil.move(extracted_path, new_file)
# remove the subfolders that were extracted
extracted_path.parent.rmdir()
print(f"{f}/{nc_file} --> {new_file}")
../03-output/netcdf_data/zipped/zipped_1991/12270_1991.zip/12270_1991/prcp.nc --> ../03-output/prcp_1991.nc
../03-output/netcdf_data/zipped/zipped_1996/12270_1996.zip/12270_1996/prcp.nc --> ../03-output/prcp_1996.nc
../03-output/netcdf_data/zipped/zipped_1998/12270_1998.zip/12270_1998/prcp.nc --> ../03-output/prcp_1998.nc
../03-output/netcdf_data/zipped/zipped_1999/12270_1999.zip/12270_1999/prcp.nc --> ../03-output/prcp_1999.nc
../03-output/netcdf_data/zipped/zipped_1997/12270_1997.zip/12270_1997/prcp.nc --> ../03-output/prcp_1997.nc
../03-output/netcdf_data/zipped/zipped_1990/12270_1990.zip/12270_1990/prcp.nc --> ../03-output/prcp_1990.nc
../03-output/netcdf_data/zipped/zipped_2003/12270_2003.zip/12270_2003/prcp.nc --> ../03-output/prcp_2003.nc
../03-output/netcdf_data/zipped/zipped_2004/12270_2004.zip/12270_2004/prcp.nc --> ../03-output/prcp_2004.nc
../03-output/netcdf_data/zipped/zipped_2005/12270_2005.zip/12270_2005/prcp.nc --> ../03-output/prcp_2005.nc
../03-output/netcdf_data/zipped/zipped_2002/12270_2002.zip/12270_2002/prcp.nc --> ../03-output/prcp_2002.nc
../03-output/netcdf_data/zipped/zipped_2011/12270_2011.zip/12270_2011/prcp.nc --> ../03-output/prcp_2011.nc
../03-output/netcdf_data/zipped/zipped_2016/12270_2016.zip/12270_2016/prcp.nc --> ../03-output/prcp_2016.nc
../03-output/netcdf_data/zipped/zipped_2017/12270_2017.zip/12270_2017/prcp.nc --> ../03-output/prcp_2017.nc
../03-output/netcdf_data/zipped/zipped_2010/12270_2010.zip/12270_2010/prcp.nc --> ../03-output/prcp_2010.nc
../03-output/netcdf_data/zipped/zipped_1983/12270_1983.zip/12270_1983/prcp.nc --> ../03-output/prcp_1983.nc
../03-output/netcdf_data/zipped/zipped_1984/12270_1984.zip/12270_1984/prcp.nc --> ../03-output/prcp_1984.nc
../03-output/netcdf_data/zipped/zipped_1985/12270_1985.zip/12270_1985/prcp.nc --> ../03-output/prcp_1985.nc
../03-output/netcdf_data/zipped/zipped_1982/12270_1982.zip/12270_1982/prcp.nc --> ../03-output/prcp_1982.nc
../03-output/netcdf_data/zipped/zipped_1995/12270_1995.zip/12270_1995/prcp.nc --> ../03-output/prcp_1995.nc
../03-output/netcdf_data/zipped/zipped_1992/12270_1992.zip/12270_1992/prcp.nc --> ../03-output/prcp_1992.nc
../03-output/netcdf_data/zipped/zipped_1993/12270_1993.zip/12270_1993/prcp.nc --> ../03-output/prcp_1993.nc
../03-output/netcdf_data/zipped/zipped_1994/12270_1994.zip/12270_1994/prcp.nc --> ../03-output/prcp_1994.nc
../03-output/netcdf_data/zipped/zipped_2009/12270_2009.zip/12270_2009/prcp.nc --> ../03-output/prcp_2009.nc
../03-output/netcdf_data/zipped/zipped_2007/12270_2007.zip/12270_2007/prcp.nc --> ../03-output/prcp_2007.nc
../03-output/netcdf_data/zipped/zipped_2000/12270_2000.zip/12270_2000/prcp.nc --> ../03-output/prcp_2000.nc
../03-output/netcdf_data/zipped/zipped_2001/12270_2001.zip/12270_2001/prcp.nc --> ../03-output/prcp_2001.nc
../03-output/netcdf_data/zipped/zipped_2006/12270_2006.zip/12270_2006/prcp.nc --> ../03-output/prcp_2006.nc
../03-output/netcdf_data/zipped/zipped_2008/12270_2008.zip/12270_2008/prcp.nc --> ../03-output/prcp_2008.nc
../03-output/netcdf_data/zipped/zipped_2015/12270_2015.zip/12270_2015/prcp.nc --> ../03-output/prcp_2015.nc
../03-output/netcdf_data/zipped/zipped_2012/12270_2012.zip/12270_2012/prcp.nc --> ../03-output/prcp_2012.nc
../03-output/netcdf_data/zipped/zipped_2013/12270_2013.zip/12270_2013/prcp.nc --> ../03-output/prcp_2013.nc
../03-output/netcdf_data/zipped/zipped_2014/12270_2014.zip/12270_2014/prcp.nc --> ../03-output/prcp_2014.nc
../03-output/netcdf_data/zipped/zipped_1989/12270_1989.zip/12270_1989/prcp.nc --> ../03-output/prcp_1989.nc
../03-output/netcdf_data/zipped/zipped_1987/12270_1987.zip/12270_1987/prcp.nc --> ../03-output/prcp_1987.nc
../03-output/netcdf_data/zipped/zipped_1980/12270_1980.zip/12270_1980/prcp.nc --> ../03-output/prcp_1980.nc
../03-output/netcdf_data/zipped/zipped_1981/12270_1981.zip/12270_1981/prcp.nc --> ../03-output/prcp_1981.nc
../03-output/netcdf_data/zipped/zipped_1986/12270_1986.zip/12270_1986/prcp.nc --> ../03-output/prcp_1986.nc
../03-output/netcdf_data/zipped/zipped_1988/12270_1988.zip/12270_1988/prcp.nc --> ../03-output/prcp_1988.nc
Another way to do this using os
instead of pathlib
¶
(from the 2018 Madison Python class)
[16]:
# declare a destination path
dest_path = 'extracted_data'
variable = 'prcp'
for f in zipfiles:
with zipfile.ZipFile(f) as src:
# get the path to the source file and the year
_, fname = os.path.split(f)
name = os.path.splitext(fname)[0].replace('.tar', '')
srcfile = '{}/{}.nc'.format(name, variable)
year = name.split('_')[1]
# where we want the extracted .nc file to end up
destfile = os.path.join(dest_path, '{}_{}.nc'.format(variable, year))
# extract the srcfile path to the /daymet folder
# unfortunately this extracts the whole path, not just the file
src.extract(srcfile, dest_path)
# move the file up from subfolders to /daymet
shutil.move(os.path.join(dest_path, srcfile), dest_path)
# rename to include year
os.rename(os.path.join(dest_path, '{}.nc'.format(variable)),
destfile)
# trash subfolders that were extracted
os.rmdir(os.path.join(dest_path, name))
print('{}/{} --> {}'.format(f, srcfile, destfile))
../03-output/netcdf_data/zipped/zipped_1991/12270_1991.zip/12270_1991/prcp.nc --> extracted_data/prcp_1991.nc
../03-output/netcdf_data/zipped/zipped_1996/12270_1996.zip/12270_1996/prcp.nc --> extracted_data/prcp_1996.nc
../03-output/netcdf_data/zipped/zipped_1998/12270_1998.zip/12270_1998/prcp.nc --> extracted_data/prcp_1998.nc
../03-output/netcdf_data/zipped/zipped_1999/12270_1999.zip/12270_1999/prcp.nc --> extracted_data/prcp_1999.nc
../03-output/netcdf_data/zipped/zipped_1997/12270_1997.zip/12270_1997/prcp.nc --> extracted_data/prcp_1997.nc
../03-output/netcdf_data/zipped/zipped_1990/12270_1990.zip/12270_1990/prcp.nc --> extracted_data/prcp_1990.nc
../03-output/netcdf_data/zipped/zipped_2003/12270_2003.zip/12270_2003/prcp.nc --> extracted_data/prcp_2003.nc
../03-output/netcdf_data/zipped/zipped_2004/12270_2004.zip/12270_2004/prcp.nc --> extracted_data/prcp_2004.nc
../03-output/netcdf_data/zipped/zipped_2005/12270_2005.zip/12270_2005/prcp.nc --> extracted_data/prcp_2005.nc
../03-output/netcdf_data/zipped/zipped_2002/12270_2002.zip/12270_2002/prcp.nc --> extracted_data/prcp_2002.nc
../03-output/netcdf_data/zipped/zipped_2011/12270_2011.zip/12270_2011/prcp.nc --> extracted_data/prcp_2011.nc
../03-output/netcdf_data/zipped/zipped_2016/12270_2016.zip/12270_2016/prcp.nc --> extracted_data/prcp_2016.nc
../03-output/netcdf_data/zipped/zipped_2017/12270_2017.zip/12270_2017/prcp.nc --> extracted_data/prcp_2017.nc
../03-output/netcdf_data/zipped/zipped_2010/12270_2010.zip/12270_2010/prcp.nc --> extracted_data/prcp_2010.nc
../03-output/netcdf_data/zipped/zipped_1983/12270_1983.zip/12270_1983/prcp.nc --> extracted_data/prcp_1983.nc
../03-output/netcdf_data/zipped/zipped_1984/12270_1984.zip/12270_1984/prcp.nc --> extracted_data/prcp_1984.nc
../03-output/netcdf_data/zipped/zipped_1985/12270_1985.zip/12270_1985/prcp.nc --> extracted_data/prcp_1985.nc
../03-output/netcdf_data/zipped/zipped_1982/12270_1982.zip/12270_1982/prcp.nc --> extracted_data/prcp_1982.nc
../03-output/netcdf_data/zipped/zipped_1995/12270_1995.zip/12270_1995/prcp.nc --> extracted_data/prcp_1995.nc
../03-output/netcdf_data/zipped/zipped_1992/12270_1992.zip/12270_1992/prcp.nc --> extracted_data/prcp_1992.nc
../03-output/netcdf_data/zipped/zipped_1993/12270_1993.zip/12270_1993/prcp.nc --> extracted_data/prcp_1993.nc
../03-output/netcdf_data/zipped/zipped_1994/12270_1994.zip/12270_1994/prcp.nc --> extracted_data/prcp_1994.nc
../03-output/netcdf_data/zipped/zipped_2009/12270_2009.zip/12270_2009/prcp.nc --> extracted_data/prcp_2009.nc
../03-output/netcdf_data/zipped/zipped_2007/12270_2007.zip/12270_2007/prcp.nc --> extracted_data/prcp_2007.nc
../03-output/netcdf_data/zipped/zipped_2000/12270_2000.zip/12270_2000/prcp.nc --> extracted_data/prcp_2000.nc
../03-output/netcdf_data/zipped/zipped_2001/12270_2001.zip/12270_2001/prcp.nc --> extracted_data/prcp_2001.nc
../03-output/netcdf_data/zipped/zipped_2006/12270_2006.zip/12270_2006/prcp.nc --> extracted_data/prcp_2006.nc
../03-output/netcdf_data/zipped/zipped_2008/12270_2008.zip/12270_2008/prcp.nc --> extracted_data/prcp_2008.nc
../03-output/netcdf_data/zipped/zipped_2015/12270_2015.zip/12270_2015/prcp.nc --> extracted_data/prcp_2015.nc
../03-output/netcdf_data/zipped/zipped_2012/12270_2012.zip/12270_2012/prcp.nc --> extracted_data/prcp_2012.nc
../03-output/netcdf_data/zipped/zipped_2013/12270_2013.zip/12270_2013/prcp.nc --> extracted_data/prcp_2013.nc
../03-output/netcdf_data/zipped/zipped_2014/12270_2014.zip/12270_2014/prcp.nc --> extracted_data/prcp_2014.nc
../03-output/netcdf_data/zipped/zipped_1989/12270_1989.zip/12270_1989/prcp.nc --> extracted_data/prcp_1989.nc
../03-output/netcdf_data/zipped/zipped_1987/12270_1987.zip/12270_1987/prcp.nc --> extracted_data/prcp_1987.nc
../03-output/netcdf_data/zipped/zipped_1980/12270_1980.zip/12270_1980/prcp.nc --> extracted_data/prcp_1980.nc
../03-output/netcdf_data/zipped/zipped_1981/12270_1981.zip/12270_1981/prcp.nc --> extracted_data/prcp_1981.nc
../03-output/netcdf_data/zipped/zipped_1986/12270_1986.zip/12270_1986/prcp.nc --> extracted_data/prcp_1986.nc
../03-output/netcdf_data/zipped/zipped_1988/12270_1988.zip/12270_1988/prcp.nc --> extracted_data/prcp_1988.nc