# 03: Useful standard library modules
(pathlib, shutil, sys, os, subprocess, zipfile, etc.)

These packages are part of the standard python library and provide very useful functionality for working with your operating system and files.  This notebook will provide these packages and demonstrate some of their functionality.  Online documentation is at https://docs.python.org/3/library/.


## Topics covered:
* **pathlib**:
    * listing files
    * creating, moving and deleting files
    * absolute vs relative paths
    * useful path object attributes
* **shutil**: 
    * copying, moving and deleting files AND folders
* **sys**: 
    * python and platform information
    * command line arguments
    * modifying the python path to import code from other locations
* **os**:
    * changing the working directory
    * recursive iteration through folder structures
    * accessing environmental variables
* **subprocess**: 
    * running system commands and checking the results
* **zipfile**:
    * creating and extracting from zip archives

In [None]:
import os
from pathlib import Path
import shutil
import subprocess
import sys
import zipfile

## ``pathlib`` — Object-oriented filesystem paths
Pathlib provides convenient "pathlike" objects for working with file paths across platforms (meaning paths or operations done with pathlib work the same on Windows or POSIX systems (Linux, OSX, etc)). The main entry point for users is the ``Path()`` class.

further reading:  
https://treyhunner.com/2018/12/why-you-should-be-using-pathlib/  
https://docs.python.org/3/library/pathlib.html

### Listing files

#### Start by making a ``Path()`` object for the current folder

In [None]:
cwd = Path('.')
cwd

In [None]:
for f in cwd.iterdir():
    print(f)

#### List just the notebooks using the ``.glob()`` method

In [None]:
for nb in cwd.glob('*.ipynb'):
    print(nb)

#### Note: ``.glob()`` works across folders too
List all notebooks for both class components

In [None]:
for nb in cwd.glob('../*/*.ipynb'):
    print(nb)

#### But ``glob`` results aren't sorted alphabetically!
(and the sorting is platform-dependent)

https://arstechnica.com/information-technology/2019/10/chemists-discover-cross-platform-python-scripts-not-so-cross-platform/?comments=1&post=38113333

we can easily sort them by casting the results to a list

In [None]:
sorted(list(cwd.glob('../*/*.ipynb')))

**Note:** There is also a glob module in the standard python library that works directly with string paths

In [None]:
import glob
sorted(list(glob.glob('../*/*.ipynb')))

#### List just the subfolders

In [None]:
[f for f in cwd.iterdir() if f.is_dir()]

### Creating files and folders

#### make a ``Path`` object for a new subdirectory

In [None]:
new_folder = cwd / 'more_files'
new_folder

#### or an individual file

In [None]:
f = cwd / '00_python_basics_review.ipynb'
f

#### check if it exists, or if it's a directory

In [None]:
f.exists(), f.is_dir()

#### make the actual folder

In [None]:
new_folder.mkdir(); new_folder.exists()

Note that if you try to run the above cell twice, you'll get an error that the folder already exists
``exist_ok=True`` suppresses these errors.

In [None]:
new_folder.mkdir(exist_ok=True)

#### make a new subfolder within a new subfolder
The ``parents=True`` argument allows for making subfolders within new subfolders

In [None]:
(new_folder / 'subfolder').mkdir(exist_ok=True, parents=True)

### absolute vs. relative pathing

Get the absolute location of the current working directory

In [None]:
abs_cwd = Path.cwd()
abs_cwd

Go up two levels to the course repository

In [None]:
class_root = (abs_cwd / '../../')
class_root

Simplify or resolve the path

In [None]:
class_root = class_root.resolve()
class_root

Get the cwd relative to the course repository

In [None]:
abs_cwd.relative_to(class_root)

check if this is an absolute or relative path

In [None]:
abs_cwd.relative_to(class_root).is_absolute()

In [None]:
abs_cwd.is_absolute()

**gottcha:** `Path.relative_to()` only works when the first path is a subpath of the second path, or if both paths are absolute

For example, try executing this line: 

```python
Path('../part1_flopy/').relative_to('data')
```

If you need a relative path that will work robustly in a script, `os.path.relpath` might be a better choice

In [None]:
os.path.relpath('../part1_flopy/', 'data')

In [None]:
os.path.relpath('data', '../part1_flopy/')

### useful attributes

In [None]:
abs_cwd.parent

In [None]:
abs_cwd.parent.parent

In [None]:
f.name

In [None]:
f.suffix

In [None]:
f.with_suffix('.junk')

In [None]:
f.stem

### Moving and deleting files

Make a file

In [None]:
fname = Path('new_file.txt')
with open(fname, 'w') as dest:
    dest.write("A new text file.")

In [None]:
fname.exists()

Move the file

In [None]:
fname2 = Path('new_file2.txt')
fname.rename(fname2)

In [None]:
fname.exists()

Delete the file

In [None]:
fname2.unlink()

In [None]:
fname2.exists()

#### Delete the empty folder we made above
Note: this only works for empty directories (use ``shutil.rmtree()`` very carefully for removing folders and all contents within)

In [None]:
Path('more_files/subfolder/').rmdir()

## ``shutil`` — High-level file operations
module for copying, moving, and deleting files and directories.

https://docs.python.org/3/library/shutil.html

The functions from shutil that you may find useful are:

    shutil.copy()
    shutil.copy2()  # this preserves most metadata (i.e. dates); unlike copy()
    shutil.copytree()
    shutil.move()
    shutil.rmtree()  #obviously, you need to be careful with this one!
    
Give these guys a shot and see what they do.  Remember, you can always get help by typing:

    help(shutil.copy)


In [None]:
#try them here.  Be careful!

In [None]:
shutil.rmtree(new_folder)

## ``sys`` — System-specific parameters and functions

### Getting information about python and the os
where python is installed

In [None]:
print(sys.prefix)

In [None]:
print(sys.version_info)

In [None]:
sys.platform

### Adding command line arguments to a script
Here the command line arguments reflect that we're running a Juptyer Notebook. 

In a python script, command line arguments are listed after the first item in the list.

In [None]:
sys.argv

### Exercise: Make a script with a command line argument using sys.argv

1) Using a text editor such as VSCode, make a new ``*.py`` file with the following contents:

```python
import sys

if len(sys.argv) > 1:
    for argument in sys.argv[1:]:
        print(argument)
else:
    print("usage is: python <script name>.py argument")
    quit()
```

2) Try running the script at the command line

### modifying the python path

If you haven't seen `sys.path` already mentioned in a python script, you will soon.  `sys.path` is a list of directories.  This path list is used by python to search for python modules and packages.  If for some reason, you want to use a python package or  module that is not installed in the main python folder, you can add the directory containing your module to sys.path.

Any packages installed by linking the source code in place (i.e. ``pip install -e .`` will also show up here.

In [None]:
for pth in sys.path:
    print(pth)

### Using ``sys.path`` to import code from an arbitrary location

1) Using a text editor such as VSCode (or ``pathlib`` and python) make a new ``*.py`` file in another folder (anything in the same folder as this notebook can already be imported). For example:

In [None]:
subfolder = Path('another_subfolder/scripts')
subfolder.mkdir(exist_ok=True, parents=True)

with open(subfolder / 'mycode.py', 'w') as dest:
    dest.write("stuff = {'this is': 'a dictionary'}")

Now add this folder to the python path

In [None]:
sys.path.append('another_subfolder/scripts')

Code can be imported by calling the containing module

In [None]:
from mycode import stuff

stuff

**Note**: Generally, importing code using ``sys.path`` is often considered bad practice, because 

* it can hide dependencies.    

    * from the information above, we don't know whether ``mycode`` is a package that is installed, a module in the current folder, or anywhere else for that matter.
    * Similarly, we know that any modules from ``'another_subfolder/scripts'`` can be imported, but we don't know which modules in that folder are needed without some additional checking.

* importing code using ``sys.path`` is also sensitive to the location of the script relative to the path. If the script is moved or used on someone else's computer with a different file structure, it'll break.

* this all said, sometimes using ``sys.path`` is expedient in reproducible workflows in that it can allow code to be consolidated and re-used across multiple scripts in various locations

For code that is useful across multiple projects, [installing reusable code in a package can be the best way to go](https://learn.scientific-python.org/development/tutorials/). Packages provide a framework for organizing, documenting, testing and sharing code in a way that is easily understood by others.

Whatever you do, avoid importing with an `*` (i.e. ``from mycode import *``) at all costs. This imports everything from the namespace of a module, which can lead to unintended consequences.

## ``os`` — Miscellaneous operating system interfaces¶
Historically, the ``os.path`` module was the de facto standard for file and path manipulation. Since python 3.4 however, ``pathlib`` is generally cleaner and easier to use for most of these operations. But there are some exceptions.

### Changing the current working directory
``pathlib`` doesn't do this.   
Note: this can obviously lead to trouble in scripts, so should usually be avoided, but sometimes it is necessary. In groundwater modeling workflows, for example, this can help keep flow and transport model files organized in separate folders.

In [None]:
# Example of changing the working directory
old_wd = os.getcwd()

# Go up one directory
os.chdir('..')
cwd = os.getcwd()
print ('Now in: ', cwd)

# Change back to original
os.chdir(old_wd)
cwd = os.getcwd()
print('Switched back to: ', cwd)

### os.walk

os.walk() is a great way to recursively generate all the file names and folders in a directory.  The following shows how it can be used to identify large directories.

In [None]:
pth = Path('..')
results = list(os.walk(pth))
results

#### Make a more readable list of just the jupyter notebooks
Note: the key advantage of ``os.walk`` over ``glob`` is the recursion-- individual subfolder levels don't need to be known or specified a priori.

In [None]:
for root, dirs, files in os.walk(pth):
    for f in files:
        filepath = Path(root, f)
        if filepath.suffix == '.ipynb':
            print(filepath)

### Accessing environmental variables

In [None]:
os.environ

#### Example: get the location of the current python (Conda) environment

In [None]:
os.environ['CONDA_PREFIX']

### Running system commands
`os.system` provides a limited way to run system commands. For more flexibility, use `subprocess` (below).

In [None]:
os.system('ls -l')

In [None]:
# on Windows
os.system('dir')

## ``subprocess`` — Subprocess management

The subprocess module offers a way to execute system commands, for example MODFLOW, or any operating system command that you can type at the command line.

The recommended approach to invoking subprocesses is to use the ``run()`` function for all use cases it can handle. For more advanced use cases, the underlying ``Popen`` interface can be used directly.

Take a look at the following help descriptions for ``run``.

Note, that on Windows, you may have to specify "shell=True" in order to access system commands.

In [None]:
help(subprocess.run)

In [None]:
# if on mac/unix
print(subprocess.run(['ls', '-l']))

With the `cwd` argument, we can control the working directory for the command. Here we list the files in the parent directory.

In [None]:
print(subprocess.run(['ls', '-l'], cwd='..'))

In [None]:
# if on windows
print(subprocess.run(['dir'], shell=True))

## ``zipfile`` — Work with ZIP archives

### zip up one of the files in data/

In [None]:
with zipfile.ZipFile('junk.zip', 'w') as dest:
    dest.write('data/xarray/daymet_prcp_rainier_1980-2018.nc')

### now extract it

In [None]:
with zipfile.ZipFile('junk.zip') as src:
    src.extract('data/xarray/daymet_prcp_rainier_1980-2018.nc', path='extracted_data')

## Testing Your Skills with a truly awful example:

### the problem:
Pretend that the file `data/fileio/netcdf_data.zip` contains some climate data (in the NetCDF format with the ``*.nc`` extension) that we downloaded. If you open `data/fileio/netcdf_data.zip`, you'll see that within a subfolder `zipped` are a bunch of additional subfolders, each for a different year. Within each subfolder is another zipfile. Within each of these zipfiles is yet another subfolder, inside of which is the actual data file we want (`prcp.nc`). 

In [None]:
with zipfile.ZipFile('data/netcdf_data.zip') as src:
    for f in src.namelist()[:10]:
        print(f)

### the goal:
To extract all of these `prcp.nc` files into a single folder, after renaming them with their respective years (obtained from their enclosing folders or zip files). e.g.  
```
prcp_1980.nc
prcp_1981.nc
...
```
This will allow us to open them together as a dataset in `xarray` (more on that later). Does this sound awful? I'm not making this up. This is the kind of structure you get when downloading tiles of climate data with the [Daymet Tile Selection Tool](https://daymet.ornl.gov/gridded/)

### hint:
you might find these functions helpful:
```
ZipFile.extractall
ZipFile.extract
Path.glob
Path.mkdir
Path.stem
Path.parent
Path.name
shutil.move
Path.rmdir()
```

### hint: start by using ``ZipFile.extractall()`` to extract all of the individual zip files from the main zip archive
This extracts the entire contents of the zip file to a designated folder

In [None]:
output_folder = Path('03-output')
output_folder.mkdir(exist_ok=True)

with zipfile.ZipFile('data/netcdf_data.zip') as src:
    src.extractall(output_folder)

Make a list of the zipfiles

In [None]:
zipfiles = list(output_folder.glob('netcdf_data/zipped/*/*.zip'))
zipfiles[:5]

### Part 1: extract with a single file

In [None]:
f = zipfiles[0]
f

#### 1a) Use ``ZipFile.namelist()`` (as above) list the contents

This will yield the name of the ``*.nc`` file that we need to extract

#### 1b) Use ``ZipFile.extract()`` to extract the ``*.nc`` file to the destination folder
(you may need to create the destination folder first)

#### 1c) Move the extracted file out of any enclosing subfolders, and rename to ``prcp_<year>.nc``
(so that if we repeat this for subsequent files, the extracted ``*.nc`` files will end up in the same place)

#### 1d) Remove the extra subfolders that were extracted

### Part 2: put the above steps together into a loop to repeat the workflow for all of the NetCDF files

## Bonus Application -- Using ``os`` to find the location of an executable

There are often times that you run an executable that is nested somewhere deep within your system path.  It can often be a good idea to know exactly where that executable is located.  This might help you one day from accidentally using an older version of an executable, such as MODFLOW.

In [None]:
# Define two functions to help determine 'which' program you are using
def is_exe(fpath):
    """
    Return True if fpath is an executable, otherwise return False
    """
    return os.path.isfile(fpath) and os.access(fpath, os.X_OK)

def which(program):
    """
    Locate the program and return its full path.  Return
    None if the program cannot be located.
    """
    fpath, fname = os.path.split(program)
    if fpath:
        if is_exe(program):
            return program
    else:
        # test for exe in current working directory
        if is_exe(program):
            return program
        # test for exe in path statement
        for path in os.environ["PATH"].split(os.pathsep):
            path = path.strip('"')
            exe_file = os.path.join(path, program)
            if is_exe(exe_file):
                return exe_file
    return None

In [None]:
which('mf6')