python_and_ml_tools.html

<!DOCTYPE html>
<html lang="en">
  <title>Tools and tenets for ML and Python</title>

  <meta charset="utf-8">
  <meta name="viewport" content="width=device-width, initial-scale=1.0, maximum-scale=1.0, user-scalable=no">
  <script src="custom_themes/html_elements.js"></script>
  <link rel="stylesheet" href="dist/reveal.css">
  <link rel="stylesheet" href="custom_themes/sussex2.css" id="theme">

  <div class="reveal">

    <div class="slides">

      <div id="background-template">
        <footer>
          <p>Tools and tenets for ML and Python
        </footer>
      </div>

      <section class="dark-cyan">
        <p>&nbsp;
        <h2>Tools and tenets<br>for ML and Python</h2>
        <p>&nbsp;
        <p style="color: white">Predictive Analytics Lab<br>
          University of Sussex
        <p>&nbsp;
        <p><img src="images/logos/University_of_Sussex_Logo.svg2000_white.png" style="width: 5rem;">
      </section>

      <section>
        <h2>Contents</h2>
        <p>Approach
        <p>Python tools
        <p>ML tools
        <p>Further recommendations
      </section>

      <section>
        <h3>Contents</h3>
        <multi-col>
          <one-col>
            <h4>Core tenets</h4>
            <ul>
              <li>Libraries &gt; Snippets
              <li>Automation &gt; Manual
            </ul>
            <h4>Python Tools</h4>
            <ul>
              <li>Poetry
              <li>Hydra
              <li>Ray
            </ul>
          </one-col>
          <one-col>
            <h4>ML Tools</h4>
            <ul>
              <li>PyTorch Lightning
              <li>Weights &amp; Biases
            </ul>
            <h4>Not Covered (but recommended)</h4>
            <ul>
              <li>BentoML (for model-serving)
              <li>Optuna (for hyperparameter optimization) &ndash; has integration with Hydra
            </ul>
          </one-col>
        </multi-col>
      </section>

      <section class="dark-cyan">
        <p>&nbsp;
        <p>&nbsp;
        <p>&nbsp;
        <p>&nbsp;
        <h2>Core tenets</h2>
      </section>

      <section>
        <h3>Core tenets</h3>

        <section>
          <multi-col>
            <one-col>
              <ul>
                <li>Avoid boilerplate code
                  <ul>
                    <li>The best code is code you don't have to write.
                  </ul>
                <li>Put shared code into libraries
                  <ul>
                    <li>EthicML
                    <li>PALkit
                  </ul>
                <li>Automation &gt; Manual
                  <ul>
                    <li>Try to make your model run end-to-end
                    <li>Keep hard-coding to a minimum
                  </ul>
              </ul>
            </one-col>
            <one-col>
              <ul>
                <li>Type-annotate everything
                  <ul>
                    <li>readable and less error-prone code
                    <li>use static type checkers like pyright and mypy
                  </ul>
                <li>Make it easy for others to run your code
                  <ul>
                    <li>Specify exact dependencies
                    <li>Don’t use notebooks
                    <li>Run and configure via the command line
                  </ul>
              </ul>
            </one-col>
          </multi-col>
        </section>

        <section>
          <multi-col>
            <one-col>
              <ul>
                <li>Make it easy to collaborate
                  <ul>
                    <li>Use git
                    <li>Make PRs and ask for reviews
                  </ul>
              </ul>
            </one-col>
            <one-col>
            </one-col>
          </multi-col>
        </section>

      </section>

      <section class="dark-cyan">
        <p>&nbsp;
        <p>&nbsp;
        <p>&nbsp;
        <p>&nbsp;
        <h2>Python Tools</h2>
      </section>

      <section>
        <h3>Poetry &mdash; Simple, Conflict-free<br> Dependency Management</h3>

        <section>
          <multi-col>
            <one-col>
              <img src="images/poetry.png" alt="poetry logo">
              <ul>
                <li>Poetry is a tool for managing Python dependencies
              </ul>
            </one-col>
            <one-col>
              <ul>
                <li>Alternative to <code>setup.py</code>
                <li>Automatically resolves meta-dependencies (dependencies between dependencies)
                <li>Maintains a lock file that ensures that all people working on the project are locked to the same versions of dependencies
                <li>Also provides some protection if you forget to activate a venv
              </ul>
            </one-col>
          </multi-col>
        </section>

        <section>
          <p><code>pyproject.toml</code> replaces <code>setup.py</code> and also auxiliary <code>.cfg</code> files such as <code>black.cfg</code> and <code>.isort.cfg</code>
          <p>Install poetry from the website or homebrew
          <h4>Useful commands</h4>
          <p><code>poetry install</code> &ndash; install dependencies
          <p><code>poetry update</code> &ndash; check for dependency updates that won’t break your code
          <p><code>poetry add &lt;package&gt;</code> &ndash; installs and adds a new dependency (no need to manually code dependencies as required for requirements.txt/setup.py)
          <p>Analogy (for rustaceans): cargo for python


        </section>
      </section>

      <section>
        <h3>Hydra &mdash; elegant and flexible configuration</h3>

        <section>
          <multi-col>
            <one-col>
              <p>&nbsp;&nbsp;<img src="images/Hydra-Readme-logo2.svg" alt="hydra logo">
              <p>&nbsp;
              <ul>
                <li>Hydra is a tool for configuring complex applications
                <li>&ldquo;Complex&rdquo; means something like more than 10 flags
              </ul>
            </one-col>
            <one-col>
              <ul>
                <li>Hydra enables configuration via YAML files and allows overriding any configuration value on the commandline
                <li>Hydra encourages modular configuration
                  <ul>
                    <li>E.g. data loading config and model config is separate
                    <li>Config modules can be swapped out
                    <li>Supports validation and variable interpolation
                  </ul>
              </ul>
            </one-col>
          </multi-col>
        </section>

        <section>
          <multi-col>
            <one-col>
              <ul>
                <li>Hydra supports <em>multiruns</em> where multiple parameter values are run in a combinatorial fashion
                <li>Hydra also has plugins that allow for hyperparameter sweeps to be conducted using popular HPO libraries (e.g. Optuna)
              </ul>
            </one-col>
            <one-col>
              <ul>
                <li>Hydra can also instantiate Python objects based on configuration values
              </ul>
            </one-col>
          </multi-col>
        </section>
      </section>

      <section>
        <h3>Ray &mdash; effortless parallelism</h3>
        <multi-col>
          <one-col>
            <p><img src="images/ray_header_logo.png" alt="ray logo">
            <ul>
              <li>Ray is a tool for easy parallelisation of Python functions
              <li>It can be used to  a hyperparameter search over multiple GPUs and multiple machines
            </ul>
          </one-col>
          <one-col>
            <ul>
              <li>Ray usually parallelises a single Python function, but combined with Hydra, it parallelises your whole application
              <li>Ray can act as a queue for jobs
              <li>Ray is GPU-aware and can distribute jobs over multiple machines according to how many GPUs are available on the machines
            </ul>
          </one-col>
        </multi-col>
      </section>

      <section class="dark-cyan">
        <p>&nbsp;
        <p>&nbsp;
        <p>&nbsp;
        <p>&nbsp;
        <h2>ML Tools</h2>
      </section>

      <section>
        <h3>PyTorch Lightning &mdash; avoid boilerplate code</h3>
        <multi-col>
          <one-col>
            <p style="text-align: center"><img src="images/PL_logo.svg" alt="pytorch lightning logo" style="width: 3rem">
            <ul>
              <li>Lightning provides a set of common abstractions that you see in PyTorch models
              <li>Gives many things for free
                <ul>
                  <li>Logging (e.g. to W&B)
                  <li>Distributed training
                    <!-- <ul> -->
                    <!--   <li>Model or Data parallel -->
                    <!-- </ul> -->
                  <li>Automatic LR and batch-size determination
                </ul>
            </ul>
          </one-col>
          <one-col>
            <p>Lightning is made up of 3 key components: a DataModule, LightningModule, and a Trainer
            <ul>
              <li><strong>DataModule</strong> is a container for train, val and test dataloaders
              <li><strong>LightningModule</strong> is a <code>nn.Module</code> but you also define the training, val and test steps
              <li><strong>Trainer</strong> abstracts away the boilerplate code of the training loop
            </ul>
          </one-col>
        </multi-col>
      </section>

    </div>
  </div>

  <script type="module" src="setup.js"></script>