Setting up a Computing Environment#

Opinionated, (hopefully) sane, defaults to build on

The material in this course is written primarily using Python.

Python has become somewhat of a “lingua franca” of the data science world, though other languages like R, Julia, or even MatLab are also widely used, loved, and powerful. Which you use will likely depend on the community you interact with most.

With that aside, managing Python as an interpreter and a set of libraries on your computer has historically been, shall we say… labarynthian?

How you approach the problem of dependency management for your projects depends on your community, and more specifically whether you lean toward development of python tools (e.g. use Poetry), or more toward using those tools. In that case, which is what Data Science (and by extension, text analysis) falls under, the most common tool is conda. Either way, the key insight to both is liberal use of environment isolation.

Hint

If that spaghetti in Fig. 3 scares you, the trick is to keep everything separated by what you want to DO with it.

Remember: 1-environment-per-project!

If the words you are reading are not familiar, this is an excellent opportunity to check out the Practical Data Science guide to setting this up on your machine. I will not duplicate Nick’s guide here, but I do have a few additional tips for this course (and years of trial-and-error).

(Ana?)Conda, but it’s Mamba#

First, terminology:

  • Anaconda is a data science corporation that develops a huge amount of open-source software, along with maintaining the “anaconda” repository channel of tools/packages that one could install using:

  • conda, which is a program on your computer called a package manager, used through command-line. Apart from installing libraries and versions of Python (or R or Julia or Haskell or Coconut, or many others), it will also help you to encapsulate these into isolated conda environments.

  • mamba is a recent re-implementation of conda that’s faster and has some nice features.

To keep your “base” conda environment clean (and running as fast as possible), look for a small installation like miniconda. A simple way to get conda installed with minimal fluff, as well as a default channel with community-driven package curation, see the installable downlads on the conda-forge/miniforge page. Pick your OS and get going, much faster than the old-days!

Modern Literate Programming with Jupyter#

I remember the first time I encountered a so-called “notebook” for code, was in a calculus class with a professor absolutely obsessed with Mathematica. After using these new-fangled “cells” for my code and having results, comments, figures, and functions all interwoven togther in a beautiful tapestry for a semester, I became obsessed too. Alas, for I was but a poor student, and if I’m honest, unduly suspicious of licensed software.

This paradigm, of keeping your code, story, and results all interwoven in one file is typically called literate programming. It’s a powerful way to both explore your programming ideas, and document them as a cohesive narrative once better defined. Literate programming was championed by Org-Mode (an extremely powerful plain-text system for note taking, authoring, and knowledge management built on emacs) By the time I was discovering literate programming through Mathematica, the open source alternative I came accross was a fascinating web-browser stack built on the “python” language: IPython Notebooks.

Hacky? Yes.

Free and Open Source? You bet!

Today, that project has absolutely exploded into the de-facto method for data science experimentation, reporting, and communication, now called Jupyter. Heck, this book is written entirely using Jupyter(-book) !

Note

Jupyter is a server-based coding interface.

  • Start a jupyter server: jupyter lab or jupyter notebook

  • Access the running server through your browser: localhos:8888, or automatic

A common misunderstanding: Jupyter is not “python”, and using it isn’t really dependent on your python installation above. Instead, the jupyter server frontend communicates with a programming language/interpreter of your choosing through a jupyter language kernel. This architecture lets jupyter operate as a rich interface to many languages, not just python. That link has kernels to use anything from R and Julia to Haskell, Common Lisp, Clojure, MatLab, Ruby, or Fortran.

In addition, being a web-first interface, Jupyter has a huge number of extensions, like adding scratchpads, widgets, bibtex citation support, and even live Reveal.js presentation mode through RISE, so you can live-demo in style!

Version Veracity, Server Sanity, Kernel Correctness#

In this course, I have adopted a few popular mechanisms to combat the proliferation of “junk notebooks”. Avoid addressing these problems at your peril!

First, it is helpful to review this guide to writing “Clean Notebooks”:

Notebooks are a magnificent tool to explore data, but such a powerful tool can become hard to manage quickly. Ironically, the ability to interact with our data rapidly (modify code cells, run, and repeat) is the exact reason why a notebook may become an obscure entanglement of variables that are hard to understand, even to the notebook’s author. But it doesn’t have to be that way.

— Eduardo Blancas (2021)

I don’t follow everything in this post exactly (for instance, I typically use pyproject.toml over setup.py to make source-code files for python functions/types, based on the now-standard PEP 518). But the advice is generally good and well worth a read-through.

Additional tools used for this book to help you stay sane in the wonderful journey to productive notebook use are below.

Jupytext#

Creates and syncs .md files (or, script files for each kernel’s language) that can roundtrip to/from .ipynb files. This lets you add *.ipynb to your .gitignore (see DevOps) and never deal with unmergeable diff files again. Also great for collaboration, since your colleague only needs to edit a plain text file (e.g. on GitHub) and you will recieve a fresh, edited notebook, locally.

YAML Environment specifications#

Remember:

1-environment-per-project!

So, you will have a lot of environments running around, with you and collaborators needing to add or remove dependencies, constantly. If only there was a way to make some kind of recipe file, that could read a list of things you want, and then “update” the environment whenever you want…

In case you didn’t see this in the mountain of conda documentation, this exists, and should probably be the only way you make environments.

These environment.yaml files are incredibly powerful, and will save you time and sanity working with other people.

```{admonition} Example environment.yaml :class: tip, dropdown

Here’s an example file that governs a conda/mamba environment named test:

name: test
channels: 
  # This is where stuff gets downloaded from
  - conda-forge
  - pyviz

dependencies:
  - python=3.9  # can lock version numbers
  - ipykernel  # needed to use "python" from a notebook
#  - irkernel  # could've been "R"
  - pandas
  - panel  # comes from the pyviz channel
  - pip  # ALWAYS INSTALL PIP if you want to use `pip install` inside the environment!
  - pip:  # now we can tell `pip` to install its own package list automatically
    - textacy
    # - -e .  # if we wanted to install local directory (e.g. pyproject.toml/setup.py)
    # - frictionless[json]  # can use standard pip syntax for e.g. options

In the directory containing this file, say, test-env.yml, you would create/update the environment test with the command:

conda env update -f test-env.yml

Now is a good time to try making your own environment for this course… see if you can make a text-data-env.yml on your own!

nb_conda_kernels#

However, due to being largely implemented in python, the Jupyter server is installed into a base environment, and has no idea about other environments while running. In practice, this means everyone ends up re-installing Jupyter into every project’s environment, which is both

  1. wasteful of space, and memory, and terminal tabs (running many instances of jupyter at once)

  2. loses any extensions/custom settings you might have had in another environment.

Instead, the folks at Anaconda have created a (temporary) fix for this hell: nb_conda_kernels. This extension gets installed once, alongside a single installation of jupyter (e.g. in base or a custom notebooks environment). Then, any time you create an environment with a valid language kernel (ipykernel, irkernel, coconut[kernel], etc. ), you will see that kernel and be able to run any notebook inside its corresponding environment. Switch between kernels from inside the notebook with Kernel > Change kernel

Feels good, man.