Tools for Data Infrastructure#
TL;DR
Make sure your Source Control Management (SCM) is ready to use; we’re using
git
Sign up for a “git forge”; we’re using GitHub.com
In relevant projects, use and build data packages wherever possible; we’re using
dvc
, since it’s language-agnostic.(optional, but highly recommended) Document and share your project; we’re using
jupyter-book
(optional, but highly recommended) Schemas and testing for your data; we’re using
pandera
In the last section, we talked about the engineering practice, a body of “art” that consists of knowledge and accepted best-practice that supports experts in making context-sensitive decisions.
The other key component to problem solving in-context is a suite of tools in your “toolbox”. Tools enable experts to efficiently accomplish their goals, and understanding the tools being used within your community will help you critique and contribute to other work, as well. If the practice is a technical community’s mutual direction or heading, we might think of these tools as the infrastructure of a technical community.
Data Science tools, and text-analysis tools, by extentsion, care a great deal about how data gets used, so, let’s call this idea data infrastructure.
- Data Infrastructure
digital infrastructure promoting data sharing and consumption
This, like the section before, is worth a course of it’s own: spanning development- and data-operations, software engineering principles, information maintainers, human factors, and plenty of cautionary tales.
What’s the Big Idea?#
Rather than delving too deep here, it’s important to get an idea of what the “big challenge” is, and some basic tools to get started for this course, specifically.
Theory vs. Practice#
Analysts, data scientists, etc., would like to:
Load their data
Grab their trusty libraries of well-documented tools
\(\rightarrow\) pipeline \(\rightarrow\)
Report out results that anyone can use and trust!
Like the “magic” practices from last section, this “alchemical” tool called a “pipeline” is so nice, assuming all the libraries , tools, and data work together like they should.
In reality,~80% of an analyst’s time is spent wrangling and preprocessing their data… everyone re-invents the wheel for almost every analysis!
Containerize — Data Packages!#
I won’t lie, this is a super sticky, outstanding problem. Many groups, around the world, are working on making data more reusable and reduce the load on analysts. See, for instance, the massive undertaking behind the FAIR Data Principles [Wilkinson et al., 2016]:
Findable
Accessible
Interoperable
Accessible
Or, for a more tooling-centric approach, the Open Knowledge Foundation has the Frictionless Data initiative to make open software and standards.
In yet another parallel to the previous section, many solution types revolve around isolation, by bundling context-specific tools and data together in reproducible and transparent ways. These “bundles”, or “containers”, are often called data packages, that treat data the way the software engineering community has learned to treat code (i.e. Development Operations, or DevOps).
Lesson from History
Shipping cargo pre-1956 looks a lot like data engineering does today:
costs were skewed 10-to-1 toward the loading/unloading phases
Teams of dock-workers numbered in the 100’s
Why? Loading was specialized, as combinatoric case-by-case:
per-cargo-type (bananas? vehicles? oil?) needed special concerns
per-ship (length? width? fuel cost?) again, special concerns
Shipping containers changed this completely! [Ebeling, 2009]
Containers provide a uniform loading/unloading interface for ships (cranes\(\rightarrow\)trains\(\rightarrow\)trucks)
The modular storage shape unified ship design needs (lots of identical boxes, stacked high)
Interiors of each container were still a mess… but solving was asynchronous & distributed.
Getting Started#
Here’s a couple types of tools to get started down the path to this idea:
Data-as-Code#
Apply DevOps principles to data as DataOps, which treats Data-as-Code. Through clever extension of Source Control Management, we can exploit existing DevOps infrastructure for
Dependency management and provision
Version Control and releases
Reproducibility and programmatic access
So, we start with a DevOps baseline of SCM (git
, mercurial
, etc.) plus a social collaboration and project management system, like GitHub.com, GitLab.com, GitTea, etc.
Note
We will be using git
Not installed?
conda install git
in the base environment, assuming you followed along up to now.Read up on why we care about
git
so much. Blischak et al. [2016] have you covered (and usegit
themselves).
and GitHub
Acts like a “remote” for your code/text to get backed-up/synced to
Adds lots of nice social and project management features on top of commits/branches (e.g. Pull Requests, issues, forks, etc.)
From here, add a layer that extends SCM to work nicely with large data files (it doesn’t by default, for similar reasons that it doesn’t like .ipynb
files).
Examples of this idea include git-LFS
, Pachyderm, Quilt, and DVC.
Note
We will be using dvc
Think “makefile + git-lfs”. It version controls data, and makes connections to code/other data explicit. Get started here.
install as project requirement
conda install dvc-STORAGE
or package dependencypip install dvc[STORAGE]
, whereSTORAGE ∈ {ssh, html, s3, ...}
Use it to get and share data cleanly.
dvc import
will grab other data out there,dvc pull
gets the latest data when available, and thecommit
,pull
,add
commands mimicgit
but for your data tracked with DVC.The current GitHub equivalent for DVC is DagsHub.com, which we use for this course to store and supply datasets on-demand.
Documentation- and Test-driven “Development”#
There are some best-practices that will take your project to the “next level”, and provide a lot of peace-of-mind if implemented throughout the project (rather than at the end, in a rush)!
First, Documentation of data infrastructure is your container’s “cargo manifest”.
This can be as simple as a README.md
, or as complex as a pretty webpage, a dashboard, and beyond.
Some notes:
Separation of form and content will make refactoring much easier \(\rightarrow\) use Plain text (
markdown
)Writing as a team, but using distributed and asynchronous collaboration \(\rightarrow\) use Version control (
git
)Ok, but make sure I can deploy PDF’s, websites, etc? \(\rightarrow\) toeghter, we’ve got a Static site generator
Note
Static site generators, like
sphinx
ormkdocs
for python. Build it early and often!This site is built on
jupyter-book
to turn markdown and jupyter notebooks into beautiful documentation.mkdocs-material is another popular option for modern mkdocs.
Finally, the phrases “unit test” and “schema validation” get a bad rap in our community. But it helps if you think of writing “tests” more like installing sensors in your “shipping container”. They provide a suite of positive assertions about your (the authors’) expectations and assumptions about the code and data. They are your eyes and ears into what happens when conditions around and inside your “container” changes.
Note
Make your (and everyone else’s) life easier by adding semantic information about your data. Options include
jsonschema? Try pydantic
GraphQL? strawberry is new and shiny, built to work like/with pydantic.
Straight DataFrames? Try pandera, also working on pydantic interop.
More general tabular solution: frictionless framework is quite powerful.
Intake can help both with making data packages and describing the datasets themselves.
Unit tests will help you find mistakes and collaborate with peace-of-mind. For python, specifically:
pytest
in general, as the current de-facto standard.datatest
and/orpandera
for your data.hypothesis
can automatically generate tests through parameterization.
Conclusion#
Phew, that was a lot of ground. It probably feels overwhelming, but what I actually want to do is help you feel like you are not alone. So many classes focus on theory and application as a separate thing from the tools being used within a community. There is an infrastructure out there, and though it may be ever-changing and vast, it is crucial to invest time into teaching yourself some new tricks.
Plus, nearly all of the tools mentioned here, anecdotally, will end up saving time, in the not-so-long run. After all, they are like the paved roads to your direction. You might need to learn how to drive, but you’ll get many more places, much faster, on the highway than on the gravel trail.
- WDA+16
Mark D Wilkinson, Michel Dumontier, IJsbrand Jan Aalbersberg, Gabrielle Appleton, Myles Axton, Arie Baak, Niklas Blomberg, Jan-Willem Boiten, Luiz Bonino da Silva Santos, Philip E Bourne, and others. The fair guiding principles for scientific data management and stewardship. Scientific data, 3(1):1–9, 2016.
- Ebe09
CE Ebeling. Evolution of a box. Invention and Technology, 23(4):8–9, 2009.
- BDW16
John D Blischak, Emily R Davenport, and Greg Wilson. A quick introduction to version control with git and github. PLoS computational biology, 12(1):e1004668, 2016.