Tools for Data Science#
So far in this (over-long) introduction, we’ve introduced the ways that technical communities find their heading (e.g. engineering practice), and why roads are important to go, efficiently, in that direction (e.g. data-driven fields need data infrastructure). The last thing we need for this metaphor are vehicles to get us to our destination: domain-specific tools (methods, algorithms, frameworks, etc.).
To keep things interesting, key tools will be introduced in the context of their use, from here-on-out. Rather than go into great detail, then, this page will store a maintained table of tools, with any notes on their use shortly after.
Tables, Vectors, & Graphs#
Basic ways of representing and manipulating data.
Tool Name |
Description |
Docs/Tutorials |
|
Notes |
---|---|---|---|---|
Numpy/Scipy |
matrix (dense or sparse) manipulation and routines |
|
||
Pandas |
All-purpose tabular data loader, manipulator, and writer. |
docs |
|
|
pyJanitor |
Convenient methods to (sanely) clean up your data-frame, in-line |
|||
XArray |
N-dimensional extension of Pandas’ “named arrays”, based on NetCDF |
|||
NetworkX |
Graphs (vertices+edges) as general-purpose dictionaries with methods. |
[] |
|
|
graph-tool |
C-based network analysis, focused on stochastic block models. |
[] |
Machine Learning#
Tool Name |
Description |
Docs/Tutorials |
|
Notes |
---|---|---|---|---|
Scikit-Learn |
Standard for ML in Python |
Natural Language Processing#
Tool Name |
Description |
Docs/Tutorials |
|
Notes |
---|---|---|---|---|
NLTK |
||||
Spacy/Textacy |
||||
Gensim |
||||
flair |
||||
huggingface |
||||
cleantext |
Visualization#
Tool Name |
Description |
Docs/Tutorials |
|
Notes |
---|---|---|---|---|
matplotlib.pyplot |
||||
seaborn |
||||
pyviz |
see below |
Notes#
Pandas#
For an excellent series on using pandas more effectively (which we all need to do), see this fantastic series from Tom Augsperger: Modern Pandas. For this course, ensure you read the method chaining post! It will dramitically alter (read: enhance) your code quality and maintainability while using pandas for data munging.