Introduction#

The Road to Technical Language Processing

Text Analysis can be hard to define.

It’s often thought of as a “popular data mining technique”: something that, tautologically, “text goes into” and “analyses” come out of.

Less trivially, text analysis might be some set of “mutually agreed-upon” assumptions and patterns, all of which try to assist in using text in computational or statistical ways.

This is nothing to sneeze at. After all, using text —especially natural language text, which can be full of symbolic and semantic meaning, written and intended for humans to read and understand — as a source of data?

That is, arguably, very cool.

Turning text, which seems to be almost certainly subjective at some fundamental level, into something that looks like objective data, whether for decision support, investigations, understanding demographics, or question-answering, can absolutely feel magical.

Do you believe in Magic?#

If we look at this definition again, though, “magical” can start to feel a bit on the nose

graph LR subgraph magic assumptions --> ta(["text analysis?"]) style ta stroke-dasharray: 5 5 patterns --> ta end text --> results ta --> results

How do we know our assumptions (or anyone else’s) are good? How do we know those assumptions exist, for that matter? Are the patterns “good”? Are the patterns relevant to the desired analysis?

The pedantry seems a little pointless, surely…

Obviously someone has answered these questions, or why would text analysis be so popular? Can’t I just use my text as data, already?

—frustrated data scientists, probably

Unfortunately, things are not so straight-forward.

Incentives for practitioners in the field of data science right now mean that there is often very little reason to question the “magic”, at least at first.

Problems start arising when the outputs of these “analyses” have impacts. They might impact decisions, impact policymaking, impact people’s lives. This tends to happen when technically-minded people, usually in need of more evidence to make better decisions (bless their heart!), start to turn to new “types” of data for answers. They might, for instance, turn to text as data.

Our magical patterns and assumptions for dealing with text-as-data start to be used outside of their original context (e.g. from domains like linguistics or Natural Language Processing). They get applied to technical domains like engineering, medicine, policy, threat assessment, risk modeling, actuarials, etc., all of which are built on the ethical application of evidence to serve others for the betterment of society.

Despite recent dramatic successes, natural language processing (NLP) is not ready to address a variety of real-world problems. Its reliance on large standard corpora, a training and evaluation paradigm that favors the learning of shallow heuristics, and large computational resource requirements, makes domain-specific application of even the most successful NLP techniques difficult.

Dima et al. [2021]

Without asking those pedantic questions about assumptions and patterns, treating our text as “data” that awoke in the darkness of ~~Khazad-dum~~ someone’s NLP model will fundamentally contradict the purpose of those evidence-based fields.

A View from Engineering#

All of this is not to say we shouldn’t use NLP or other text-analysis tools at all. They can be quite beneficial to us! The key is applying them critically and transparently.

All models are wrong; some are useful.

George Box (and many since)

The key is measuring and adjusting the model based on its usefulness. I like to call engineering “the art of applying science to solving people’s problems”.1 I think this turn of phrase captures something important about using science to problem solve, whether that is building cities, designing cars, or doing data analysis for a policy firm: it’s somewhat of an artform. There are many, many subjective and context-sensitive decisions that will get made throughout the entire process. But engineers, typically as positions of public trust, have developed systems, checks, and communities, that all overlap to try and mitigate the inherent risk in subjective decision-making.2

So we might say engineering is applying science, but

supplemented by necessary art […] the know-how built up and handed down from past experience.

[The know-how] is also called engineering practice. In civil engineering, engineering practice refers to a body of knowledge, methods, and rules of thumb that consist of accepted techniques for solving problems and conducting business.

The reason for the existence of this body of knowledge, engineering practice, is that engineers are accountable. If a structure fails, the engineer is the one who is probably going to be held responsible. Nobody knows everything, and mistakes will happen despite the best preparation possible. Engineers must show that they performed their duties according to the best of their abilities in accordance with accepted standards. This is called performance. The engineer’s defense will be based on demonstrating that he or she followed acceptable engineering practice.

[Hutcheson, 2003]

Unfortunately, such “bodies of knowledge” and practice are sorely lacking within data science and text analysis as it is in, say, building a dam, an aircraft, or a nuclear power plant. How does an analyst show they performed their duty according to their best ability? What is “accepted text-as-data practice?”

This is why I have organized this textbook with chapters named after steps of the “Engineering Approach”. It is not an Engineering course, and you most definitely do not need to care about becoming an engineer to finish it. For the record, computer science has been applying engineering principles for a long time (thus the book reference used above).

Instead, these sections are meant to be an, admittedly unorthodox, way to introduce text analysis fundamentals in an application-centered way. Always remember who might be trusting you with their data, their language, and their ear, when you analyse text-as-data.

Steps in Practicing the Engineering Approach

  • Goals & Approaches

    “State the methods followed & why.”

  • Assumptions

    “State your assumptions.

  • Measure & Evaluate

    “Apply adequate factors of safety.”

  • Validate

    “Always get a second opinion”

References#

DLH+21

Alden Dima, Sarah Lukens, Melinda Hodkiewicz, Thurston Sexton, and Michael P. Brundage. Adapting natural language processing for technical text. Applied AI Letters, 2(3):e33, 2021. URL: https://onlinelibrary.wiley.com/doi/abs/10.1002/ail2.33, arXiv:https://onlinelibrary.wiley.com/doi/pdf/10.1002/ail2.33, doi:https://doi.org/10.1002/ail2.33.

Hut03

Marnie L Hutcheson. Software testing fundamentals: Methods and metrics. John Wiley & Sons, 2003.


1

I discovered a much more elegant version of this, recently. From Jashree Seth’s award letter to the Society of Women Engineers:

the art of applying science to life.

2

This video by Grady Hillhouse (Practical Engineering) is a succinct insight into this complex web of decisions and checks/balances that were intended to prevent disaster, as well as the processes that kick into gear as soon as disaster occurs to understand why. If you watch this video, pay attention to how many times subjective decisions were necessary, but also transparent enough that future instances could use this failure event as a solid case study. As Grady says in the video description:

Whether they realized it or not, the people living and working downstream of Oroville Dam put their trust in the engineers, operators, and regulators to keep them safe and sound against disaster. In this case, that trust was broken.