Rules for Entities, Rules for Relations#

We’ve used words like content and keywords to refer to “stuff we’re interested in”, so let’s come up with a more unified term.

Entity

Something that exists independently, and is uniquely identifiable. A thing we’re interested in that exists.

In the case of our tokens, many of the tokens-of-interest were words instantiating some entity. Like tokens, entities can have types: e.g. London and New York are both entities about a location, so location is an entity type.

In addition, we often know something useful about how entities relate to one another.

Relation

captures how entities are related to one another

This should be familiar from e.g. data modeling. If not, one way to think of this is

  • entities ≈ nouns

  • relations ≈ verbs

In this section, we’ll take a look at how people systematically express their intent about entities and relations: ways they write rules for them.

Entity Legos: Regular Expressions (RegEx)#

Recall the example above:

we missed keyword occurrences in the text when the case was not taken into acount

Hard-coding “lowercase”, while a viable option, case is only the tip of the text-variation iceberg. The answer to

“How to I find strings that fit some pattern I’m looking for?”

is almost always

Regular Expressions

What is RegEx#

ReGex is a mini-language for writing patterns that you want to find in a bunch of text. The most basic use is to match exact strings:

pattern

meaning

example

a

match character a

apple

the

_match string the

there went the last one!

This is the same as what we did above, for keywords: my_pattern in my_text. By default, RegEx patterns are case-sensitive! (A will not match an a)

Regex also has several special characters ,e.g. …

characters inside square brackets [ ] or patterns separated with | are options.

pattern

meaning

example

[Tt]h

“the” with upper or lower t

The end of the road

[tT]he|end

either [Tt]he or end

The end of the road

Character classes

meaning

syntax

character set

[ABC]

negated set

[^ABC]

range

[A-Z]

dot (non-break)

.

unicode

\X

word (alpha, num, _)

\w

not word

\W

digit, not digit

\d, \D

whitespace, not

\s, \S

hspace, not

\h, \H

vspace, not

\v, \V

line break, not

\R, \N

Anchors

These match positions, not characters!

meaning

syntax

beginning

^

end

$

string begin

\A

string end

\Z

string end (no \n)

\z

word boundary, not

\b, \B

end of last match

\G

Escaped Characters

Special meanings, so chararcter needs escaping

meaning

syntax

reserved

+*?^$\.[]{}()|/

escape reserved

\+reserved

escape multiple

\Q\E

tab

\t

line feed, return

\n

carriage return

\r

Capture Groups

“hold on” to patterns for later, or group tokens into a single, bigger pattern.

meaning

syntax

capture group

(ABC)

reference group

\1($1), \2($2), etc.

named group

(?'name' ABC)

reference name

\k'name'

non-capture group

(?:ABC)

For other things like atomic groups, branch resets, or subroutine definitions, see a regex reference manual or webpage! :)

Lookaround

sometimes you want to make sure stuff does or does not “exist”, but don’t need that stuff, itself.

meaning

syntax

positive lookahead

(?=ABC)

negative lookahead

(?!ABC)

positive lookbehind

(?<=ABC)

negative lookbehind

(?<!ABC)

discard until

\K

E.g. (?=New )York will return the match York, but only when it sees New in front of it.

Be careful, not all languages’ regex implementations support all of these!

Quantifiers & Alternation

How much of something? Also, boolean operations (kinda…)

meaning

syntax

one-or-more (plus)

ABC+

Any amount (star)

ABC*

quantifier

{n} or {start,stop}

0 or 1 (optional)

?

make lazy

quantifier+? (e.g. +?)

alternation (OR)

|

These are incredibly useful, since you can build sophisticated patterns using smaller, specific building blocks:

Entity Legos!

Flags

Configuration for the regex parser.

meaning

syntax

ignore case

i

global search

g

multiline

m

dotall

s

no literal ws

x

ungreedy

U

Use depends on implementation:

  • after the pattern (e.g. in vim): /pattern/gms

  • as an Enum kwarg (in Python): re.S, re.M, etc.

References

  • Need Practice? Rgex Golf is pretty fun!

    • get the matches, ignore the false matches

    • as few characters in the pattern as possible

  • Need help? regexr.com

    • Docstrings and examples for all you just saw

    • add unit tests (like “golf” to pass/fail)

    • get auto explanations for why something matches

Discussion:

What is this for?

([ab]*)c\1

# Try below: 
import re
patt = re.compile(r'([ab]*)c\1')

patt.match('aabcaab')
<re.Match object; span=(0, 7), match='aabcaab'>

note that this matches, but the next one does not:

patt.match('aabcbaa')

This means that the “group” being referened is evaluated before it gets referenced later… \1 must be aab in this case, not b (even though that would be a valid match for [ab]*). Our pattern matched on aab instead of b, so \1 must match that!

Now, the findall function will return match groups…in fact, it will find all non-overlapping matches. Since b is a valid match group, and is duplicated for \1 to succeed, the findall function will return it. This is different from the match function!

patt.findall('aabcaab  aabcbaa')
['aab', 'b']

A (pedantic note)#

That example isn’t actually “regular”, in a computer-science sense. It isn’t even “context-free”.

regular expression comes from Chomsky’s Hierarchy, in that it defines a regular language (the strictest subset, below context-free). A regular language is equivalent to a language that can be recognized by a finite automoton.

What we just witnessed (exact-string memory of a group instance) was not context-free, or regular.

modern RegEx implementations are majorly “souped-up” with convenience features!

Take a Break (Exercises)#

  • Write an expression to match markdown sections, to return (hashes, header, content) pairs.

    tip: use “multiline”re.M and “dotall” re.S flags

  • Write expressions to match Wordle words, given feedback

    tip: how can we make boolean “AND” in regex?

Markdown Parser (worked example)#

One approach to turning markdown into a table of sections

  • group the header “hash” symbols

  • group the text that follows the hash symbols

  • find all text on the next line(s), EXCEPT:

  • stop when the next line (look-ahead) is another header!

patt = re.compile(
    "(^#+)"  # the header hashes
    "\s([\w -:]*)"  # the title (could contain non-word chars)
    "\n+(.*?)"  # the body content
    "(?=^#|\Z)", # do not include the next section header!
    flags=re.S | re.M
)
# FIND `patt` SUCH THAT:  
matches = patt.findall(
    """
# This is a Markdown Title
this is _italicized content_.

## This is a Level 2 Subtitle
What more is there to say?
"""
)
import pandas as pd
pd.DataFrame.from_records(matches, columns=['level', 'title', 'content'])
level title content
0 # This is a Markdown Title this is _italicized content_.\n\n
1 ## This is a Level 2 Subtitle What more is there to say?\n

Tokenizers#

To split text into tokens, we could match patterns for:

  • what is a word? –> re.findall

  • what is not a word? –> re.split

It’s worth mentioning that the token we match is not “the same” as the entity. Rather, we humans are using RegEx as an assumption that pieces of matching text are valid stand-ins to represent the abstract entity, correctly!

This distinction becomes especially important when our human languages have e.g. polysemy; some tokens should point to completely separate entities, depending on conext. However, if we assume Regex as a mechanism to find tokens meant to stand-in for entities of interest, that’s a rule standing in for our intent.

Whitespace Tokenizer

my_str = "Isn't this fun? It's time to tokenize!"
re.compile('\s').split(my_str)
["Isn't", 'this', 'fun?', "It's", 'time', 'to', 'tokenize!']

This looks pretty good! Notice, though, the punctuation is sometimes helpful (it’s, isn’t), but often adds unnecessary extra (“fun?”, “tokenize!”, all are ostensibly the same entity as “fun” or “tokenize”. |

Scikit-Learn-Style Tokenizer

Very popular way to tokenize text, especially given the intended use-case (statistical NLP, with matrices)

\b\w\w+\b

re.compile(r"\b\w\w+\b").findall(my_str)
['Isn', 'this', 'fun', 'It', 'time', 'to', 'tokenize']

This time, we avoid punctuation, but lose conjunctions. It’s not uncommon to remove punctuation entirely as a preprocessing step.

Does that always make sense?

“Technical” Tokens for Technical Entities#

A lot of times the above assumptions won’t cut it, especially if there are specific technical entities that a token needs to reference. Here are a few patterns I have seen in e.g. Maintenance Work Order text:

Pattern

Example

Description

\#[\w\d]+\b

#T43H5sw

ID’s for machines, positions

\b\w[\/\&]\w\b

P&G,A/C

Split bigrams, common shorthands

\b\w[\w\'\d]+\b

won't

conjuctions (won vs. won’t start)

re.compile(
    r'(?:\#[\w\d]+\b)'
    r'|(?:\b\w[\/\&]\w)\b'
    r'|(?:\b\w[\w\'\d]+)\b'
).findall(
    "Stop H43; repos. to #5 grade."
    "Carburetor won't start."
)
['Stop', 'H43', 'repos', 'to', '#5', 'grade', 'Carburetor', "won't", 'start']

Relational Pedantry: Graphs & Ontologies#

How we define and standardize known relations between entities.

Now that we have ways to explicitly define what we intend an entity occurence to look like, we can start to explicitly define the ways that entities relate to one-another.

This is a (mathematical) graph, if we think of entities as nodes and relationships as edges

import graphviz
graphviz.Source(
    'digraph "entities and relations" '
    '{ rankdir=LR; '
    'A -> B [label="relation A:B"] '
    'C -> B [label="relation C:B"]'
    '}')
../_images/84088283248471a915523ba3add82ac2037a03f709852c1a43c49c91ae2cdaee.svg

There are many ways to use graphs to express how entities related to one another, but two of the most common are

  • Labeled Property Graphs e.g. NoSQL Graph Databases like Neo4J, JanusGraph, etc.
    Nodes and edges both have unique ID’s, and can have internal properties (key-value).

  • Triple Stores i.e. the Resource Description Framework (RDF)
    All information is stored as triples: (subject, predicate, object). Every vertex has a unique identifier (no internal information).

For more information comparing the two paradigms, see this breakdown of a talk by Jesús Barrasa. We will return to these, and how knowledge engineering works (and interfaces with text data) more generally in later sections.

LPG’s come from a data-storage and querying community (think databases), while RDF comes from a web-technology culture (think W3C and Semantic Web). We will return to these later, but for now we want to focus on forms that assumptions about entity relationships might take.

What kinds of relations are out there already?

WordNet#

Lexical database for nouns, verbs, adverbs, and adjectives.

  • Groups terms into sets of synonyms (synsets) that have meaning

  • Several types of relationships are available

    • Hyper/hyponyms (is-a). e.g. bed is a furniture

    • Mero/holonyms (part-of) e.g. bread part-of sandwich.

    • entailment (implies) e.g. snore implies sleep.

  • Distinguishes between types (e.g. President) and instances (Abraham Lincoln)

import nltk
try:
    nltk.data.find('tokenizers/punkt')
    nltk.find('corpora/wordnet')
except LookupError:
    nltk.download('punkt')
    nltk.download('wordnet'); 
    nltk.download('omw-1.4')

from nltk.corpus import wordnet as wn
wn.synsets('dog')
[Synset('dog.n.01'),
 Synset('frump.n.01'),
 Synset('dog.n.03'),
 Synset('cad.n.01'),
 Synset('frank.n.02'),
 Synset('pawl.n.01'),
 Synset('andiron.n.01'),
 Synset('chase.v.01')]
print(wn.synset('dog.n.01').definition())
print(wn.synset('dog.n.01').lemma_names())
a member of the genus Canis (probably descended from the common wolf) that has been domesticated by man since prehistoric times; occurs in many breeds
['dog', 'domestic_dog', 'Canis_familiaris']
# Hyper/Hyponyms
print('pasta could be: ', [i.lemma_names() for i in wn.synset('pasta.n.01').hyponyms()])
print('pasta is a: ', wn.synset('pasta.n.01').hypernyms())
pasta could be:  [['cannelloni'], ['lasagna', 'lasagne'], ['macaroni_and_cheese'], ['spaghetti']]
pasta is a:  [Synset('dish.n.02')]
# Holo/Meronyms
print('a living room is part of a: ', wn.synset('living_room.n.01').part_holonyms()[0].lemma_names())
a living room is part of a:  ['dwelling', 'home', 'domicile', 'abode', 'habitation', 'dwelling_house']
graphviz.Source('digraph {rankdir=LR '
                'living_room -> dwelling [label="part of"];'
                'dwelling -> home [label="same as"]'
                'dwelling -> abode [label="same as"]}')
../_images/e416c99c97f9d09eddf54e66dc3433a13e0376bd96e2efdbd5daf4b87aaf95b5.svg
[i.lemma_names() for i in wn.synset('car.n.01').part_meronyms()]
[['accelerator', 'accelerator_pedal', 'gas_pedal', 'gas', 'throttle', 'gun'],
 ['air_bag'],
 ['auto_accessory'],
 ['automobile_engine'],
 ['automobile_horn', 'car_horn', 'motor_horn', 'horn', 'hooter'],
 ['buffer', 'fender'],
 ['bumper'],
 ['car_door'],
 ['car_mirror'],
 ['car_seat'],
 ['car_window'],
 ['fender', 'wing'],
 ['first_gear', 'first', 'low_gear', 'low'],
 ['floorboard'],
 ['gasoline_engine', 'petrol_engine'],
 ['glove_compartment'],
 ['grille', 'radiator_grille'],
 ['high_gear', 'high'],
 ['hood', 'bonnet', 'cowl', 'cowling'],
 ['luggage_compartment', 'automobile_trunk', 'trunk'],
 ['rear_window'],
 ['reverse', 'reverse_gear'],
 ['roof'],
 ['running_board'],
 ['stabilizer_bar', 'anti-sway_bar'],
 ['sunroof', 'sunshine-roof'],
 ['tail_fin', 'tailfin', 'fin'],
 ['third_gear', 'third'],
 ['window']]

Problem
WordNet is for general english, and is not exhaustive. If using for technical text, expect decent precision, but poor recall.

[i.lemma_names() for i in wn.synset('bicycle.n.01').part_meronyms()]
[['bicycle_seat', 'saddle'],
 ['bicycle_wheel'],
 ['chain'],
 ['coaster_brake'],
 ['handlebar'],
 ['kickstand'],
 ['mudguard', 'splash_guard', 'splash-guard'],
 ['pedal', 'treadle', 'foot_pedal', 'foot_lever'],
 ['sprocket', 'sprocket_wheel']]

Meanwhile, from an engineering paper discussing bicycle drivetrains:

graphviz.Source("""graph g { 
  graph[rankdir=LR, center=true, margin=0.2, nodesep=0.05, ranksep=0.1, bgcolor="transparent";];
  node[shape=plaintext, color=none, width=0.1, height=0.2, fontsize=11]
  pedal -- crank_arm;
  crank_arm -- chain_rings;
  chain_rings -- rear_derailleur;
  chain_rings -- chain;
  chain -- cogset;
  cogset -- rear_hub;
  rear_derailleur -- rear_hub;
  rear_hub -- rear_spokes;
  rear_spokes -- rear_rim;
  rear_rim -- rear_tire;
  
}
""")
../_images/c74ecf3e5323ff635e0fbec610844117445b4ed53420504f017daad566c1e598.svg
# Entailments
print('to buy means to: ', [i.lemma_names() for i in wn.synset('buy.v.01').entailments()])
print('to snore means to: ', [i.lemma_names() for i in wn.synset('snore.v.01').entailments()])
to buy means to:  [['choose', 'take', 'select', 'pick_out'], ['pay']]
to snore means to:  [['sleep', 'kip', 'slumber', "log_Z's", "catch_some_Z's"]]

For sources and more info:

ConceptNet#

  • Crowdsourced knowledge: Open Mind Common Sense, Wiktionary, DBPedia, Yahoo Japan / Kyoto University project

  • Games with a purpose: Verbosity, nadya.jp

  • Expert resources: Open Multilingual WordNet, JMDict, CEDict, OpenCyc, CLDR emoji definitions

Also uses graph embeddings (we’ll come back to this later) to “fuzzify” knowledge relationships.

see: ConceptNet in Context, https://rcqa-ws.github.io/slides/robyn.pdf

How it works:
ConceptNet builds on WordNet and many others, using nodes and more generic “relations”

Interestingly, these are not the “edges”… edges are assertions that have start and end-nodes, and have a relation property.

  • Edges can also have sources, weights (for uncertainty), licenses, datasets, “surfaceText” that generated the assertion, etc.

import requests 
def conceptnet_query(q):
    url = 'http://api.conceptnet.io/c/en/'
    return dict(requests.get(url+q).json())

for i in conceptnet_query('bicycle?rel=/r/PartOf')['edges']:
    print('{}:\t{}'.format(i['rel']['label'],i['surfaceText']))
AtLocation:	You are likely to find [[a bicycle]] in [[the garage]]
HasA:	[[A bicycle]] has [[two wheels]]
UsedFor:	[[a bicycle]] is for [[transportation]]
AtLocation:	*Something you find on [[the street]] is [[a bicycle]]
IsA:	[[A bicycle]] is [[a two wheel vehicle]]
UsedFor:	[[a bicycle]] is used for [[riding]]
MadeOf:	[[a bicycle]] can be made of [[metal]]
PartOf:	[[a wheel]] is part of [[a bicycle]]
HasA:	[[a bicycle]] has [[a chain]]
HasA:	[[A bicycle]] has [[two tires]]
AtLocation:	*Something you find at [[a toy store]] is [[a bicycle]]
UsedFor:	[[a bicycle]] is for [[Racing]]
Synonym:	[[چرخیدن]] is a translation of [[bicycle]]
IsA:	[[ordinary]] is a type of [[bicycle]]
IsA:	[[a bicycle]] is a type of [[transportation]]
Synonym:	[[bersepeda]] is a translation of [[bicycle]]
PartOf:	[[pedal]] is a part of [[bicycle]]
Synonym:	[[bicicleta]] is a translation of [[bicycle]]
Synonym:	[[berbasikal]] is a translation of [[bicycle]]
Synonym:	[[pedalear]] is a translation of [[bicycle]]

Play around! see: https://conceptnet.io/c/en/bicycle?rel=/r/PartOf&limit=1000

Takeaway#

The “top-down” approach to transforming text into “something computable” is to express your intent as rules.

Entities

  • “things that exist”, and can have types or instances.

  • we must map text occurences (tokens) to entities using rules e.g. RegEx

Relationships

  • how entities relate to each other, often as “verbs”, but more generally as “predicates”.

  • Will often see holonyms/meronyms, hypernyms/hyponyms, entailment, and synonyms

  • How they are represented greatly depends on domain and historical use-case (database vs. web, etc.)

We’ve only brushed the surface of ontologies and entity relationships, but the key point for this chapter is that people had to write all of this down! For rules-based systems, the name of the game is human input, which can be incredibly useful and powerful, while also being very fragile to context-specific application. It also takes a lot of work to write down all of these rules, let alone validate them.

Using one of these pre-existing rules-based systems is making, whether admitted or not, an important assumption about the applicability of these rules to your problem!