|Rule-Based Matching In Natural Language Processing|
|Written by Jannes Klaas|
|Monday, 20 May 2019|
Page 1 of 2
SpaCy is an open-source software library for advanced Natural Language Processing, written in Python and Cython. Here it is used to build a rule-based matcher that always classifies the word "iPhone" as a product entity
This is an excerpt from the book Machine Learning for Finance written by Jannes Klaas. This book introduces the study of machine learning and deep learning algorithms for financial practitioners.
Before deep learning and statistical modeling took over, natural language processing was all about rules. That's not to say that rule-based systems are dead! They are often easy to set up and perform very well at doing simple tasks.
Imagine you wanted to find all mentions of Google in a text. Would you really train a neural network based named entity recognizer? You would have to run all the text through the neural network and then look for Google in the entity texts. Or would you rather just search for text that exactly matches Google with a classic search algorithm? spaCy comes with an easy-to-use rule-based matcher that allows us to do just that.
Before we start this section, we must first make sure that we reload the English language model and import the matcher. This is a very simple task that can be done by running the following code:import spacyfrom spacy.matcher import Matcher
nlp = spacy.load('en')
The matcher searchers for patterns that we encode as a list of dictionaries. It operates token by token, that is word for word, except for punctuation and numbers where a single symbol can be a token.
As a starting example, let’s search for the phrase "Hello, world." We will define a pattern as follows:
This pattern is fulfilled if the lower case first token is hello. That means if the actual token text is "Hello" or "HELLO" then it would also fulfill the requirement. The second token has to be punctuation, so the phrase "hello. world" or "hello! world" would both work, but not "hello world."
The lower case of the third token has to be "world," so "WoRlD" would also be fine.
The possible attributes for a token can be the following:
spaCy’s lemmatization is extremely useful. A lemma is the base version of a word. For example, "Was" is a version of "be," so "be" is the lemma for "was" as well as "is." spaCy can lemmatize words in context, meaning it uses the surrounding words to determine what the actual base version of a word is.
To create a matcher, we have to pass on the vocabulary the matcher works on. In this case, we can just pass the vocabulary of our English language model by running:
In order to add the required attributes to our matcher, we must call:
The add function expects three arguments, which are:
To use our matcher, we can simply call matcher(doc). This will give us back all the matches the matcher found. We can call this by running:doc = nlp(u'Hello, world! Hello world!')matches = matcher(doc)
If we print out the matches, we can see the structure:matches[(15578876784678163569, 0, 3)]
The first thing in a match is the hash of the string found. This is just to identify what was found internally; we won't use it here. The next two numbers indicate the range in which the matcher found something; here tokens 0 to 3 indicate the range.
We can get the text back by indexing the original document:doc[0:3]Hello, world
|Last Updated ( Monday, 20 May 2019 )|