Tokenization: Segmenting text into single tokens
The first activity of most NLP Pipelines is to split the sequence of text into words or sub-words, called “tokens”. There are three types of “tokenization” approaches: based on words, sub-words, or even splitting at the character level.
Some NLP pipelines perform extra pre-processing activities. They may remove non-alphabetic data (such as punctuation) or transform tokens to their canonical form (done by finding the stem of the word), or attempt to automatically fix errors in the text.
01
02
Embedding: Converting words into numbers
Once the sentence is broken into a collection of tokens, we have to convert these tokens into numerical vectors. These numerical vectors will be what are used to represent the words as input to the model that will be used to extract the needed information from them.
This conversion of words or sub-words to numerical values can be learned independently, typically by learning from large corpuses of text which contain many instances of all the words in all sorts of contexts. With transformer models, typically this conversion step is learned together with the rest of the model which is trained on sequences from the same large corpuses of text.
Information Extraction
Once the sequences of words are converted to sequences of numerical vectors, models can be trained to recognize patterns in the sequences which correspond to information that we are interested in. For example, sentences and documents contain a set of special words known as “entities”. Examples of “entities” are people’s names, addresses, dates, currencies or product names. Named Entities Recognition (NER) identifies these entities so that they can be processed appropriately. For example we may want to find instances of personally identifiable information (PII) in documents; or we may be looking for company confidential information.
NER is only one of the ways in which models are able to extract information from text. NLP includes topics like sentiment analysis, phrase extraction, named entities disambiguation and linking, relation extraction, and event extraction.
03
04
Model Development
In 2018 came a paradigm-shifting NLP technique called 'Bidirectional Encoder Representation from Transformer'. BERT, which its commonly known as, uses attention mechanisms to encode representations of words in a sentence and form correlations between them. It's unique in that it factors context both before and after each word into its representation and uses its own statistical understanding of a particular language, from which is pre-trained using unlabeled corpuses of text (i.e. Wikipedia). BERT is a language model that is truly the state-of-the-art; it can be fine-tuned to perform a wide variety of business or industry-specific NLP tasks, ranging from intent and sentiment analysis to question and answer recognition.
Model Evaluation
A model is evaluated based on its internal performance, and in addition, based on the business outcomes it provides. The internal performance of a model is measured over a dedicated “testing set” of data with indicators like precision, recall, and the F1 Score. Once the internal performance of the model is sufficient, we need to ensure that the model fulfills the business objectives it has been developed for. For instance, for a spam / email classification model, we would need to evaluate the internal performance of the model (using the F1 Score), and we would also need to verify with the end users that the model works well, for example, by checking with the users that they have not received any spams in their mailbox and that no email has been wrongly classified as a spam.
05
06
Model Monitoring and Retraining
Once the model is rolled out in the production environment, it needs to be monitored to ensure its performance remains appropriate. This is also referred to as “Human in the Loop”. Humans are asked to identify whenever the model makes a mistake, and those new labels are saved and added to the training set. From time to time, the model is re-trained when enough new data is available and we want the model to take them into account.