Segmenting the text into single tokens (Tokenization)
The first activity of most NLP Pipelines is to split the sequence of text into words or sub-words, called “tokens”. There are three types of “tokenization” approaches: based on words, sub-words, or even splitting at the character level.
Some NLP pipelines perform extra pre-processing activities. They may remove non-alphabetic data (such as punctuation) or transform tokens to their canonical form (done by finding the stem of the word), or attempt to automatically fix errors in the text.
Converting the words into numbers (Embedding)
Once the sentence is broken into a collection of tokens, we have to convert these tokens into numerical vectors. These numerical vectors will be what are used to represent the words as input to the model that will be used to extract the needed information from them.
This conversion of words or sub-words to numerical values can be learned independently, typically by learning from large corpuses of text which contain many instances of all the words in all sorts of contexts. With transformer models, typically this conversion step is learned together with the rest of the model which is trained on sequences from the same large corpuses of text.
Once the sequences of words are converted to sequences of numerical vectors, models can be trained to recognize patterns in the sequences which correspond to information that we are interested in. For example, sentences and documents contain a set of special words known as “entities”. Examples of “entities” are people’s names, addresses, dates, currencies or product names. Named Entities Recognition (NER) identifies these entities so that they can be processed appropriately. For example we may want to find instances of personally identifiable information (PII) in documents; or we may be looking for company confidential information.
NER is only one of the ways in which models are able to extract information from text. NLP includes topics like sentiment analysis, phrase extraction, named entities disambiguation and linking, relation extraction, and event extraction.
Before 2018, NLP models were trained by reading sentences based on the ordered sequence of words. In 2018, Transformers Models based on “Attention Mechanisms” changed this paradigm. The Transformer Models read the whole sentence at once and then use the Attention Mechanisms to estimate how strongly words are correlated with each other in the whole sentence. This ensures that the word context is fully taken into account, and none of the context is lost or “forgotten” as it could have been with the earlier sequential algorithms.
The most famous Transformer Model is BERT which stands for Bidirectional Encoder Representation from Transformer. BERT encodes a representation of words (or sub-words) in a sentence in such a way that both the context before and after each word is taken into account in its representation - it is “bidirectional”. It is what’s known as a “language model”, meaning a model that transforms the representation of language from the text into its own encoding using its statistical “knowledge” of the language. The procedure that is used to train the BERT model uses unlabeled corpuses of text, including the entire Wikipedia (2,500 million words).
Once BERT is “pre-trained” to understand a language, it can then be fine-tuned for a task relevant to a specific industry or business in order to achieve state-of-the-art models for a wide range of NLP tasks. Such tasks could be Sentiment Analysis, Question and Answering tasks and Intent Recognition. These types of tasks typically require manually labeled data, which is time consuming and expensive to produce, and therefore are usually of smaller volume.
A model is evaluated based on its internal performance, and in addition, based on the business outcomes it provides. The internal performance of a model is measured over a dedicated “testing set” of data with indicators like precision, recall, and the F1 Score. Once the internal performance of the model is sufficient, we need to ensure that the model fulfills the business objectives it has been developed for.
For instance, for a spam / email classification model, we would need to evaluate the internal performance of the model (using the F1 Score), and we would also need to verify with the end users that the model works well, for example, by checking with the users that they have not received any spams in their mailbox and that no email has been wrongly classified as a spam.
Model Monitoring and Retraining
Once the model is rolled out in the production environment, it needs to be monitored to ensure its performance remains appropriate. This is also referred to as “Human in the Loop”. Humans are asked to identify whenever the model makes a mistake, and those new labels are saved and added to the training set.
From time to time, the model is retrained when enough new data is available and we want the model to take them into account.