Transformers for Natural Language Processing (NLP)

10 min readNov 3, 2021

Whether you’re an experienced Artificial Intelligence (AI) developer or you’re a newbie in this world, this post will provide you with the required knowledge to build your own Transformers implementations for resolving Natural Language Processing (NLP) challenges.

Natural Language Processing

Natural Language Processing (NLP) is one of the major fields in AI. NLP is not only a subfield of AI but also of linguistics and computer sciences. The NLP is concerned with the interactions between computers and human language, in particular how to program computers to process and analyze large amounts of natural language data.

NLP provides mechanisms to solve a large number of challenges. For instance, Named Entity Recognition, text translation, sentiment analysis, text generation, question answering systems or documents classification among others.

In this article, we will address a multi-label text classification problem.

What do the Transformers do?

Figure 2: Attention mechanism (Source: https://medium.com/deep-learning-with-keras/sequence-to-sequence-learning-c8be6cd34848 )

The transformers are a Deep Learning (DL) model that adopts the mechanism of attention. This mechanism was introduced in the paper “Attention is all you need” (Vaswani, Ashish & Shazeer, Noam & Parmar, Niki & Uszkoreit, Jakob & Jones, Llion & Gomez, Aidan & Kaiser, Lukasz & Polosukhin, Illia. (2017))[1]. This paper supposed great improvements for the Transformers. In these years dozens of Transformers have emerged, being the most well-known GPT-2, GPT-3, or Bert. There are variants of these transformers, for instance, Beto provides a Bert implementation of the Spanish language.

We will use Bert to implement our solution

Before the emerging of Transformer, most of the models used in NLP were based on Recurrent Neural Networks (RNN). The usage of RNN’s was not really efficient in handling long sequences of text. These models tend to forget the content of the initial words of the text to make decisions.

Transformer avoids recursion by processing sentences as whole using attention mechanisms and positional embeddings.

The architecture of Transformers

Figure 3: The Transformer — Model architecture (Source: https://arxiv.org/abs/1706.03762)

There are many articles on the Internet in which the architecture of Transformers is explained, thus I’m not going to spend time doing the same. See below a list of my favorite ones:

Showcase: Quotes classification

As was mentioned above, we will take advantage of the Transformers for addressing a multilabel text classification problem.

In this post, we will use the dataset “Quotes Dataset”, provided by Amit Mittal. This dataset is hosted on https://www.kaggle.com/akmittal/quotes-dataset. This is a >22MB dataset in JSON format and each quote contains the author, the tags, popularity, and the category.

{
  "Quote": "Don't cry because it's over, smile because it happened.",
  "Author":"Dr. Seuss",
  "Tags":[ "attributed-no-source",  "cry",  "crying", ...],
  "Popularity": 0.15566615566615566,
  "Category":"life"
}

Our goal will be to predict the tags for a given quote, so we can omit the other attributes.

In our implementation, we will use PyTorch. PyTorch is an open-source machine learning library based on the Torch library.

Next, we will go through the different parts of our Notebook. The full Notebook can be downloaded here

Install and import libraries

Install the dependencies

!pip install torch
!pip install numpy
!pip install pandas
!pip install matplotlib
!pip install seaborn
!pip install sklearn
!pip install transformers
!pip install pytorch-lightning

Import requerided dependencies

# Import all libraries
import pandas as pd
import numpy as np
import re
import os

# Huggingface transformers
import transformers
from transformers import BertModel,BertTokenizer,AdamW, get_linear_schedule_with_warmup

import torch
from torch import nn ,cuda
from torch.utils.data import DataLoader,Dataset,RandomSampler, SequentialSampler

import pytorch_lightning as pl
from pytorch_lightning.callbacks import ModelCheckpoint

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_recall_fscore_support

import seaborn as sns
from pylab import rcParams
import matplotlib.pyplot as plt
from matplotlib import rc
%matplotlib inline

is_gpu_available = torch.cuda.is_available()
device = torch.device("cuda:0" if is_gpu_available else "cpu")
if is_gpu_available:
  !nvidia-smi

Download the dataset

We will make use of a dataset that is hosted on Kaggle. We need to follow the steps below to download and use Kaggle data within Google Colab:

Sign in to https://kaggle.com/, then click on your profile picture on the top right and select “My Account” from the menu.
Scroll down to the “API” section and click “Create New API Token”. This will download a file kaggle.json.
Upload the downloaded kaggle.json file in the next cell.

"""The Kaggle dataset path"""
KAGGLE_DATASET ='akmittal/quotes-dataset'

!pip install -q kaggle
from google.colab import files
files.upload()

!pip install -q kaggle
!mkdir -p ~/.kaggle
!mv kaggle.json ~/.kaggle/
!ls ~/.kaggle
!chmod 600 /root/.kaggle/kaggle.json
!kaggle datasets download "{KAGGLE_DATASET}"
!mkdir  /content/dataset
!unzip -q /content/quotes-dataset.zip -d /content/dataset
!ls /content/dataset

Load dataset

Load the JSON file into a Pandas DataFrame

import pandas as pd

df = pd.read_json('/content/dataset/quotes.json')
# Take only the required attributes and discard the others
df = df[['Quote','Tags']]
# Drop the duplicates records into our dataset
df = df.drop_duplicates(['Quote'])
print(f'The data frame contains {len(df)} records.')
df.head(6)

Normalize dataset

In this step, we will normalize the tags (to lowercase, remove empty blanks). Additionally, we will work only with the 15 most used tags.

df.Tags = df.Tags.transform(lambda tags: [tag.lower().strip() for tag in tags])

tags = [element for list_ in df.Tags for element in list_]
tags = [tag.lower().strip() for tag in tags]

print(f'There are {len(tags)} tags.')# There are 215664 tags.

As we can observe in the above cell, there are more than 200k tags. As this Notebook has mainly educational purposes we will discard the other tags and clean the dataset.

classes = pd.Series(tags).value_counts()[:15].index
classes = list(set(classes))
classes.sort()
df['Tags'] = df.Tags.transform(lambda tags: list(set(tags).intersection(classes)))
df = df[df.Tags.transform(lambda tags: len(tags)>0)]

print(f'We will only consider the following tags: {classes}.')
print(f'The data frame contains {len(df)} records with one or more tags.')

To work on our multi-label tag classification we will convert the field ‘Tags’ (with arrays of tags) into 15 columns (one per tag) with binary values.

from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
df = df.join(pd.DataFrame(mlb.fit_transform(df.pop('Tags')),
                          columns=mlb.classes_,
                          index=df.index))
df.head(n=6)

Splitting dataset into train, test, and validation data

We have more than 20k records in our data frame. Our intention is purely educational, so let’s take a subset with “only” 2k records to work with.

df = df[:2000]

70% of our records will be used for training the model and the other 30% will be distributed between test and validation.

RANDOM_STATE = 167

train_data,temp_data = train_test_split(
    df,
    test_size=.3, 
    random_state=RANDOM_STATE,
    shuffle=True,
)


test_data, val_data = train_test_split(
    temp_data, 
    test_size=0.5, 
    random_state=RANDOM_STATE, 
    shuffle=True,
)
print(f' - dataset for training model: {train_data.shape[0]}.')
print(f' - dataset for validate trained model: {val_data.shape[0]}.')
print(f' - dataset for test the model {test_data.shape[0]}.')

Preparing the Dataset and DataModule

class QuoteTagDataset (Dataset):
  def __init__(self, data, tokenizer, max_len):
      self.tokenizer = tokenizer
      self.data      = data
      self.max_len   = max_len
      
  def __len__(self):
      return len(self.data)
  
  def __getitem__(self, item_idx):
      item   = self.data.iloc[item_idx]
      quote  = item['Quote']
      labels = item[classes]

      inputs = self.tokenizer.encode_plus(
          quote,
          None,
          max_length= self.max_len,
          padding = 'max_length',
          add_special_tokens=True,
          return_token_type_ids= False,
          return_attention_mask= True,
          truncation=True,
          return_tensors = 'pt'
        )
      
      input_ids = inputs['input_ids'].flatten()
      attn_mask = inputs['attention_mask'].flatten()
      
      return {
          'input_ids': input_ids ,
          'attention_mask': attn_mask,
          'label': torch.tensor(labels, dtype=torch.float)    
      }class QuoteTagDataModule (pl.LightningDataModule):
    
    def __init__(self, train_data, val_data, test_data,tokenizer,train_batch_size=8, val_batch_size=8, test_batch_size=8, max_token_len=150):
        super().__init__()
        self.train_data = train_data
        self.test_data  = test_data
        self.val_data   = val_data
        self.tokenizer = tokenizer
        self.train_batch_size = train_batch_size
        self.test_batch_size = test_batch_size
        self.val_batch_size = val_batch_size
        self.max_token_len = max_token_len

    def setup(self):
        self.train_dataset = QuoteTagDataset(data=self.train_data, tokenizer=self.tokenizer,max_len = self.max_token_len)
        self.val_dataset  = QuoteTagDataset(data=self.val_data,tokenizer=self.tokenizer,max_len = self.max_token_len)
        self.test_dataset  = QuoteTagDataset(data=self.test_data,tokenizer=self.tokenizer,max_len = self.max_token_len)
        
        
    def train_dataloader(self):
        return DataLoader (self.train_dataset, batch_size = self.train_batch_size, shuffle = True , num_workers=0)

    def val_dataloader(self):
        return DataLoader (self.val_dataset,batch_size=self.val_batch_size)

    def test_dataloader(self):
        return DataLoader (self.test_dataset,batch_size=self.test_batch_size)# Initialize the Bert tokenizer
BERT_MODEL_NAME = 'bert-base-cased'
Bert_tokenizer = BertTokenizer.from_pretrained(BERT_MODEL_NAME)

# Initialize the parameters that will be use for training
TRAIN_BATCH_SIZE = 8
TEST_BATCH_SIZE  = 8
VAL_BATCH_SIZE   = 8
MAX_LEN          = 128

# Instantiate and set up the data_module
data_module = QuoteTagDataModule(train_data,val_data,test_data,Bert_tokenizer,TRAIN_BATCH_SIZE,VAL_BATCH_SIZE,TEST_BATCH_SIZE,MAX_LEN)
data_module.setup()

Train the Model

class QuoteTagClassifier(pl.LightningModule):
    
    def __init__(self, n_classes=15, steps_per_epoch=None, n_epochs=3, lr=2e-5 ):
        super().__init__()

        self.bert = BertModel.from_pretrained(BERT_MODEL_NAME, return_dict=True)
        self.classifier = nn.Linear(self.bert.config.hidden_size,n_classes) # outputs = number of labels
        self.steps_per_epoch = steps_per_epoch
        self.n_epochs = n_epochs
        self.lr = lr
        self.criterion = nn.BCEWithLogitsLoss()
        
    def forward(self,input_ids, attn_mask):
        output = self.bert(input_ids = input_ids ,attention_mask = attn_mask)
        output = self.classifier(output.pooler_output)
        return output
    
    
    def training_step(self,batch,batch_idx):
        input_ids = batch['input_ids']
        attention_mask = batch['attention_mask']
        labels = batch['label']
        
        outputs = self(input_ids,attention_mask)
        loss = self.criterion(outputs,labels)
        self.log('train_loss',loss , prog_bar=True,logger=True)
        
        return {"loss" :loss, "predictions":outputs, "labels": labels }


    def validation_step(self,batch,batch_idx):
        input_ids = batch['input_ids']
        attention_mask = batch['attention_mask']
        labels = batch['label']
        
        outputs = self(input_ids,attention_mask)
        loss = self.criterion(outputs,labels)
        self.log('val_loss',loss , prog_bar=True,logger=True)
        
        return loss

    def test_step(self,batch,batch_idx):
        input_ids = batch['input_ids']
        attention_mask = batch['attention_mask']
        labels = batch['label']
        outputs = self(input_ids,attention_mask)
        loss = self.criterion(outputs,labels)
        self.log('test_loss',loss , prog_bar=True,logger=True)
        return loss
    
    
    def configure_optimizers(self):
        optimizer = AdamW(self.parameters() , lr=self.lr)
        warmup_steps = self.steps_per_epoch//3
        total_steps = self.steps_per_epoch * self.n_epochs - warmup_steps
        scheduler = get_linear_schedule_with_warmup(optimizer,warmup_steps,total_steps)
        return [optimizer], [scheduler]N_EPOCHS   = 20
LR         = 2e-05
steps_per_epoch = len(train_data)//TRAIN_BATCH_SIZE

model = QuoteTagClassifier(n_classes=len(classes), steps_per_epoch=steps_per_epoch,n_epochs=N_EPOCHS,lr=LR)

This belos step could take long as It’s the part in which we train our model

# Instantiate the Model Trainer
trainer = pl.Trainer(max_epochs = N_EPOCHS , gpus = 1, progress_bar_refresh_rate = 20)
# Train the Classifier Model
trainer.fit(model, data_module)

# Evaluate the model performance on the test dataset
trainer.test(model,datamodule=data_module)

Evaluate Model Performance on Test Set

from torch.utils.data import TensorDataset

# Tokenize all quotes in test_data
input_ids = []
attention_masks = []


for quote in test_data.Quote:
    encoded_quote =  Bert_tokenizer.encode_plus(
      quote,
      None,
      add_special_tokens=True,
      max_length= MAX_LEN,
      padding = 'max_length',
      return_token_type_ids= False,
      return_attention_mask= True,
      truncation=True,
      return_tensors = 'pt'      
    )
    
    input_ids.append(encoded_quote['input_ids'])
    attention_masks.append(encoded_quote['attention_mask'])
    
# Now convert the lists into tensors.
input_ids = torch.cat(input_ids, dim=0)
attention_masks = torch.cat(attention_masks, dim=0)
labels = torch.tensor(test_data[classes].values)

# Create the DataLoader.
pred_data = TensorDataset(input_ids, attention_masks, labels)
pred_sampler = SequentialSampler(pred_data)
pred_dataloader = DataLoader(pred_data, sampler=pred_sampler, batch_size=TEST_BATCH_SIZE)flat_pred_outs = 0
flat_true_labels = 0# Put model in evaluation mode
model = model.to(device) # moving model to cuda
model.eval()

# Tracking variables 
pred_outs, true_labels = [], []
#i=0
# Predict 
for batch in pred_dataloader:
    # Add batch to GPU
    batch = tuple(t.to(device) for t in batch)
  
    # Unpack the inputs from our dataloader
    b_input_ids, b_attn_mask, b_labels = batch
 
    with torch.no_grad():
        pred_out = model(b_input_ids,b_attn_mask)
        pred_out = torch.sigmoid(pred_out)
        pred_out = pred_out.detach().cpu().numpy()
        label_ids = b_labels.to('cpu').numpy()

    pred_outs.append(pred_out)
    true_labels.append(label_ids)flat_pred_outs = np.concatenate(pred_outs, axis=0)
flat_true_labels = np.concatenate(true_labels, axis=0)

Predictions of Tags in the Test set

First of all, we need to identify the threshold that performs the best for the test dataset

threshold  = np.arange(0.4,0.51,0.01)
threshold

Let’s define a function that takes a threshold value and uses it to convert probabilities into 1 or 0.

# convert probabilities into 0 or 1 based on a threshold value
def classify(pred_prob,thresh):
    y_pred = []

    for tag_label_row in pred_prob:
        temp=[]
        for tag_label in tag_label_row:
            if tag_label >= thresh:
                temp.append(1) 
            else:
                temp.append(0)
        y_pred.append(temp)

    return y_predfrom sklearn import metrics
scores=[] # Store the list of f1 scores for prediction on each threshold

#convert labels to 1D array
y_true = flat_true_labels.ravel() 

for thresh in threshold:
    
    #classes for each threshold
    pred_bin_label = classify(flat_pred_outs,thresh) 

    #convert to 1D array
    y_pred = np.array(pred_bin_label).ravel()

    scores.append(metrics.f1_score(y_true,y_pred))# find the optimal threshold
opt_thresh = threshold[scores.index(max(scores))]
print(f'Optimal Threshold Value = {opt_thresh}')

Performance Score Evaluation

#predictions for optimal threshold
y_pred_labels = classify(flat_pred_outs,opt_thresh)
y_pred = np.array(y_pred_labels).ravel() # Flatten
print(metrics.classification_report(y_true,y_pred))

from sklearn.preprocessing import MultiLabelBinarizer

mlb = MultiLabelBinarizer()
yt = mlb.fit_transform([classes])
yt.shape

y_pred = mlb.inverse_transform(np.array(y_pred_labels))
y_act = mlb.inverse_transform(flat_true_labels)

df = pd.DataFrame({'Body':test_data['Quote'],'Actual Tags':y_act,'Predicted Tags':y_pred})
df.sample(40)

Conclusions

The performance score is not the best but we didn’t work on fine tunning (This would be covered in another post). Moreover, the example was run with free instances provided by Google and as consequence, this has multiple limitations, like the batch size.

Understanding how the Transformers work could be tricky but ¡implementing a showcase is not rocket science at all

References

*Wikipedia contributors. (2021, October 30). Natural language processing. In Wikipedia, The Free Encyclopedia. Retrieved 07:27, November 2, 2021, from https://en.wikipedia.org/w/index.php?title=Natural_language_processing&oldid=1052735380
Maher. (2021, October 19). Five contemporary natural language processing problems pushed forward by DL. Medium. Retrieved November 2, 2021, from https://heartbeat.comet.ml/five-contemporary-natural-language-processing-problems-pushed-forward-by-dl-61989682cad5.
Muñoz, E. (2021, February 11). Attention is all you need: Discovering the transformer paper. Medium. Retrieved November 2, 2021, from https://towardsdatascience.com/attention-is-all-you-need-discovering-the-transformer-paper-73e5ff5e0634.
Goled, S. (2021, March 16). Why transformers are increasingly becoming as important as RNN and CNN? Analytics India Magazine. Retrieved November 2, 2021, from https://analyticsindiamag.com/why-transformers-are-increasingly-becoming-as-important-as-rnn-and-cnn/.
Transformers in NLP: State-of-the-art-models. Analytics Vidhya. (2021, July 23). Retrieved November 2, 2021, from https://www.analyticsvidhya.com/blog/2019/06/understanding-transformers-nlp-state-of-the-art-models/.
Montantes, J. (2020, July 21). A deep dive into the transformer architecture — the development of Transformer models. Medium. Retrieved November 2, 2021, from https://towardsdatascience.com/a-deep-dive-into-the-transformer-architecture-the-development-of-transformer-models-acbdf7ca34e0.
Alammar, J. (n.d.). The illustrated transformer. The Illustrated Transformer — Jay Alammar — Visualizing machine learning one concept at a time. Retrieved November 2, 2021, from https://jalammar.github.io/illustrated-transformer/.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł. & Polosukhin, I. (2017). Attention is all you need. Advances in Neural Information Processing Systems (p./pp. 5998–6008), https://arxiv.org/pdf/1706.03762.pdf.
Mittal, A. (2018, March 18). Quotes dataset. Kaggle. Retrieved November 2, 2021, from https://www.kaggle.com/akmittal/quotes-dataset.