Transformers for Natural Language Processing (NLP)
Whether you’re an experienced Artificial Intelligence (AI) developer or you’re a newbie in this world, this post will provide you with the required knowledge to build your own Transformers implementations for resolving Natural Language Processing (NLP) challenges.
Natural Language Processing
Natural Language Processing (NLP) is one of the major fields in AI. NLP is not only a subfield of AI but also of linguistics and computer sciences. The NLP is concerned with the interactions between computers and human language, in particular how to program computers to process and analyze large amounts of natural language data.
NLP provides mechanisms to solve a large number of challenges. For instance, Named Entity Recognition, text translation, sentiment analysis, text generation, question answering systems or documents classification among others.
In this article, we will address a multi-label text classification problem.
What do the Transformers do?
The transformers are a Deep Learning (DL) model that adopts the mechanism of attention. This mechanism was introduced in the paper “Attention is all you need” (Vaswani, Ashish & Shazeer, Noam & Parmar, Niki & Uszkoreit, Jakob & Jones, Llion & Gomez, Aidan & Kaiser, Lukasz & Polosukhin, Illia. (2017))[1]. This paper supposed great improvements for the Transformers. In these years dozens of Transformers have emerged, being the most well-known GPT-2, GPT-3, or Bert. There are variants of these transformers, for instance, Beto provides a Bert implementation of the Spanish language.
We will use Bert to implement our solution
Before the emerging of Transformer, most of the models used in NLP were based on Recurrent Neural Networks (RNN). The usage of RNN’s was not really efficient in handling long sequences of text. These models tend to forget the content of the initial words of the text to make decisions.
Transformer avoids recursion by processing sentences as whole using attention mechanisms and positional embeddings.
The architecture of Transformers
There are many articles on the Internet in which the architecture of Transformers is explained, thus I’m not going to spend time doing the same. See below a list of my favorite ones:
- A Deep Dive Into the Transformer Architecture — The Development of Transformer Models
- The Illustrated Transformer
- How do Transformers Work in NLP? A Guide to the Latest State-of-the-Art Models
Showcase: Quotes classification
As was mentioned above, we will take advantage of the Transformers for addressing a multilabel text classification problem.
In this post, we will use the dataset “Quotes Dataset”, provided by Amit Mittal. This dataset is hosted on https://www.kaggle.com/akmittal/quotes-dataset. This is a >22MB dataset in JSON format and each quote contains the author, the tags, popularity, and the category.
{
"Quote": "Don't cry because it's over, smile because it happened.",
"Author":"Dr. Seuss",
"Tags":[ "attributed-no-source", "cry", "crying", ...],
"Popularity": 0.15566615566615566,
"Category":"life"
}
Our goal will be to predict the tags for a given quote, so we can omit the other attributes.
In our implementation, we will use PyTorch. PyTorch is an open-source machine learning library based on the Torch library.
Next, we will go through the different parts of our Notebook. The full Notebook can be downloaded here
Install and import libraries
Install the dependencies
!pip install torch
!pip install numpy
!pip install pandas
!pip install matplotlib
!pip install seaborn
!pip install sklearn
!pip install transformers
!pip install pytorch-lightning
Import requerided dependencies
# Import all libraries
import pandas as pd
import numpy as np
import re
import os
# Huggingface transformers
import transformers
from transformers import BertModel,BertTokenizer,AdamW, get_linear_schedule_with_warmup
import torch
from torch import nn ,cuda
from torch.utils.data import DataLoader,Dataset,RandomSampler, SequentialSampler
import pytorch_lightning as pl
from pytorch_lightning.callbacks import ModelCheckpoint
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_recall_fscore_support
import seaborn as sns
from pylab import rcParams
import matplotlib.pyplot as plt
from matplotlib import rc
%matplotlib inline
is_gpu_available = torch.cuda.is_available()
device = torch.device("cuda:0" if is_gpu_available else "cpu")
if is_gpu_available:
!nvidia-smi
Download the dataset
We will make use of a dataset that is hosted on Kaggle. We need to follow the steps below to download and use Kaggle data within Google Colab:
- Sign in to https://kaggle.com/, then click on your profile picture on the top right and select “My Account” from the menu.
- Scroll down to the “API” section and click “Create New API Token”. This will download a file kaggle.json.
- Upload the downloaded kaggle.json file in the next cell.
"""The Kaggle dataset path"""
KAGGLE_DATASET ='akmittal/quotes-dataset'
!pip install -q kaggle
from google.colab import files
files.upload()
!pip install -q kaggle
!mkdir -p ~/.kaggle
!mv kaggle.json ~/.kaggle/
!ls ~/.kaggle
!chmod 600 /root/.kaggle/kaggle.json
!kaggle datasets download "{KAGGLE_DATASET}"
!mkdir /content/dataset
!unzip -q /content/quotes-dataset.zip -d /content/dataset
!ls /content/dataset
Load dataset
Load the JSON file into a Pandas DataFrame
import pandas as pd
df = pd.read_json('/content/dataset/quotes.json')
# Take only the required attributes and discard the others
df = df[['Quote','Tags']]
# Drop the duplicates records into our dataset
df = df.drop_duplicates(['Quote'])
print(f'The data frame contains {len(df)} records.')
df.head(6)
Normalize dataset
In this step, we will normalize the tags (to lowercase, remove empty blanks). Additionally, we will work only with the 15 most used tags.
df.Tags = df.Tags.transform(lambda tags: [tag.lower().strip() for tag in tags])
tags = [element for list_ in df.Tags for element in list_]
tags = [tag.lower().strip() for tag in tags]
print(f'There are {len(tags)} tags.')# There are 215664 tags.
As we can observe in the above cell, there are more than 200k tags. As this Notebook has mainly educational purposes we will discard the other tags and clean the dataset.
classes = pd.Series(tags).value_counts()[:15].index
classes = list(set(classes))
classes.sort()
df['Tags'] = df.Tags.transform(lambda tags: list(set(tags).intersection(classes)))
df = df[df.Tags.transform(lambda tags: len(tags)>0)]
print(f'We will only consider the following tags: {classes}.')
print(f'The data frame contains {len(df)} records with one or more tags.')
To work on our multi-label tag classification we will convert the field ‘Tags’ (with arrays of tags) into 15 columns (one per tag) with binary values.
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
df = df.join(pd.DataFrame(mlb.fit_transform(df.pop('Tags')),
columns=mlb.classes_,
index=df.index))
df.head(n=6)
Splitting dataset into train, test, and validation data
We have more than 20k records in our data frame. Our intention is purely educational, so let’s take a subset with “only” 2k records to work with.
df = df[:2000]
70% of our records will be used for training the model and the other 30% will be distributed between test and validation.
RANDOM_STATE = 167
train_data,temp_data = train_test_split(
df,
test_size=.3,
random_state=RANDOM_STATE,
shuffle=True,
)
test_data, val_data = train_test_split(
temp_data,
test_size=0.5,
random_state=RANDOM_STATE,
shuffle=True,
)
print(f' - dataset for training model: {train_data.shape[0]}.')
print(f' - dataset for validate trained model: {val_data.shape[0]}.')
print(f' - dataset for test the model {test_data.shape[0]}.')
Preparing the Dataset and DataModule
class QuoteTagDataset (Dataset):
def __init__(self, data, tokenizer, max_len):
self.tokenizer = tokenizer
self.data = data
self.max_len = max_len
def __len__(self):
return len(self.data)
def __getitem__(self, item_idx):
item = self.data.iloc[item_idx]
quote = item['Quote']
labels = item[classes]
inputs = self.tokenizer.encode_plus(
quote,
None,
max_length= self.max_len,
padding = 'max_length',
add_special_tokens=True,
return_token_type_ids= False,
return_attention_mask= True,
truncation=True,
return_tensors = 'pt'
)
input_ids = inputs['input_ids'].flatten()
attn_mask = inputs['attention_mask'].flatten()
return {
'input_ids': input_ids ,
'attention_mask': attn_mask,
'label': torch.tensor(labels, dtype=torch.float)
}class QuoteTagDataModule (pl.LightningDataModule):
def __init__(self, train_data, val_data, test_data,tokenizer,train_batch_size=8, val_batch_size=8, test_batch_size=8, max_token_len=150):
super().__init__()
self.train_data = train_data
self.test_data = test_data
self.val_data = val_data
self.tokenizer = tokenizer
self.train_batch_size = train_batch_size
self.test_batch_size = test_batch_size
self.val_batch_size = val_batch_size
self.max_token_len = max_token_len
def setup(self):
self.train_dataset = QuoteTagDataset(data=self.train_data, tokenizer=self.tokenizer,max_len = self.max_token_len)
self.val_dataset = QuoteTagDataset(data=self.val_data,tokenizer=self.tokenizer,max_len = self.max_token_len)
self.test_dataset = QuoteTagDataset(data=self.test_data,tokenizer=self.tokenizer,max_len = self.max_token_len)
def train_dataloader(self):
return DataLoader (self.train_dataset, batch_size = self.train_batch_size, shuffle = True , num_workers=0)
def val_dataloader(self):
return DataLoader (self.val_dataset,batch_size=self.val_batch_size)
def test_dataloader(self):
return DataLoader (self.test_dataset,batch_size=self.test_batch_size)# Initialize the Bert tokenizer
BERT_MODEL_NAME = 'bert-base-cased'
Bert_tokenizer = BertTokenizer.from_pretrained(BERT_MODEL_NAME)
# Initialize the parameters that will be use for training
TRAIN_BATCH_SIZE = 8
TEST_BATCH_SIZE = 8
VAL_BATCH_SIZE = 8
MAX_LEN = 128
# Instantiate and set up the data_module
data_module = QuoteTagDataModule(train_data,val_data,test_data,Bert_tokenizer,TRAIN_BATCH_SIZE,VAL_BATCH_SIZE,TEST_BATCH_SIZE,MAX_LEN)
data_module.setup()
Train the Model
class QuoteTagClassifier(pl.LightningModule):
def __init__(self, n_classes=15, steps_per_epoch=None, n_epochs=3, lr=2e-5 ):
super().__init__()
self.bert = BertModel.from_pretrained(BERT_MODEL_NAME, return_dict=True)
self.classifier = nn.Linear(self.bert.config.hidden_size,n_classes) # outputs = number of labels
self.steps_per_epoch = steps_per_epoch
self.n_epochs = n_epochs
self.lr = lr
self.criterion = nn.BCEWithLogitsLoss()
def forward(self,input_ids, attn_mask):
output = self.bert(input_ids = input_ids ,attention_mask = attn_mask)
output = self.classifier(output.pooler_output)
return output
def training_step(self,batch,batch_idx):
input_ids = batch['input_ids']
attention_mask = batch['attention_mask']
labels = batch['label']
outputs = self(input_ids,attention_mask)
loss = self.criterion(outputs,labels)
self.log('train_loss',loss , prog_bar=True,logger=True)
return {"loss" :loss, "predictions":outputs, "labels": labels }
def validation_step(self,batch,batch_idx):
input_ids = batch['input_ids']
attention_mask = batch['attention_mask']
labels = batch['label']
outputs = self(input_ids,attention_mask)
loss = self.criterion(outputs,labels)
self.log('val_loss',loss , prog_bar=True,logger=True)
return loss
def test_step(self,batch,batch_idx):
input_ids = batch['input_ids']
attention_mask = batch['attention_mask']
labels = batch['label']
outputs = self(input_ids,attention_mask)
loss = self.criterion(outputs,labels)
self.log('test_loss',loss , prog_bar=True,logger=True)
return loss
def configure_optimizers(self):
optimizer = AdamW(self.parameters() , lr=self.lr)
warmup_steps = self.steps_per_epoch//3
total_steps = self.steps_per_epoch * self.n_epochs - warmup_steps
scheduler = get_linear_schedule_with_warmup(optimizer,warmup_steps,total_steps)
return [optimizer], [scheduler]N_EPOCHS = 20
LR = 2e-05
steps_per_epoch = len(train_data)//TRAIN_BATCH_SIZE
model = QuoteTagClassifier(n_classes=len(classes), steps_per_epoch=steps_per_epoch,n_epochs=N_EPOCHS,lr=LR)
This belos step could take long as It’s the part in which we train our model
# Instantiate the Model Trainer
trainer = pl.Trainer(max_epochs = N_EPOCHS , gpus = 1, progress_bar_refresh_rate = 20)
# Train the Classifier Model
trainer.fit(model, data_module)
# Evaluate the model performance on the test dataset
trainer.test(model,datamodule=data_module)
Evaluate Model Performance on Test Set
from torch.utils.data import TensorDataset
# Tokenize all quotes in test_data
input_ids = []
attention_masks = []
for quote in test_data.Quote:
encoded_quote = Bert_tokenizer.encode_plus(
quote,
None,
add_special_tokens=True,
max_length= MAX_LEN,
padding = 'max_length',
return_token_type_ids= False,
return_attention_mask= True,
truncation=True,
return_tensors = 'pt'
)
input_ids.append(encoded_quote['input_ids'])
attention_masks.append(encoded_quote['attention_mask'])
# Now convert the lists into tensors.
input_ids = torch.cat(input_ids, dim=0)
attention_masks = torch.cat(attention_masks, dim=0)
labels = torch.tensor(test_data[classes].values)
# Create the DataLoader.
pred_data = TensorDataset(input_ids, attention_masks, labels)
pred_sampler = SequentialSampler(pred_data)
pred_dataloader = DataLoader(pred_data, sampler=pred_sampler, batch_size=TEST_BATCH_SIZE)flat_pred_outs = 0
flat_true_labels = 0# Put model in evaluation mode
model = model.to(device) # moving model to cuda
model.eval()
# Tracking variables
pred_outs, true_labels = [], []
#i=0
# Predict
for batch in pred_dataloader:
# Add batch to GPU
batch = tuple(t.to(device) for t in batch)
# Unpack the inputs from our dataloader
b_input_ids, b_attn_mask, b_labels = batch
with torch.no_grad():
pred_out = model(b_input_ids,b_attn_mask)
pred_out = torch.sigmoid(pred_out)
pred_out = pred_out.detach().cpu().numpy()
label_ids = b_labels.to('cpu').numpy()
pred_outs.append(pred_out)
true_labels.append(label_ids)flat_pred_outs = np.concatenate(pred_outs, axis=0)
flat_true_labels = np.concatenate(true_labels, axis=0)
Predictions of Tags in the Test set
First of all, we need to identify the threshold that performs the best for the test dataset
threshold = np.arange(0.4,0.51,0.01)
threshold
Let’s define a function that takes a threshold value and uses it to convert probabilities into 1 or 0.
# convert probabilities into 0 or 1 based on a threshold value
def classify(pred_prob,thresh):
y_pred = []
for tag_label_row in pred_prob:
temp=[]
for tag_label in tag_label_row:
if tag_label >= thresh:
temp.append(1)
else:
temp.append(0)
y_pred.append(temp)
return y_predfrom sklearn import metrics
scores=[] # Store the list of f1 scores for prediction on each threshold
#convert labels to 1D array
y_true = flat_true_labels.ravel()
for thresh in threshold:
#classes for each threshold
pred_bin_label = classify(flat_pred_outs,thresh)
#convert to 1D array
y_pred = np.array(pred_bin_label).ravel()
scores.append(metrics.f1_score(y_true,y_pred))# find the optimal threshold
opt_thresh = threshold[scores.index(max(scores))]
print(f'Optimal Threshold Value = {opt_thresh}')
Performance Score Evaluation
#predictions for optimal threshold
y_pred_labels = classify(flat_pred_outs,opt_thresh)
y_pred = np.array(y_pred_labels).ravel() # Flatten
print(metrics.classification_report(y_true,y_pred))
from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
yt = mlb.fit_transform([classes])
yt.shape
y_pred = mlb.inverse_transform(np.array(y_pred_labels))
y_act = mlb.inverse_transform(flat_true_labels)
df = pd.DataFrame({'Body':test_data['Quote'],'Actual Tags':y_act,'Predicted Tags':y_pred})
df.sample(40)
Conclusions
The performance score is not the best but we didn’t work on fine tunning (This would be covered in another post). Moreover, the example was run with free instances provided by Google and as consequence, this has multiple limitations, like the batch size.
Understanding how the Transformers work could be tricky but ¡implementing a showcase is not rocket science at all
References
- *Wikipedia contributors. (2021, October 30). Natural language processing. In Wikipedia, The Free Encyclopedia. Retrieved 07:27, November 2, 2021, from https://en.wikipedia.org/w/index.php?title=Natural_language_processing&oldid=1052735380
- Maher. (2021, October 19). Five contemporary natural language processing problems pushed forward by DL. Medium. Retrieved November 2, 2021, from https://heartbeat.comet.ml/five-contemporary-natural-language-processing-problems-pushed-forward-by-dl-61989682cad5.
- Muñoz, E. (2021, February 11). Attention is all you need: Discovering the transformer paper. Medium. Retrieved November 2, 2021, from https://towardsdatascience.com/attention-is-all-you-need-discovering-the-transformer-paper-73e5ff5e0634.
- Goled, S. (2021, March 16). Why transformers are increasingly becoming as important as RNN and CNN? Analytics India Magazine. Retrieved November 2, 2021, from https://analyticsindiamag.com/why-transformers-are-increasingly-becoming-as-important-as-rnn-and-cnn/.
- Transformers in NLP: State-of-the-art-models. Analytics Vidhya. (2021, July 23). Retrieved November 2, 2021, from https://www.analyticsvidhya.com/blog/2019/06/understanding-transformers-nlp-state-of-the-art-models/.
- Montantes, J. (2020, July 21). A deep dive into the transformer architecture — the development of Transformer models. Medium. Retrieved November 2, 2021, from https://towardsdatascience.com/a-deep-dive-into-the-transformer-architecture-the-development-of-transformer-models-acbdf7ca34e0.
- Alammar, J. (n.d.). The illustrated transformer. The Illustrated Transformer — Jay Alammar — Visualizing machine learning one concept at a time. Retrieved November 2, 2021, from https://jalammar.github.io/illustrated-transformer/.
- Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł. & Polosukhin, I. (2017). Attention is all you need. Advances in Neural Information Processing Systems (p./pp. 5998–6008), https://arxiv.org/pdf/1706.03762.pdf.
- Mittal, A. (2018, March 18). Quotes dataset. Kaggle. Retrieved November 2, 2021, from https://www.kaggle.com/akmittal/quotes-dataset.