Transformers for Natural Language Processing (NLP)

Iván Corrales Solera
10 min readNov 3, 2021


Whether you’re an experienced Artificial Intelligence (AI) developer or you’re a newbie in this world, this post will provide you with the required knowledge to build your own Transformers implementations for resolving Natural Language Processing (NLP) challenges.

Natural Language Processing

Figure 1: Natural Language Processing

Natural Language Processing (NLP) is one of the major fields in AI. NLP is not only a subfield of AI but also of linguistics and computer sciences. The NLP is concerned with the interactions between computers and human language, in particular how to program computers to process and analyze large amounts of natural language data.

NLP provides mechanisms to solve a large number of challenges. For instance, Named Entity Recognition, text translation, sentiment analysis, text generation, question answering systems or documents classification among others.

In this article, we will address a multi-label text classification problem.

What do the Transformers do?

Figure 2: Attention mechanism (Source: )

The transformers are a Deep Learning (DL) model that adopts the mechanism of attention. This mechanism was introduced in the paper “Attention is all you need” (Vaswani, Ashish & Shazeer, Noam & Parmar, Niki & Uszkoreit, Jakob & Jones, Llion & Gomez, Aidan & Kaiser, Lukasz & Polosukhin, Illia. (2017))[1]. This paper supposed great improvements for the Transformers. In these years dozens of Transformers have emerged, being the most well-known GPT-2, GPT-3, or Bert. There are variants of these transformers, for instance, Beto provides a Bert implementation of the Spanish language.

We will use Bert to implement our solution

Before the emerging of Transformer, most of the models used in NLP were based on Recurrent Neural Networks (RNN). The usage of RNN’s was not really efficient in handling long sequences of text. These models tend to forget the content of the initial words of the text to make decisions.

Transformer avoids recursion by processing sentences as whole using attention mechanisms and positional embeddings.

The architecture of Transformers

Figure 3: The Transformer — Model architecture (Source:

There are many articles on the Internet in which the architecture of Transformers is explained, thus I’m not going to spend time doing the same. See below a list of my favorite ones:

Showcase: Quotes classification

As was mentioned above, we will take advantage of the Transformers for addressing a multilabel text classification problem.

In this post, we will use the dataset “Quotes Dataset”, provided by Amit Mittal. This dataset is hosted on This is a >22MB dataset in JSON format and each quote contains the author, the tags, popularity, and the category.

"Quote": "Don't cry because it's over, smile because it happened.",
"Author":"Dr. Seuss",
"Tags":[ "attributed-no-source", "cry", "crying", ...],
"Popularity": 0.15566615566615566,

Our goal will be to predict the tags for a given quote, so we can omit the other attributes.

In our implementation, we will use PyTorch. PyTorch is an open-source machine learning library based on the Torch library.

Next, we will go through the different parts of our Notebook. The full Notebook can be downloaded here

Install and import libraries

Install the dependencies

!pip install torch
!pip install numpy
!pip install pandas
!pip install matplotlib
!pip install seaborn
!pip install sklearn
!pip install transformers
!pip install pytorch-lightning

Import requerided dependencies

# Import all libraries
import pandas as pd
import numpy as np
import re
import os

# Huggingface transformers
import transformers
from transformers import BertModel,BertTokenizer,AdamW, get_linear_schedule_with_warmup

import torch
from torch import nn ,cuda
from import DataLoader,Dataset,RandomSampler, SequentialSampler

import pytorch_lightning as pl
from pytorch_lightning.callbacks import ModelCheckpoint

from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_recall_fscore_support

import seaborn as sns
from pylab import rcParams
import matplotlib.pyplot as plt
from matplotlib import rc
%matplotlib inline

is_gpu_available = torch.cuda.is_available()
device = torch.device("cuda:0" if is_gpu_available else "cpu")
if is_gpu_available:

Download the dataset

We will make use of a dataset that is hosted on Kaggle. We need to follow the steps below to download and use Kaggle data within Google Colab:

  1. Sign in to, then click on your profile picture on the top right and select “My Account” from the menu.
  2. Scroll down to the “API” section and click “Create New API Token”. This will download a file kaggle.json.
  3. Upload the downloaded kaggle.json file in the next cell.
"""The Kaggle dataset path"""
KAGGLE_DATASET ='akmittal/quotes-dataset'

!pip install -q kaggle
from google.colab import files

!pip install -q kaggle
!mkdir -p ~/.kaggle
!mv kaggle.json ~/.kaggle/
!ls ~/.kaggle
!chmod 600 /root/.kaggle/kaggle.json
!kaggle datasets download "{KAGGLE_DATASET}"
!mkdir /content/dataset
!unzip -q /content/ -d /content/dataset
!ls /content/dataset

Load dataset

Load the JSON file into a Pandas DataFrame

import pandas as pd

df = pd.read_json('/content/dataset/quotes.json')
# Take only the required attributes and discard the others
df = df[['Quote','Tags']]
# Drop the duplicates records into our dataset
df = df.drop_duplicates(['Quote'])
print(f'The data frame contains {len(df)} records.')

Normalize dataset

In this step, we will normalize the tags (to lowercase, remove empty blanks). Additionally, we will work only with the 15 most used tags.

df.Tags = df.Tags.transform(lambda tags: [tag.lower().strip() for tag in tags])

tags = [element for list_ in df.Tags for element in list_]
tags = [tag.lower().strip() for tag in tags]

print(f'There are {len(tags)} tags.')
# There are 215664 tags.

As we can observe in the above cell, there are more than 200k tags. As this Notebook has mainly educational purposes we will discard the other tags and clean the dataset.

classes = pd.Series(tags).value_counts()[:15].index
classes = list(set(classes))
df['Tags'] = df.Tags.transform(lambda tags: list(set(tags).intersection(classes)))
df = df[df.Tags.transform(lambda tags: len(tags)>0)]

print(f'We will only consider the following tags: {classes}.')
print(f'The data frame contains {len(df)} records with one or more tags.')

To work on our multi-label tag classification we will convert the field ‘Tags’ (with arrays of tags) into 15 columns (one per tag) with binary values.

from sklearn.preprocessing import MultiLabelBinarizer
mlb = MultiLabelBinarizer()
df = df.join(pd.DataFrame(mlb.fit_transform(df.pop('Tags')),

Splitting dataset into train, test, and validation data

We have more than 20k records in our data frame. Our intention is purely educational, so let’s take a subset with “only” 2k records to work with.

df = df[:2000]

70% of our records will be used for training the model and the other 30% will be distributed between test and validation.


train_data,temp_data = train_test_split(

test_data, val_data = train_test_split(
print(f' - dataset for training model: {train_data.shape[0]}.')
print(f' - dataset for validate trained model: {val_data.shape[0]}.')
print(f' - dataset for test the model {test_data.shape[0]}.')

Preparing the Dataset and DataModule

class QuoteTagDataset (Dataset):
def __init__(self, data, tokenizer, max_len):
self.tokenizer = tokenizer = data
self.max_len = max_len

def __len__(self):
return len(

def __getitem__(self, item_idx):
item =[item_idx]
quote = item['Quote']
labels = item[classes]

inputs = self.tokenizer.encode_plus(
max_length= self.max_len,
padding = 'max_length',
return_token_type_ids= False,
return_attention_mask= True,
return_tensors = 'pt'

input_ids = inputs['input_ids'].flatten()
attn_mask = inputs['attention_mask'].flatten()

return {
'input_ids': input_ids ,
'attention_mask': attn_mask,
'label': torch.tensor(labels, dtype=torch.float)
class QuoteTagDataModule (pl.LightningDataModule):

def __init__(self, train_data, val_data, test_data,tokenizer,train_batch_size=8, val_batch_size=8, test_batch_size=8, max_token_len=150):
self.train_data = train_data
self.test_data = test_data
self.val_data = val_data
self.tokenizer = tokenizer
self.train_batch_size = train_batch_size
self.test_batch_size = test_batch_size
self.val_batch_size = val_batch_size
self.max_token_len = max_token_len

def setup(self):
self.train_dataset = QuoteTagDataset(data=self.train_data, tokenizer=self.tokenizer,max_len = self.max_token_len)
self.val_dataset = QuoteTagDataset(data=self.val_data,tokenizer=self.tokenizer,max_len = self.max_token_len)
self.test_dataset = QuoteTagDataset(data=self.test_data,tokenizer=self.tokenizer,max_len = self.max_token_len)

def train_dataloader(self):
return DataLoader (self.train_dataset, batch_size = self.train_batch_size, shuffle = True , num_workers=0)

def val_dataloader(self):
return DataLoader (self.val_dataset,batch_size=self.val_batch_size)

def test_dataloader(self):
return DataLoader (self.test_dataset,batch_size=self.test_batch_size)
# Initialize the Bert tokenizer
BERT_MODEL_NAME = 'bert-base-cased'
Bert_tokenizer = BertTokenizer.from_pretrained(BERT_MODEL_NAME)

# Initialize the parameters that will be use for training
MAX_LEN = 128

# Instantiate and set up the data_module
data_module = QuoteTagDataModule(train_data,val_data,test_data,Bert_tokenizer,TRAIN_BATCH_SIZE,VAL_BATCH_SIZE,TEST_BATCH_SIZE,MAX_LEN)

Train the Model

class QuoteTagClassifier(pl.LightningModule):

def __init__(self, n_classes=15, steps_per_epoch=None, n_epochs=3, lr=2e-5 ):

self.bert = BertModel.from_pretrained(BERT_MODEL_NAME, return_dict=True)
self.classifier = nn.Linear(self.bert.config.hidden_size,n_classes) # outputs = number of labels
self.steps_per_epoch = steps_per_epoch
self.n_epochs = n_epochs = lr
self.criterion = nn.BCEWithLogitsLoss()

def forward(self,input_ids, attn_mask):
output = self.bert(input_ids = input_ids ,attention_mask = attn_mask)
output = self.classifier(output.pooler_output)
return output

def training_step(self,batch,batch_idx):
input_ids = batch['input_ids']
attention_mask = batch['attention_mask']
labels = batch['label']

outputs = self(input_ids,attention_mask)
loss = self.criterion(outputs,labels)
self.log('train_loss',loss , prog_bar=True,logger=True)

return {"loss" :loss, "predictions":outputs, "labels": labels }

def validation_step(self,batch,batch_idx):
input_ids = batch['input_ids']
attention_mask = batch['attention_mask']
labels = batch['label']

outputs = self(input_ids,attention_mask)
loss = self.criterion(outputs,labels)
self.log('val_loss',loss , prog_bar=True,logger=True)

return loss

def test_step(self,batch,batch_idx):
input_ids = batch['input_ids']
attention_mask = batch['attention_mask']
labels = batch['label']
outputs = self(input_ids,attention_mask)
loss = self.criterion(outputs,labels)
self.log('test_loss',loss , prog_bar=True,logger=True)
return loss

def configure_optimizers(self):
optimizer = AdamW(self.parameters() ,
warmup_steps = self.steps_per_epoch//3
total_steps = self.steps_per_epoch * self.n_epochs - warmup_steps
scheduler = get_linear_schedule_with_warmup(optimizer,warmup_steps,total_steps)
return [optimizer], [scheduler]
LR = 2e-05
steps_per_epoch = len(train_data)//TRAIN_BATCH_SIZE

model = QuoteTagClassifier(n_classes=len(classes), steps_per_epoch=steps_per_epoch,n_epochs=N_EPOCHS,lr=LR)

This belos step could take long as It’s the part in which we train our model

# Instantiate the Model Trainer
trainer = pl.Trainer(max_epochs = N_EPOCHS , gpus = 1, progress_bar_refresh_rate = 20)
# Train the Classifier Model, data_module)
# Evaluate the model performance on the test dataset

Evaluate Model Performance on Test Set

from import TensorDataset

# Tokenize all quotes in test_data
input_ids = []
attention_masks = []

for quote in test_data.Quote:
encoded_quote = Bert_tokenizer.encode_plus(
max_length= MAX_LEN,
padding = 'max_length',
return_token_type_ids= False,
return_attention_mask= True,
return_tensors = 'pt'


# Now convert the lists into tensors.
input_ids =, dim=0)
attention_masks =, dim=0)
labels = torch.tensor(test_data[classes].values)

# Create the DataLoader.
pred_data = TensorDataset(input_ids, attention_masks, labels)
pred_sampler = SequentialSampler(pred_data)
pred_dataloader = DataLoader(pred_data, sampler=pred_sampler, batch_size=TEST_BATCH_SIZE)
flat_pred_outs = 0
flat_true_labels = 0
# Put model in evaluation mode
model = # moving model to cuda

# Tracking variables
pred_outs, true_labels = [], []
# Predict
for batch in pred_dataloader:
# Add batch to GPU
batch = tuple( for t in batch)

# Unpack the inputs from our dataloader
b_input_ids, b_attn_mask, b_labels = batch

with torch.no_grad():
pred_out = model(b_input_ids,b_attn_mask)
pred_out = torch.sigmoid(pred_out)
pred_out = pred_out.detach().cpu().numpy()
label_ids ='cpu').numpy()

flat_pred_outs = np.concatenate(pred_outs, axis=0)
flat_true_labels = np.concatenate(true_labels, axis=0)

Predictions of Tags in the Test set

First of all, we need to identify the threshold that performs the best for the test dataset

threshold  = np.arange(0.4,0.51,0.01)

Let’s define a function that takes a threshold value and uses it to convert probabilities into 1 or 0.

# convert probabilities into 0 or 1 based on a threshold value
def classify(pred_prob,thresh):
y_pred = []

for tag_label_row in pred_prob:
for tag_label in tag_label_row:
if tag_label >= thresh:

return y_pred
from sklearn import metrics
scores=[] # Store the list of f1 scores for prediction on each threshold

#convert labels to 1D array
y_true = flat_true_labels.ravel()

for thresh in threshold:

#classes for each threshold
pred_bin_label = classify(flat_pred_outs,thresh)

#convert to 1D array
y_pred = np.array(pred_bin_label).ravel()

# find the optimal threshold
opt_thresh = threshold[scores.index(max(scores))]
print(f'Optimal Threshold Value = {opt_thresh}')

Performance Score Evaluation

#predictions for optimal threshold
y_pred_labels = classify(flat_pred_outs,opt_thresh)
y_pred = np.array(y_pred_labels).ravel() # Flatten
from sklearn.preprocessing import MultiLabelBinarizer

mlb = MultiLabelBinarizer()
yt = mlb.fit_transform([classes])

y_pred = mlb.inverse_transform(np.array(y_pred_labels))
y_act = mlb.inverse_transform(flat_true_labels)

df = pd.DataFrame({'Body':test_data['Quote'],'Actual Tags':y_act,'Predicted Tags':y_pred})


The performance score is not the best but we didn’t work on fine tunning (This would be covered in another post). Moreover, the example was run with free instances provided by Google and as consequence, this has multiple limitations, like the batch size.

Understanding how the Transformers work could be tricky but ¡implementing a showcase is not rocket science at all