Document Classification using LayoutLM

Lucky Verma
4 min readApr 8, 2022

--

source: Unsplash

Why LayoutLM?

Pre-training techniques have been verified successfully in a variety of NLP tasks in recent years. Despite the widespread use of pre-training models for NLP applications, they almost exclusively focus on text-level manipulation, while neglecting layout and style information that is vital for document image understanding. LayoutLM proposes a joint model interactions between text and layout information across scanned document images, which is beneficial for a great number of real-world document image understanding tasks such as information extraction from scanned documents.

Furthermore, it makes use of image features to incorporate visual information from words into LayoutLM. To the best of our knowledge, this is the first time that text and layout are taught together in a single framework for document-level pre-training. It achieves new state-of-the-art results in a variety of downstream tasks, including form understanding, receipt understanding, and document image classification.

LayoutLM in action, with 2-D layout and image embeddings integrated into the original BERT architecture. The LayoutLM embeddings and image embeddings from Faster R-CNN work together for downstream tasks.

Although BERT-like models have become state-of-the-art techniques for a variety of difficult NLP tasks, they typically rely solely on text information for all inputs. When it comes to visually rich documents, the pre-trained model can encode a lot more information. As a result, they propose using visually rich information from document layouts and aligning it with the input texts.

Our plan:

We plan to make a streamlit app for document classification which takes inputs as an Image of a document and throws output as document type probabilities.

Let’s start with setting up the data processing part:

We can use the Tesseract OCR engine to turn the image into a list of recognized words.

Sample Image
words output from tesseract OCR

We can also visualize the bounding boxes of the recognized words, as follows:

Preprocessing the data using 🤗 datasets

First, we convert the dataset into a Pandas dataframe, having 2 columns: image_path and label.

labels from the dataset labeled folder
{'bill': 0, 'invoice': 1, 'others': 2, 'Purchase_Order': 3, 'remittance': 4}

Now, let’s apply OCR to get the words and bounding boxes of every image. To do this efficiently, we turn our Pandas dataframe into a HuggingFace Dataset object and use the .map() functionality to get the words and normalized bounding boxes of every image. Note that this can take a while to run (Tesseract seems a bit slow).

$ updated_datasetDataset({
features: ['bbox', 'image_path', 'label', 'words'],
num_rows: 449
})

Let’s see some data we get after OCR:

df = pd.DataFrame.from_dict(updated_dataset)
print(len(df["words"][0]))
print(df["words"][0])

Next, we can turn the word-level ‘words’ and ‘bbox’ columns into token-level input_ids, attention_mask, bbox and token_type_ids using LayoutLMTokenizer.

Finally, we set the format to PyTorch because the Transformers library’s LayoutLM implementation is in PyTorch. We also specify which columns will be used.

Let’s verify whether the input ids are created correctly by decoding them back to the text:

tokenizer.decode(batch['input_ids'][0].tolist())

Define the model

Here we define the model, namely LayoutLMForSequenceClassification. We initialize it with the weights of the pre-trained base model (LayoutLMModel). The weights of the classification head are randomly initialized and will be fine-tuned together with the weights of the base model on our tiny dataset. Once loaded, we move it to the GPU.

Train the model

Here we train the model in familiar PyTorch fashion. We use the Adam optimizer with a weight decay fix (normally you can also specify which variables should have weight decay and which not + a learning rate scheduler, see here for how the authors of LayoutLM did this), and train for 30 epochs. If the model is able to overfit it, then it means there are no issues and we can train it on the entire dataset. We’ll run the training over 30 loops.

Epoch: 0
Loss: 272.3193453811109
Training accuracy: 82.18263244628906
Epoch: 1
Loss: 137.21977971307933
Training accuracy: 90.42316436767578
Epoch: 2
...
...
...
Epoch: 28
Loss: 0.023347617649051244
Training accuracy: 100.0
Epoch: 29
Loss: 0.019624314532848075
Training accuracy: 100.0

Saving the model

model.save_pretrained('saved_model/')

Inference

bill: 0%
invoice: 0%
others: 100%
Purchase_Order: 0%
remittance: 0%

Thank you!

The complete version of my code can be found here:

Github Repo: https://github.com/lucky-verma/Document-Classification-using-LayoutLM

Paper: https://arxiv.org/abs/1912.13318

Unilm Repo: https://github.com/microsoft/unilm/tree/master/layoutlm

You can contact me on LinkedIn:

source: Unsplash

--

--

Lucky Verma
Lucky Verma

Responses (1)