Is BERT Always the Better Cheaper Faster Answer in NLP? Apparently Not.

Author: William Vorhies

Summary: Since BERT NLP models were first introduced by Google in 2018 they have become the go-to choice. New evidence however shows that LSTM models may widely outperform BERT meaning you may need to evaluate both approaches for your NLP project.

Over the last year or two, if you needed to bring in an NLP project quickly and with SOTA (state of the art) performance, increasingly you reached for a pretrained BERT module as the starting point.

BERT (Bidirectional Encoder Representations for Transformers) has been heralded as the go-to replacement for LSTM models for several reasons:

It’s available as off the shelf modules especially from the TensorFlow Hub Library that have been trained and tested over large open datasets. These are then used as the baseline in transfer learning which fine tunes your resulting NLP application.
Because they are based on the transformer element of DNNs and not the recursive structure of RNNs / LSTMs, they lend themselves to parallelization which can both speed up modeling and reduce cost.
Also, the transformer is the attention mechanism of BERT that takes into account the context of the words in the entire sentence at one time. As opposed to LSTMs which must look both forward and backwards (directionally recurs) along the line of words to extract meaning. This is believed to give BERTs an advantage in accuracy.

Recently however there is growing evidence that BERT may not always give the best performance.

In their recently released arXiv paper, Victor Makarenkov and Lior Rokach of Ben-Gurion University share the results of their controlled experiment contrasting transfer-based BERT models with from scratch LSTM models.

Using a number of different pre-trained BERT modules from the TensorFlow Hub that were then fine-tuned for the downstream purpose, both their experiments resulted in the LSTM models outperforming BERT with transfer learning.

Experiment 1

The first test was a proper word choice test, for example selecting between alternatives such as “my wife thinks I am a BEAUTIFUL / HANDSOME guy”.

They used a corpus of 30,000 academic articles to train from scratch the bidirectional LSTM model (their baseline) and the same set to fine tune two off the shelf pretrained BERT model, one general purpose and the other trained on the same domain material.

Accuracy was evaluated with the Mean Recurring Rank (MRR) metric. MRR is a typical metric for any process that creates a list of potential alternatives ordered by the probability of correctness such as selections among most appropriate words.

The articles contained domain specific terms which the LSTM learned from scratch but the fine-tuned BERTs failed to differentiate as well. In fact the BERT models fell well short of the LSTM models.

Experiment 2

In the second instance, a binary classification test, the models were tasked to differentiate the political perspective of test articles from the US and British press to determine if they represented a fundamentally Israeli or Palestinian point of view. The articles were manually annotated by different annotators who agreed on the labels applied.

The general purpose BERT model had been trained on the English Wikipedia and Books. The LSTM model was also off the shelf and was a general purpose news-domain model.

Once again, in this classification problem, the LSTM was clearly superior.

These results call into question whether you should reach first for a BERT transfer learning approach for your next NLP project. If performances differences as great as those shown above are important to your project’s success, you may need to look at both approaches.

Other articles by Bill Vorhies.

About the author: Bill is Contributing Editor for Data Science Central. Bill is also President & Chief Data Scientist at Data-Magnum and has practiced as a data scientist since 2001. His articles have been read more than 2.1 million times.

Bill@DataScienceCentral.com or Bill@Data-Magnum.com

Go to Source