Improve your Large Language Models (LLMs) with Active Learning

Fine-tuning LLMs not only takes significant computing resources but also requires the use of annotated data. But does all data matter equally?

At Lightly, we focus on making machine learning models more data efficient by leveraging self-supervised and active learning. Trusted by Fortune 500 companies and working with academic institutes and research teams, Lightly is considered the industry leader in active learning for computer vision. Let us explore how to transfer the knowledge we gained in computer vision to NLP use cases.

2023 has witnessed a significant surge in the widespread adoption of AI. We have observed major advancements in computer vision with models such as SAM and VideoLDM. Additionally, in the field of NLP, open-source models have finally reached a level of proficiency that allows anyone to develop LLM applications, such as chatbots, for various domains.

Recent LLM models can easily contain 10s of billions of parameters. Here is a recent snapshot of publicly available models summarised in the following paper: https://arxiv.org/pdf/2303.18223.pdf

Still, working with LLMs brings several challenges. On the one hand, these models are massive in terms of parameters which can easily range in the 10s of billions, such as GPT-3 with 175 billion parameters. Recent research addressed these issues by allowing fine-tuning these models on regular consumer hardware instead of massive multi-GPU clusters. One popular framework is LoRA (Low-Rank Adaptation of Large Language Models, 2021). LoRA uses a clever technique to fine-tune these billion-parameter models without having to optimize all of their parameters. We will use the LoRA framework as part of our experiments in this blog post as it makes fine-tuning of these models fast and efficient on consumer-grade hardware.

In this post, we focus on the SST2 dataset. The dataset consists of movie reviews and has labels for positive and negative reviews. The goal is to train a binary classifier that achieves high accuracy for sentiment prediction. We will use a pre-trained RoBERTa model and fine-tune it using LoRA.

We show that ML teams can save up to 78% of labeling costs or improve the model by up to 4.6x per additional labeled batch when using active learning for NLP compared to random selection!

Furthermore, we will evaluate different data selection strategies and how they impact the final model accuracy for this task. For these experiments, we will freeze the validation set and model training routine but only vary the training data we use. We rely only on predictions or embeddings to change the data selection strategies. This allows us to transfer the learnings to real-world applications where data collection and labeling are expensive.

In case you read my other blog post about Active Learning strategies compared in Computer Vision, you will be familiar with the format of this post.

The SST2 Dataset

The SST2 Dataset is part of GLUE (General Language Understanding Evaluation benchmark), a well-known benchmark in NLP.

The data consists of phrases like:

`A masterful film from a master filmmaker , unique in its deceptive grimness , compelling in its fatalist worldview .` which is labeled as positive or

`Far from perfect , but its heart is in the right place … innocent and well-meaning .` which is labeled negative.

In total, we have almost 70k samples in the dataset.

  • Training set: 67 349 samples
  • Validation set: 872 samples
  • Test set: 1 821 samples

Experiments

Typically, one would train a model either directly on this task or, nowadays, fine-tune a pre-trained large language model on it. In this experiment, we will follow the latter approach and fine-tune a RoBERTa model on the SST2 dataset. The model is trained to classify phrases binary into positive or negative sentiment.

The dataset is not very challenging and according to PapersWithCode the best performing models succeed the 90% mark since around 2017. We use this dataset for our experiments because of two main reasons:

  1. The sentiment classification task requires a similar understanding of the sentence as popular tasks such as content moderation.
  2. The small size of the dataset makes it easy to run several experiments with multiple seeds

We also need a model for our experiments. We will use the RoBERTa base (125 million parameters) model. The model is based on a paper from 2023. Since it’s not our goal to get state-of-the-art but rather evaluate the impact of Active Learning, this model already available in the LoRA codebase fits our needs perfectly.

All experiments have been conducted with the LoRA codebase here: https://github.com/microsoft/LoRA

Note that there might be a small degradation in performance as we used only a single GPU and set num_gpus accordingly to 1 instead of 8. This also resulted in a smaller batch size of 16 instead of 128.

For data selection, we use a not yet released version of Lightly that focuses on Active Learning for NLP. This solution is currently tested in a private beta. Early access is by invitation only. If you would like to get access, please reach out to nlp@lightly.ai.

We fix the validation set and the training procedure for the following experiments. The only parameter we vary is how the subset of the training data is sampled.

Active Learning Strategies for LLMs

As expected, there is a huge overlap in the literature between Active Learning on vision and text data. Nevertheless, we want to use the available selection strategies LightlyOne supports out of the box. Similar to the related blog post for computer vision, we focus on methods that neither require ensemble models nor data augmentation tricks. Both would introduce significantly increased costs due to multiple model forward passes.

Let’s have a look at the different strategies. There are two main groups of methods we will further look at:

  • Embedding based methods to, for example, diversify data.
  • Prediction based methods to find difficult examples along the decision boundaries.Using Embeddings.

We saw great success at LightlyOne working with images and using embeddings for diversity sampling. Why not replicate that in NLP? The approach is simple. We train an embedding model using self-supervised learning, embed all the training samples, and then use an algorithm to diversify the set. Self-supervised learning is a well-known concept in NLP since the famous BERT paper was published in 2018.

We use a model based on MiniLMv2 that generates 384 dimensional embeddings. We then use LightlyOne's diversity based selection criteria.

Using Predictions

For uncertainty sampling, we also take predictions into account. For this post, we evaluate the followin strategy:

  • prediction margin — the margin is high if the margin between the two cases (positive and negative) is low (note that we use 1-margin for the computation of the score)

Combining different Selection Strategies

At Lightly, we have worked on active learning for over three years now. We worked closely with Academia and industry, and if there’s one common theme for success noticed by us, it’s combining different strategies instead of relying on a single strategy for all use cases.
For some use cases, one cares more about outliers and special situations. That’s when we rely on embeddings and diversity. For other use cases, we want to refine the decision boundary of the model and reduce confusion in frequent situations. That’s where the prediction margin should be used. We can combine both if we want to focus on improving the decision boundary while sampling more edge cases.

We built LightlyOne to do all of this with just a few lines of code. Making it easy for you to adapt your data selection strategy for the current use case.

Results

The whole training set consists of 67 349 samples. We picked 40k and 50k as operating points for our experiments. Note that both numbers have been picked arbitrarily.

We use random selection as a baseline. We randomly select 40k and 50k samples from the training set and train a model. We fine-tune the RoBERTa base model for our experiments using LoRA for four epochs.

We show different plots from the tensorboard for our experiment by randomly selecting data (Turquoise) and the best-performing method — diversity (Pink). Left: Training loss. Center: Validation accuracy. Right: Validation loss

We show different plots from the tensorboard for our experiment by randomly selecting data (Turquoise) and the best-performing method — diversity (Pink). Left: Training loss. Center: Validation accuracy. Right: Validation loss

Looking at the training logs plotted above, we can make a few interesting observations:

  • The training loss is higher for the diversity based method compared to random.
  • At the same time, the validation loss for the diversity based method is lower. This might indicate that selecting diverse data acts similarly to a regularisation method and reduces overfitting. Further investigation and experiments would be needed.

Now, let’s look at the validation accuracy numbers. We report the peak accuracy on the validation set among all epochs. We use two seeds for all experiments.

Comparison of different data selection strategies. We plot the validation accuracy and standard deviation based on two seeds. We use the operating point at 40 000 training samples to make predictions on the whole dataset for uncertainty margin as this method requires predictions. The approach of picking diverse samples outperforms all other methods in this case.

There are a few interesting observations. The different strategies yield different increases in accuracy. Adding more edge cases using diversity sampling can yield a huge boost in this case. We can get over four times more value for labeling the same number of phrases. In return, we could label 78% fewer samples for the same gain.

Comparison of the different methods compared to the baseline (randomly selecting 50k samples). We show the absolute increase in accuracy and factor compared to randomly adding more samples. When selecting data based on diversity, we see an absolute gain of 0.8% (4,6 times higher than using random).

The advantage of diversity sampling is reduced when we combine it with uncertainty sampling methods.

On the other hand, only relying on the prediction probability using methods like uncertainty margin is on par with random sampling.

We also notice that the standard deviation for some of the experiments needs to be lowered to make clear conclusions. Further experiments on larger datasets and/or with more seeds could bring additional clarity.

0-Shot Comparison

If you wonder how a 0-shot model would perform, I found this nice write up for evaluating the bigscience bloom model on the same validation set:

Although the SST2 dataset is considered simple for fine-tuned models, big LLM still struggles to get a good 0-shot performance.

How to integrate Active Learning into your Project?

To add active learning to your existing LLM project, you can orient yourself on the architecture OpenAI uses for its content moderation setup. In the illustration below, you see a regular training loop where data is collected from different sources and then used to train a model. The three Lightly blue blocks create a feedback loop from production data where new data is selected and fed back to the training set.

Active learning helps your deployment stay up to date with changes in data (data drift) and continuously improves the model performance to reduce false positives and false negatives.

Example of a content moderation pipeline with Active Learning (modified illustration from https://arxiv.org/pdf/2208.03274.pdf)

Conclusion

In this post, we looked at different strategies supported by LightlyOne to select training data and their impact on model accuracy for fine-tuning LLMs. Even though these are early results, we are very excited to see huge gains in data efficiency.

The learnings of these experiments can be transferred to other NLP use cases, such as building LLM safety layers and using LLMs for content moderation.

Igor Susmelj,
Co-Founder Lightly