Few Shot Learning and Zero Shot Learning

The most used terms in NLP these days are Large Language Models(LLM). Few shot learning and zero shot learning are transfer learning techniques that are used to make most of the vast pre-trained knowledge of these LLMs. In this blog post, let us understand what these terms mean and see them in action.

Zero shot Learning(ZSL)

Zero shot learning is a process in which a language model(or any machine learning model) is expected to classify or generate text for the input samples and target classes it has never seen before. The main objective here is to check how well the language models have gained generic knowledge that can be applied to entirely new scenarios.

How does this work under the hood? Cosine Similarity to the rescue. The embeddings from the test sentence obtained from the model and the word embeddings of the potential classes are aligned to each other. Then, cosine similarity is calculated and the class that scores the highest similarity with the input sentence is reported as the output class. Know more in detail here: Zero shot Learning in NLP.

Few Shot Learning(FSL)

Few shot learning is a process in which a model is expected to perform classification or text generation after showing only a few data samples. Here, more context is provided to the model when compared to zero shot learning. We can generalize this as n-shot learning where n = 0 makes this zero-shot learning and n = 1 makes it one shot learning. The value of n is equal to the number of training samples you can provide for training.

The two main applications of these techniques are Classification and Prompting. Let us discuss them with code and examples.


In Zero shot classification, a model is trained on a dataset that has samples with certain known classes and now needs to predict samples of unknown classes. Usually, for supervised classification the training and test samples belong to the same space. Here, the testing and training samples are from a different space.

For example, consider that you are a new startup where people submit news articles that belong to various fields. You want a system where an article is placed into categories automatically. Since you do not have much data to train a new model you want to make use of the pre-trained knowledge of large language models. Let us see how this is achieved with zero shot classification using Transformers.

!pip install transformers

The HuggingFace transformers provide a zero-shot-classification pipeline that can perform this task in a few lines of code. Load the facebook/bart-large-mnli model to use in the zero shot classification pipeline. Define a few example news sentences and corresponding target classes.

# Load the zero-shot classification pipeline
classifier = pipeline("zero-shot-classification", model="facebook/bart-large-mnli")

# Define candidate class labels
candidate_labels = ["sports", "politics", "science", "business"]

# Text samples to classify
text_samples = [
      "The team played exceptionally well and won the championship.",
      "The government announced new policies to boost the economy.",
      "Scientists discovered a new species of marine life in the deep sea.",
      "A new tech startup just received a significant investment."

Now, we run the classifier for each of the news articles and print the results.

  # Perform zero-shot classification
  for text_sample in text_samples:
      classification_result = classifier(
          multi_label=False,  # Set to False for single-label classification

  # Extract the predicted label and confidence score
  predicted_label = classification_result["labels"][0]
  confidence_score = classification_result["scores"][0]

  print(f"Text: {text_sample}")
  print(f"Predicted Class: {predicted_label}")
  print(f"Confidence Score: {confidence_score:.4f}\n")

From the output, we can see that the model could classify all the 4 examples correctly without any fine-tuning. The confidence score quantifies the confidence of the model that the sample belongs to a particular class. Technically speaking it is the cosine similarity between the sentence and target label embeddings. This is an example for the zero shot learning with the BART model.

Text: The team played exceptionally well and won the championship.
Predicted Class: sports
Confidence Score: 0.9589

Text: The government announced new policies to boost the economy.
Predicted Class: politics
Confidence Score: 0.7458

Text: Scientists discovered a new species of marine life in the deep sea.
Predicted Class: science
Confidence Score: 0.9851

Text: A new tech startup just received a significant investment.
Predicted Class: business
Confidence Score: 0.9603

Few shot classification is a subset of supervised fine-tuning and uses a similar approach but with much lesser training data. Consider the task of sentiment classification of user reviews with the BERT model. Let us define a function that runs the classifier and prints the sentiment and the confidence scores.

def get_few_shot_output(model_directory):
  # Load the model from a directory. 
  tokenizer = BertTokenizer.from_pretrained(model_directory)
  model = BertForSequenceClassification.from_pretrained(model_directory)

  # Define the text classification pipeline with the fine-tuned model
  classifier = pipeline("text-classification", model=model, tokenizer=tokenizer)

  class_label = {
    "LABEL_0": "Positive",
    "LABEL_1": "Negative"

  # Text samples to classify
  text_samples = [
      "I loved the movie! It was amazing.",
      "This book is terrible. I hated it.",
      "The restaurant was fantastic. I had a great time.",
      "The service at the hotel was awful. I had a terrible experience."

  # Perform few-shot text classification with the fine-tuned model
  for text_sample in text_samples:
      classification_result = classifier(text_sample)
      # Extract the predicted label and confidence score
      predicted_label = class_label[classification_result[0]["label"]]
      confidence_score = classification_result[0]["score"]

      print(f"Text: {text_sample}")
      print(f"Predicted Label: {predicted_label}")
      print(f"Confidence Score: {confidence_score}")

As we can see here, the model predicts Positive for all the 4 sentences which is incorrect.


Text: I loved the movie! It was amazing.
Predicted Label: Positive
Confidence Score: 0.6037021279335022

Text: This book is terrible. I hated it.
Predicted Label: Positive
Confidence Score: 0.6179338693618774

Text: The restaurant was fantastic. I had a great time.
Predicted Label: Positive
Confidence Score: 0.6017237305641174

Text: The service at the hotel was awful. I had a terrible experience.
Predicted Label: Positive
Confidence Score: 0.6129955649375916

Now, let us tune and save a new BERT model with a few similar examples and see if the result has improved. For this we prepare a dataset with 4 samples and convert it into a HuggingFace dataset.

# Prepare a small dataset for few-shot classification
# Ideally, you would have more examples, but for this example, we'll use a small dataset
dataset = [
    {"text": "I loved the book! It was amazing.", "label": "Positive"},
    {"text": "This book is terrible. I hated it.", "label": "Negative"},
    {"text": "I had a terrible experience with the service there.", "label": "Negative"},
    {"text": "The ambience at the store was fantastic.", "label": "Positive"}

# Define candidate class labels for zero-shot classification
candidate_labels = ["Positive", "Negative"]

# Tokenize the dataset
tokenized_dataset = tokenizer([item["text"] for item in dataset], truncation=True, padding=True)
tokenized_dataset["labels"] = [candidate_labels.index(item["label"]) for item in dataset]

# Convert to PyTorch Dataset
import torch
from torch.utils.data import Dataset, DataLoader

class CustomDataset(Dataset):
    def __init__(self, tokenized_dataset):
        self.input_ids = torch.tensor(tokenized_dataset["input_ids"])
        self.attention_mask = torch.tensor(tokenized_dataset["attention_mask"])
        self.labels = torch.tensor(tokenized_dataset["labels"])

    def __len__(self):
        return len(self.input_ids)

    def __getitem__(self, idx):
        return {
            "input_ids": self.input_ids[idx],
            "attention_mask": self.attention_mask[idx],
            "labels": self.labels[idx]

custom_dataset = CustomDataset(tokenized_dataset)

Next, we can use this custom dataset to train the bert-base-uncased model and save it in a new folder. The Trainer class from the HuggingFace transformers is used to perform the training. With the method used at the beginning we can compare the outputs of the original and fine-tuned model.

# Define training arguments
training_args = TrainingArguments(
    eval_steps=100,  # Evaluate every 100 steps
    save_steps=100,  # Save every 100 steps

# Initialize the Trainer and fine-tune the model
trainer = Trainer(
    data_collator=lambda features: {
        "input_ids": torch.stack([feature["input_ids"] for feature in features]),
        "attention_mask": torch.stack([feature["attention_mask"] for feature in features]),
        "labels": torch.stack([feature["labels"] for feature in features])

# Fine-tune the model

The saved model is now used for classification and we can see that the classification is correct now after tuning with a dataset of merely 4 samples as we see below.


Text: I loved the movie! It was amazing.
Predicted Label: Positive
Confidence Score: 0.7981343269348145

Text: This book is terrible. I hated it.
Predicted Label: Negative
Confidence Score: 0.8204653263092041

Text: The restaurant was fantastic. I had a great time.
Predicted Label: Positive
Confidence Score: 0.7870051860809326

Text: The service at the hotel was awful. I had a terrible experience.
Predicted Label: Negative
Confidence Score: 0.7408967614173889

Note: The same can be achieved by prompting a large language model but here we are using the BERT model for our classification.


Another important application of n-shot learning paradigms is prompting. Prompt engineering is a hot topic now-a-days thanks to the AI revolution. Now, let us explore with an example, how we can improve LLM output of zero shot with few shot prompting.

Zero shot prompting is nothing but giving a prompt without any additional context or examples. Here, you expect the model to use its existing pre-trained knowledge to perform your task similar to the classification task.

For example, let us say you are a fan of fantasy stories and you want to generate a new name for yourself that resembles names in this genre. You could do this by giving a general zero shot prompt like the one below and the model generates the output as “Balthazar”. This name is good but this output is generic and not to your taste.

import openai

openai.api_key = <API_KEY>

prompt = "Create a unique fantasy character name."
response_zero_shot = openai.Completion.create(

print("Zero-Shot Prompt:")

Output: Balthazar

Now, imagine that you want a unique name that resembles the names in the famous fantasy HBO series House of the Dragon. The Open AI’s text-davinci-002 model has not heard of the names in House of the Dragon. In this scenario, we can teach the model to customize the output by specifying example names i.e., perform few shot prompting as shown below.

prompt = """
Create a new unique fantasy character name:
1. Visenya
2. Rhaenyra
3. Rheana
response_few_shot = openai.Completion.create(

print("\nFew-Shot Prompt (With Examples):")

4. Jyana
5. Vayana
6. Sareyna
7. Rydera

The output names now sound more like the ones in House of the Dragon and you can choose one of them as your aliases too. Here, we did not have to fine-tune the model with the scripts from the show to generate these names. This example illustrates the power of few shot learning with LLMs.

Pros and Cons

These techniques usually work well with large language models because some existing knowledge is needed for a model to learn quickly. The Language models are few shot learners paper says that models like GPT-3 can inherently work for simple tasks in a few shot setting using only text prompts without any additional fine-tuning.

These techniques are usually used for smaller datasets and simple tasks as the context to be provided for a model increases as the complexity of the task increases. For example, these are great for text classification but they might struggle with complex reasoning tasks.

An additional disadvantage of these approaches is that the tokens that are needed to generate such prompts. Consider the case of using OpenAI’s gpt-3 or higher for any NLP task. The output cost depends on the number of tokens in the prompt you give. Now, imagine if you have to give few shot examples for every prompt that could result in higher costs. In this scenario, it might be cheaper for you to go for fine-tuning.


Finally, in this blog we have introduced the concept of zero and few shot learning with the applications of classification and prompting.

  • Zero shot and Few shot learning are techniques that are beneficial in utilizing the pre-trained knowledge of Large Language models.
  • They have applications in NLP and computer vision such as image classification, text classification and text generation.
  • However, they are not suitable for complex tasks and may not generalize very well and fine-tuning with a large dataset may be necessary after a certain point.

Want to read more blogs like this? Make sure to check out more such posts here. Want to write for us and earn money? You can find more detailsĀ here.

Insert math as
Additional settings
Formula color
Text color
Type math using LaTeX
Nothing to preview