Presented by Dr. Pulkit Agarwal, MIT, Nov 23 2021
Notes by Emily Liu
The general motivation for the class of transfer and few-shot learning tasks is that after pretraining a model, we should be able to perform a new unseen task faster, or perform an unseen task that's more complex than the training tasks.
Previously, we covered multitask and transfer learning. In multitask learning, we learn N iid tasks simultaneously with the reasoning that many features will be shared between them. In transfer learning, we have a domain shift in data, meaning we finetune our pretrained task to fit our new domain.
In few-shot learning, we update model parameters at test time, often to create new object categories. As the name suggests, this is (intuitively) doable with fewer data points than is required for training from scratch (in practice, about 50-100s of labelled finetuning points).
When finetuning a network there is a question of which layers we should update. Empericially, it has been shown that when there is less data, finetuning the last two layers is sufficient, but with more data, we can finetune all the layers.
Suppose we have a two-class few-shot learning problem where our training set consists of one sample from each class ($(x_1, y_1), (x_2, y_2)$). Our test set is one sample belonging to one of the two classes and we need to figure out which.
One strategy is to use the neural network to compute embeddings for the training points (
One solution to this problem is to evaluate pairs of similar or dissimilar samples on the same neural network. If the samples are similar, the cosine distance between their embeddings should be close to 1, and if the samples are different, the cosine distance between their embeddings should be close to 0.
Concepts from the Siamese network can be extended to multiple classes using the matching network. Matching networks take a support set
The likelihood of belonging to class
Additionally, it has been noted that the useful features of the support set can depend on what images we want to classify. Therefore, performance can be improved further with contextual embeddings that change per task.
We can think of the pretrained parameters
The task is to match images with captions. We train the images on an image encoder and the captions on a text encoder to capture similar information. Then take a cosine distance between the image and text encodings. The higher cosine distances are more likely to be a relevant image-caption match.
Given a new (uncaptioned) image, we pass it through the image encoder. Then, we find a match through the text encodings to "generate" a caption for the image without any extra training.
In sequential learning tasks, where we take a pretrained set of parameters and continually finetune on new tasks, we run into the issue of catastrophic forgetting, which means that the network trained on new tasks is unable to perform the old tasks.
To deal with catastrophic forgetting, we can remember the weights from each task and accept features from the previous networks as additional inputs into each layer of the new networks (ProgressiveNet).
- Few-shot learning is finetuning on few unseen examples. In practice, few-shot learning can be implemented via Siamese networks or matching networks.
- Contrastive pretraining/zero shot learning leverages big data to "generate" labels for unseen test images.
- Catastrophic forgetting occurs when a finetuned network is unable to "remember" old tasks. This is avoided by feeding previous weights into new networks.