Presented by Dr. Suvrit Sra, MIT, Oct 19 2021
Notes by Emily Liu
Currently, neural networks can perform well on supervised learning tasks but fall short on tasks with unlabeled data. However, many machine learning problems have large amounts of unlabeled data. How do we leverage these data in our models?
In some tasks, it is possible to generate metrics from the data that are correlated with the target label. Because these labels are proxies for the target as opposed to preexisting ground truths, we refer to this class of techniques as weakly supervised learning.
Metric learning is an example of similarity-driven learning. Intuitively, if we are able to learn the distance between two data points (how similar they are - technically unsimilar, since distance is inversely related to similarity), the points are likely to have similar metrics.
- This idea is the forerunner of "modern representation learning".
Let
Another way to think about this is in terms of pairs of points. If you randomly draw two data points, they can either be similar (for example, belong to the same classification category) or dissimilar.
Now, we are able to define two sets,
We want to learn a linear transformation
We note that in the expression
The matrix
First, we note that a naive formulation of this task
$$
min_{A \geq 0} \sum_{(x_i, x_j) \in S} d_A(x_i, x_j) - \lambda \sum_{(x_i, x_j) \in D} d_A(x_i, x_j)
$$
fails empirically because poor scaling or choice of
Instead, we write the minimization task as such: $$ min_{A \geq 0} \sum_{(x_i, x_j) \in S} d_A(x_i, x_j) + \sum_{(x_i, x_j) \in D} d_{A^{-1}}(x_i, x_j) $$
Intuitively, using the inverse of a matrix leads to the opposite behavior, so instead of subtracting
We then define two matrices
Given this formulation, we reach an equivalent optimization problem with a closed form solution:
where
In a task where we have lots of unlabelled data, we would likely want to use self-supervised representation learning. In self-supervised representation learning, our goal is to learn a lower-dimensional representation (which we will call
Since we don't have a supervised task (labels), we will need to come up with a pretext task on
For example, in vision tasks, we can use predicting relative patch locations as a pretext for object detection, visual data mining, etc. In language tasks, we can use word prediction as a pretext for question answering or sentiment analysis.
More generally, we want our pretext tasks to reflect contrastive learning. Positive pairs are close together (often generated from the same data point from the original sample) and far from a negative sample. This property should hold even with slight perturbations to the data (invariance in pretext features).
Question: why do pretrained representations help?
(Robinson et al 2020)
If the central condition holds and the pretext task has learning rate
At a high level, this means that if the pretext task can be trained on
We want to learn similarity scores such that positive pairs score more similar to each other than negatives:
$$ min_f E_{x, x^+, {x_i^-}{i=1}^N}\left[-\log \frac{e^{f(x)^T f(x^+)}}{e^{f(x)^T f(x^+)} + \sum{i=1}^N e^{f(x)^T f(x^-_i)}}\right] $$
In this formulation,
With this formulation, you will need to generate positive and negative examples from your anchor point. Positive examples are easily generated via a random combination of data augmentations (since a slightly perturbed version of
Removing false negative improves generalization capabilities of the model. However, since the training data is unlabelled, we are not able to identify whether or not a negative is a false negative. What we can do is use the positive and uniform samples to approximate true and false negatives.
On the other hand, easy negatives (which would have a very low similarity score) are not useful to improving model performance. What would be more useful is to sample "hard" negatives, negative samples that have closer similarity scores to positives and are more likely to be the samples the model is "wrong" on.
Uniform: Sample from marginal
Debiased negatives (no false negatives): Sample from
Hard negatives: Sample from
- Semi-supervised learning is used when only a fraction of the dataset is labelled.
- Weakly supervised learning uses a pretext task as a proxy to the target (labelled) task.
- We want similar points to be close in embedding space and dissimilar points to be farther away.
- Linear metric learning is a precursor to modern representation learning that maps points to a linear transformation, the solution to which can be found analytically.
- In self-supervised learning, it is possible to pretrain a large dataset on a pretext task and fine-tune the same model on the target task to achieve reasonable performance.
- In contrastive learning, we generate positive and negative points from an anchor point in the dataset. The positive point will be an augmented or perturbed version of the anchor point. The negative point is selected to be dissimilar from the original point. However, hard (less dissimilar) negatives improve model performance moreso than easy (very dissimilar) negatives.