Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
38 changes: 19 additions & 19 deletions chapters/en/unit2/cnns/yolo.mdx
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ For training purpose, the paper proposed the following steps
3. Then use the fine tuned model on the object detection dataset.

The following is a basic pipeline of RCNN
![rcnn](https://huggingface.co/datasets/hf-vision/course-assets/resolve/main/Fast%20R-CNN.png)
![rcnn](https://huggingface.co/datasets/hf-vision/course-assets/resolve/main/RCNN.png)


#### Fast RCNN
Expand Down Expand Up @@ -76,7 +76,7 @@ YOLO is a single step detector where the bounding box and the class of the objec
#### Reframing Object Detection
YOLO reframes the Object Detection task as a single regression problem, which predicts bounding box coordinates and class probabilities.

In this design, we divide the image into an $S \times S$ grid. If the center of the object falls into a grid cell, that grid cell is responsible to detect that object. We can define $B$ as the maximum number of objects to be detected in each cell. So each grid cell predicts $B$ bounding boxes including confidence scores for each box.
In this design, we divide the image into an $$S \times S$$ grid. If the center of the object falls into a grid cell, that grid cell is responsible to detect that object. We can define $$B$$ as the maximum number of objects to be detected in each cell. So each grid cell predicts $$B$$ bounding boxes including confidence scores for each box.

#### Confidence
The confidence score of a bounding box should reflect how accurately the box was predicted. It should be close to the IOU (Intersection over Union) of the ground truth box versus the predicted box. If the grid was not supposed to predict a box, then it should be zero. So this should encode the probability of the center of the box being present in the grid and the correctness of the bounding box.
Expand All @@ -86,10 +86,10 @@ Formally,
$$\text{confidence} := P(\text{Object}) \times \text{IOU}_{\text{pred}}^{\text{truth}}$$

#### Coordinates
The coordinates of a bounding box are encoded in 4 numbers $(x, y, w, h)$. The $(x, y)$ coordinates represent the center of the box relative to the bounds of the grid cell. The width and height are normalized to image dimensions.
The coordinates of a bounding box are encoded in 4 numbers $$(x, y, w, h)$$. The $$(x, y)$$ coordinates represent the center of the box relative to the bounds of the grid cell. The width and height are normalized to image dimensions.

#### Class
The class probabilities is a $C$ long vecto,r representing conditional class probabilities of each class given an object existed in a cell. Each grid cell only predicts one vector, i.e a single class will be assigned to each grid cell and so all the $B$ bounding boxes predicted by that grid cell will have the same class.
The class probabilities is a $$C$$ long vector, representing conditional class probabilities of each class given an object existed in a cell. Each grid cell only predicts one vector, i.e a single class will be assigned to each grid cell and so all the $$B$$ bounding boxes predicted by that grid cell will have the same class.

Formally,
$$C_i = P(\text{class}_i \mid \text{Object})$$
Expand All @@ -102,14 +102,14 @@ C_i \times \text{confidence} &= P(\text{class}_i \mid \text{Object}) \times P(\t
\end{align}
$$

To recap, we have an image, divided into $S \times S$ grid. Each grid cell contains $B$ bounding boxes consisting of 5 values - confidence + 4 coordinates and $C$ long vector containing conditional probabilities of each class. So, each grid cell is a $B \times 5 + C$ long vector. The whole grid is $S \times S \times (B \times 5 + C)$.
To recap, we have an image, divided into $$S \times S$$ grid. Each grid cell contains $$B$$ bounding boxes consisting of 5 values - confidence + 4 coordinates and $$C$$ long vector containing conditional probabilities of each class. So, each grid cell is a $$B \times 5 + C$$ long vector. The whole grid is $$S \times S \times (B \times 5 + C)$$.

So if we have a learnable system which converts an image to an $S \times S \times (B \times 5 + C)$ feature map, we are one step closer to the task.
So if we have a learnable system which converts an image to an $$S \times S \times (B \times 5 + C)$$ feature map, we are one step closer to the task.

#### Network Architecture
In the original YOLOv1 design the input is an RGB image of size $448 \times 448$. The image is divided into a $S \times S = 7 \times 7$ grid, where each grid cell is responsible for detecting $B=2$ bounding boxes and $C=20$ classes.
In the original YOLOv1 design the input is an RGB image of size $$448 \times 448$$. The image is divided into a $$S \times S = 7 \times 7$$ grid, where each grid cell is responsible for detecting $$B=2$$ bounding boxes and $$C=20$$ classes.

The network architecture is a simple convolutional neural network. The input image is passed through a series of convolutional layers, followed by a fully connected layer. The final layer outputs are reshaped to $7 \times 7 \times (2 \times 5 + 20) = 7 \times 7 \times 30$.
The network architecture is a simple convolutional neural network. The input image is passed through a series of convolutional layers, followed by a fully connected layer. The final layer outputs are reshaped to $$7 \times 7 \times (2 \times 5 + 20) = 7 \times 7 \times 30$$.

The YOLOv1 design took inspiration from GoogLeNet, which used 1x1 convolutions to reduce the depth of the feature maps. This was done to reduce the number of parameters and the amount of computation in the network. The network has 24 convolutional layers followed by 2 fully connected layers. It uses a linear activation function for the final layer, and all other layers use the leaky rectified linear activation:

Expand All @@ -126,14 +126,14 @@ See figure below for the architecture of YOLOv1.
#### Training
The network is trained end-to-end on the image and the ground truth bounding boxes. The loss function is a sum of squared error loss. The loss function is designed to penalize the network for incorrect predictions of bounding box coordinates, confidence and class probabilities. We will discuss the loss function in the next section.

YOLO predicts multiple bounding boxes per grid cell. At training time, we only want one bounding box predictor to be responsible for each object. We assign one predictor to be “responsible” for predicting an object based on which prediction has the highest current IOU with the ground truth. This leads to specialization between the bounding box predictors. Each predictor gets better at predicting certain sizes, aspect ratios, or classes of objects, improving overall recall. We will encode this information in the loss function for grid cell $i$ and bounding box $b$ using $\mathbb{1}{ib}^{\text{obj}}$. $\mathbb{1}{ib}^{\text{noobj}}$ is the opposite of $\mathbb{1}_{ib}^{\text{obj}}$.
YOLO predicts multiple bounding boxes per grid cell. At training time, we only want one bounding box predictor to be responsible for each object. We assign one predictor to be “responsible” for predicting an object based on which prediction has the highest current IOU with the ground truth. This leads to specialization between the bounding box predictors. Each predictor gets better at predicting certain sizes, aspect ratios, or classes of objects, improving overall recall. We will encode this information in the loss function for grid cell $$i$$ and bounding box $$b$$ using $$\mathbb{1}{ib}^{\text{obj}}$$. $$\mathbb{1}{ib}^{\text{noobj}}$$ is the opposite of $$\mathbb{1}_{ib}^{\text{obj}}$$.

##### Loss Function
Now that we have a learnable system which converts an image to a $S \times S \times (B\times5 + C)$ feature map, we need to train it.
Now that we have a learnable system which converts an image to a $$S \times S \times (B\times5 + C)$$ feature map, we need to train it.

A simple function to train such a system is to use a sum of squared error loss. We can use the squared error between the predicted values and the true values, i.e for bounding box coordinates, confidence and class probabilities.

The loss for each grid cell $(i)$ can look like this:
The loss for each grid cell $$(i)$$ can look like this:

$$
\mathcal{L}^{i} = \mathcal{L}^{i}_{\text{coord}} + \mathcal{L}^{i}_{\text{conf}} + \mathcal{L}^{i}_{\text{class}}\\
Expand All @@ -154,24 +154,24 @@ $$\begin{align*}
$$

where
- $\mathbb{1}_{ib}^{\text{obj}}$ is 1 if the $b$-th bounding box in the $i$-th grid cell is responsible for detecting the object, 0 otherwise.
- $\mathbb{1}_i^\text{obj}$ is 1 if the $i$-th grid cell contains an object, 0 otherwise.
- $$\mathbb{1}_{ib}^{\text{obj}}$$ is 1 if the $$b$$-th bounding box in the $$i$$-th grid cell is responsible for detecting the object, 0 otherwise.
- $$\mathbb{1}_i^\text{obj}$$ is 1 if the $$i$$-th grid cell contains an object, 0 otherwise.

But this loss function does not necessarily align well with the task of object detection. The simple addition of losses for both tasks (classification and localization) weights the loss equally.

To rectify, YOLOv1 uses a weighted sum of squared error loss. First, we assign a separate weight to localization error called $\lambda_{\text{coord}}$. It is usually set to 5.
To rectify, YOLOv1 uses a weighted sum of squared error loss. First, we assign a separate weight to localization error called $$\lambda_{\text{coord}}$$. It is usually set to 5.

So the loss for each grid cell $(i)$ can look like this:
So the loss for each grid cell $$(i)$$ can look like this:
$$
\mathcal{L}^{i} = \lambda_{\text{coord}}
\mathcal{L}^{i}_{\text{coord}} + \mathcal{L}^{i}_{\text{conf}} + \mathcal{L}^{i}_{\text{class}}\\
$$

In addition, many grid cells do not contain objects. The confidence is close to zero and thus the grid cells containing the objects often overpower the gradients. This makes the network unstable during training.

To rectify, we also weigh the loss from the confidence predictions in the grid cells that do not contain objects lower than in the grid cells that contain objects. We use a separate weight for the confidence loss called $\lambda_{\text{noobj}}$, which is usually set to 0.5.
To rectify, we also weigh the loss from the confidence predictions in the grid cells that do not contain objects lower than in the grid cells that contain objects. We use a separate weight for the confidence loss called $$\lambda_{\text{noobj}}$$, which is usually set to 0.5.

So the confidence loss for each grid cell $(i)$ can look like this:
So the confidence loss for each grid cell $$(i)$$ can look like this:
$$
\mathcal{L}^{i}_{\text{conf}} = \sum_{b=0}^{B} \left[
\mathbb{1}_{ib}^{\text{obj}} \left( \hat{\text{conf}}_{i} - \text{conf}_{i} \right)^2 +
Expand All @@ -183,7 +183,7 @@ The sum of squared error for bounding box coordinates can be problematic. It equ

To rectify, YOLOv1 uses a sum of squared error loss for the **square root** of the bounding box width and height. This makes the loss function scale invariant.

So the localization loss for each grid cell $(i)$ can look like this:
So the localization loss for each grid cell $$(i)$$ can look like this:

$$
\mathcal{L}^{i}_{\text{coord}} = \sum_{b=0}^{B} \mathbb{1}_{ib}^{\text{obj}} \left[ \left( \hat{x}_{ib} - x_{ib} \right)^2 + \left( \hat{y}_{ib} - y_{ib} \right)^2 +
Expand All @@ -197,7 +197,7 @@ $$
$$

#### Inference
Inference is simple. We pass the image through the network and get the $S \times S \times (B \times 5 + C)$ feature map. We then filter out the boxes which have confidence scores less than a threshold.
Inference is simple. We pass the image through the network and get the $$S \times S \times (B \times 5 + C)$$ feature map. We then filter out the boxes which have confidence scores less than a threshold.

##### Non-Maximum Suppression
In rare cases, for large objects, the network tends to predict multiple boxes from multiple grid cells. To eliminate duplicate detections, we use a technique called Non-Maximum Suppression (NMS). NMS works by selecting the box with the highest confidence score and eliminating all other boxes with an IOU greater than a threshold. This is done iteratively until no overlapping boxes remain.
Expand Down