Home>
I want to understand YOLO9000 (v2)

YOLO9000 (v2)I read the treatise.
I understand most of it, but I have some questions, so I would like to ask you a question.

Does the universality theorem hold for other than the fully connected layer (convolutional layer)?

In the original YOLO (v1), the output (S * S * (B * 5 + C)) was obtained by using two fully connected layers in the process from the feature map to the output.

this isUniversality theoremI thought it was more possible.

However, in YOLO9000 (v2), the fully connected layer is eliminated and the output (S * S * B * (5 + C)) is obtained in the convolutional layer and the pooling layer.

At this point, I was wondering if I could get the output I wanted for the convolutional layer and the pooling layer from the extracted feature map. about it.
It is understandable that it is possible to transform the convolutional layer and pooling layer into the desired output format (number of dimensions and shape), but this time it is in the S * S * B * (5 + C) format. Each has its own meaning.
(Vertical width of grit) * (Width of grid) * (Number of BB) * ((x, y, w, h, objectness) + Probability for the number of classes)

The following is the output of YOLO (v1)

However, you should be able to output meaningful results by some function, not just changing the format.
I didn't know if the function could be realized only by the convolution layer and the pooling layer, and I couldn't find it even after examining it.

multi-scale training (yolov2)

In YOLO9000 (v2), you will learn by learning at various resolutions.
In order to study in multiples of 32 {320, 352, ..., 608}, the treatise says to resize the network, but I don't know how to resize it.

However, since our model only uses convolutional and pooling layers it can be resized on the fly. We want YOLOv2 to be robust to running on images of different sizes so we train this into the model. Instead of fixing the input image size we change the network every few iterations. Every 10 batches our network randomly chooses a new image dimension size. Since our model downsamples by a factor of 32, we pull from the following multiples of 32: {320, 352, ..., 608}. Thus the smallest option is 320 x 320 and the largest is 608 x 608.We resize the network to that dimension and continue training.

I don't use a fully connected layer, so I think the input size can be changed freely, but the output size should change.
Such a questionThere was, but the content was a little different, so I still don't understand.
Are you adjusting with padding or stride?

Postscript (2020/11/23)
Is it possible that the output size does not have to be unified? I started thinking.
→ If so, how should we interpret the bold part of the treatise (We resize the network ...)?

Finally

I did some research on my own, but I haven't come up with an answer.
You may not have enough research, or you may have asked common sense questions, but I would appreciate it if anyone could tell me.

  • Answer # 1

    Does the universality theorem hold for other than the fully connected layer (convolutional layer)?

    Shouldn't we think of it as a regression problem that infers the coordinates of a rectangle?
    In fact, not only YOLO, but also the model of the object detection system can be learned well by that policy.
    In general, deep learning has not been well studied as a mathematical guarantee of when learning works (or does not work).

    I don't use a fully connected layer, so I think the input size can be changed freely, but the output size should change.

    The output size will change. Output Features Since each grid in the map has rectangle information, the larger the output size, the more rectangles will be inferred.
    Since the loss is calculated based on whether or not it matches the correct rectangle, there is no problem in learning even if the number of output rectangles changes due to the change in the input size.