简单总结一下 YOLO 的核心代码

想用 YOLO 做点事情，但是不敢碰纯 C 的原版 YOLO 代码。网上 Tensoeflow 实现的 YOLO 倒是有几份，但要么是不能用，要么是代码写得太魔幻看不懂改不动。于是作死自己造了个轮子，深入理解一下 YOLO。

YOLO 大致可以分为特征提取网络和检测网络。

特征提取网络

这部分不用多说，唯一一点是它加入了一个 reorg 的层，在缩小特征图尺寸的同时增加通道，“几乎不丢信息”。

其实这部分可以使用其他网络来代替的。通用的网络后面再接几层卷积，进行特征变换。

检测网络

拿到的特征图尺寸会比较小（例如 13×13），每一个 “像素点” 都是一个“格子”（所以，13x13x1024 的特征图就是 13×13 个格子，每个格子维度是 1024 维）。

reshape

我们希望每个格子产生多个（例如，5 个）预测框，以供我们进行挑选。每个预测框都有 x y w h confidence [class] 这 6 个属性（注意顺序），其中 xywh 是一个预测框的坐标，confidence 是“我们有多大的把握它准确框住了一个物体”。最后的 [class] 是说，这个被框住的物体属于各个类（比如，总共有 7 个类）的概率。所以一个预测框的数据长度是 4+1+len(class)，由于产生了多个预测框，所以需要的滤波器的长度为 box_per_grid * (4 + 1 + len(class))。为了方便使用，我们作一下 reshape：

B, H, W, C = net.shape.as_list()
net = tf.reshape(net, [-1, H * W, box_per_grid, (4 + 1 + num_classes)])

获取检测结果

YOLO 使用了 Anchor，可以理解为 “默认大小”，或者干脆理解为“长宽比” 都行。在获取结果的时候，直接把这个乘上去就好了。

net_coords_xy = tf.sigmoid(net[..., 0:2])
net_coords_wh = tf.sqrt(tf.exp(net[..., 2:4]) * anchors) 
net_confidence = tf.sigmoid(net[..., 4])
net_class = tf.sigmoid(net[..., 4:])

但是，这里的 xy 是 相对于格子的偏移，而不是 在整张图片中的位置。这个好办，弄一个 meshgrid 就好了：

mesh_num_x, mesh_num_y = tf.range(W, dtype=tf.float32), tf.range(H, dtype=tf.float32)
mesh = tf.stack(tf.meshgrid(mesh_num_x, mesh_num_y), axis=-1)
mesh = tf.reshape(mesh, [1, -1, 1, 2])
mesh = tf.tile(mesh, [1, 1, box_per_grid, 1])
net_coords_xy += mesh

然后根据 confidence 和 class 的预测过滤一下就好了。

Loss

做 Loss 相对比较复杂一点，但也复杂不到哪儿去。

我们可以知道每张图的 bbox 和对应的 label。bbox 是 YOLO 自己定义的格式，“一个检测框的中心相对于整张图的位置，边长相对于整张图的比例”，但现在网络输出需要的是 相对于格子。所以，在特征图那个层面直接做这个转换就好了：

center_xy = target_coord_float[..., 0:2] * [W, H]
center_xy_int = tf.floor(center_xy)
center_xy_offset = center_xy - center_xy_int

我们拿到的标注是一个 “平铺直叙” 的标注，这里需要将它们转回到和特征图一样的维度的标注。

grid_x = center_xy_int[..., 1] * W + center_xy_int[..., 0]  # pos = y * W + x
grid_y = tf.ones_like(grid_x) * tf.expand_dims(tf.range(B, dtype=tf.float32), axis=-1)
grid_pos = tf.reshape(tf.stack([grid_y, grid_x], axis=-1), [-1, 2])  # Like a sparse matrix index
grid_pos = tf.cast(grid_pos, tf.int64)
target_fix_yolo2 = tf.concat([center_xy_offset, target_coord_float[..., 2:4] * np.sqrt(2)], axis=-1)
mask_coord_yolo2 = tf.scatter_nd(grid_pos, tf.reshape(target_fix_yolo2, [-1, 4]), [B, H * W, 4])

这里还能顺手做一个 “这个格子里面有没有物体” 的 mask

好了，准备工作基本完成了，下面可以开始算 loss 了：

找到 IoU 最大的 box，只用它来做计算：

iou = tf_box_iou(predict_coords_voc, gt_coords_voc) 
best_box = tf.reduce_max(iou, axis=-1, keep_dims=True)
best_box = tf.logical_and(tf.equal(iou, best_box), object_mask, -1)
best_box = tf.to_float(best_box)

下面就是做减法、加法，计算总 Loss 了：

weight_coord = args.loss.coord_scale * best_box_aligned
weight_object = args.loss.object_scale * best_box + args.loss.noobject_scale * (1. - best_box)

_mask_coord_yolo2_aligned = tf.tile(tf.expand_dims(_mask_coord_yolo2, -2), [1, 1, box_per_grid, 1])
coords_yolo2 = tf.concat([net_coords_xy, net_coords_wh * np.sqrt(2)], axis=-1)
loss_coord = tf.square(coords_yolo2 - _mask_coord_yolo2_aligned) * weight_coord
loss_coord = reduce_sum_mean(loss_coord, [1, 2, 3])

loss_conf = tf.square(net_confidence - best_box) * weight_object
loss_conf = reduce_sum_mean(loss_conf, [1, 2])

大致就这样了

简单总结一下 YOLO 的核心代码

特征提取网络

检测网络

reshape

获取检测结果

Loss

评论

发表回复取消回复

简单总结一下 YOLO 的核心代码

特征提取网络

检测网络

reshape

获取检测结果

Loss

分享到:

评论

发表回复 取消回复

发表回复取消回复