论文链接:Deep Residual Learning for Image Recognition

残差网络的由来

随着神经网络研究的深入,人们发现了深度的增加会为模型带来巨大的能力提升

但是网络深度的增加也带来了梯度消失/爆炸的问题
幸运的是这一问题被初始归一化归一化层解决了,它们使得网络得以收敛

然而,解决了梯度消失问题的深度网络却又暴露出了退化问题
即网络加深到一定程度后,继续加深网络反而会使训练误差增大
不幸的是研究表明这个问题并不是过拟合引起的

可退化问题实际上并不应该发生
我们考虑一个已有的层数较少的模型和按其相同结构加深的模型
假如深度模型中新加入的层都表示恒等映射,那么两个模型至少应该有相同性能

Kaiming He认为这种退化问题是目标隐函数难以用现有optimizer近似导致的
所以他提出了残差结构来解决这一问题

残差结构

残差结构如图所示

resblock

假设网络中一部分堆叠层的目标隐函数为$H(x)$
在残差结构中这部分对叠层的目标则为$F(x)=H(x)-x$
而输入x被直接以恒等形式(identity shortcut)连接至输出,即残差结构的输出为$F(x)+x$

基于前面的假设,当恒等映射确实为深度网络中某些层的最优选择时
残差结构中的层就只需要简单的将权值置0

即使实际上目标函数并不是恒等映射,也可以发现残差结构能使得网络更有效地收敛

同时很重要的一点是,残差结构并没有引入多余的参数

残差网络

He在论文中给出了多个不同深度的resnet模型

resnet

还有34-layer resnet的图解

resnet-34

需要注意的是,对于恒等跳跃连接对应的输入和输出维度不同的情况,作者给出了两种方案

  • 用0填充来增加维度
  • 将恒等跳跃连接变为一个单独的层(例如一个1x1卷积)

作者的实验表明两种方法效果相似,但前者不引入参数,计算效率更高

此外一个residual block不应该只有一层
否则它会变成$\boldsymbol{y}=\boldsymbol{W}\boldsymbol{x}+\boldsymbol{x}$,使得残差结构毫无意义

Keras代码实现

为了方便代码实现,维度变化的shortcut用了1x1的卷积
论文所述的zero-padding在keras中不太好实现,实际上keras官方文档中的实例也用了卷积
zero-padding shortcut的实现可参考github讨论Zero-padding for ResNet shortcut connections when channel number increase

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
from keras.models import Model, Sequential
from keras.layers import Dense, Conv2D, Input, Activation, BatchNormalization, \
MaxPool2D, GlobalAveragePooling2D, Add
from keras.optimizers import Adam

class ResNet:
def __init__(self):
self.input_shape = (224, 224, 3)
self.resnet = self.buildNet()
self.resnet.compile(optimizer=Adam(1e-4), loss='categorical_crossentropy')

print(self.resnet.summary())

def buildNet(self):
inputs = Input(self.input_shape)

x = Conv2D(64, kernel_size=7, strides=2, padding='same')(inputs)
x = MaxPool2D(pool_size=(3, 3), strides=2)(x)

num_block = [3, 4, 6, 3]
for k in range(4):
for idx in range(num_block[k]):
downsample = (idx==0 and k!=0)
x = self.resblock(inputs=x, filters=64*2**k, downsample=downsample)

x = GlobalAveragePooling2D()(x)
outputs = Dense(1000, activation='softmax')(x)

return Model(inputs, outputs)

def resblock(self, inputs, filters, downsample=False):
strides = 2 if downsample else 1

x = Conv2D(filters, kernel_size=3, strides=strides, padding='same')(inputs)
x = BatchNormalization()(x)
x = Activation('relu')(x)

x = Conv2D(filters, kernel_size=3, strides=1, padding='same')(x)
x = BatchNormalization()(x)
x = Activation('relu')(x)

if downsample:
inputs = Conv2D(filters, kernel_size=1, strides=2, padding='same', activation='relu')(inputs)

x = Add()([inputs, x])

outputs = Activation('relu')(x)
return outputs

if __name__ == '__main__':
resnet = ResNet()

Pytorch代码实现

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
import torch
import torch.nn as nn
import torchvision
import numpy as np

class ResBlock(nn.Module):
def __init__(self, input_channel, filters, downsample=False):
super().__init__()

self.downsample = downsample
stride = 2 if downsample else 1

self.layer1 = nn.Sequential(
nn.Conv2d(input_channel, filters, kernel_size=3, stride=stride, padding=1),
nn.ReLU(),
nn.BatchNorm2d(filters)
)

self.layer2 = nn.Sequential(
nn.Conv2d(filters, filters, kernel_size=3, stride=1, padding=1),
nn.ReLU(),
nn.BatchNorm2d(filters)
)

if downsample:
self.shortcut = nn.Sequential(
nn.Conv2d(input_channel, filters, kernel_size=1, stride=2, padding=0),
nn.ReLU(),
)

def forward(self, inputs):
x = self.layer1(inputs)
x = self.layer2(x)
if self.downsample:
inputs = self.shortcut(inputs)

outputs = torch.add(inputs, x)
return outputs

class ResNet(nn.Module):
def __init__(self):
super().__init__()

self.input_channel = 3
self.buildNet()

def buildNet(self):
def buildBlock(num, input_channel, filters, downsample=True):
layer = []
for idx in range(num):
downsample = (idx == 0 and downsample)
input_channel = filters if idx !=0 else input_channel
layer.append(ResBlock(input_channel=input_channel, filters=filters, downsample=downsample))

return nn.Sequential(*layer)

self.conv1 = nn.Conv2d(self.input_channel, 64, kernel_size=7, stride=2, padding=3)
self.maxpool = nn.MaxPool2d(3, stride=2)

num_block = [3, 4, 6, 3]

self.conv2 = buildBlock(num_block[0], input_channel=64, filters=64, downsample=False)
self.conv3 = buildBlock(num_block[1], input_channel=64, filters=128)
self.conv4 = buildBlock(num_block[2], input_channel=128, filters=256)
self.conv5 = buildBlock(num_block[3], input_channel=256, filters=512)

self.global_avgpool = nn.AdaptiveAvgPool2d((1,1)) # 全局平均池化
self.classifier = nn.Linear(512, 10)

def forward(self, inputs):
x = self.conv1(inputs)
x = self.maxpool(x)

x = self.conv2(x)
x = self.conv3(x)
x = self.conv4(x)
x = self.conv5(x)

x = self.global_avgpool(x)
x = x.view(x.size(0), -1)

outputs = self.classifier(x) # 使用交叉熵损失会自动加上softmax
return outputs