Cosine Annealing Warm Restart论文讲解

SGDR: STOCHASTIC GRADIENT DESCENT WITH WARM RESTARTS

0. Abstract

Restart techniques are common in gradient-free optimization to deal with multi-modal functions.

In this paper, we propose a simple warm restart technique for stochastic gradient descent to improve its anytime performance when training deep neural networks.

1. Introduction

In this paper, we propose to periodically simulate warm restarts of SGD, where in each restart the learning rate is initialized to some value and is scheduled to decrease.

1. 加速模型收敛
2. 提升模型准确率

3. SGDR(Stochastic Gradient Descent with Warm Restarts)

η t = η m i n i + 1 2 ( η m a x i − η m i n i ) ( 1 + cos ⁡ ( T c u r T i π ) ) \eta_t = \eta^i_{min} + \frac{1}{2}(\eta^i_{max} - \eta^i_{min}) (1 + \cos(\frac{T_{cur}}{T_i}\pi))

{ T 0 = 1 , T m u l t = 2 T 0 = 10 , T m u l t = 2 \begin{cases} T_0 = 1, T_{mult}=2 \\ T_0 = 10, T_{mult}=2 \end{cases}

4. Experiments

Figure 2: Test errors on CIFAR-10 (left column) and CIFAR-100 (right column) datasets. Note that for SGDR we only plot the recommended solutions. The top and middle rows show the same results on WRN-28-10, with the middle row zooming into the good performance region of low test error. The bottom row shows performance with a wider network, WRN-28-20.

5. 总结

SGDR的热重启学习率策略的确是有效的，特别是残差结构。相比于人工设计的阶梯下降的学习，SGDR可以实现更早的收敛到相同精度（快约2~4倍）。

6. PyTorch代码

# 导包
from torch import optim
from torch.optim import lr_scheduler

# 定义模型
model, parameters = generate_model(opt)

# 定义优化器
if opt.nesterov:
dampening = 0
else:
dampening = 0.9
optimizer = opt.SGD(parameters, lr=0.1, momentum=0.9, dampening=dampending, weight_decay=1e-3, nesterov=opt.nesterov)

# 定义热重启学习率策略
scheduler = lr_scheduler.CosineAnnealingWarmRestarts(optimizer, T_0=10, T_mult=2, eta_min=0, last_epoch=-1)


7. 重启周期计算

a T 0 T_0
b a × 3 a\times 3
c b × T m u l t + a b \times T_{mult} + a
d c × T m u l t + a c\times T_{mult} + a
e d × T m u l t + a d\times T_{mult} + a

a 10 10
b 30 30
c 30 × 2 + 10 30 \times 2 + 10
d 70 × 2 + 10 70 \times 2 + 10
e 150 × 2 + 10 150\times 2 + 10