Improving Generalization Performance by Switching from Adam to SGD [609357]

Improving Generalization Performance by Switching from Adam to SGD
Nitish Shirish Keskar1Richard Socher1
Abstract
Despite superior training outcomes, adaptive op-
timization methods such as Adam, Adagrad or
RMSprop have been found to generalize poorly
compared to Stochastic gradient descent (SGD).
These methods tend to perform well in the initial
portion of training but are outperformed by SGD
at later stages of training. We investigate a hy-
brid strategy that begins training with an adaptive
method and switches to SGD when appropriate.
Concretely, we propose SWATS, a simple strat-
egy which Switches from Adam toSGD when
a triggering condition is satisfied. The condi-
tion we propose relates to the projection of Adam
steps on the gradient subspace. By design, the
monitoring process for this condition adds very
little overhead and does not increase the num-
ber of hyperparameters in the optimizer. We re-
port experiments on several standard benchmarks
such as: ResNet, SENet, DenseNet and Pyramid-
Net for the CIFAR-10 and CIFAR-100 data sets,
ResNet on the tiny-ImageNet data set and lan-
guage modeling with recurrent networks on the
PTB and WT2 data sets. The results show that
our strategy is capable of closing the generaliza-
tion gap between SGD and Adam on a majority
of the tasks.
1. Introduction
Stochastic gradient descent (SGD) (Robbins & Monro,
1951) has emerged as one of the most used training al-
gorithms for deep neural networks. Despite its simplic-
ity, SGD performs well empirically across a variety of
applications but also has strong theoretical foundations.
This includes, but is not limited to, guarantees of saddle
point avoidance (Lee et al., 2016), improved generalization
(Hardt et al., 2015; Wilson et al., 2017) and interpretations
as Bayesian inference (Mandt et al., 2017).
Training neural networks is equivalent to solving the fol-
1Salesforce Research, Palo Alto, CA – 94301. Correspon-
dence to: Nitish Shirish Keskar <[anonimizat] >.lowing non-convex optimization problem,
min
w2Rnf(w);
wherefis a loss function. The iterations of SGD can be
described as:
wk=wk1 k1^rf(wk1);
wherewkdenotes thekthiterate, kis a (tuned) step size
sequence, also called the learning rate, and ^rf(wk)de-
notes the stochastic gradient computed at wk. A variant of
SGD (SGDM), that uses the inertia of the iterates to accel-
erate the training process, has also found to be successful in
practice (Sutskever et al., 2013). The iterations of SGDM
can be described as:
vk= vk1+^rf(wk1)
wk=wk1 k1vk;
where 2[0;1)is a momentum parameter and v0is ini-
tialized to 0.
One disadvantage of SGD is that it scales the gradient uni-
formly in all directions; this can be particularly detrimental
for ill-scaled problems. This also makes the process of tun-
ing the learning rate circumstantially laborious.
To correct for these shortcomings, several adaptive meth-
ods have been proposed which diagonally scale the gradi-
ent via estimates of the function’s curvature. Examples of
such methods include Adam (Kingma & Ba, 2015), Ada-
grad (Duchi et al., 2011) and RMSprop (Tieleman & Hin-
ton, 2012). These methods can be interpreted as methods
that use a vector of learning rates, one for each parameter,
that are adapted as the training algorithm progresses. This
is in contrast to SGD and SGDM which use a scalar learn-
ing rate uniformly for all parameters.
Adagrad takes steps of the form
wk=wk1 k1^rf(wk1)pvk1+;where (1)
vk1=k1X
j=1^rf(wj)2:arXiv:1712.07628v1 [cs.LG] 20 Dec 2017

Improving Generalization Performance by Switching from Adam to SGD
RMSProp uses the same update rule as (1), but instead
of accumulating vkin a monotonically increasing fashion,
uses an RMS-based approximation instead, i.e.,
vk1= vk2+ (1 )^rf(wk1)2:
In both Adagrad and RMSProp, the accumulator vis ini-
tialized to 0. Owing to the fact that vkis monotonically
increasing in each dimension for Adagrad, the scaling fac-
tor for ^rf(wk1)monotonically decreases leading to slow
progress. RMSProp corrects for this behavior by employ-
ing an average scale instead of a cumulative scale. How-
ever, because vis initialized to 0, the initial updates tend
to be noisy given that the scaling estimate is biased by its
initialization. This behavior is rectified in Adam by em-
ploying a bias correction. Further, it uses an exponential
moving average for the step in lieu of the gradient. Math-
ematically, the Adam update equation can be represented
as:
wk=wk1 k1p
1 k
2
1 k
1mk1pvk1+;where
(2)
mk1= 1mk2+ (1 1)^rf(wk1);
vk1= 2vk2+ (1 2)^rf(wk1)2: (3)
Adam has been used in many applications owing to its com-
petitive performance and its ability to work well despite
minimal tuning (Karpathy, 2017). Recent work, however,
highlights the possible inability of adaptive methods to per-
form on par with SGD when measured by their ability to
generalize (Wilson et al., 2017).
Furthermore, the authors also show that for even simple
quadratic problems, adaptive methods find solutions that
can be orders-of-magnitude worse at generalization than
those found by SGD(M).
Indeed, for several state-of-the-art results in language mod-
eling and computer vision, the optimizer of choice is SGD
(Merity et al., 2017; Loshchilov & Hutter, 2016; He et al.,
2015). Interestingly however, in these and other instances,
Adam outperforms SGD in both training and generaliza-
tion metrics in the initial portion of the training, but then
the performance stagnates. This motivates the investigation
of a strategy that combines the benefits of Adam, viz. good
performance with default hyperparameters and fast initial
progress, and the generalization properties of SGD. Given
the insights of Wilson et al. (2017) which suggest that the
lack of generalization performance of adaptive methods
stems from the non-uniform scaling of the gradient, a nat-
uralhybrid strategy would begin the training process with
Adam and switch to SGD when appropriate. To investi-
gate this further, we propose SWATS, a simple strategy that
combines the best of both worlds by Switching from AdamtoSGD. The switch is designed to be automatic and one
that does not introduce any more hyper-parameters. The
choice of not adding additional hyperparameters is deliber-
ate since it allows for a fair comparison between Adam and
SWATS. Our experiments on several architectures and data
sets suggest that such a strategy is indeed effective.
Several attempts have been made at improving the con-
vergence and generalization performance of Adam. The
closest to our proposed approach is (Zhang et al., 2017) in
which the authors propose ND-Adam, a variant of Adam
which preserves the gradient direction by a nested opti-
mization procedure. This, however, introduces an addi-
tional hyperparameter along with the ( ; 1; 2)used in
Adam. Further, empirically, this adaptation sacrifices the
rapid initial progress typically observed for Adam. In
Anonymous (2018), the authors investigate Adam and as-
cribe the poor generalization performance to training issues
arising from the non-monotonic nature of the steps. The
authors propose a variant of Adam called AMSGrad which
monotonically reduces the step sizes and possesses theoret-
ical convergence guarantees. Despite these guarantees, we
empirically found the generalization performance of AMS-
Grad to be similar to that of Adam on problems where a
generalization gap exists between Adam and SGD. We note
that in the context of the hypothesis of Wilson et al. (2017),
all of the aforementioned methods would still yield poor
generalization given that the scaling of the gradient is non-
uniform.
The idea of switching from an adaptive method to SGD
is not novel and has been explored previously in the con-
text of machine translation (Wu et al., 2016) and ImageNet
training (Akiba et al., 2017). Wu et al. (2016) use such a
mixed strategy for training and tune both the switchover
point and the learning rate for SGD after the switch. Akiba
et al. (2017) use a similar strategy but use a convex com-
bination of RMSProp and SGD steps whose contributions
and learning rates are tuned.
In our strategy, the switchover point and the SGD learn-
ing rate are both learned as a part of the training process.
We monitor a projection of the Adam step on the gradi-
ent subspace and use its exponential average as an estimate
for the SGD learning rate after the switchover. Further, the
switchover is triggered when no change in this monitored
quantity is detected. We describe this strategy in detail in
Section 2. In Section 3, we describe our experiments com-
paring Adam, SGD and SWATS on a host of benchmark
problems. Finally, in Section 4, we present ideas for future
research and concluding remarks. We conclude this section
by emphasizing the goal of this work is less to propose a
new training algorithm but rather to empirically investigate
the viability of hybrid training for improving generaliza-
tion.

Improving Generalization Performance by Switching from Adam to SGD
2. SWATS
To investigate the generalization gap between Adam and
SGD, let us consider the training of the CIFAR-10 data
set (Krizhevsky & Hinton, 2009) on the DenseNet archi-
tecture (Iandola et al., 2014). This is an example of an
instance where a significant generalization gap exists be-
tween Adam and SGD. We plot the performance of Adam
and SGD on this task but also consider a variant of Adam
which we call Adam-Clip (p;q). Given (p;q)such that
p<q , the iterates for this variant take on the form
wk=wk1
clip p
1 k
2
1 k
1 k1pvk1+;p sgd;q sgd!
mk1:
Here, sgdis the tuned value of the learning rate for SGD
that leads to the best performance for the same task. The
function clip (x;a;b )clips the vector xelement-wise such
that the output is constrained to be in [a;b]. Note that
Adam-Clip (1;1)would correspond to SGD. The network
is trained using Adam, SGD and two variants: Adam-
Clip(1;1), Adam-Clip (0;1)with tuned learning rates for
200 epochs, reducing the learning rate by 10after 150
epochs. The goal of this experiment is to investigate the
effect of constraining the large and small step sizes that
Adam implicitly learns, i.e.,p
1 k
2
1 k
1 k1pvk1+, on the gen-
eralization performance of the network. We present the re-
sults in Figure 1.
As seen from Figure 1, SGD converges to the expected test-
ing error of5%while Adam stagnates in performance at
around7%error. We note that fine-tuning of the learning
rate schedule (primarily the initial value, reduction amount
and the timing) did not lead to better performance. Also,
note that the rapid initial progress of Adam relative to SGD.
This experiment is in agreement with the experimental ob-
servations of Wilson et al. (2017). Interestingly, Adam-
Clip(0;1)has no tangible effect on the final generalization
performance while Adam-Clip (1;1)partially closes the
generalization gap by achieving a final accuracy of 6%.
We observe similar results for several architectures, data
sets and modalities whenever a generalization gap exists
between SGD and Adam. This stands as evidence that the
step sizes learned by Adam could circumstantially be too
small for effective convergence. This observation regarding
the need to lower-bound the step sizes of Adam is similar
to the one made in Anonymous (2018), where the authors
devise a one-dimensional example in which infrequent but
large gradients are not emphasized sufficiently causing the
non-convergence of Adam.
Given the potential insufficiency of Adam, even when con-
straining one side of the accumulator, we consider switch-
ingto SGD once we have reaped the benefits of Adam’s
025 50 75100 125 150 175 200
Epochs5.07.510.012.515.017.520.022.525.0Testing ErrorSGD
AdamClip-(1, )
Clip-(0,1)Figure 1. Training the DenseNet architecture on the CIFAR-10
data set with four optimizers: SGD, Adam, Adam-Clip (1;1)and
Adam-Clip (0;1). SGD achieves the best testing accuracy while
training with Adam leads to a generalization gap of roughly 2%.
Setting a minimum learning rate for each parameter of Adam par-
tially closes the generalization gap.
rapid initial progress. This raises two questions: (a) when
to switch over from Adam to SGD, and (b) what learn-
ing rate to use for SGD after the switch. Assuming that
the learning rate of SGD after the switchover is tuned, we
found that switching too late does not yield generalization
improvements while switching too early may cause the hy-
brid optimizer to not benefit from Adam’s initial progress.
Indeed, as shown in Figure 2, switching after 10epochs
leads to a learning curve very similar to that of SGD, while
switching after 80epochs leads to inferior testing accuracy
of6:5%. To investigate the efficacy of a hybrid strat-
egy whilst ensuring no increase in the number of hyperpa-
rameters (a necessity for fair comparison with Adam), we
propose SWATS, a strategy that automates the process of
switching over by determining both the switchover point
and the learning rate of SGD after the switch.
2.1. Learning rate for SGD after the switch
Consider an iterate wkwith a stochastic gradient gkand a
step computed by Adam, pk. For the sake of simplicity,
assume that pk6= 0 andpT
kgk<0. This is a common
requirement imposed on directions to derive convergence
(Nocedal & Wright, 2006). In the case when 1= 0 for
Adam, i.e., no first-order exponential averaging is used, this

Improving Generalization Performance by Switching from Adam to SGD
025 50 75100 125 150 175 200
Epochs468101214Testing ErrorSw@10
Sw@40
Sw@80
SGD
Adam
Figure 2. Training the DenseNet architecture on the CIFAR-10
data set using Adam and switching to SGD with learning rate with
learning rate 0:1and momentum 0:9after (10;40;80)epochs;
the switchover point is denoted by Sw@ in the figure. Switching
early enables the model to achieve testing accuracy comparable
to SGD but switching too late in the training process leads to a
generalization gap similar to Adam.
is trivially true since
pk=q
1 k+1
2
1 k+1
1 kpvk+
|{z}
:=diag(Hk)gk;t
withHk0where diag (A)denotes the vector constructed
from the diagonal of A. Ordinarily, to train using Adam, we
would update the iterate as:
wk+1=wk+pk:
To determine a feasible learning rate for SGD,
k, we pro-
pose solving the subproblem for finding
k
proj
kgkpk=pk
where projabdenotes the orthogonal projection of aontob.
This scalar optimization problem can be solved in closed
form to yield:

k=pT
kpk
pT
kgk;
since
pk=proj
kgkpk=
kgT
kpk
pT
kpkpkimplies the above equality.
Geometrically, this can be interpreted as the scaling neces-
sary for the gradient that leads to its projection on the Adam
steppkto bepkitself; see Figure 3. Note that this is not
the same as an orthogonal projection of pkongk. Empir-
ically, we found that an orthogonal projection consistently
underestimates the SGD learning rate necessary, leading to
much smaller SGD steps. Indeed, the `2norm of an orthog-
onally projected step will always be lesser than or equal to
that ofpk, which is undesirable given our needs. The non-
orthogonal projection proposed above does not suffer from
this problem, and empirically we found that it estimates the
SGD learning rate well. A simple scaling rule of
k=kpk
kgk
was also not found to be successful. We attribute this to the
fact that a scaling rule of this form ignores the relative im-
portance of the coordinate directions and tends to amplify
the importance of directions with a large step pbut small
first-order importance g, and vice versa.
Note again that if no momentum ( 1= 0) is employed in
Adam, then necessarily
k>0sinceHk0. We should
mention in passing that in this case
kis equivalent to the
reciprocal of the Rayleigh Quotient of H1
kwith respect to
the vectorpk.
Since
kis a noisy estimate of the scaling needed, we main-
tain an exponential average initialized at 0, denoted by k
such that
k= 2k1+ (1 2)
k:
We use 2of Adam, see (3), as the averaging coefficient
since this reuse avoids another hyperparameter and also be-
cause the performance is relatively invariant to fine-grained
specification of this parameter.
2.2. Switchover Point
Having answered the question of what learning rate kto
choose for SGD after the switch, we now discuss when to
switch. We propose checking a simple, yet powerful, crite-
rion:
k
1 k
2
k <; (4)
at every iteration with k >1. The condition compares the
bias-corrected exponential averaged value and the current
value (
k). The bias correction is necessary to prevent the
influence of the zero initialization during the initial portion
of training. Once this condition is true, we switch over
to SGD with learning rate  :=k
(1 k
2). We also experi-
mented with more complex criteria including those involv-
ing monitoring of gradient norms. However, we found that
this simple un-normalized criterion works well across a va-
riety of different applications.

Improving Generalization Performance by Switching from Adam to SGD
wkpk
gk

kgk
Figure 3. Illustrating the learning rate for SGD (
k) estimated by
our proposed projection given an iterate wk, a stochastic gradient
gkand the Adam step pk.
In the case when 1>0, we switch to SGDM with learning
rate(1 1)and momentum parameter 1. The (1
1)factor is the common momentum correction. Refer to
Algorithm 1 for a unified view of the algorithm. The text
in blue denotes operations that are also present in Adam.
3. Numerical Results
To demonstrate the efficacy of our approach, we present nu-
merical experiments comparing the proposed strategy with
Adam and SGD. We consider the problems of image clas-
sification and language modeling.
For the former, we experiment with four architectures:
ResNet-32 (He et al., 2015), DenseNet (Iandola et al.,
2014), PyramidNet (Han et al., 2016), and SENet (Hu
et al., 2017) on the CIFAR-10 and CIFAR-100 data sets
(Krizhevsky & Hinton, 2009). The goal is to classify im-
ages into one of 10 classes for CIFAR-10 and 100 classes
for CIFAR-100. The data sets contain 50000 3232RGB
images in the training set and 10000 images in the test-
ing set. We choose these architectures given their superior
performance on several image classification benchmarking
tasks. For a large-scale image classification experiment,
we experiment with the Tiny-ImageNet data set1on the
ResNet-18 architecture (He et al., 2015). This data set is
a subset of the ILSVRC 2012 data set (Deng et al., 2009)
and contains 200classes with 500 224224RGB images
per class in the training set and 50per class in the valida-
tion and testing sets. We choose this data set given that it is
a good proxy for the performance on the larger ImageNet
data set.
We also present results for word-level language modeling
where the task is to take as inputs a sequence of words and
predict the next word. We choose this task because of its
1https://tiny-imagenet :herokuapp :com/Algorithm 1 SWATS
Inputs: Objective function f, initial point w0, learn-
ing rate = 103, accumulator coefficients ( 1; 2) =
(0:9;0:999) ,= 109, phase= Adam .
1:Initializek 0,mk 0,ak 0,k 0
2:while stopping criterion not met do
3:k=k+ 1
4: Compute stochastic gradient gk=^rf(wk1)
5: ifphase = SGDthen
6:vk= 1vk1+gk
7:wk=wk1(1 1)vk
8: continue
9: end if
10:mk= 1mk1+ (1 1)gk
11:ak= 2ak1+ (1 2)g2
k
12:pk= kp
1 k
2
1 k
1mkpak+
13:wk=wk+pk
14: ifpT
kgk6= 0then
15:
k=pT
kpk
pT
kgk
16:k= 2k1+ (1 2)
k
17: ifk>1andjk
(1 k
2)
kj<then
18: phase = SGD
19:vk= 0
20:  =k=(1 k
2)
21: end if
22: else
23:k=k1
24: end if
25:end while
returnwk
broad importance, the inherent difficulties that arise due
to long term dependencies (Hochreiter & Schmidhuber,
1997), and since it is a proxy for other sequence learning
tasks such as machine translation (Bahdanau et al., 2014).
We use the Penn Treebank (PTB) (Mikolov et al., 2011) and
the larger WikiText-2 (WT-2) (Merity et al., 2016) data sets
and experimented with the AWD-LSTM and AWD-QRNN
architectures. In the case of SGD, we clip the gradients to a
norm of 0:25while we perform no such clipping for Adam
and SWATS. We found that the performance of SGD deteri-
orates without clipping and that of Adam and SWATS with.
The AWD-LSTM architecture uses a multi-layered LSTM
network with learned embeddings while the AWD-QRNN
replaces the expensive LSTM layer by the cheaper QRNN
layer (Bradbury et al., 2016) which uses convolutions in-
stead of recurrences. The model is regularized with Drop-
Connect (Wan et al., 2013) on the hidden-to-hidden con-
nections as well as other strategies such as weight decay,
embedding-softmax weight tying, activity regularization
and temporal activity regularization. We refer the reader

Improving Generalization Performance by Switching from Adam to SGD
to (Merity et al., 2016) for additional details regarding the
data sets including the sizes of the training, validation and
testing sets, size of the vocabulary, and source of the data.
For our experiments, we tuned the learning rate of all
optimizers, and report the best-performing configuration
in terms of generalization. The learning rate of Adam
and SWATS were chosen from a grid of f0:0005;0:0007;
0:001;0:002;0:003;0:004;0:005g:For both optimizers,
we use the (default) recommended values ( 1; 2) =
(0:9;0:999) . Note that this implies that, in all cases, we
switch from Adam to SGDM with a momentum coefficient
of0:9. For tuning the learning rate for the SGD(M) op-
timizer, we first coarsely tune the learning rate on a log-
arithmic scale from 103to102and then fine-tune the
learning rate. For all cases, we experiment with and with-
out employing momentum but don’t tune this parameter
( = 0:9). We found this overall procedure to perform
better than a generic grid-search or hyperparameter opti-
mization given the vastly different scales of learning rates
needed for different modalities. For instance, SGD with
learning rate 0:7performed best for the DenseNet task on
CIFAR-10 but for the PTB language modeling task using
the LSTM architecture, a learning rate of 50for SGD was
necessary. Hyperparameters such as batch size, dropout
probability, `2-norm decay etc. were chosen to match the
recommendations of the respective base architectures. We
trained all networks for a total of 300epochs and reduced
the learning rate by 10on epochs 150,225and262. This
scheme was surprisingly powerful at obtaining good per-
formance across the different modalities and architectures.
The experiments were coded in PyTorch2and conducted
using job scheduling on 16NVIDIA Tesla K80 GPUs for
roughly 3 weeks.
The experiments comparing SGD, Adam and SWATS on
the CIFAR and Tiny-ImageNet data sets are presented in
Figures 4 and 5, respectively. The experiments compar-
ing the optimizers on the language modeling tasks are pre-
sented in Figure 6. In Table 1, we summarize the meta-data
concerning our experiments including the learning rates
that achieved the best performance, and, in the case of
SWATS, the number of epochs before the switch occurred
and the learning rate ( ) for SGD after the switch. Finally,
in Figure 7, we depict the evolution of the estimated SGD
learning rate (
k) as the algorithm progresses on two rep-
resentative tasks.
With respect to the image classification data sets, it is ev-
ident that, across different architectures, on all three data
sets, Adam fails to find solutions that generalize well de-
spite making good initial progress. This is in agreement
with the findings of (Wilson et al., 2017). As can be
seen from Table 1, the switch from Adam to SGD hap-
2pytorch:orgpens within the first 20epochs for most CIFAR data sets
and at epoch 49for Tiny-ImageNet. Curiously, in the case
of the Tiny-ImageNet problem, the switch from Adam to
SGD leads to significant but temporary degradation in per-
formance. Despite the testing accuracy dropping from 80%
to52% immediately after the switch, the model recovers
and achieves a better peak testing accuracy compared to
Adam. We observed similar outcomes for several other ar-
chitectures on this data set.
In the language modeling tasks, Adam outperforms SGD
not only in final generalization performance but also in
the number of epochs necessary to attain that performance.
This is not entirely surprising given that Merity et al. (2017)
required iterate averaging for SGD to achieve state-of-the-
art performance despite gradient clipping or learning rate
decay rules. In this case, SWATS switches over to SGD,
albeit later in the training process, but achieves compara-
ble generalization performance to Adam as measured by
the lowest validation perplexity achieved in the experiment.
Again, as in the case of the Tiny-ImageNet experiment
(Figure 5), the switch may cause a temporary degradation
in performance from which the model is able to recover.
These experiments suggest that it is indeed possible to com-
bine the best of both worlds for these tasks: in all the tasks
described, SWATS performs almost as well as the best al-
gorithm amongst SGD and Adam, and in several cases
achieves a good initial decrease in the error metric.
Figure 7 shows that the estimated learning rate for SGD
(
k) is noisy but convergent (in mean), and that it converges
to a value of similar scale as the value obtained by tuning
the SGD optimizer (see Table 1). We emphasize that other
than the learning rate, no other hyperparameters were tuned
between the experiments.
4. Discussion and Conclusion
Wilson et al. (2017) pointed to the insufficiency of adaptive
methods, such as Adam, Adagrad and RMSProp, at gener-
alizing in a fashion comparable to that of SGD. In the case
of a convex quadratic function, the authors demonstrate that
adaptive methods provably converge to a point with orders-
of-magnitude worse generalization performance than SGD.
The authors attribute this generalization gap to the scal-
ing of the per-variable learning rates definitive of adaptive
methods as we explain below.
Nevertheless, adaptive methods are important given their
rapid initial progress, relative insensitivity to hyperparam-
eters, and ability to deal with ill-scaled problems. Several
recent papers have attempted to explain and improve adap-
tive methods (Loshchilov & Hutter, 2017; Anonymous,
2018; Zhang et al., 2017). However, given that they retain
the adaptivity and non-uniform gradient scaling, they too

Improving Generalization Performance by Switching from Adam to SGD
0 50 100 150 200 250 300
Epochs051015202530Testing ErrorSGD
Adam
SWATS
(a) ResNet-32 — CIFAR-10
0 50 100 150 200 250 300
Epochs0510152025Testing ErrorSGD
Adam
SWATS (b) DenseNet — CIFAR-10
0 50 100 150 200 250 300
Epochs051015202530Testing ErrorSGD
Adam
SWATS (c) PyramidNet — CIFAR-10
0 50 100 150 200 250 300
Epochs10203040506070Testing ErrorSGD
Adam
SWATS (d) SENet — CIFAR-10
0 50 100 150 200 250 300
Epochs25303540455055606570Testing ErrorSGD
Adam
SWATS
(e) ResNet-32 — CIFAR-100
0 50 100 150 200 250 300
Epochs20253035404550Testing ErrorSGD
Adam
SWATS (f) DenseNet — CIFAR-100
0 50 100 150 200 250 300
Epochs101520253035404550Testing ErrorSGD
Adam
SWATS (g) PyramidNet — CIFAR-100
0 50 100 150 200 250 300
Epochs25303540455055606570Testing ErrorSGD
Adam
SWATS (h) SENet — CIFAR-100
Figure 4. Numerical experiments comparing SGD(M), Adam and SWATS with tuned learning rates on the ResNet-32, DenseNet, Pyra-
midNet and SENet architectures on CIFAR-10 and CIFAR-100 data sets.
Model Data Set SGDM Adam SWATS  Switchover Point (epochs)
ResNet-32 CIFAR-10 0.1 0.001 0.001 0.52 1.37
DenseNet CIFAR-10 0.1 0.001 0.001 0.79 11.54
PyramidNet CIFAR-10 0.1 0.001 0.0007 0.85 4.94
SENet CIFAR-10 0.1 0.001 0.001 0.54 24.19
ResNet-32 CIFAR-100 0.3 0.002 0.002 1.22 10.42
DenseNet CIFAR-100 0.1 0.001 0.001 0.51 11.81
PyramidNet CIFAR-100 0.1 0.001 0.001 0.76 18.54
SENet CIFAR-100 0.1 0.001 0.001 1.39 2.04
LSTM PTB 55y0.003 0.003 7.52 186.03
QRNN PTB 35y0.002 0.002 4.61 184.14
LSTM WT-2 60y0.003 0.003 1.11 259.47
QRNN WT-2 60y0.003 0.004 14.46 295.71
ResNet-18 Tiny-ImageNet 0.2 0.001 0.0007 1.71 48.91
Table 1. Summarizing the optimal hyperparameters for SGD(M), Adam and SWATS for all experiments and, in the case of SWATS, the
value of the estimated learning rate for SGD after the switch and the switchover point in epochs. ydenotes that no momentum was
employed for SGDM.

Improving Generalization Performance by Switching from Adam to SGD
0 50 100 150 200 250 300
Epochs5055606570758085Testing Accuracy SGD
Adam
SWATS
Figure 5. Numerical experiments comparing SGD(M), Adam and
SWATS with tuned learning rates on the ResNet-18 architecture
on the Tiny-ImageNet data set.
are expected to suffer from similar generalization issues as
Adam. Motivated by this observation, we investigate the
question of using a hybrid training strategy that starts with
an adaptive method and switches to SGD. By design, both
the switchover point and the learning rate for SGD after
the switch, are determined as a part of the algorithm and
as such require no added tuning effort. We demonstrate
the efficacy of this approach on several standard bench-
marks, including a host of architectures, on the PennTree
Bank, WikiText-2, Tiny-ImageNet, CIFAR-10 and CIFAR-
100 data sets. In summary, our results show that the pro-
posed strategy leads to results comparable to SGD while
retaining the beneficial properties of Adam such as hyper-
parameter insensitivity and rapid initial progress.
The success of our strategy motivates a deeper exploration
into the interplay between the dynamics of the optimizer
and the generalization performance. Recent theoretical
work analyzing generalization for deep learning suggests
coupling generalization arguments with the training pro-
cess (Soudry et al., 2017; Hardt et al., 2015; Zhang et al.,
2016; Wilson et al., 2017). The optimizers choose differ-
ent trajectories in the parameter space and are attracted to
different basins of attractions, with vastly different general-
ization performance. Even for a simple least-squares prob-
lem: minwkXwyk2
2withw0= 0, SGD recovers the
minimum-norm solution, with its associated margin bene-
fits, whereas adaptive methods do not. The fundamental
reason for this is that SGD ensures that the iterates remainin the column space of the X, and that only one optimum
exists in that column space, viz. the minimum-norm so-
lution. On the other hand, adaptive methods do not nec-
essarily stay in the column space of X. Similar arguments
can be constructed for logistic regression problems (Soudry
et al., 2017), but an analogous treatment for deep networks
is, to the best of our knowledge, an open question. We
hypothesize that a successful implementation of a hybrid
strategy, such as SWATS, suggests that in the case of deep
networks, despite training for few epochs before switching
to SGD, the model is able to navigate towards a basin with
better generalization performance. However, further em-
pirical and theoretical evidence is necessary to buttress this
hypothesis, and is a topic of future research.
While the focus of this work has been on Adam, the strat-
egy proposed is generally applicable and can be analo-
gously employed to other adaptive methods such as Ada-
grad and RMSProp. A viable research direction includes
exploring the possibility of switching back-and-forth, as
needed, from Adam to SGD. Indeed, in our preliminary
experiments, we found that switching back from SGD to
Adam at the end of a 300 epoch run for any of the ex-
periments on the CIFAR-10 data set yielded slightly better
performance. Along the same line, a future research direc-
tion includes a smoother transition from Adam to SGD as
opposed to the hard switch proposed in this paper, which
may cause short-term performance degradation. This can
be achieved by using a convex combination of the SGD and
Adam directions as in the case of Akiba et al. (2017), and
gradually increasing the weight for the SGD contribution
by a criterion. Finally, we note that the strategy proposed
in this paper does not preclude the use of those proposed in
Zhang et al. (2017); Loshchilov & Hutter (2017); Anony-
mous (2018). We plan to investigate the performance of
the algorithm obtained by mixing these strategies, such
as monotonic increase guarantees of the second-order mo-
ment, cosine-annealing, `2-norm correction, in the future.
References
Akiba, T., Suzuki, S., and Fukuda, K. Extremely large
minibatch SGD: Training resnet-50 on ImageNet in 15
minutes. arXiv preprint arXiv:1711.04325 , 2017.
Anonymous. On the convergence of Adam and be-
yond. International Conference on Learning Represen-
tations , 2018. URL https://openreview :net/
forum?id=ryQu7f-RZ .
Bahdanau, D., Cho, K., and Bengio, Y . Neural machine
translation by jointly learning to align and translate.
arXiv preprint arXiv:1409.0473 , 2014.
Bradbury, J., Merity, S., Xiong, C., and Socher, R.

Improving Generalization Performance by Switching from Adam to SGD
0 50 100 150 200 250 300
Epochs5060708090100Validation PerplexitySGD
Adam
SWATS
(a) LSTM — PTB
0 50 100 150 200 250 300
Epochs6065707580859095100Validation PerplexitySGD
Adam
SWATS (b) LSTM — WT2
0 50 100 150 200 250 300
Epochs6065707580859095100Validation PerplexitySGD
Adam
SWATS (c) QRNN — PTB
0 50 100 150 200 250 300
Epochs6065707580859095100Validation PerplexitySGD
Adam
SWATS (d) QRNN — WT2
Figure 6. Numerical experiments comparing SGD(M), Adam and SWATS with tuned learning rates on the AWD-LSTM and AWD-
QRNN architectures on PTB and WT-2 data sets.
0 2 4 6 8 10 12
Epochs0.60.81.01.21.41.61.8
(a) DenseNet — CIFAR-100
025 50 75100 125 150 175
Epochs020406080100
(b) QRNN — PTB
Figure 7. Evolution of the estimated SGD learning rate (
k) on
two representative tasks.
Quasi-Recurrent Neural Networks. arXiv preprint
arXiv:1611.01576 , 2016.
Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and Fei-
Fei, L. ImageNet: A Large-Scale Hierarchical Image
Database. In CVPR09 , 2009.
Duchi, J., Hazan, E., and Singer, Y . Adaptive subgradient
methods for online learning and stochastic optimization.
The Journal of Machine Learning Research , 12:2121–
2159, 2011.
Han, D., Kim, J., and Kim, J. Deep pyramidal residual
networks. arXiv preprint arXiv:1610.02915 , 2016.
Hardt, M., Recht, B., and Singer, Y . Train faster, generalize
better: Stability of stochastic gradient descent. arXiv
preprint arXiv:1509.01240 , 2015.
He, K., Zhang, X., Ren, S., and Sun, J. Deep resid-
ual learning for image recognition. arXiv preprint
arXiv:1512.03385 , 2015.
Hochreiter, S. and Schmidhuber, J. Long short-term mem-
ory.Neural computation , 9(8):1735–1780, 1997.Hu, J., Shen, L., and Sun, G. Squeeze-and-excitation net-
works. arXiv preprint arXiv:1709.01507 , 2017.
Iandola, F., Moskewicz, M., Karayev, S., Girshick, R.,
Darrell, T., and Keutzer, K. Densenet: Implementing
efficient convnet descriptor pyramids. arXiv preprint
arXiv:1404.1869 , 2014.
Karpathy, A. A Peek at Trends in Machine Learn-
ing. https://medium :com/@karpathy/a-
peek-at-trends-in-machine-learning-
ab8a1085a106 , 2017. [Online; accessed 12-Dec-
2017].
Kingma, D. and Ba, J. Adam: A method for stochastic
optimization. In International Conference on Learning
Representations (ICLR 2015) , 2015.
Krizhevsky, A. and Hinton, G. Learning multiple layers of
features from tiny images. 2009.
Lee, J., Simchowitz, M., Jordan, M. I, and Recht, B. Gradi-
ent descent converges to minimizers. University of Cal-
ifornia, Berkeley , 1050:16, 2016.
Loshchilov, I. and Hutter, F. SGDR: Stochastic gradient
descent with warm restarts. 2016.
Loshchilov, I. and Hutter, F. Fixing Weight Decay Regu-
larization in Adam. ArXiv e-prints , November 2017.
Mandt, S., Hoffman, M. D., and Blei, D. M. Stochastic
Gradient Descent as Approximate Bayesian Inference.
ArXiv e-prints , April 2017.
Merity, S., Xiong, C., Bradbury, J., and Socher, R.
Pointer sentinel mixture models. arXiv preprint
arXiv:1609.07843 , 2016.
Merity, S., Keskar, N., and Socher, R. Regularizing and
Optimizing LSTM Language Models. arXiv preprint
arXiv:1708.02182 , 2017.

Improving Generalization Performance by Switching from Adam to SGD
Mikolov, T., Kombrink, S., Deoras, A., Burget, L., and Cer-
nocky, J. RNNLM-recurrent neural network language
modeling toolkit. In Proc. of the 2011 ASRU Workshop ,
pp. 196–201, 2011.
Nocedal, J. and Wright, S. Numerical optimization .
Springer Science & Business Media, 2006.
Robbins, Herbert and Monro, Sutton. A stochastic approx-
imation method. The annals of mathematical statistics ,
pp. 400–407, 1951.
Soudry, D., Hoffer, E., and Srebro, N. The implicit bias
of gradient descent on separable data. arXiv preprint
arXiv:1710.10345 , 2017.
Sutskever, I., Martens, J., Dahl, G., and Hinton, G. On
the importance of initialization and momentum in deep
learning. In International conference on machine learn-
ing, pp. 1139–1147, 2013.
Tieleman, T. and Hinton, G. Lecture 6.5-RMSProp: Divide
the gradient by a running average of its recent magni-
tude. COURSERA: Neural Networks for Machine Learn-
ing, 4, 2012.
Wan, L., Zeiler, M., Zhang, S., LeCun, Y , and Fergus, R.
Regularization of neural networks using dropconnect. In
Proceedings of the 30th international conference on ma-
chine learning (ICML-13) , pp. 1058–1066, 2013.
Wilson, A. C., Roelofs, R., Stern, M., Srebro, N., and
Recht, B. The Marginal Value of Adaptive Gradient
Methods in Machine Learning. ArXiv e-prints , May
2017.
Wu, Y ., Schuster, M., Chen, Z., Le, Q., Norouzi, M.,
Macherey, W., Krikun, M., Cao, Y ., Gao, Q., Macherey,
K., et al. Google’s neural machine translation system:
Bridging the gap between human and machine transla-
tion. arXiv preprint arXiv:1609.08144 , 2016.
Zhang, C., Bengio, S., Hardt, M., Recht, B., and Vinyals,
O. Understanding deep learning requires rethinking gen-
eralization. arXiv preprint arXiv:1611.03530 , 2016.
Zhang, Z., Ma, L., Li, Z., and Wu, C. Nor-
malized direction-preserving Adam. arXiv preprint
arXiv:1709.04546 , 2017.

Similar Posts