Deep Learning:Optimization for Training Deep Models(二)

Challenges in Neural Network Optimization

When training neural networks, we must confront the general non-convex case. Even convex optimization is not without its complications. In this section, we summarize several of the most prominent challenges involved in optimization for training deep models.


Some challenges arise even when optimizing convex functions. Of these, the most prominent is ill-conditioning of the Hessian matrix H.
The ill-conditioning problem is generally believed to be present in neural network training problems. Ill-conditioning can manifest by causing SGD to get “stuck” in the sense that even very small steps increase the cost function.
A second-order Taylor series expansion of the cost function predicts that a gradient descent step of ϵg will add


to the cost. Ill-conditioning of the gradient becomes a problem when 12ϵ2gTHg exceeds ϵgTg .
To determine whether ill-conditioning is detrimental to a neural network training task, one can monitor the squared gradient norm gTg and the gTHg term. In many cases, the gradient norm does not shrink significantly throughout learning, but the gTHg term grows by more than order of magnitude. The result is that learning becomes very slow despite the presence of a strong gradient because the learning rate must be shrunk to compensate for even stronger curvature.
Though ill-conditioning is present in other settings besides neural network training, some of the techniques used to combat it in other contexts are less applicable to neural networks. For example, Newton’s method is an excellent tool for minimizing convex functions with poorly conditioned Hessian matrices, but in the subsequent sections we will argue that Newton’s method requires significant modification before it can be applied to neural networks.

Local Minima

With non-convex functions, such as neural nets, it is possible to have many local minima. Indeed, nearly any deep model is essentially guaranteed to have an extremely large number of local minima.
Neural networks and any models with multiple equivalently parametrized latent variables all have multiple local minima because of the model identifiability problem.
A model is said to be identifiable if a sufficiently large training set can rule out all but one setting of the model’s parameters. Models with latent variables are often not identifiable because we can obtain equivalent models by exchanging latent variables with each other.