It has been empirically observed that different local optima, obtained from training deep neural networks don't generalize in the same way for the unseen data sets, even if they achieve the same training loss. Recently, this problem has received renewed interest from both the empirical and theoretical deep learning communities. From the theoretical standpoint, most of the generalization bounds developed for explaining this phenomenon only consider a worst-case scenario, thus ignoring the generalization capabilities among different solutions. In this blog we are interested in the following question:
How is the model generalization related to the local "smoothness" of the solution?
We discuss several related proposals and introduce our recent work that builds a connection between the model generalization and the Hessian of the loss function.
Hochreiter
and Schmidhuber, and more recently, Chaudhari et. al. and Keskar et al. argue that the
local curvature, or "sharpness", of the converged solutions for deep
networks is closely related to the generalization property of the
resulting classifier. The sharp minimizers, which led to lack of
generalization ability, are characterized by a significant number of large
positive eigenvalues in \(\nabla^2 \hat{L}(x)\), the loss function being
minimized.
Neyshabur et al.
point out the sharpness itself may not be enough to determine the
generalization capability. They argue that it is necessary to include
the scales of the solution with sharpness to be able to explain
generalization. To this end, they suggest an "expected sharpness" based
on the PAC-Bayes bound: $$E_{u\sim N(0,\sigma^2)^m}[\hat{L}(w+u)] -
\hat{L}(w)$$
As shown in the figure below, intuitively if the local minima is
"smooth" then perturbing the model won't cause the objective to change
much. Thus it provides a way of measuring the smoothness of the local
minima.
It is well known that adding noise
to the model during training can help improving the model generalization.
However, how to set proper noise levels remains unknown. Most
state-of-the-art methods assume same level of perturbation along all
directions, but intuition suggests that this may not be appropriate.
Let's look at a toy example. We construct a small 2-dimensional sample set
from a mixture of \(3\) Gaussians, and binarize the labels by thresholding
them from the median value. Then we use a \(5\)-layer MLP model with only
two parameters \(w_1\) and \(w_2\) for prediction, and the cross entropy
as the loss. The variables from different layers are shared. Sigmoid is
used as the activation function. The loss landscape of a model trained
using \(100\) samples is plotted below:
In order to get
a quantative evaluation of the local optimum and the model perturbation, we
need to look at the PAC-Bayes bound:
[PAC-Bayes-Hoeffding
Bound] Let \(l(f, x,y)\in [0,1]\), and \(\pi\) be any fixed distribution
over the parameters \(\mathcal{W}\). For any \(\delta>0\) and \(\eta>0\),
with probability at least \(1-\delta\) over the draw of \(n\) samples, for
any \(w\) and any random perturbation \(u\), $$\mathbb{E}_u [L(w+u)]\leq
\mathbb{E}_u [\hat{L}(w+u)] + \frac{KL(w+u||\pi) + \log
\frac{1}{\delta}}{\eta} + \frac{\eta}{2n}$$
Suppose \(\hat{L}(w^\ast)\) is the loss function at a local optima
\(w^\ast\), when the perturbation level of \(u\) is small,
\(\mathbb{E}_u[\hat{L}(w^\ast+u)]\) tends to be small, but
\(KL(w^\ast+u|\pi)\) may be large since the posterior is too focused on a
small neighboring area around \(w^\ast\), and vice versa. As a consequence,
we may need to search for an optimal perturbation level for \(u\) so that
the bound is minimized.
For that we need some assumptions on the smoothness of the Hessian:
[Hessian Lipschitz] A twice differentiable function
\(f(\cdot)\) is \(\rho\)-Hessian Lipschitz if:
$$\forall w_1, w_2, \|\nabla^2 f(w_1) - \nabla^2 f(w_2)\|\leq \rho
\|w_1-w_2\|,$$
where \(\|\cdot\|\) is the operator norm.
Suppose the empirical loss function \(\hat{L}(w)\) satisfies the
Hessian Lipschitz condition in some local neighborhood \(Neigh_{\gamma,
\epsilon}(w)\), then the perturbation of the function around a fixed
point can be bounded by terms up to the third-order,
(Nesterov) $$\hat{L}(w+u) \leq \hat{L}(w) + \nabla \hat{L}(w)^T u +
\frac{1}{2}u^T \nabla^2 \hat{L}(w) u + \frac{1}{6}\rho\|u\|^3~~~~\forall
u~~s.t.~~ w+u\in Neigh_{\gamma,\epsilon}(w)$$
Combine the PAC-Bayes bound and we get $$\mathbb{E}_u [L(w+u)]\leq \hat{L}(w)
+\frac{1}{2} \sum_i \nabla_{i,i}^2 \hat{L}(w)\mathbb{E} [u_i^2] +
\frac{\rho}{6}\mathbb{E}[\|u\|^3] + \frac{KL(w+u||\pi) + \log
\frac{1}{\delta}}{\eta} + \frac{\eta}{2n}$$
Pluging in different perturbation distributions we can explicitly calculate the
KL divergence and solve for the best level of perturbation for each parameter.
PAC-Bayes bound suggests we should optimize the perturbed loss instead of the true loss for a better generalization. In particular, the level of perturbation on each parameter is set according to the local smoothness property. We observe improved performance for the perturbed model on CIFAR-10, CIFAR-100, as well as the tiny Imagenet data sets.
We connect the smoothness of the solution with the model generalization in the PAC-Bayes framework. We theoretically show that the generalization power of a model is related to the Hessian and the smoothness of the solution, the scales of the parameters, as well as the number of training samples. Based on our generalization bound, we propose a new metric to test the model generalization and a new perturbation algorithm that adjusts the perturbation levels according to the Hessian. Finally, we empirically demonstrate the effect of our algorithm is similar to a regularizer in its ability to attain better performance on unseen data. For more details, including details regarding the proof and its assumptions, please refer to our pre-print.
Identifying Generalization Properties in Neural Networks Huan Wang, Nitish Shirish Keskar, Caiming Xiong and Richard Socher.