## Thursday, March 3, 2016

### The two cultures of optimization

The standard curriculum in high school math includes elementary functional analysis, and methods for finding the minima, maxima and saddle points of a single dimensional function.  When moving to high dimensions, this becomes beyond the reach of your typical high-school student: mathematical optimization theory spans a multitude of involved techniques in virtually all areas of computer science and mathematics.

Iterative methods, in particular, are the most successful algorithms for large-scale optimization and the most widely used in machine learning. Of these, most popular are first-order gradient-based methods due to their very low per-iteration complexity.

However, way before these became prominent, physicists needed to solve large scale optimization problems, since the time of the Manhattan project at Los Alamos. The problems that they faced looked very different, essentially simulation of physical experiments, as were the solutions they developed. The Metropolis algorithm is the basis for randomized optimization methods and Markov Chain Monte Carlo algorithms.

To use the terminology of Breiman's famous paper, the two cultures of optimization have independently developed to fields of study by themselves.

The readers of this particular blog may be more familiar with the literature on iterative methods. The hallmark of polynomial time methods for mathematical optimization are so called "interior point methods", based on Newton's method. Of these, the most efficient and well-studied are "central path following" methods, established in the pioneering work of Nesterov and Nemirovski (see this excellent introductory text).

On the other side of the world, the early physicists joined forces with engineers and other scientists to develop highly successful Monte Carlo simulation algorithms. This approach grew out of work in statistical mechanics, and a thorough description is given in this 1983 Science paper by Kirkpatrick, Gelatt, and Vecchi. One of the central questions of this area of research is what happens to particular samples of liquid or solid matter as the temperature of the matter is dropped. As the sample is cooled, the authors ask:
[...] for example, whether the atoms remain fluid or solidify, and if they solidify, whether they form a crystalline solid or a glass. Ground states and configurations close to them in energy are extremely rare among all the configurations of a macroscopic body, yet they dominate its properties at low temperatures because as T is lowered the Boltzmann distribution collapses into the lowest energy state or states.
In this groundbreaking work drawing out the connections between the cooling of matter and the problem of combinatorial optimization, a new algorithmic approach was proposed which we now refer to as Simulated Annealing. In more Computer Science-y terminology, the key idea is to use random walk sampling methods to select a point in the convex set from a distribution which has high entropy (i.e. at a "high temperature"), and then to iteratively modify the distribution to concentrate around the optimum (i.e. sampling at "low temperature"), where we need to perform additional random walk steps after each update to the distribution to guarantee "mixing" (more below). There has been a great deal of work understanding annealing methods, and they are quite popular in practice.

In a recent paper, we have discovered a close connection between the two methodologies. Roughly speaking, for constrained convex optimization problems, the iterates of Newton's method lie on the same path as the means of the consecutive distributions of simulated annealing. This is depicted in the picture below. Besides the intriguing connection, this discovery yields algorithmic benefits as well: a faster convex optimization algorithm for the most general input access model, and a resolution of a open problem in interior point methods.

We continue with a more detailed description of path-following interior point methods, simulated annealing, and finally a discussion of the consequences.

### An overview of the Statistical Mechanics approach to optimization

The Simulated Annealing approach to optimization is to reduce a given formulation to a sampling problem.  Consider the following distribution over the set $K \subseteq R^n$ for a given function $f: R^n \mapsto R$.

\textstyle
P_{f,t}(x) := \frac{\exp(- f(x)/t )}
{ \int_K \exp(-f(x')/t ) \, dx' }.

This is often referred to as the Boltzmann distribution.  Clearly - as $t$ approaches zero, this distribution approaches a point-mass concentrated on the global minimizer of $f$.

The question is now - how do we sample from the Boltzman distribution? Here comes the fantastic observation of the Los Alamos scientists:  formulate a random walk such that its stationary distribution is exactly the one we wish to sample from. Then simulate the random walk till it mixes - and voila - you have a sample!

The Metropolis algorithm does exactly that - a general recipe for creating Markov Chains whose stationary distribution is whatever we want. The crucial observation is that simulating a random walk is many times significantly easier than reasoning about the distribution itself. Furthermore, it so happens that the mixing time in such chains, although notoriously difficult to analyze rigorously, are very small in practice.

The analogy to physics is as follows - given certain initial conditions for a complex system along with evolution rules, one is interested in system behavior at "steady state".  Such methods were used to evaluate complex systems during WWII in the infamous Manhattan project.

For optimization, one would use a random walk technique to iteratively sample from ever decreasing temperature parameters $t \mapsto 0$. For $t=0$ sampling amounts to optimization, and thus hard.

The simulated annealing method slowly changes the temperature. For $t = \infty$, the Boltzmann distribution amounts to uniform sampling, and it is reasonable to assume this can be done efficiently. Then, exploiting similarity in consecutive distributions, one can use samples from one temperature (called a "warm start") to efficiently sample from a cooler one, till a near-zero temperature is reached.  But how to sample from even one Boltzman distributions given a warm start? There is vast literature on random walks and their properties. Popular walks include the "hit and run from a corner", "ball walk", and many more.

We henceforth define the Heat Path as the deterministic curve which maps a temperature to the mean of the Boltzmann distribution for that particular temperature.
$$\mu_t = E_{x \sim P_{f,t}} [ x]$$
Incredibly, to the best of our knowledge, this fascinating curve was not studied before as a deterministic object in the random walks literature.

### An overview of the Mathematics approach to optimization

It is of course infeasible to do justice to the large body of work addressing these ideas within the mathematics/ML literature. Let us instead give a brief intro to Interior Point Methods (IPM) - the state of the art in polynomial time methods for convex optimization.

The setting of IPM is that of constrained convex optimization, i.e. minimizing a convex function subject to convex constraints.

The basic idea of IPM is to reduce constrained optimization to unconstrained optimization, similarly to Lagrangian relaxation. The constraints are folded into a single penalty function, called a "barrier function", which guaranties three properties:

1. The barrier approaches infinity near the border of the convex set, thereby ensuring that the optimum is obtained inside the convex set described by the constraints.
2. The barrier doesn't distort the objective function too much, so that solving for the new unconstrained problem will give us a close solution to the original formulation.
3. The barrier function allows for efficient optimization via iterative methods, namely Newton's method.
The hallmark of IPM is the existence of such barrier functions with all three properties, called self-concordant barrier functions, that allow efficient deterministic polynomial-time algorithms for many (but not all) convex optimization formulations.

Iteratively, a "temperature" parameter multiplied by the barrier function is added to the objective, and the resulting unconstrained formulation solved by Newton's method.  The temperature is reduced, till the barrier effect becomes negligible compared to the original objective, and a solution is obtained.

The curve mapping the temperature parameter to the solution of the unconstrained problem is called the central path. and the overall methodology called "central path following methods".

### The connection

In our recent paper, we show that for convex optimization, the heat path and central path for IPM for a particular barrier function (called the entropic barrier, following the terminology of the recent excellent work of Bubeck and Eldan) are identical! Thus, in some precise sense, the two cultures of optimization have been studied the same object in disguise and using different techniques.

Can this observation give any new insight to the design of efficient algorithm? The answer turns out to be affirmative.

First, we resolve the long standing question in IPM on the existence of an efficiently computable self-concordant barrier for general convex sets. We show that the covariance of the Boltzman distribution is related to the Hessian of a self-concordant barrier (known as the entropic barrier) for any convex set. For this computation, all we need is a membership oracle for the convex set.

Second, our observation gives rise to a faster annealing temperature for convex optimization, and a much simpler analysis (building on the seminal work of Kalai and Vempala). This gives a faster polynomial-time algorithm for convex optimization in the membership oracle model - a model in which the access to the convex set is only through an oracle answering questions of the form "is a certain point x inside the set?".

## Tuesday, March 1, 2016

### Making second order methods practical for machine learning

First-order methods such as Gradient Descent, AdaGrad, SVRG, etc. dominate the landscape of optimization for machine learning due to their extremely low per-iteration computational cost. Second order methods have largely been ignored in this context due to their prohibitively large time complexity. As a general rule, any super-linear time operation is prohibitively expensive for large scale data analysis. In our recent paper we attempt to bridge this divide by exhibiting an efficient linear time second order algorithm for typical (convex) optimization problems arising in machine learning.

Previously, second order methods were successfully implemented in linear time for special optimization problems such as maximum flow [see e.g. this paper] using very different techniques. In the optimization community, Quasi-Newton methods (such as BFGS) have been developed which use first order information to approximate a second order step. Efficient implementations of these methods (such as L-BFGS) have been proposed, mostly without theoretical guarantees.

Concretely, consider the PAC learning model where given $m$ samples $\{(\mathbf{x}_k, y_k)\}_{k\in \{1,\dots, m\}}$ the objective is to produce a hypothesis from a hypothesis class that minimizes the overall error. This optimization problem, known as  Empirical Risk Minimization, can be written as follows:
$\min_{\theta \in \mathbb{R}^d} f(\theta) = \min_{\theta \in \mathbb{R}^d} \left\{\frac{1}{m}\sum\limits_{k=1}^m f_k(\theta) + R(\theta) \right\}.$
In typical examples of interest, such as Logistic Regression, soft-margin Support Vector Machines, etc., we have that each $f_k(\theta)$ is a convex function of the form $\ell(y_k, \theta^{\top}\mathbf{x}_k)$ and $R(\theta)$ is a regularization function. For simplicity, consider the Euclidean regularization given by $R(\theta) = \lambda \|\theta\|^2$.

Gradient Descent is a natural first approach to optimize a function, whereby you repeatedly take steps in the direction opposite of the gradient at the current iterate $\mathbf{x}_{t+1} = \mathbf{x}_t - \eta \nabla f(\mathbf{x}_t).$ Under certain conditions on the function and with an appropriate choice of $\eta$, it can be shown this process converges linearly to the true minimizer $\mathbf{x}^*$. If $f$ has condition number $\kappa$, then it can be shown that $O(\kappa \log (1/\epsilon))$ iterations are needed to be $\epsilon$-close to the minimizer, with each iteration taking $O(md)$ time, so the total running time is $O(md \kappa \log(1/ \epsilon))$. Numerous extensions to this basic method have been proposed in the ML literature in recent years, which is a subject for a future blog post.

Given the success of first order methods, a natural extension is to incorporate second order information in deciding what direction to take at each iteration. Newton's Method does exactly this, where, denoting $\nabla^{-2}f(\mathbf{x}) := [\nabla^2 f(\mathbf{x})]^{-1}$, the update takes the form
$\mathbf{x}_{t+1} = \mathbf{x}_t - \eta \nabla^{-2}f(\mathbf{x}_t) \nabla f(\mathbf{x}_t).$
Although Newton's Method can converge to the optimum of a quadratic function in a single iteration and can achieve quadratic convergence for general functions (when close enough to the optimum), the per-iteration costs can be prohibitively expensive. Each step requires calculation of the Hessian (which is $O(md^2)$ for functions of the form $f(\mathbf{x}) = \sum\limits_{k=1}^m f_k(\mathbf{x})$) as well as a matrix inversion, which naively takes $O(d^3)$ time.

Drawing inspiration from the success of using stochastic gradient estimates, a natural step forward is to use stochastic second order information in order to estimate the Newton's direction. One of the key hurdles towards a direct application of this approach is the lack of an immediate unbiased estimator of the inverse Hessian. Indeed, it is not the case that the inverse of an unbiased estimator of the Hessian is an unbiased estimator of the Hessian inverse. In a recent paper by Erdogdu and Montanari, they propose an algorithm called NewSamp, which first obtains an estimate of the Hessian by aggregating multiple samples, and then computes the matrix inverse. Unfortunately the above method still suffers due to the prohibitive run-time of the matrix inverse operation. One can take a spectral approximation to speed up inversion, however this comes at the cost of loosing some spectral information and thus loosing some nice theoretical guarantees.

Our method tackles these issues by making use of the Neumann series. Suppose that, for all $\mathbf{x}$, $\|\nabla^2 f(\mathbf{x})\|_2 \leq 1$,  $\nabla^2 f(\mathbf{x}) \succeq 0$. Then, considering the Neumann series of the Hessian matrix, we have that $\nabla^{-2}f(\mathbf{x}) = \sum\limits_{i=0}^{\infty} \left(I - \nabla^2 f(\mathbf{x})\right)^i.$
The above representation can be used to construct an unbiased estimator in the following way. The key component is an unbiased estimator of a single term $\left(I - \nabla^2 f(\mathbf{x})\right)^i$ in the above summation. This can obtained by taking independent and unbiased samples $\{ \nabla^2 f_{[j]} \}$ and creating the product by multiplying the individual terms. Formally, pick a probability distribution $\{p_i\}_{i \geq 0}$ over the non-negative integers (representing a probability distribution over the individual terms in the series) and sample an index $\hat{i}$. Sample $\hat{i}$ independent and unbiased Hessian samples $\left\{\nabla^2 f_{[j]}(\mathbf{x})\right\}_{j \in \{1, \dots, \hat{i}\}}$ and define the estimator: $\tilde{\nabla}^{-2} f(\mathbf{x}) = \frac{1}{p_{\hat{i}}} \prod\limits_{j=1}^{\hat{i}} \left(I - \nabla^2 f_{[j]}(\mathbf{x})\right).$
Note that the above estimator is an unbiased estimator which follows from the fact that the samples are independent and unbiased.

We consider the above estimator in our paper and analyze its convergence. The above estimator unfortunately has the disadvantage that it captures one term in the summation and introduces the need for picking a distribution over the terms. However, this issue can be circumvented by making the following observation about a recursive reformulation of the above series:

$\nabla^{-2} f = I + \left(I - \nabla^2 f\right)\left( I + \left(I - \nabla^2 f\right) ( \ldots)\right).$
If we curtail the Taylor series to contain the first $t$ terms (which is equivalent to the recursive depth of $t$ in the above formulation) we can see that an unbiased estimator of the curtailed series can easily be constructed given $t$ unbiased and independent samples as follows:

$\nabla^{-2} f = I + \left(I - \nabla^2 f_{[t]}\right)\left( I + \left(I - \nabla^2 f_{[t-1]}\right) ( \ldots)\right).$
Note that as $t \rightarrow \infty$ our estimator becomes an unbiased estimator of the inverse. We use the above estimator in place of the true inverse in the Newton Step in the new algorithm, dubbed LiSSA.

The next key observation we make with regards to our estimator is that since we are interested in computing the Newton direction $(\nabla^{-2} f \nabla f)$, we need to repeatedly compute products of the form $(I - \nabla^2 f_{[i]})v$ where $\nabla^2 f_{[i]}$ is a sample of the Hessian and $v$ is a vector. As remarked earlier in common machine learning settings the loss function is usually of the form $\ell_k(y_k, \theta^{\top} \mathbf{x}_k)$ and therefore the hessian is usually of the form $c \mathbf{x}_k\mathbf{x}_k^{\top}$, i.e. it is a rank 1 matrix. It can now be seen that the product $(I - \nabla^2 f_{[i]})v$ can be performed in time $O(d)$, giving us a linear time update step. For full details, we refer the reader to Algorithm 1 in the paper.

We would like to point out an alternative interpretation of our estimator. Consider the second order approximation $g(\mathbf{y})$ to the function $f(\mathbf{x})$ at any point $\mathbf{z}$, where $\mathbf{y} = \mathbf{x} - \mathbf{z}$, given by
$g(\mathbf{y}) = f(\mathbf{z}) + \nabla f(\mathbf{z})^{\top}\mathbf{y} + \frac{1}{2}\mathbf{y}^{\top}\nabla^2 f(\mathbf{z})\mathbf{y}.$
Indeed, a single step of Newton's method takes us to the minimizer of the quadratic. Let's consider what a gradient descent step for the above quadratic looks like:
$\mathbf{x}_{t+1} = (I - \nabla^2 f(\mathbf{z}))\mathbf{x}_t - \nabla f(\mathbf{z}).$
It can now be seen that using $t$ terms of the recursive Taylor series as the inverse of the Hessian corresponds exactly to doing $t$ gradient descent steps on the quadratic approximation. Continuing the analogy we get that our stochastic estimator is a specific form of SGD on the above quadratic. Instead of SGD, one can use any of the improved/accelerated/variance-reduced versions too. Although the above connection is precise we note that it is important that in the expression above, the gradient of $f$ is exact (and not stochastic). Indeed, known lower bounds on stochastic optimization imply that it is impossible to achieve the kind of guarantees we achieve with purely stochastic first and second order information. This is the reason one cannot adapt the standard SGD analysis in our scenario, and it remains an intuitive explanation only.

Before presenting the theorem governing the convergence of LiSSA, we will introduce some useful notation. It is convenient to first assume WLOG that every function $f_k(\mathbf{x})$ has been scaled such that $\|\nabla^2 f_k(\mathbf{x})\| \leq 1.$ We also denote the condition number of $f(\mathbf{x})$ as $\kappa$, and we define $\kappa_{max}$ to be such that
$\lambda_{min}(\nabla^2 f_k(\mathbf{x})) \geq \frac{1}{\kappa_{max}}.$

Theorem:  LiSSA returns a point $\mathbf{x}_t$ such that with probability at least $1 - \delta$, $f(\mathbf{x}_t) \leq \min\limits_{\mathbf{x}^*} f(\mathbf{x}^*) + \epsilon$ in $O(\log \frac{1}{\epsilon})$ iterations, and total time $O \left( \left( md + \kappa_{max}^2 \kappa d \right) \log \frac{1}{\epsilon}\right).$

At this point we would like to comment on the running times described above. Our theoretical guarantees are similar to methods such as SVRG and SDCA which achieve running time $O(m + \kappa)d\log(1/\epsilon)$. The additional loss of $\kappa^2_{max}$ in our case is due to the concentration step as we prove high probability bounds on our per-step guarantee. In practice we observe that this concentration step is not required, i.e. applying our estimator only once, performs well as is demonstrated by the graphs below. It is remarkable that the number of iterations is independent of any condition number - this is the significant advantage of second order methods.

These preliminary experimental graphs compare LiSSA with standard algorithms such as AdaGrad, Gradient Descent, BFGS and SVRG on a binary Logistic Regression task on three datasets obtained from LibSVM. The metric used in the above graphs is log(Value - Optimum) vs Time Elapsed.

Next we include a comparison between LiSSA and second order methods, vanilla Newton's Method and NewSamp. We observe that in terms of iterations Newton's method shows a quadratic decline whereas NewSamp and LiSSA perform similarly in a linear fashion. As is expected, a significant performance advantage for LiSSA is observed when we consider a comparison with respect to time.

To summarize, in this blog post we described a linear time stochastic second order algorithm that achieves linear convergence for typical problems in machine learning while still maintaining run-times theoretically comparable to state-of-the-art first order algorithms. This relies heavily on the special structure of the optimization problem that allows our unbiased hessian estimator to be implemented efficiently, using only vector-vector products.