tag:blogger.com,1999:blog-72335534922531014902018-06-02T13:53:02.068-04:00Minimizing RegretSelected topics on, but not restricted to, machine learning, optimization and game theory <br>
by <a href="http://cs.princeton.edu/~ehazan">Elad Hazan</a> and group @ Princeton UniversityElad Hazanhttps://plus.google.com/111108599717601698901noreply@blogger.comBlogger13125tag:blogger.com,1999:blog-7233553492253101490.post-15463851356657623272018-05-29T14:48:00.001-04:002018-05-29T14:48:57.947-04:00Taking control by convex optimization<br /><br /><img alt="Image result for rudolph kalman" class="rg_ic rg_i" jsaction="load:str.tbn" name="_oATemYgAbiLYM:" src="" style="height: 190px; margin-left: -7px; margin-right: 0px; margin-top: 0px; width: 254px;" /><br /><br /><br />The handsome fellow above is the late Rudolf Kalman, a pioneer of modern control, receiving the presidential medal of freedom from president Barack Obama.<br /><br />The setting which Kalman considered, and has been a cornerstone of modern control, is that of stateful time-series. More precisely, consider a sequence of inputs and outputs, or measurements, coming from some real-world machine<br />$$ x_1,x_2,...,x_T,... \ \ \Rightarrow \ \ \ y_1,y_2, y_3 , ..., y_T , ... $$<br />Examples include:<br /><br /><ul><li>$y_t$ are measurements of ocean temperature, in a study of global warming. $x_t$ are various environmental knowns, such as time of the year, water acidity, etc. </li><li>Language translation, $x_t$ are words in Hebrew, and $y_t$ is their English translation</li><li>$x_t$ are physical controls of a robot, i.e. force in various directions, and $y_t$ are the location to which it moves</li></ul>Such systems are generally called dynamical systems, and one of the simplest and most useful mathematical models to capture a subset of them is called linear dynamical systems (LDS), in which the output is generated according to the following equations:<br />\begin{align*}<br />h_{t+1} & = A h_t + B x_t + \eta_t \\<br />y_{t+1} & = C h_t + D x_t + \zeta_t<br />\end{align*}<br />Here $h_t$ is a hidden state, which can have very large or even infinite dimensionality, $A,B,C,D$ are linear transformations and $\eta_t,\zeta_t$ are noise vectors.<br /><br />This setting is general enough to capture a variety of machine learning models previously considered in isolation, such as HMM (hidden Markov models), principle and independent component analysis, mixture of Gaussian clusters and many more, see this <a href="https://cs.nyu.edu/~roweis/papers/NC110201.pdf">survey by Roweis and Ghahramani</a>.<br /><br />Alas - without any further assumptions the Kalman filtering model is non-convex and computationally hard, thus general polynomial time algorithms for efficient worst-case prediction were not known, till some exciting recent developments I'll survey below.<br /><br />If the linear transformations $A,B,C,D$ are known, then it is possible to efficiently predict $y_t$ given all previous inputs and outputs of the system using convex optimization. Indeed, this is what the Kalman filter famously achieves, and in an optimal way. Similarly, if there is no hidden state, the problem becomes trivially convex, and thus easily solvable (see recent blog posts and papers by the Recht group further analyzing this case).<br /><br />However, if the system given by the transformations A,B,C,D is unknown, recovering them from observations is called "system identification", a notoriously hard problem. Even for HMMs, which can be seen as a special case of LDS (see the <a href="http://mlg.eng.cam.ac.uk/zoubin/papers/lds.pdf">RG</a> survey), the identification problem is known to be <a href="http://www.math.ru.nl/~terwijn/publications/icgiFinal.pdf">NP-hard</a> without additional assumptions. (Aside: HMMs <b>are</b> identifiable with further distributional assumption, see the breakthrough paper of<a href="http://www.cs.columbia.edu/~djhsu/papers/hmm.pdf"> Hsu, Kakade and Zhang</a>, which started the recent tensor revolution in theoretical ML).<br /><br />In the face of NP-hardness, state-of-the-art reduces to a set of heuristics. Amongst them are:<br /><ul><li>By far the most common is the use of EM, or expectation maximization, to alternately solve for the system and the hidden states. Since this is a heuristic for non-convex optimization, global optimality is not guaranteed, and indeed EM can settle on a local optimum, as shown <a href="https://arxiv.org/abs/1711.00946">here</a>.</li><li>Gradient descent: this wonderful heuristic can be applied for the non-convex problem of finding both the system and hidden states simultaneously. In a striking recent result, <a href="https://arxiv.org/abs/1609.05191">Hardt and Ma</a> showed that under some generative assumptions, gradient descent converges to the global minimum. </li></ul><div><br /></div>This is where we go back to the convex path: using the failsafe methodology of convex relaxation and regret minimization in games, in a <a href="https://arxiv.org/abs/1711.00946">recent</a> set of <a href="https://arxiv.org/abs/1802.03981">results</a> our group has found assumption-free, efficient algorithms for predicting in general LDS, as well as some other control extensions. At the heart of these results is a new method of spectral filtering. This method requires some mathematical exposition, we'll defer it to a future longer post. For the rest of this post, we'll describe the autoregressive (a.k.a. regression) methodology for LDS, which similar to our result, is a worst-case and convex-relaxation based method.<br /><br />To see how this works, consider the LDS equations above, and notice that for zero-mean noise, the expected output at time $t+1$ is given by: (we take $D=0$ for simplicity)<br />$$ E[y_{t+1} ] = \sum_{i=1}^t C A^i B x_{t-i} $$<br />This has few parameters to learn, namely only three matrices $A,B,C$, but has a very visible non-convex component of raising $A$ to high powers.<br /><br />The obvious convex relaxation takes $M_i = C A^i B$ as a set of matrices for $i=1,...,t$. We now have a set of linear equalities of the form:<br />$$ y_{t+1} = \sum_{i=1}^t M_i x_{t-1} $$<br />which can be solved by your favorite convex linear-regression optimizer. The fallacy is also clear: we have parameters that scale as the number of observations, and thus cannot hope to generalize.<br /><br />Luckily, many systems are well behaved in the following way. First, notice that the very definition of an LDS requires the spectral norm of $A$ to be bounded by one, otherwise we have signal explosion - it's magnitude grows indefinitely with time. It is thus not too unreasonable to assume a strict bound away from one, say $|A| \leq 1- \delta$, for some small constant $\delta >0$. This allows us to bound the magnitude of the $i$'th matrix $M_i$, that is supposed to represent $C A^i B$, by $(1-\delta)^i$.<br /><br />With this assumption, we can then take only $ k = O(\frac{1}{\delta} \log \frac{1}{\epsilon})$ matrices, and write<br />$$ y_{t+1} = \sum_{i=1}^k M_i x_{t-1} + O(\epsilon) $$<br /><br />This folklore method, called an autoregressive model or simply learning LDS by regression, is very powerful despite its simplicity. It has the usual wonderful merits of online learning by convex relaxation: is worst-case (no generative assumptions), agnostic, and online. Caveats? we have a linear dependence on the eigengap of the system, which can be very large for many interesting systems.<br /><br />However, it is possible to remove this gap completely, and learn arbitrarily ill-defined LDS without dependence on the condition number. This method is also online, agnostic, and worst case, and is based on a new tool: wave-filtering. In essence it applies the online regression method on a specially-designed fixed filter of the input. To see how these filters are designed, and their theoretical properties, check out our recent <a href="https://arxiv.org/abs/1711.00946">papers</a>! <br /><br /><br /><br /><br /><br /><br />Elad Hazanhttps://plus.google.com/111108599717601698901noreply@blogger.com0tag:blogger.com,1999:blog-7233553492253101490.post-34184048006759703022018-03-01T14:48:00.000-05:002018-05-24T11:22:45.598-04:00unsupervised learning III: worst-case compression-based metrics that generalize<script src="http://cdn.mathjax.org/mathjax/latest/MathJax.js" type="text/javascript">MathJax.Hub.Config({ extensions: ["tex2jax.js","TeX/AMSmath.js","TeX/AMSsymbols.js"], jax: ["input/TeX", "output/HTML-CSS"], tex2jax: { inlineMath: [ ['$','$'], ["\\(","\\)"] ], displayMath: [ ['$$','$$'], ["\\[","\\]"] ], }, "HTML-CSS": { availableFonts: ["TeX"] } }); </script> <br /><div dir="ltr" style="text-align: left;" trbidi="on">This post is a long-overdue sequel to parts I and II of previous posts on unsupervised learning. The long time it took me to publish this one is not by chance - I am still ambivalent on the merit, or lack of, for the notion of a worst-case PAC/statistical learnability (or, god forbid, regret minimization) without supervision.<br /><br />And yet,<br />1. I owe a followup to the stubborn few readers of the blog who have made it thus far.<br />2. The algorithms we'll discuss are mathematically correct, and even if completely useless in practice, can serve as amusement to us, fans of regret minimization in convex games.<br />3. I really want to get on to more exciting recent developments in control that have kept me busy at night...<br /><br />So, here goes:<br /><br />The idea is to define unsupervised learning as PAC learning in the usual sense, with a real-valued objective function, and specially-structured hypothesis classes.<br /><br /><ul><li>A hypothesis is a pair of functions, one from domain to range and the other vice versa, called encoder-decoder pair. </li><li>The real-valued loss is also composed of two parts, one of which is the reconstruction error of the encoding-decoding scheme, and the other proportional to encoding length. </li></ul>Et-voila, all notions of generalizability carry over from statistical learning theory together with dimensionality concepts such as Rademacher complexity.<br /><br />More significantly - these worst-case definitions allow improper learning of NP-hard concepts. For the fully-precise definitions, see the <a href="https://arxiv.org/abs/1610.01132">paper with Tengyu Ma</a>. The latter also gives a striking example of a very well-studied NP-hard problem, namely dictionary learning, showing it can be learned in polynomial time using convex relaxations.<br /><br />Henceforth, I'll give a much simpler example that perhaps illustrates the concepts in an easier way, one that arose in conversations with our post-docs <a href="http://www.cs.princeton.edu/~rlivni/">Roi Livni</a> and <a href="http://www.cs.princeton.edu/~kothari/">Pravesh Kothari</a>.<br /><br />This example is on learning the Mixture-Of-Gaussians (MOG) unsupervised learning model, one of the most well-studied problems in machine learning. In this setting, a given distribution over data is assumed to be from a mixture distribution over normal random variables, and the goal is to associate each data point with the corresponding Gaussian, as well as to identify these Gaussians.<br /><br />The MOG model has long served as a tool for scientific discovery, see e.g. this <a href="http://blog.mrtz.org/2014/04/22/pearsons-polynomial.html">blog post by Moritz Hardt</a> and links therein. However, computationally it is not well behaved in terms of worst-case complexity. Even the simpler "k-means" model is known to be NP-hard.<br /><br />Here is where things get interesting: using reconstruction error and compression as a metric, we can learn MOG by a seemingly different unsupervised learning method - Principle Component Analysis (PCA)! The latter admits very efficient linear-algebra-based algorithms.<br /><br />More formally: let $X \in R^{m \times d}$ be a data matrix sampled i.i.d from some arbitrary distribution $x \sim D$. Assume w.l.o.g. that all norms are at most one. Given a set of $k$ centers of Gaussians, $\mu_1,...,\mu_k$, minimize reconstruction error given by<br />$$ E_{x \sim D} [ \min_i \| \mu_i - x \|^2 ] $$<br />A more general formulation, allowing several means to contribute to the reconstruction error, is<br />$$ E_{x \sim D} [ \min_{ \| \alpha_x\|_2 \leq 1 } \| \sum_{j} \alpha_j \mu_j - x \|^2 ] $$<br />Let $M \in R^{d \times k}$ be the matrix of all $\mu$ vectors. Then we can write the optimization problem as<br />$$ \min_{M} E_{x \sim D} [ \min_{ \| \alpha_x\|_2 \leq 1 } \| x - M \alpha_x \|^2 ] $$<br />This corresponds to an <b>encoding by vector $\alpha$ rather than by a single coordinate</b>. We can write the closed form solution to $\alpha_x$ as:<br />$$ \arg \min_{\alpha_x } \| x - M \alpha_x \|^2 = M^{-1} x $$<br />and the objective becomes<br />$$ \min_{\alpha_x } \| x - M \alpha_x \|^2 = \| x - M M^{-1} x\|_\star^2 $$<br />for $\| \|_\star$ being the spectral norm.<br />Thus, we're left with the optimization problem of<br />$$ \min_{M} E_{x \sim D} [ \| x - M M^{\dagger} x \|_\star^2 ] $$<br />which amounts to PCA.<br /><br />We have just semi-formally seen how MOG can be learned by PCA!<br /><br />A more general theorem applies also to MOG that are not spherical, some details appear in the paper with Tengyu, in the section on spectral auto-encoders.<br /><br />The keen readers will notice we lost something in compression along the way: the encoding is now a k-dimensional vector as opposed to a 1-hot k-dimensional vector, and thus takes $k \log \frac{1}{\epsilon}$ bits to represent in binary, as opposed to $O(\log k)$. We leave it as an open question to come up with an <i>efficient</i> poly-log(k) compression that matches MOG. The solver gets a free humus at Sayeed's, my treat!<br /><br /><br /><br /><br /><br /><br /><br /></div>Elad Hazanhttps://plus.google.com/111108599717601698901noreply@blogger.com0tag:blogger.com,1999:blog-7233553492253101490.post-6512241948087377732017-06-08T13:56:00.001-04:002018-02-22T16:00:34.972-05:00Hyperparameter optimization and the analysis of boolean functions<div dir="ltr" style="text-align: left;" trbidi="on">I've always admired the beautiful theory on the analysis of boolean functions (see the excellent <a href="http://dl.acm.org/citation.cfm?id=2683783">book</a> of Ryan O'Donnell, as well <a href="http://www.ma.huji.ac.il/~kalai/">Gil Kalai's lectures</a>). Heck, it used to be my main area of investigation back in the early grad-school days we were studying hardness of approximation, the PCP theorem, and the "new" (back at the time) technology of Fourier analysis for boolean functions. This technology also gave rise to seminal results on learnability with respect to the uniform distribution. Uniform distribution learning has been widely criticized as unrealistic, and the focus of the theoretical machine learning community has shifted to (mostly) convex, real-valued, agnostic learning in recent years.<br /><br />It is thus particularly exciting to me that some of the amazing work on boolean learning can be used for the very practical problem of Hyperparameter Optimization (HO), which has on the face of it nothing to do with the uniform distribution. <a href="https://arxiv.org/abs/1706.00764">Here is the draft</a>, joint work with <a href="https://www.cs.utexas.edu/~klivans/">Adam Klivans</a> and <a href="http://www.callowbird.com/">Yang Yuan</a>.<br /><br /><a name='more'></a><br /><br />To describe hyperparameter optimization: many software packages, in particular with the rise of deep learning, come with gazillions of input parameters. An example is the TensorFlow software package for training deep nets. To actually use it the user needs to specify how many layers, which layers are convolutional / full connected, which optimization algorithm to use (stochastic gradient descent, AdaGrad, etc.), whether to add momentum to the optimization or not.... You get the picture - choosing the parameters is by itself an optimization problem!<br /><br />It is a highly non-convex optimization problem, over mostly discrete (but some continuous) choices. Evaluating the function, i.e. training a deep net over a large dataset with a specific configuration, is very expensive, and you cannot assume any other information about the function such as gradients (that are not even defined for discrete functions). In other words, sample complexity is of the essence, whereas computation complexity can be relaxed as long as no function evaluations are required.<br /><br />The automatic choice of hyperparameters has been hyped recently with various names such as "auto tuning", "auto ML", and so forth. For an excellent post on existing approaches and their downsides see these posts <a href="http://www.argmin.net/2016/06/20/hypertuning/">Ben Recht's blog</a>.<br /><br />This is where harmonic analysis and compressed sensing can be of help! While in general hyperparameter optimization is computationally hard, we assume a sparse Fourier representation of the objective function. Under this assumption, one can use compressed sensing techniques to recover and optimize the underlying function with low sample complexity.<br /><br />Surprisingly, using sparse recovery techniques one can even improve the known sample complexity results for well-studied problems such as learning decision trees under the uniform distribution, improving upon classical results by <a href="http://dl.acm.org/citation.cfm?id=174138">Linial, Mansour and Nisan</a>.<br /><br />Experiments show that the sparsity assumptions in the Fourier domain are justified for certain hyperparameter optimization settings, in particular, when training of deep nets for vision tasks. The harmonic approach (we call the algorithm "Harmonica") attains a better solution than existing approaches based on multi-armed bandits and Bayesian optimization, and perhaps more importantly, significantly faster. It also beats random search 5X (i.e. random sampling with 5 times as many samples allowed as for our own method, a smart benchmark proposed by <a href="https://people.eecs.berkeley.edu/~kjamieson/hyperband.html">Jamieson</a>).<br /><br />Those of you interested in code, my collaborator <a href="http://www.callowbird.com/">Yang Yuan</a> has promised to release it on GitHub, so please go ahead and email him ;-)<br /><br /><br />PS. We're grateful to Sanjeev Arora for helpful discussions on this topic.<br /><br />PPS. It has been a long time since my last post, and I still owe the readers an exposition on the data compression approach to unsupervised learning. In the mean time, you may want to read this <a href="https://arxiv.org/abs/1610.01132">paper with Tengyu Ma</a> in NIPS 2016.</div>Elad Hazanhttps://plus.google.com/111108599717601698901noreply@blogger.com0tag:blogger.com,1999:blog-7233553492253101490.post-13994100809767551412016-10-29T21:34:00.000-04:002018-02-22T22:12:30.046-05:00Unsupervised Learning II: the power of improper convex relaxations<script type="text/javascript" src="http://cdn.mathjax.org/mathjax/latest/MathJax.js">MathJax.Hub.Config({ extensions: ["tex2jax.js","TeX/AMSmath.js","TeX/AMSsymbols.js"], jax: ["input/TeX", "output/HTML-CSS"], tex2jax: { inlineMath: [ ['$','$'], ["\\(","\\)"] ], displayMath: [ ['$$','$$'], ["\\[","\\]"] ], }, "HTML-CSS": { availableFonts: ["TeX"] } }); </script> <div dir="ltr" style="text-align: left;" trbidi="on">In this continuation post I'd like to bring out a critical component of learning theory that is essentially missing in today's approaches for unsupervised learning: improper learning by convex relaxations.<br /><br />Consider the task of learning a sparse linear separator. The sparsity, or $\ell_0$, constraint is non-convex and computationally hard. Here comes the <a href="http://statweb.stanford.edu/~tibs/lasso.html">Lasso</a> - replace the $\ell_0$ constraint with $\ell_1$ convex relaxation --- et voila --- you've enabled polynomial-time solvable convex optimization.<br /><br /><a name='more'></a><br /><br />Another more timely example: latent variable models for recommender systems, a.k.a. the matrix completion problem. Think of a huge matrix, with one dimension corresponding to users and the other corresponding to media items (e.g. movies as in the Netflix problem). Given a set of entries in the matrix, corresponding to ranking of movies that the users already gave feedback on, the problem is to complete the entire matrix and figure out the preferences of people on movies they haven't seen.<br />This is of course ill posed without further assumptions. The low-rank assumption intuitively states that people's preferences are governed by few factors (say genre, director, etc.). This corresponds to the user-movie matrix having low algebraic rank.<br /><br />Completing a low-rank matrix is NP-hard in general. However, stemming from the compressed-sensing literature, the statistical-reconstruction approach is to assume additional statistical and structural properties over the user-movie matrix. For example, if this matrix is "incoherent", and the observations of the entires are obtained via a uniform distribution, then <a href="http://pages.cs.wisc.edu/~brecht/papers/08.Candes.Recht.MatrixCompletion.pdf">this paper by Emmanuel Candes and Ben Recht</a> shows efficient reconstruction is still possible.<br /><br />But is there an alternative to incoherent assumptions such as incoherence and i.i.d. uniform observations?<br /><br />A long line of research has taken the "improper learning by convex relaxation approach" and applied it to matrix completion in works such as <a href="http://ttic.uchicago.edu/~nati/Publications/SrebroShraibmanCOLT05.pdf">Srebro and Shraibman</a>, considering convex relaxations to rank such as the trace norm and the max norm. Finally, in joint work with <a href="https://arxiv.org/abs/1204.0136">Satyen Kale and Shai Shalev-Shwartz</a>, we take this approach to the extreme by not requiring any distribution at all, giving regret bounds in the online convex optimization model (see <a href="http://www.minimizingregret.com/2016/07/more-than-decade-of-online-convex.html">previous post</a>).<br /><br /><br />By the exposition above one may think that this technique of improper convex relaxations applies only to problems whose hardness comes from "sparsity". This is far from the truth, and in the very same <a href="https://arxiv.org/abs/1204.0136">paper</a> referenced above, we show how the technique can be applied to combinatorial problems such as predicting cuts in graphs, and outcomes of matches in tournaments.<br /><br />In fact, improper learning is such a powerful concept, that problems such that the complexity of problems such as learning DNF formulas has remained open for quite a long time, despite the fact that showing proper learnability was long known to be NP-hard.<br /><br />On the other hand, improper convex relaxation is not an all powerful magic pill. When designing convex relaxations to learning problems, special care is required so not to increase sample complexity. Consider for example the question of predicting tournaments, as given in this <a href="http://web.eecs.umich.edu/~jabernet/OpenProblemAbernethy.pdf">COLT open problem formulation by Jake Abernethy</a>. The problem, loosely stated, is to iteratively predict the outcome of a match between two competing teams from a league of N teams. The goal is to compete, or make as few mistakes, as the best ranking of teams in hindsight. Since the number of rankings scales as $N!$, the <a href="http://theoryofcomputing.org/articles/v008a006/">multiplicative weights update method</a> can be used to guarantee regret that scales as $\sqrt{T \log N!} = O(\sqrt{T N \log N})$. However, the latter, naively implemented, runs in time proportional to $N!$. Is there an efficient algorithm for the problem attaining similar regret bounds?<br /><br />A naive improper learning relaxation would treat each pair of competing teams as an instance of binary prediction, for which we have efficient algorithms. The resulting regret bound, however, would scale as the number of pairs of teams over $N$ candidate teams, or as $\sqrt{T N^2}$, essentially removing any meaningful prediction property. What has gone wrong?<br /><br />By removing the restriction of focus from rankings to pairs of teams, we have enlarged the decision set significantly, and increased the number of examples needed to learn. This is a general and important concern for improper convex relaxations: one needs to relax the problem in such a way that sample complexity (usually measured in terms of Rademacher complexity) doesn't explode. For the aforementioned problem of predicting tournaments, a convex relaxation that preserves sample complexity up to logarithmic factors is indeed possible, and described in the same <a href="https://arxiv.org/abs/1204.0136">paper</a>.<br /><br />In the coming posts we'll describe a framework for unsupervised learning that allows us to use the power of improper convex relaxations.<br /><br /><br /><br /><br /><br /></div>Elad Hazanhttps://plus.google.com/111108599717601698901noreply@blogger.com1tag:blogger.com,1999:blog-7233553492253101490.post-10320287154600707662016-10-06T19:32:00.000-04:002018-02-22T16:01:18.997-05:00A mathematical definition of unsupervised learning?<div dir="ltr" style="text-align: left;" trbidi="on">Extracting structure from data is considered by many to be the frontier of machine learning. Yet even defining "structure", or the very goal of learning structure from unlabelled data, is not well understood or rigorously defined.<br /><br />In this post we'll give a little background to unsupervised learning and argue that, compared to supervised learning, we lack a well-founded theory to tackle even the most basic questions. In the next post, we'll introduce a new candidate framework.<br /><br /><a name='more'></a><br /><br />Unsupervised learning is a commonplace component of many undergraduate machine learning courses and books. At the core, the treatment of unsupervised learning is a collection of methods for analyzing data and extracting hidden structure from this data. Take for example <a href="http://cs229.stanford.edu/schedule.html">John Duchi's "Machine Learning" course at Stanford</a>. Under "Unsupervised learning", we have the topics: Clustering, K-means, EM. Mixture of Gaussians, Factor analysis, PCA (Principal components analysis), ICA (Independent components analysis). This is an excellent representation of the current state-of-the-art:<br /><br /><br /><ul style="text-align: left;"><li>Clustering and K-means: by grouping "similar" elements together, usually according to some underlying metric, we can create artificial "labels" for data elements in hope that this rough labelling will be of future use, either in a supervised learning task on the same dataset, or in "understanding" the data. K-means is a widely used algorithm for finding k representative centers of the data, allowing for k "labels".</li><li>Mixture of Gaussians: an exemplary and widely used method stemming from the generative approach to unsupervised learning. This approach stipulates that the data is generated from a distribution, in this case a mixture of normal random variables. There are numerous algorithms for finding the centers of these Gaussians, thereby learning the structure of the data.</li><li>Factor analysis, PCA and ICA: the spectral approach to unsupervised learning is perhaps the most widely used, in which the covariance matrix of the data is approximated by the largest few singular vectors. This type of clustering is widely successful in text (word embeddings) and many other forms of data. </li></ul><br /><br /><br />The above techniques are widely used and successful in practice, and under suitable conditions polynomial-time algorithms are known. The common thread amongst them is finding structure in the data, which turns out for many practical applications to be useful in a variety of supervised learning tasks.<br /><br />Notable unsupervised learning techniques that are missing from the above list are Restricted Boltzman machines (RBMs), Autoencoders and Dictionary Learning (which was conceived in the context of deep neural networks). The latter techniques attempt to find a succinct representation of the data. Various algorithms and heuristics for RBMs, Autotencoders and Dictionary Learning exist, but these satisfy at least one of the following:<br /><br />1) come without rigorous performance guarantees.<br /><br />2) run in exponential time in the worst case.<br /><br />3) assume strong generative assumptions about the input.<br /><br />Recent breakthroughs in polynomial time unsupervised learning due to Sanjeev and his group address points (1) & (2) above and require only (3). Independently the same is also achievable by the method of moments, see e.g. <a href="http://arxiv.org/abs/1210.7559">this paper</a>, originally from Sham's group @ MSR New England, and many more recent advances. What's the downside of this approach? The Arora-Hazan debate over this point, which the theory-ML students are exposed to in our weekly seminar, is subject for a longer post...<br /><br />What is missing then? Compare the above syllabus with that of supervised learning, in most undergrad courses and books, the difference in terms of theoretical foundations is stark. For supervised learning we have Vapnik's statistical learning theory - a deep and elegant mathematical theory that classifies exactly the learnable from labeled real-valued examples. Valiant's computational learning theory adds the crucial element of computation, and over the combined result we have hundreds of scholarly articles describing in detail problems that are learnable efficiently / learnable but inefficiently / improperly learnable / various other notions.<br /><br />Is there meaning, in pure mathematical sense such as Vapnik and Valiant instilled into supervised learning, to the unsupervised world?<br /><br />I like to point out in my ML courses that computational learning theory say something philosophical about science: the concept of "learning", or a "theory" such as a physical theory of the world, has a precise mathematical meaning independent of humans. While many of us scientists seek a "meaningful" theory in the human sense, there need not be one! It could very well be that a physical theory, for example that of condensed matter, has a "theory" in the Vapnik-Valiant sense, but not one that would be as "meaningful" and "elegant" in the <a href="https://www.ted.com/talks/richard_feynman">Feynman</a> sense. <br /><br /><br />How can we then, give mathematical meaning to unsupervised learning in a way that:<br /><br /><ol style="text-align: left;"><li>Allows natural notions of generalization from examples</li><li>Improves future supervised learning tasks regardless of their nature</li><li>Allows for natural family of algorithms (hopefully but not necessarily - based on convex optimization)</li></ol><br /><br />This will be the starting point for our next post...<br /><br /><br /><br /></div>Elad Hazanhttps://plus.google.com/111108599717601698901noreply@blogger.com5tag:blogger.com,1999:blog-7233553492253101490.post-88901167253311951712016-07-31T22:44:00.000-04:002018-02-22T16:01:32.817-05:00Faster Than SGD 1: Variance Reduction<div dir="ltr" style="text-align: left;" trbidi="on">SGD is well-known for large-scale optimization. In my mind, there are so-far two fundamentally different improvements since the original introduction of SGD: (1) variance reduction, and (2) acceleration. In this post I'd love to conduct a survey regarding (1), and I'd like to especially thank those ICML'16 participants who pushed me to write this post.<br /><br /><a name='more'></a><br /><br />Consider the famous composite convex minimization problem<br />\begin{equation}\label{eqn:the-problem}<br />\min_{x\in \mathbb{R}^d} \Big\{ F(x) := f(x) + \psi(x) := \frac{1}{n}\sum_{i=1}^n f_i(x) + \psi(x) \Big\} \enspace, \tag{1}<br />\end{equation}<br />in which $f(x) = \frac{1}{n}\sum_{i=1}^n f_i(x)$ is a finite average of $n$ functions, and $\psi(x)$ is simple "proximal" function such as the $\ell_1$ or $\ell_2$ norm. In this finite-sum form, each function $f_i(x)$ usually represents the loss function with respect to the $i$-th data vector. <span style="color: #cccccc;">Problem \ref{eqn:the-problem} arises in many places:</span><br /><ul><li><span style="color: #cccccc;">convex classification and regression problems (e.g. Lasso, SVM, Logistic Regression) <a href="http://www.minimizingregret.com/2016/06/how-to-solve-classification-and-regression.html" target="_blank">fall into</a> </span><span style="color: #cccccc;">\ref{eqn:the-problem}.</span></li><li><span style="color: #cccccc;">some notable non-convex problems <a href="http://arxiv.org/abs/1607.06017" target="_blank">including PCA, SVD, CCA</a> can be reduced to \ref{eqn:the-problem}.</span></li><li><span style="color: #cccccc;">the neural net objective can be written in </span><span style="color: #cccccc;">\ref{eqn:the-problem} as well although the function $F(x)$ becomes non-convex; in any case, methods solving convex versions of \ref{eqn:the-problem} <a href="http://arxiv.org/abs/1603.05643" target="_blank">sometimes do generalize</a> to non-convex settings.</span></li></ul><h2>Recall: Stochastic Gradient Descent (SGD)</h2><div><div>To minimize objective $F(x)$, stochastic gradient methods iteratively perform the following update</div><div>$$x_{k+1} \gets \mathrm{argmin}_{y\in \mathbb{R}^d} \Big\{ \frac{1}{2 \eta } \|y-x_k\|_2^2 + \langle \tilde{\nabla}_k, y \rangle + \psi(y) \Big\} \enspace,$$</div><div>where $\eta$ is the step length and $\tilde{\nabla}_k$ is a random vector satisfying $\mathbb{E}[\tilde{\nabla}_k] = \nabla f(x_k)$ and is referred to as the <i>gradient estimator</i>. If the proximal function $\psi(y)$ equals zero, the update reduces to $x_{k+1} \gets x_k - \eta \tilde{\nabla}_k$.</div><div>A popular choice for the gradient estimator is $\tilde{\nabla}_k = \nabla f_i(x_k)$ for some random index $i \in [n]$ per iteration, and methods based on this choice are known as <i>stochastic gradient descent (SGD)</i>. Since computing $\nabla f_i(x)$ is usually $n$ times faster than that of $\nabla f(x)$, SGD enjoys a low per-iteration cost as compared to full-gradient methods; however, SGD cannot converge at a rate faster than $1/\varepsilon$ even if $F(\cdot)$ is very nice.</div></div><div><h2>Key Idea: Variance Reduction Gives Faster SGD</h2>The theory of <i>variance reduction</i> states that, SGD can converge much faster if one makes a better choice of the gradient estimator $\tilde{\nabla}_k$, so that its variance "reduces as $k$ increases". Of course, such a better choice must have (asymptotically) the same per-iteration cost as compared with SGD.<br /><br />There are two fundamentally different ways to choose a better gradient estimator, the first one is known as SVRG, and the second one is known as SAGA (which is built on top of SAG). Both of them require each function $f_i(x)$ to be smooth<span style="color: #cccccc;">, but such a requirement is not intrinsic and <a href="http://www.minimizingregret.com/2016/05/the-complexity-zoo-and-reductions-in.html" target="_blank">can be somehow removed</a></span>.<br /><br /><ul><li>Choice 1: the SVRG estimator (my favorite)<br />Keep a snapshot vector $\tilde{x} = x_k$ every $m$ iterations (where $m$ is some parameter usually around $2n$), and compute the full gradient $\nabla f(\tilde{x})$ only for such snapshots. Then, set <br />$$\tilde{\nabla}_k := \nabla f_i (x_k) - \nabla f_i(\tilde{x}) + \nabla f(\tilde{x})$$<br />where $i$ is randomly chosen from $1,\dots,n$. The amortized cost of computing $\tilde{\nabla}_k$ is only 3/2 times bigger than SGD if $m=2n$ and if we store $\nabla f_i(\tilde{x})$ in memory.</li><li>Choice 2: the SAGA estimator.<br />Store in memory $n$ vectors $\phi_1,\dots,\phi_n$ and set all of them to be zero at the beginning. Then, in each iteration $k$, set<br />$$\tilde{\nabla}_k := \nabla f_i(x_k) - \nabla f_i(\phi_i) + \frac{1}{n} \sum_{j=1}^n \nabla f_j(\phi_j)$$<br />where $i$ is randomly chosen from $1,\dots,n$. Then, very importantly, update $\phi_i \gets x_k$ for this $i$. If properly implemented, the per-iteration cost to compute $\tilde{\nabla}_k$ is the same as SGD.</li></ul><br /><h2>How Variance is Reduced?</h2>Both choices of gradient estimators ensure that the variance of $\tilde{\nabla}_k$ approaches to zero as $k$ grows. In a rough sense, both of them ensure that $\mathbb{E}[\|\tilde{\nabla}_k - \nabla f(x_k)\|^2] \leq O(f(x_k) - f(x^*))$ so the variance decreases as we approach to the minimizer $x^*$. <span style="color: #cccccc;">The proof of this is two-lined if $\psi(x)=0$ and requires a little more effort in the general setting, see for instance Lemma A.2 of <a href="http://arxiv.org/pdf/1506.01972v3.pdf" target="_blank">this paper</a>. </span><br /><br />Using this key observation one can prove that, if $F(x)$ is $\sigma$-strongly convex and if each $f_i(x)$ is $L$-smooth, then the "gradient complexity" (i.e., # of computations of $\nabla f_i (x)$) of variance-reduced SGD methods to minimize Problem \ref{eqn:the-problem} is only $O\big( \big(n + \frac{L}{\sigma} \big) \log \frac{1}{\varepsilon}\big)$. This is much faster than the original SGD method.<br /><h2>Is Variance Reduction Significant?</h2>Short answer: NO when first introduced, but becoming YES, YES and YES.<br /><br />Arguably the original purpose of variance reduction is to make SGD run faster on convex classification / regression problems. However, variance-reduction methods cannot beat the slightly-earlier introduced coordinate-descent method SDCA, and performs worse than its accelerated variant AccSDCA (see for instance <a href="http://www.minimizingregret.com/2016/06/how-to-solve-classification-and-regression.html" target="_blank">the comparison here</a>).<br /><br />Then why is variance reduction useful at all? The answer is on the generality of Problem \eqref{eqn:the-problem}. In all classification and regression problems, each $f_i(x)$ is of a restricted form $loss(\langle a_i, x\rangle; b_i)$ where $a_i$ is the $i$-th data vector and $b_i$ is its label. However, Problem \eqref{eqn:the-problem} is a much bigger class, and each function $f_i(x)$ can encode a complicated structure of the learning problem. <span style="color: #cccccc;">In the extreme case, $f_i(x)$ could encode a neural network where $x$ characterize the weights of the connections (so becoming nonconvex). </span>For such general problems, SDCA does not work at all.<br /><br />In sum, variance reduction methods, although converging in the same speed as SDCA, applies more widely.<br /><h2>The History of Variance Reduction</h2><div>There are too many variance-reduction papers that even an expert sometime can't keep track of all of them. Below, let me point out some interesting papers that one should definitely cite:</div><div><ul><li>The first variance-reduction method is <a href="http://papers.nips.cc/paper/5258-saga-a-fast-incremental-gradient-method-with-support-for-non-strongly-convex-composite-objectives.pdf" target="_blank">SAG</a>. <br />However, SAG is not known to work in the full proximal setting and thus (in principle) does not apply to for instance Lasso or anything L1-regularized. I conjecture that SAG also works in the proximal setting, although some of my earlier experiments seem to suggest that SAG is outperformed by its gradient-unbiased version SAGA in the proximal setting.</li><li><a href="http://papers.nips.cc/paper/5258-saga-a-fast-incremental-gradient-method-with-support-for-non-strongly-convex-composite-objectives" target="_blank">SAGA</a> is a simple unbiased fix of SAG, and gives a much simpler proof than SAG. In my experiments, SAGA seems performing never worse than SAG.</li><li>SVRG was actually discovered independently by two groups of authors, <a href="http://papers.nips.cc/paper/4940-linear-convergence-with-condition-number-independent-access-of-full-gradients" target="_blank">group 1</a> and <a href="http://papers.nips.cc/paper/4937-accelerating" target="_blank">group 2</a>. Perhaps because there is no experiment, the first group's paper quickly got unnoticed (cited by 24) and the second one becomes very famous (cited by 200+). What a pity.</li></ul><div>Because SAGA and SVRG are the popular choices, one may ask which one runs faster? My answer is </div><div><ul><li>It depends on the structure of the dataset: a <a href="http://arxiv.org/abs/1602.02151" target="_blank">corollary of this paper</a> suggests that if the feature vectors are pairwisely close, then SVRG is better, and vice versa. </li><li>Experiments seem to suggest that if all vectors are normalized to norm 1, then SVRG performs better.</li><li>If the objective is not strongly convex (such as Lasso), then a <a href="http://arxiv.org/abs/1506.01972" target="_blank">simple modification of SVRG</a> outperforms both SVRG and SAGA. </li><li>Also, when $f_i(x)$ is a general function, SAGA requires $O(nd)$ memory storage which could be too large to load into memory; SVRG only needs $O(d)$. </li></ul></div></div><h2>What's Next Beyond Variance Reduction?</h2><div>There are many works that tried to extend SVRG to other settings. Most of them are no-so-interesting tweaks, but there are three fundamental extensions. </div><div><ul><li>Shalev-Shwartz <a href="http://arxiv.org/abs/1502.06177" target="_blank">first studied</a> Problem \ref{eqn:the-problem} but each $f_i(x)$ is non-convex (although the summation $f(x)$ is convex). He showed that SVRG also works there, and this has been later better <a href="http://arxiv.org/pdf/1506.01972v3.pdf" target="_blank">formalized and slightly improved</a>. This class of problems has given rise to the fastest low-rank solvers (in theory) on SVD and related problems.</li><li>Elad and I showed that SVRG also <a href="http://arxiv.org/abs/1603.05643" target="_blank">works for totally non-convex functions</a> $F(x)$. This is independently discovered by <a href="http://suvrit.de/papers/nonconvex_svrg.pdf" target="_blank">another group of authors</a>. <span style="color: #cccccc;">They published at least two more papers on this problem too, one supporting proximal, and one proving SAGA's variant.</span></li></ul>The above two improvements are regarding what will happen if we enlarge the class of Problem \ref{eqn:the-problem}. The next improvement is regarding the same Problem \ref{eqn:the-problem} but an even faster running time:<br /><ul><li>This March, I <a href="http://arxiv.org/abs/1603.05953" target="_blank">found a way to design accelerated SGD</a>, and the technique is based on variance reduction. (Recall that SGD, when equipped with Nesterov's momentum, does not perform well; it has been open regarding how to fix that.) I will talk about <a href="https://zeyuan.wordpress.com/2016/11/20/faster-than-sgd-2-the-katyusha-acceleration/" target="_blank">this result in my next post</a>.</li></ul></div></div></div>Zeyuan Allen-Zhunoreply@blogger.com2tag:blogger.com,1999:blog-7233553492253101490.post-42000717461783991502016-07-06T22:09:00.002-04:002018-02-22T16:02:07.927-05:00More than a decade of online convex optimization<div dir="ltr" style="text-align: left;" trbidi="on">This nostalgic post is written after a <a href="http://www.cs.princeton.edu/~ehazan/tutorial/tutorial.htm">tutorial in ICML 2016</a> as a recollection of a few memories with my friend <a href="http://www.satyenkale.com/">Satyen Kale</a>.<br /><br /><a name='more'></a><br /><br />In ICML 2003 Zinkevich published his paper "<a href="https://www.aaai.org/Papers/ICML/2003/ICML03-120.pdf">Online Convex Programming and Generalized Infinitesimal Gradient Ascent</a>" analyzing the performance of the popular gradient descent method in an online decision-making framework.<br /><br />The framework addressed in his paper was an iterative game, in which a player chooses a point in a convex decision set, an adversary chooses a cost function, and the player suffers the cost which is the value of the cost function evaluated at the point she chose. The performance metric in this setting is taken from game theory: minimize the <b>regret</b> of the player - which is defined to be the difference of the total cost suffered by the player and that of the best <b>fixed</b> decision in hindsight.<br /><br />A couple of years later, circa 2004-2005, a group of theory students at Princeton decide to hedge their bets in the research world. At that time, finding an academic position in theoretical computer science was extremely challenging, and looking at other options was a reasonable thing to do. These were the days before the financial meltdown, when a Wall-Street job was the dream of Ivy League graduates.<br /><br />In our case - hedging our bets meant taking a course in finance at the ORFE department and to look at research problems in finance. We fell upon Tom Cover's timeless paper "<a href="http://www-isl.stanford.edu/~cover/papers/paper93.pdf">universal portfolios</a>" (I was very fortunate to talk with the great information theorist a few years later in San Diego and him tell about his influence in machine learning). As good theorists, our first stab at the problem was to obtain a polynomial time algorithm for universal portfolio selection, which we did. Our paper didn't get accepted to the main theory venues at the time, which turned out for the best in hindsight, pun intended :-) <br /><br />Cover's paper on universal portfolios was written in the language of information theory and universal sequences, and applied to wealth which is multiplicatively changing. This was very different than the additive, regret-based and optimization-based paper of Zinkevich.<br /><br />One of my best memories of all times is the moment in which the connection between optimization and Cover's method came to mind. It was more than a "guess" at first: if online gradient descent is effective in online optimization, and if Newton's method is even better for offline optimization, why can we use Newton's method in the online world? Better yet - why can't we use it for portfolio selection?<br /><br />It turns out that indeed it can, thereby the <a href="http://cs.princeton.edu/~ehazan/papers/log-journal.pdf">Online Newton Step</a> algorithm came to life, applied to portfolio selection, and presented in COLT 2016 (along with a follow-up paper devoted only to portfolio selection, with <a href="http://rob.schapire.net/">Rob Schapire</a>. Satyen and me had the nerve to climb up to Rob's office and waste his time for hours at a time, and Rob was too nice to kick us out...). <br /><br />The connection between optimization, online learning, and the game theoretic notion of regret has been very fruitful since, giving rise to a multitude of applications, algorithms and settings. To mention a few areas that spawned off:<br /><br /><ul style="text-align: left;"><li>Bandit convex optimization - in which the cost value is the only information available to the online player (rather than the entire cost function, or its derivatives). <br />This setting is useful to model a host of limited-observation problems common in online routing and reinforcement learning.</li><li>Matrix learning (also called "local learning") - for capturing problems such as recommendation systems and the matrix completion problem, online gambling and online constraint-satisfaction problems such as online max-cut.</li><li>Projection free methods - motivated by the high computational cost of projections of first order methods, the Frank-Wolfe algorithm came into renewed interest in recent years. The <a href="http://icml.cc/2012/papers/292.pdf">online version</a> is particularly useful for problems whose decision set is hard to project upon, but easy to perform linear optimization over. Examples include the spectahedron for various matrix problems, the flow polytope for various graph problems, the cube for submodular optimization, etc.<br /> </li><li>Fast first-order methods - the connection of online learning to optimization introduced some new ideas into optimization for machine learning. One of the first examples is the <a href="http://ttic.uchicago.edu/~nati/Publications/PegasosMPB.pdf">Pegasus</a> paper. By now there is a flurry of optimization papers in each and every major ML conference, some incorporate ideas from online convex optimization such as adaptive regularization, introduced in the <a href="http://www.magicbroom.info/Papers/DuchiHaSi10.pdf">AdaGrad</a> paper. </li></ul><div>There are a multitude of other connections that should be mentioned here, such as the recent literature on adversarial MDPs and online learning, connections to game theory and equilibrium in online games, and many more. For more (partial) information, see our <a href="http://www.cs.princeton.edu/~ehazan/tutorial/tutorial.htm">tutorial webpage</a> and this <a href="http://ocobook.cs.princeton.edu/OCObook.pdf">book draft</a>. </div><br /><div>It was a wild ride! What's next in store for online learning? Some exciting new directions in future posts...</div><div><br /></div><div><br /></div><div><br /></div></div>Elad Hazanhttps://plus.google.com/111108599717601698901noreply@blogger.com11tag:blogger.com,1999:blog-7233553492253101490.post-64733696148002162152016-06-04T03:17:00.000-04:002018-02-22T16:02:16.927-05:00How to solve classification and regression fast, faster, and fastest<div dir="ltr" style="text-align: left;" trbidi="on">(post by Zeyuan Allen-Zhu)<br /><br />I am often asked what is the best algorithm to solve SVM, to solve Lasso Regression, to solve Logistic Regression, etc. At the same time, a growing number of first-order methods have been recently proposed, making it hard to track down the state-of-the-arts. I feel it perhaps a good idea to have a blog post to answer all these questions properly and simultaneously.<br /><br /><a name='more'></a><br /><br />Consider the general problem of empirical risk minimization (ERM):<br />\begin{equation}\label{eqn:primal}<br />\min_{x\in \mathbb{R}^d} \Big\{ F(x) := \frac{1}{n} \sum_{i=1}^n f_i (\langle a_i, x \rangle) + \psi(x) \Big\}\tag{1}\end{equation}<br />Here, each $a_i \in \mathbb{R}^d$ can be viewed as a feature vector, each $f_i(\cdot)$ is a unvariate loss function, and $\psi(x)$ can be viewed as regularizers such as the $\lambda \|x\|_1$ or $\frac{\lambda}{2}\|x\|_2^2$. There are naturally four classes of interesting classification or regression problems that fit into the above framework. Namely,<br /><ul><li>Case 1: $f_i(x)$ is smooth and $F(x)$ is strongly convex. Example: ridge regression.</li><li>Case 2: $f_i(x)$ is smooth and $F(x)$ is weakly convex. Example: Lasso regression.</li><li>Case 3: $f_i(x)$ is non-smooth and $F(x)$ is strongly convex. Example: SVM.</li><li>Case 4: $f_i(x)$ is non-smooth and $F(x)$ is weakly convex. Example: L1-SVM.</li></ul><div>Somewhat surprisingly, it is not necessary to design an algorithm to solve each of the four cases above. For instance, one can "optimally" transform an algorithm solving Case 1 to algorithms solving Case 2,3 and 4 (<a href="http://arxiv.org/abs/1603.05642" target="_blank">link to paper</a>). Since full gradient based methods are too slow for large-scale machine learning, in this post I'll summarize only stochastic methods.<br /><br /><span style="color: #cccccc;">[[edit remrak: one may also consider a few other interesting cases, including: Case 5, $f_i(x)$ is non-convex but $F(x)$ is strongly convex; Case 6, $f_i(x)$ is non-convex but $F(x)$ is weakly convex; Case 7, $f_i(x)$ is non-convex and $F(x)$ is non-convex too. Case 5 was first studied by <a href="http://arxiv.org/abs/1502.06177" target="_blank">Shai Shalev-Shwartz in this paper</a>. I have slightly better results for Case 5 and 6 <a href="http://arxiv.org/abs/1506.01972" target="_blank">in this paper</a>. As for Case 7, a recent progress <a href="http://arxiv.org/abs/1603.05643" target="_blank">is this paper</a>.]]</span></div><h2>Running-Time Summaries</h2>There are three classes of stochastic first-order methods, and the best known running times are respectively:<br /><table><tbody><tr> <td style="width: 60px;"></td> <td style="width: 100px;">Column 1: SGD Method (fast)</td> <td style="width: 220px;">Column 2: Non-Accelerated Methods (faster)</td> <td style="width: 200px;">Column 3: Accelerated Methods (fastest)</td> </tr><tr> <td>Case 1</td> <td>$O\Big(\frac{G d}{\sigma \varepsilon}\Big)$</td> <td>$O\Big(\big(nd + \frac{L d}{\sigma}\big) \log\frac{1}{\varepsilon} \Big)$</td> <td>$O\Big(\big(nd + \frac{\sqrt{n L} d}{\sigma}\big) \log\frac{1}{\varepsilon} \Big)$</td> </tr><tr> <td>Case 2</td> <td>$O\Big(\frac{G d}{\varepsilon^2}\Big)$</td> <td>$O\Big(nd \log\frac{1}{\varepsilon} + \frac{L d}{\varepsilon} \Big)$</td> <td>$O\Big(nd \log\frac{1}{\varepsilon} + \frac{\sqrt{n L} d}{\sqrt{\varepsilon}} \Big)$</td> </tr><tr> <td>Case 3</td> <td>$O\Big(\frac{G d}{\sigma \varepsilon}\Big)$</td> <td>$O\Big(nd \log\frac{1}{\varepsilon} + \frac{G d}{\sigma \varepsilon} \Big)$ (useless)</td> <td>$O\Big(nd \log\frac{1}{\varepsilon} + \frac{\sqrt{n G} d}{\sqrt{\sigma \varepsilon}} \Big)$</td> </tr><tr> <td>Case 4</td> <td>$O\Big(\frac{G d}{\varepsilon^2}\Big)$</td> <td>$O\Big(nd \log\frac{1}{\varepsilon} + \frac{G d}{\varepsilon^2} \Big)$ (useless)</td> <td>$O\Big(nd \log\frac{1}{\varepsilon} + \frac{\sqrt{n G} d}{\varepsilon} \Big)$</td> </tr></tbody></table><div><span style="color: #cccccc;">In the above table, $L$ is the smoothness parameter for Cases 1/2, and (square root of) $G$ is the Lipschitz continuity parameter of $f_i$ for Cases 3/4, and $\sigma$ is the strong convexity parameter for Cases 1/3. It is clear from the table above that non-accelerated methods should not be used for cases 3 and 4, and also clear (using AM-GM inequality) that column 3 is faster than column 2.</span><br /><br />If one simply wants to know how to obtain such running times, then the short answer is.<br /><ul><li>SGD results are more-or-less folklore, see for instance Section 3 of <a href="http://ocobook.cs.princeton.edu/OCObook.pdf" target="_blank">Elad Hazan's text book</a>.</li><li>Non-accelerated results: Case 1 is first obtained by <a href="http://arxiv.org/pdf/1309.2388v2.pdf" target="_blank">SAG</a> to the best of my knowledge. Case 2 is first obtained by <a href="http://arxiv.org/abs/1506.01972" target="_blank">Yang and me</a> (by shaving off a log factor from <a href="http://papers.nips.cc/paper/4937-accelerating" target="_blank">SVRG</a>). Case 3/4 are not interesting but anyways obtainable for instance using <a href="http://arxiv.org/abs/1603.05642" target="_blank">this reduction</a>. In practice, <a href="http://papers.nips.cc/paper/4937-accelerating" target="_blank">SVRG</a> and <a href="http://arxiv.org/abs/1506.01972" target="_blank">SVRG++</a> indeed outperform SGD for Cases 1 and 2 based on my experience.</li><li>Accelerated results: The tightest results for Cases 1/2/3/4 are first <a href="http://arxiv.org/abs/1603.05953" target="_blank">obtained by the new method Katyusha</a>. If one is willing to lose a few log factors, these cases were first obtained by <a href="http://arxiv.org/abs/1309.2375" target="_blank">AccSDCA</a> in 2013. Based on my experience, at least for Cases 1+2, Katyusha seems to give the best practical performance. For Cases 3+4, I would suggest <a href="http://arxiv.org/abs/1407.1296" target="_blank">APCG</a> (see later).</li></ul><div>If one is interested in, from a high level, how such results are obtained, I categorize all existing methods into dual-only methods, primal-only methods, and primal-dual methods.</div></div><h2>Dual-Only Methods (SDCA, APCG, etc.)</h2><div>Due to technical reasons, it is easier (and somehow earlier in history) to study ERM problems from the dual perspective. The dual problem of \eqref{eqn:primal}:</div>\begin{equation}\label{eqn:dual} \min_{y \in \mathbb{R}^n} \Big\{ D(y) := \frac{1}{n}\sum_{i=1}^n f_i^*(y_i) + r^*\Big(-\frac{1}{n} \sum_{i=1}^n y_i a_i \Big) \Big\}<br />\tag{2}<br />\end{equation} <br /><div>Above, $f_i^*$ and $r^*$ are respectively the so-called <i>Fenchel dual</i> of $f_i$ and $r$ respectively. For starters, Fenchel duals are easily computable for most applications; as a concrete example, if $r(x) = \frac{\lambda}{2}\|x\|_2^2$ is the L2 regularizer, then $r^*(x) = \frac{1}{2\lambda} \|x\|_2^2$. Section 5 of <a href="http://arxiv.org/abs/1309.2375" target="_blank">this paper</a> provides lots of examples.</div><div><span style="color: #cccccc;"><br /></span></div><div><span style="color: #cccccc;">Note that the dual objective $D(y)$ is </span><i style="color: #cccccc;">undefined</i><span style="color: #cccccc;"> for Cases 2 and 4. Although $D(y)$ is defined for Case 3, it is in theory impossible to translate an approximate dual solution $y$ to a primal solution. For such reasons, in theory people </span><span style="color: #cccccc;">directly analyze how to minimize $D(y)$ for Case 1</span><span style="color: #cccccc;">, and remember, such an algorithm can be turned into solvers for Cases 2/3/4 through </span><a href="http://arxiv.org/abs/1603.05642" style="color: #cccccc;" target="_blank">reduction</a><span style="color: #cccccc;">.</span></div><h2></h2>One can prove that if $F(x)$ is $\sigma$-strongly convex, then $D(y)$ is $(1/n + 1/sigma n^2)$-smooth with respect to each coordinate. For this reason, one can apply coordinate descent to minimize $D(y)$ directly, which results in the <a href="http://arxiv.org/pdf/1209.1873.pdf" target="_blank">SDCA</a> method; or apply accelerated coordinate descent to minimize $D(y)$ directly, which results in the <a href="http://arxiv.org/abs/1407.1296" target="_blank">APCG</a> method. Note that SDCA is a non-accelerated method (column 2 of the table) and APCG is an accelerated method (column 3).<br /><br /><span style="color: #cccccc;">Although (accelerated or not) coordinate descent was well-known in optimization, it is not immediately clear how to apply them to minimize $D(y)$ in \eqref{eqn:dual} at a first glance, mainly due to the existence of the Fenchel conjugates. The <a href="http://arxiv.org/abs/1407.1296" target="_blank">APCG</a> paper provides a nice summary for beginners. </span><br /><br />I have <a href="https://zeyuan.wordpress.com/2016/06/13/coordinate-descent/" target="_blank">another blog post</a> discussing coordinate descent methods in details.<br /><h2>Primal-Only Methods (SGD, SVRG, etc.)</h2><div>A stochastic method is call primal if it directly computes $f_i'(\cdot)$ for one random sample $i$, and update $x$ accordingly. <u>Primal-only methods are more desirable for lots of reasons.</u> <span style="color: #cccccc;">First, one primal methods avoid the accuracy loss when turning from dual variables to primal. Second, one can avoid the potentially involved Fenchel computation. Third, consider the more general problem of \eqref{eqn:primal} by replacing each $f_i(\langle a_i, x\rangle)$ with the more general form $f_i(x)$, this problem can only be solved by primal-only methods because its dual (in some sense) does not exist.</span><br /><br />The first stochastic method is stochastic gradient descent (SGD). Ignoring the existence of the regularizer $\psi(x)$, the SGD method iteratively updates $x_{k+1} \gets x_k - \eta \cdot f_i' (\langle a_i, x\rangle) a_i$ where $\eta$ is some learning rate and $i\in [n]$ is a random sample. It is clear from this formulation that SGD replaces the full gradient $\nabla := \frac{1}{n} \sum_{i=1}^n f_i' (\langle a_i, x\rangle) \cdot a_i$ with an unbiased random sample $\tilde{\nabla} := f_i' (\langle a_i, x\rangle) \cdot a_i$, which is $n$ times faster to compute.<br /><br />The first theoretical breakthrough on primal-only methods is <a href="http://arxiv.org/pdf/1309.2388v2.pdf" target="_blank">SAG</a> (to the best of my knowledge), which introduces the variance-reduction technique in order to match the running time of SDCA (and thus column 2 of the table). <span style="color: #cccccc;">Approximately a year later, <a href="http://papers.nips.cc/paper/4937-accelerating" target="_blank">SVRG</a> and <a href="http://papers.nips.cc/paper/5258-saga-a-fast-incremental-gradient-method-with-support-for-non-strongly-convex-composite-objectives" target="_blank">SAGA </a>are introduced to replace SAG with not only a simpler proof, but much better practical performance. </span>The key idea of these methods is to design a better $\tilde{\nabla}$ so that its expectation $\mathbb{E}[\tilde{\nabla}]$ stills equals to the full gradient $\nabla$, but somehow approaching to zero. Unfortunately, all these variance-reduction based methods only match the non-accelerated running time (column 2).<br /><br />In this March 2016, I finally obtained the first accelerated method (thus column 3) that is primal-only. The key idea there is to introduce a momentum plus a negative momentum on top of variance-reduced $\tilde{\nabla}$. Interested readers can <a href="http://arxiv.org/abs/1603.05953" target="_blank">find my method here</a>, and I plan to give a more detailed survey on variance-reduction based methods later this summer.<br /><h2>Primal-Dual Methods (SPDC)</h2><div>One can also solve \eqref{eqn:primal} from a saddle-point perspective. Consider<br />$$ \min_{x\in \mathbb{R}^d} \max_{y \in \mathbb{R}^n} \Big\{ \phi(x,y) := \frac{1}{n} y^T A x + \psi(x) - \frac{1}{n}\sum_{i=1}^n f_i^*(y_i) \Big\}$$<br />where $A = [a_1,\dots,a_n]^T \in \mathbb{R}^{n\times d}$ is the data matrix. One can prove that $F(x) = \max_y \phi(x,y)$ and $D(y) = - \min_x \phi(x,y)$, and therefore to solve the original ERM problem it suffices to solve this saddle-point problem.<br /><br />In 2014, Zhang and Xiao provided an accelerated, stochastic method <a href="https://arxiv.org/abs/1409.3257" target="_blank">SPDC</a> (column 3) that directly solves this saddle-point problem. It is built on the so-called accelerated primal-dual methods of Chambolle and Pock, which I also plan to write a blog post about it some time in the future. Unfortunately, I know people who report to me that the parameter tuning steps for SPDC may be too complicated in practice, but I haven't tried it myself so I can't say for sure.<br /><h2 style="-webkit-text-stroke-width: 0px; color: black; font-family: 'Times New Roman'; font-style: normal; font-variant: normal; letter-spacing: normal; line-height: normal; orphans: auto; text-align: start; text-indent: 0px; text-transform: none; white-space: normal; widows: 1; word-spacing: 0px;">Final Remarks</h2></div></div>In the above summary I have injected lots of my own (and perhaps controversial) opinions into the discussions above. For instance, one can also regard SDCA as a primal-dual methods because it somehow "also maintains primal variables". At the same time, the running times provided in my table are tight and the (almost) matching lower bounds recently shown by <a href="http://arxiv.org/abs/1605.08003" target="_blank">Woodworth and Srebro</a>. Finally, let me conclude with a performance plot comparing Katyusha to other state-of-the-arts:<br /><img border="0" height="353" src="https://2.bp.blogspot.com/-pGqQ58yEvD0/V1J89Jj9SCI/AAAAAAAAKdo/Gq-q4oGx3hUZZGm_-Gi5n0cYV_H5Ml3CACK4B/s640/Untitled.png" width="640" /></div>Zeyuan Allen-Zhunoreply@blogger.com5tag:blogger.com,1999:blog-7233553492253101490.post-24470248654295449012016-05-26T09:37:00.001-04:002018-02-22T16:02:25.670-05:00The complexity zoo and reductions in optimization<div dir="ltr" style="text-align: left;" trbidi="on">(post by Zeyuan Allen-Zhu and Elad Hazan)<br /><br />The following dilemma is encountered by many of my friends when teaching basic optimization: which variant/proof of gradient descent should one start with? Of course, one needs to decide on which depth of convex analysis one should dive into, and decide on issues such as "should I define strong-convexity?", "discuss smoothness?", "Nesterov acceleration?", etc.<br /><br /><a name='more'></a><br /><br />This is especially acute for courses that do not deal directly with optimization, which is described as a tool for learning or as a basic building block for other algorithms. Some approaches:<br /><ul style="text-align: left;"><li>I teach online gradient descent, in the context of online convex optimization, and then derive the offline counterpart. This is non-standard, but permits an extremely simple derivation and gets to the online learning component first. </li><li>Sanjeev Arora teaches basic offline GD for the smooth and strongly-convex case first. </li><li>In OR/optimization courses the smooth (but not strongly-convex) case is many times taught first. </li></ul>All of these variants have different proofs whose connections are perhaps not immediate. If one wishes to go into more depth, usually in convex optimization courses one covers the full spectrum of different smoothness/strong-convexity/acceleration/stochasticity regimes, each with a separate analysis (a total of 16 possible configurations!)<br /><br />This year I've tried something different in COS511 @ Princeton, which turns out also to have research significance. We've covered basic GD for well-conditioned functions, i.e. smooth and strongly-convex functions, and then <b>extended</b> these result <b>by reduction</b> to all other cases!<br />A (simplified) outline of this teaching strategy is given in chapter 2 of this <a href="http://ocobook.cs.princeton.edu/">book</a>. <br /><h3></h3><h3>Classical Strong-Convexity and Smoothness Reductions</h3>Given any optimization algorithm A for the well-conditioned case (i.e., strongly convex and smooth case), we can derive an algorithm for smooth but not strongly functions as follows.<br /><br />Given a non-strongly convex but smooth objective $f$, define a objective by<br />$$ \text{strong-convexity reduction:} \qquad f_1(x) = f(x) + \epsilon \|x\|^2 $$<br />It is straightforward to see that $f_1$ differs from $f$ by at most $\epsilon$ times a distance factor, and in addition it is $\epsilon$-strongly convex. Thus, one can apply $A$ to minimize $f_1$ and get a solution which is not too far from the optimal solution for $f$ itself. This simplistic reduction yields an almost optimal rate, up to logarithmic factors.<br /><br />Similar simplistic assumptions can be derived for (finite-sum forms of) non-smooth by strongly-convex functions (via randomized smoothing or Fenchel duality), and for functions that are neither smooth nor strongly-convex by just applying both reductions simultaneously. Notice that such classes of functions include famous machine learning problems such as SVM, Logistic Regression, SVM, L1-SVM, Lasso, and many others.<br /><h3></h3><h3>Necessity of Reductions</h3><div>This is not only a pedagogical question. In fact, very few algorithms apply to the entire spectrum of strong-convexity / smoothness regimes, and thus reductions are very often<i> intrinsically</i> necessary. To name a few examples,</div><div><ul><li>Variance-reduction methods such as SAG, SAGA and SVRG require the objective to be smooth, and do not work for non-smooth problems like SVM. This is because for loss functions such as hinge loss, no unbiased gradient estimator can achieve a variance that approaches to zero</li><li>Dual methods such as SDCA or APCG require the objective to be strongly convex, and do not directly apply to non-strongly convex problems. This is because for non-strongly convex objectives such as Lasso, their duals are not even be well-defined.</li><li>Primal-dual methods such as SPDC require the objective to be both smooth and SC.</li></ul></div><h3></h3><h3>Optimality and Practicality of Reductions</h3><div>The folklore strong-convexity and smoothing reductions are suboptimal. Focusing on the strong-convexity reduction for instance:</div><div><ul><li>It incurs a logarithmic factor log(1/ε) in the running time so leading to slower algorithms than direct methods. For instance, if $f(x)$ is smooth but non-strongly convex, gradient descent minimizes it to an $\epsilon$ accuracy in $O(1/\epsilon)$ iterations; if one uses the reduction instead, the complexity becomes $O(\log(1/\epsilon)/\epsilon)$.</li><li>More importantly, algorithms based on such reductions become <i><b>biased</b></i>: the original objective value $f(x)$ does not converge to the global minimum, and if the desired accuracy is changed, one has to change the weight of the regularizer $\|x\|^2$ and restart the algorithm. </li></ul></div>These theoretical concerns also translate into running time losses and parameter tuning difficulties in practice. For such reasons, researchers usually make efforts on designing unbiased methods instead.<br /><ul><li>For instance, <a href="http://papers.nips.cc/paper/4937-accelerating" target="_blank">SVRG</a> in theory solves the Lasso problem (with smoothness L) in time $O((nd+\frac{Ld}{\epsilon})\log\frac{1}{\epsilon})$ using reduction. Later, direct and unbiased methods for Lasso are introduced, including <a href="http://papers.nips.cc/paper/5258-saga-a-fast-incremental-gradient-method-with-support-for-non-strongly-convex-composite-objectives" target="_blank">SAGA</a> which has running time $O(\frac{nd+Ld}{\epsilon})$, and <a href="http://arxiv.org/abs/1506.01972" target="_blank">SVRG++</a> which has running time $O(nd \log\frac{1}{\epsilon} + \frac{Ld}{\epsilon})$.</li></ul>One can find academic papers derive various optimization improvements many times for only one of the settings, leaving the other settings desirable. An <b><i>optimal and unbiased</i></b> black-box reduction is thus a tool to extend optimization algorithms from one domain to the rest.<br /><h3></h3><h3>Optimal, Unbiased, and Practical Reductions</h3><div>In this <a href="http://arxiv.org/abs/1603.05642">paper</a>, we give optimal and unbiased reductions. For instance, the new reduction, when applied to SVRG, implies the same running time as SVRG++ up to constants, and is unbiased so converges to the global minimum. Perhaps more surprisingly, these new results imply<b><i> new</i></b> theoretical results that were not previously known by direct methods. To name two of such results:</div><div><ul><li>On Lasso, it gives an accelerated running time $O(nd \log\frac{1}{\epsilon} + \frac{\sqrt{nL}d}{\sqrt{\epsilon}})$ where the best known result was not only an biased algorithm but also slower $O( (nd + \frac{\sqrt{nL}d}{\sqrt{\epsilon}}) \log\frac{1}{\epsilon} )$.</li><li>On SVM with strong convexity $\sigma$, it gives an accelerated running time $O(nd \log\frac{1}{\epsilon} + \frac{\sqrt{n}d}{\sqrt{\sigma \epsilon}})$ where the best known result was not only an biased algorithm but also slower $O( (nd + \frac{\sqrt{n}d}{\sqrt{\sigma \epsilon}}) \log\frac{1}{\epsilon} )$.</li></ul>These reductions are surprisingly simple. In the language of strong-convexity reduction, the new algorithm starts with a regularizer $\lambda \|x\|^2$ of some large weight $\lambda$, and then keeps halving it throughout the convergence. Here, the time to decrease $\lambda$ can be either decided by theory or by practice (such as by computing duality gap).<br /><br />A figure to demonstrate the practical performance of our new reduction (red dotted curve) as compared to the classical biased reduction (blue curves, with different regularizer weights) are presented in the figure below.</div><div style="text-align: center;"><a href="http://2.bp.blogspot.com/-G-Y75QWFfTI/V0Zo9mrz-nI/AAAAAAAAKdQ/-HtalLmHkp0nd2RN0sQ4S5d-rC725q5PwCK4B/s1600/SDCA-8.png" imageanchor="1"><img border="0" height="182" src="https://2.bp.blogspot.com/-G-Y75QWFfTI/V0Zo9mrz-nI/AAAAAAAAKdQ/-HtalLmHkp0nd2RN0sQ4S5d-rC725q5PwCK4B/s320/SDCA-8.png" width="320" /></a><br /><br /><div style="text-align: left;">As a final word - if you were every debating whether to post your paper on ArXiV, yet another example of how quickly it helps research propagate: only a few weeks after our paper was made available online, Woodworth and Srebro have already made use of our reductions in their new <a href="https://arxiv.org/abs/1605.08003">paper</a>. </div></div><br /><br /><br /></div>Elad Hazanhttps://plus.google.com/111108599717601698901noreply@blogger.com1tag:blogger.com,1999:blog-7233553492253101490.post-9360929683136214622016-03-03T12:56:00.001-05:002016-03-07T10:08:43.749-05:00The two cultures of optimization<div dir="ltr" style="text-align: left;" trbidi="on">The standard curriculum in high school math includes elementary functional analysis, and methods for finding the minima, maxima and saddle points of a single dimensional function. When moving to high dimensions, this becomes beyond the reach of your typical high-school student: mathematical optimization theory spans a multitude of involved techniques in virtually all areas of computer science and mathematics. <br /><br />Iterative methods, in particular, are the most successful algorithms for large-scale optimization and the most widely used in machine learning. Of these, most popular are first-order gradient-based methods due to their very low per-iteration complexity.<br /><br />However, way before these became prominent, physicists needed to solve large scale optimization problems, since the time of the Manhattan project at Los Alamos. The problems that they faced looked very different, essentially simulation of physical experiments, as were the solutions they developed. The Metropolis algorithm is the basis for randomized optimization methods and Markov Chain Monte Carlo algorithms.<br /><br />To use the terminology of Breiman's famous <a href="http://projecteuclid.org/euclid.ss/1009213726">paper</a>, the two cultures of optimization have independently developed to fields of study by themselves. <br /><br />The readers of this particular blog may be more familiar with the literature on iterative methods. The hallmark of polynomial time methods for mathematical optimization are so called "interior point methods", based on Newton's method. Of these, the most efficient and well-studied are "central path following" methods, established in the pioneering work of Nesterov and Nemirovski (see <a href="http://www2.isye.gatech.edu/~nemirovs/Lect_IPM.pdf">this</a> excellent introductory text).<br /><br />On the other side of the world, the early physicists joined forces with engineers and other scientists to develop highly successful Monte Carlo simulation algorithms. This approach grew out of work in statistical mechanics, and a thorough description is given in <a href="http://minds.jacobs-university.de/sites/default/files/uploads/teaching/share/KirkpatrickSimulatedAnnealing.pdf">this</a> 1983 <i>Science</i> paper by Kirkpatrick, Gelatt, and Vecchi. One of the central questions of this area of research is what happens to particular samples of liquid or solid matter as the temperature of the matter is dropped. As the sample is cooled, the authors ask:<br /><blockquote class="tr_bq">[...] for example, whether the atoms remain fluid or solidify, and if they solidify, whether they form a crystalline solid or a glass. Ground states and configurations close to them in energy are extremely rare among all the configurations of a macroscopic body, yet they dominate its properties at low temperatures because as T is lowered the Boltzmann distribution collapses into the lowest energy state or states.</blockquote>In this groundbreaking work drawing out the connections between the cooling of matter and the problem of combinatorial optimization, a new algorithmic approach was proposed which we now refer to as <i>Simulated Annealing</i>. In more Computer Science-y terminology, the key idea is to use random walk sampling methods to select a point in the convex set from a distribution which has high entropy (i.e. at a "high temperature"), and then to iteratively modify the distribution to concentrate around the optimum (i.e. sampling at "low temperature"), where we need to perform additional random walk steps after each update to the distribution to guarantee "mixing" (more below). There has been a great deal of work understanding annealing methods, and they are quite popular in practice.<br /><br />In a recent <a href="http://arxiv.org/abs/1507.02528">paper</a>, we have discovered a close connection between the two methodologies. Roughly speaking, for constrained <b>convex</b> optimization problems, the iterates of Newton's method lie on the same path as the means of the consecutive distributions of simulated annealing. This is depicted in the picture below. Besides the intriguing connection, this discovery yields algorithmic benefits as well: a faster convex optimization algorithm for the most general input access model, and a resolution of a open problem in interior point methods. <br /><br /><div style="text-align: center;"><a href="http://3.bp.blogspot.com/-_NisG0CvqR4/Vtbuj3EjvzI/AAAAAAABG2Q/bbwQGXfrZbE/s1600/heatpath_with_samples.png" imageanchor="1"><img border="0" height="265" src="https://3.bp.blogspot.com/-_NisG0CvqR4/Vtbuj3EjvzI/AAAAAAABG2Q/bbwQGXfrZbE/s320/heatpath_with_samples.png" width="320" /></a></div><br /><br />We continue with a more detailed description of path-following interior point methods, simulated annealing, and finally a discussion of the consequences.<br /><br /><h3 style="text-align: left;">An overview of the Statistical Mechanics approach to optimization </h3><div><br /></div>The Simulated Annealing approach to optimization is to reduce a given formulation to a sampling problem. Consider the following distribution over the set $K \subseteq R^n$ for a given function $f: R^n \mapsto R$.<br />\begin{equation}<br />\textstyle<br /> P_{f,t}(x) := \frac{\exp(- f(x)/t )}<br /> { \int_K \exp(-f(x')/t ) \, dx' }.<br />\end{equation}<br />This is often referred to as the Boltzmann distribution. Clearly - as $t$ approaches zero, this distribution approaches a point-mass concentrated on the global minimizer of $f$.<br /><br />The question is now - how do we sample from the Boltzman distribution? Here comes the fantastic observation of the Los Alamos scientists: formulate a random walk such that its stationary distribution is exactly the one we wish to sample from. Then simulate the random walk till it mixes - and voila - you have a sample!<br /><br />The Metropolis algorithm does exactly that - a general recipe for creating Markov Chains whose stationary distribution is whatever we want. The crucial observation is that simulating a random walk is many times significantly easier than reasoning about the distribution itself. Furthermore, it so happens that the mixing time in such chains, although notoriously difficult to analyze rigorously, are very small in practice.<br /><br />The analogy to physics is as follows - given certain initial conditions for a complex system along with evolution rules, one is interested in system behavior at "steady state". Such methods were used to evaluate complex systems during WWII in the infamous Manhattan project.<br /><br />For optimization, one would use a random walk technique to iteratively sample from ever decreasing temperature parameters $t \mapsto 0$. For $t=0$ sampling amounts to optimization, and thus hard.<br /><br />The simulated annealing method slowly changes the temperature. For $t = \infty$, the Boltzmann distribution amounts to uniform sampling, and it is reasonable to assume this can be done efficiently. Then, exploiting similarity in consecutive distributions, one can use samples from one temperature (called a "warm start") to efficiently sample from a cooler one, till a near-zero temperature is reached. But how to sample from even one Boltzman distributions given a warm start? There is vast literature on random walks and their properties. Popular walks include the "hit and run from a corner", "ball walk", and many more.<br /><br />We henceforth define the <b>Heat Path </b>as the deterministic curve which maps a temperature to the mean of the Boltzmann distribution for that particular temperature. <br />$$ \mu_t = E_{x \sim P_{f,t}} [ x] $$<br />Incredibly, to the best of our knowledge, this fascinating curve was not studied before as a deterministic object in the random walks literature.<br /><br /><h3 style="text-align: left;">An overview of the Mathematics approach to optimization </h3><br />It is of course infeasible to do justice to the large body of work addressing these ideas within the mathematics/ML literature. Let us instead give a brief intro to Interior Point Methods (IPM) - the state of the art in polynomial time methods for convex optimization.<br /><br />The setting of IPM is that of constrained convex optimization, i.e. minimizing a convex function subject to convex constraints.<br /><br />The basic idea of IPM is to reduce constrained optimization to unconstrained optimization, similarly to Lagrangian relaxation. The constraints are folded into a single penalty function, called a "barrier function", which guaranties three properties:<br /><br /><ol style="text-align: left;"><li>The barrier approaches infinity near the border of the convex set, thereby ensuring that the optimum is obtained inside the convex set described by the constraints. </li><li>The barrier doesn't distort the objective function too much, so that solving for the new unconstrained problem will give us a close solution to the original formulation.</li><li>The barrier function allows for efficient optimization via iterative methods, namely Newton's method.</li></ol>The hallmark of IPM is the existence of such barrier functions with all three properties, called self-concordant barrier functions, that allow efficient deterministic polynomial-time algorithms for many (but not all) convex optimization formulations. <br /><br />Iteratively, a "temperature" parameter multiplied by the barrier function is added to the objective, and the resulting unconstrained formulation solved by Newton's method. The temperature is reduced, till the barrier effect becomes negligible compared to the original objective, and a solution is obtained.<br /><br />The curve mapping the temperature parameter to the solution of the unconstrained problem is called the <b>central path</b>. and the overall methodology called "central path following methods". <br /><br /><br /><h3 style="text-align: left;">The connection</h3><div style="text-align: left;"><span style="font-weight: normal;"><br /></span><span style="font-weight: normal;">In our recent paper, we show that for </span><span style="font-weight: normal;"><b>convex</b></span><span style="font-weight: normal;"> optimization, the </span><span style="font-weight: normal;"><b>heat path</b></span><span style="font-weight: normal;"> and </span><span style="font-weight: normal;"><b>central path</b></span><span style="font-weight: normal;"> for IPM for a particular barrier function (called the <i>entropic barrier</i>, following the terminology of the recent excellent work of <a href="http://arxiv.org/abs/1412.1587">Bubeck and Eldan</a>) are identical! Thus, in some precise sense, the two cultures of optimization have been studied the same object in disguise and using different techniques. </span></div><div style="text-align: left;"><span style="font-weight: normal;"><br /></span></div><div style="text-align: left;"><span style="font-weight: normal;">Can this observation give any new insight to the design of efficient algorithm? The answer turns out to be affirmative. </span></div><div style="text-align: left;"><br /></div><div style="text-align: left;">First, we resolve the long standing question in IPM on the existence of an efficiently computable self-concordant barrier for general convex sets. We show that the covariance of the Boltzman distribution is related to the Hessian of a self-concordant barrier (known as the entropic barrier) for any convex set. For this computation, all we need is a membership oracle for the convex set. </div><div style="text-align: left;"><br /></div><div style="text-align: left;">Second, our observation gives rise to a faster annealing temperature for convex optimization, and a much simpler analysis (building on the seminal work of <a href="http://www.cc.gatech.edu/~vempala/papers/adamanneal.pdf">Kalai and Vempala</a>). This gives a faster polynomial-time algorithm for convex optimization in the membership oracle model - a model in which the access to the convex set is only through an oracle answering questions of the form "is a certain point x inside the set?". </div><div><br /></div><br /><br /></div>Elad Hazanhttps://plus.google.com/111108599717601698901noreply@blogger.com0tag:blogger.com,1999:blog-7233553492253101490.post-42820217464035264412016-03-01T22:55:00.001-05:002016-03-02T14:43:39.937-05:00Making second order methods practical for machine learning<div dir="ltr" style="text-align: left;" trbidi="on">First-order methods such as Gradient Descent, AdaGrad, SVRG, etc. dominate the landscape of optimization for machine learning due to their extremely low per-iteration computational cost. Second order methods have largely been ignored in this context due to their prohibitively large time complexity. As a general rule, any super-linear time operation is prohibitively expensive for large scale data analysis. In our recent <a href="http://arxiv.org/abs/1602.03943">paper</a> we attempt to bridge this divide by exhibiting an efficient linear time second order algorithm for typical <b>(</b>convex<b>)</b> optimization problems arising in machine learning.<br /><br />Previously, second order methods were successfully implemented in linear time for special optimization problems such as maximum flow [see e.g. this <a href="http://arxiv.org/abs/1010.2921">paper</a>] using very different techniques. In the optimization community, Quasi-Newton methods (such as BFGS) have been developed which use first order information to approximate a second order step. Efficient implementations of these methods (such as L-BFGS) have been proposed, mostly without theoretical guarantees.<br /><br />Concretely, consider the PAC learning model where given $m$ samples $\{(\mathbf{x}_k, y_k)\}_{k\in \{1,\dots, m\}}$ the objective is to produce a hypothesis from a hypothesis class that minimizes the overall error. This optimization problem, known as Empirical Risk Minimization, can be written as follows:<br />\[\min_{\theta \in \mathbb{R}^d} f(\theta) = \min_{\theta \in \mathbb{R}^d} \left\{\frac{1}{m}\sum\limits_{k=1}^m f_k(\theta) + R(\theta) \right\}.\]<br />In typical examples of interest, such as Logistic Regression, soft-margin Support Vector Machines, etc., we have that each $f_k(\theta)$ is a convex function of the form $\ell(y_k, \theta^{\top}\mathbf{x}_k)$ and $R(\theta)$ is a regularization function. For simplicity, consider the Euclidean regularization given by $R(\theta) = \lambda \|\theta\|^2$.<br /><br />Gradient Descent is a natural first approach to optimize a function, whereby you repeatedly take steps in the direction opposite of the gradient at the current iterate \[ \mathbf{x}_{t+1} = \mathbf{x}_t - \eta \nabla f(\mathbf{x}_t). \] Under certain conditions on the function and with an appropriate choice of $\eta$, it can be shown this process converges linearly to the true minimizer $\mathbf{x}^*$. If $f$ has condition number $\kappa$, then it can be shown that $O(\kappa \log (1/\epsilon))$ iterations are needed to be $\epsilon$-close to the minimizer, with each iteration taking $O(md)$ time, so the total running time is $O(md \kappa \log(1/ \epsilon))$. Numerous extensions to this basic method have been proposed in the ML literature in recent years, which is a subject for a future blog post.<br /><br />Given the success of first order methods, a natural extension is to incorporate second order information in deciding what direction to take at each iteration. Newton's Method does exactly this, where, denoting $\nabla^{-2}f(\mathbf{x}) := [\nabla^2 f(\mathbf{x})]^{-1}$, the update takes the form<br />\[\mathbf{x}_{t+1} = \mathbf{x}_t - \eta \nabla^{-2}f(\mathbf{x}_t) \nabla f(\mathbf{x}_t).\]<br />Although Newton's Method can converge to the optimum of a quadratic function in a single iteration and can achieve quadratic convergence for general functions (when close enough to the optimum), the per-iteration costs can be prohibitively expensive. Each step requires calculation of the Hessian (which is $O(md^2)$ for functions of the form $f(\mathbf{x}) = \sum\limits_{k=1}^m f_k(\mathbf{x})$) as well as a matrix inversion, which naively takes $O(d^3)$ time.<br /><br />Drawing inspiration from the success of using stochastic gradient estimates, a natural step forward is to use stochastic second order information in order to estimate the Newton's direction. One of the key hurdles towards a direct application of this approach is the lack of an immediate unbiased estimator of the inverse Hessian. Indeed, it is not the case that the inverse of an unbiased estimator of the Hessian is an unbiased estimator of the Hessian inverse. In a recent <a href="http://papers.nips.cc/paper/5918-convergence-rates-of-sub-sampled-newton-methods.pdf">paper</a> by Erdogdu and Montanari, they propose an algorithm called NewSamp, which first obtains an estimate of the Hessian by aggregating multiple samples, and then computes the matrix inverse. Unfortunately the above method still suffers due to the prohibitive run-time of the matrix inverse operation. One can take a spectral approximation to speed up inversion, however this comes at the cost of loosing some spectral information and thus loosing some nice theoretical guarantees.<br /><br />Our method tackles these issues by making use of the Neumann series. Suppose that, for all $\mathbf{x}$, $\|\nabla^2 f(\mathbf{x})\|_2 \leq 1$, $\nabla^2 f(\mathbf{x}) \succeq 0$. Then, considering the Neumann series of the Hessian matrix, we have that \[\nabla^{-2}f(\mathbf{x}) = \sum\limits_{i=0}^{\infty} \left(I - \nabla^2 f(\mathbf{x})\right)^i.\]<br />The above representation can be used to construct an unbiased estimator in the following way. The key component is an unbiased estimator of a single term $\left(I - \nabla^2 f(\mathbf{x})\right)^i$ in the above summation. This can obtained by taking independent and unbiased samples $\{ \nabla^2 f_{[j]} \}$ and creating the product by multiplying the individual terms. Formally, pick a probability distribution $\{p_i\}_{i \geq 0}$ over the non-negative integers (representing a probability distribution over the individual terms in the series) and sample an index $\hat{i}$. Sample $\hat{i}$ independent and unbiased Hessian samples $\left\{\nabla^2 f_{[j]}(\mathbf{x})\right\}_{j \in \{1, \dots, \hat{i}\}}$ and define the estimator: \[\tilde{\nabla}^{-2} f(\mathbf{x}) = \frac{1}{p_{\hat{i}}} \prod\limits_{j=1}^{\hat{i}} \left(I - \nabla^2 f_{[j]}(\mathbf{x})\right).\]<br />Note that the above estimator is an unbiased estimator which follows from the fact that the samples are independent and unbiased. <br /><br />We consider the above estimator in our paper and analyze its convergence. The above estimator unfortunately has the disadvantage that it captures one term in the summation and introduces the need for picking a distribution over the terms. However, this issue can be circumvented by making the following observation about a recursive reformulation of the above series:<br /><br />\[ \nabla^{-2} f = I + \left(I - \nabla^2 f\right)\left( I + \left(I - \nabla^2 f\right) ( \ldots)\right). \]<br />If we curtail the Taylor series to contain the first $t$ terms (which is equivalent to the recursive depth of $t$ in the above formulation) we can see that an unbiased estimator of the curtailed series can easily be constructed given $t$ unbiased and independent samples as follows:<br /><br />\[\nabla^{-2} f = I + \left(I - \nabla^2 f_{[t]}\right)\left( I + \left(I - \nabla^2 f_{[t-1]}\right) ( \ldots)\right).\]<br />Note that as $t \rightarrow \infty$ our estimator becomes an unbiased estimator of the inverse. We use the above estimator in place of the true inverse in the Newton Step in the new algorithm, dubbed LiSSA.<br /><br />The next key observation we make with regards to our estimator is that since we are interested in computing the Newton direction $(\nabla^{-2} f \nabla f)$, we need to repeatedly compute products of the form $(I - \nabla^2 f_{[i]})v$ where $\nabla^2 f_{[i]}$ is a sample of the Hessian and $v$ is a vector. As remarked earlier in common machine learning settings the loss function is usually of the form $\ell_k(y_k, \theta^{\top} \mathbf{x}_k)$ and therefore the hessian is usually of the form $c \mathbf{x}_k\mathbf{x}_k^{\top}$, i.e. it is a rank 1 matrix. It can now be seen that the product $(I - \nabla^2 f_{[i]})v$ can be performed in time $O(d)$, giving us a <b>linear time update step.</b> For full details, we refer the reader to Algorithm 1 in the paper.<br /><br />We would like to point out an alternative interpretation of our estimator. Consider the second order approximation $g(\mathbf{y})$ to the function $f(\mathbf{x})$ at any point $\mathbf{z}$, where $\mathbf{y} = \mathbf{x} - \mathbf{z}$, given by<br />\[g(\mathbf{y}) = f(\mathbf{z}) + \nabla f(\mathbf{z})^{\top}\mathbf{y} + \frac{1}{2}\mathbf{y}^{\top}\nabla^2 f(\mathbf{z})\mathbf{y}. \]<br />Indeed, a single step of Newton's method takes us to the minimizer of the quadratic. Let's consider what a gradient descent step for the above quadratic looks like:<br />\[ \mathbf{x}_{t+1} = (I - \nabla^2 f(\mathbf{z}))\mathbf{x}_t - \nabla f(\mathbf{z}).\]<br />It can now be seen that using $t$ terms of the recursive Taylor series as the inverse of the Hessian corresponds exactly to doing $t$ gradient descent steps on the quadratic approximation. Continuing the analogy we get that our stochastic estimator is a specific form of SGD on the above quadratic. Instead of SGD, one can use any of the improved/accelerated/variance-reduced versions too. Although the above connection is precise we note that it is important that in the expression above, the gradient of $f$ is exact (and not stochastic). Indeed, known lower bounds on stochastic optimization imply that it is impossible to achieve the kind of guarantees we achieve with purely stochastic first and second order information. This is the reason one cannot adapt the standard SGD analysis in our scenario, and it remains an intuitive explanation only.<br /><br />Before presenting the theorem governing the convergence of LiSSA, we will introduce some useful notation. It is convenient to first assume WLOG that every function $f_k(\mathbf{x})$ has been scaled such that $\|\nabla^2 f_k(\mathbf{x})\| \leq 1.$ We also denote the condition number of $f(\mathbf{x})$ as $\kappa$, and we define $\kappa_{max}$ to be such that<br />\[\lambda_{min}(\nabla^2 f_k(\mathbf{x})) \geq \frac{1}{\kappa_{max}}.\]<br /><br /><b>Theorem:</b> LiSSA returns a point $\mathbf{x}_t$ such that with probability at least $1 - \delta$, \[f(\mathbf{x}_t) \leq \min\limits_{\mathbf{x}^*} f(\mathbf{x}^*) + \epsilon \] in $O(\log \frac{1}{\epsilon})$ iterations, and total time $ O \left( \left( md + \kappa_{max}^2 \kappa d \right) \log \frac{1}{\epsilon}\right).$<br /><br />At this point we would like to comment on the running times described above. Our theoretical guarantees are similar to methods such as SVRG and SDCA which achieve running time $O(m + \kappa)d\log(1/\epsilon)$. The additional loss of $\kappa^2_{max}$ in our case is due to the concentration step as we prove high probability bounds on our per-step guarantee. In practice we observe that this concentration step is not required, i.e. applying our estimator only once, performs well as is demonstrated by the graphs below. It is remarkable that the number of <b>iterations </b>is independent of any condition number - this is the significant advantage of second order methods.<br /><br /><div class="separator" style="clear: both; text-align: center;"><a href="https://4.bp.blogspot.com/-ND6FiSB9FwM/VtYk3OPWfOI/AAAAAAAAAi8/D6TdVY_-qQ0/s1600/merged_image.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="424" src="https://4.bp.blogspot.com/-ND6FiSB9FwM/VtYk3OPWfOI/AAAAAAAAAi8/D6TdVY_-qQ0/s640/merged_image.png" width="640" /></a></div><br />These preliminary experimental graphs compare LiSSA with standard algorithms such as AdaGrad, Gradient Descent, BFGS and SVRG on a binary Logistic Regression task on three datasets obtained from LibSVM. The metric used in the above graphs is log(Value - Optimum) vs Time Elapsed.<br /><br />Next we include a comparison between LiSSA and second order methods, vanilla Newton's Method and NewSamp. We observe that in terms of iterations Newton's method shows a quadratic decline whereas NewSamp and LiSSA perform similarly in a linear fashion. As is expected, a significant performance advantage for LiSSA is observed when we consider a comparison with respect to time.<br /><br /><div class="separator" style="clear: both; text-align: center;"><a href="https://4.bp.blogspot.com/-ueT1Kos1yPE/VtYlKv8tHHI/AAAAAAAAAjA/aOShnbUTqSc/s1600/MergedNewtonImage.png" imageanchor="1" style="margin-left: 1em; margin-right: 1em;"><img border="0" height="248" src="https://4.bp.blogspot.com/-ueT1Kos1yPE/VtYlKv8tHHI/AAAAAAAAAjA/aOShnbUTqSc/s640/MergedNewtonImage.png" width="640" /></a></div><br />To summarize, in this blog post we described a linear time stochastic second order algorithm that achieves linear convergence for typical problems in machine learning while still maintaining run-times theoretically comparable to state-of-the-art first order algorithms. This relies heavily on the special structure of the optimization problem that allows our unbiased hessian estimator to be implemented efficiently, using only vector-vector products. </div>Brian Bullinsnoreply@blogger.com3tag:blogger.com,1999:blog-7233553492253101490.post-4585394218857437722011-02-21T14:16:00.000-05:002011-02-21T14:16:24.853-05:00Worst-case vs. average-case learning<div dir="ltr" style="text-align: left;" trbidi="on">Classical statistical learning theory deals with finding a consistent hypothesis given sufficiently many examples. Such a hypothesis may be a rule for classifying emails into spam/not-spam, and examples are classified emails.<br /><br />Statistical and computational learning theories tell us how many examples are needed to be able to generate a good hypothesis - one that can correctly classify instances that it had not seen, given that they are generated from the same distribution from which we have seen classified examples.<br /><br />This spam problem precisely demonstrates the caveat of making statistical assumptions - clearly the spammers are adaptive and adversarial. The characteristics of spam emails change over time, adapting to the filters.<br /><br />Here comes online learning theory: in online learning we do not make any assumptions about the data, but rather sequentially modify our hypothesis such that our average performance approaches that of the best hypothesis in hindsight. The difference between this average performance and the best-in-hindsight performance is called <i>regret</i>, and the foundation of online learning are algorithms whose regret is bounded by a sublinear function of the number of iterations (hence the average regret approaches zero).<br /><br />Can we interpolate between the two approaches ? That is, can we design algorithms which have a good regret-guarantee if the data is adversarial, but perform much better if there exists benign and simple distribution governing nature.<br /><br />This natural question was raised several times in the last few years, with some partial progress and many interesting open questions, one of which is made explicit with all required background <a href="http://ie.technion.ac.il/~ehazan/papers/openvar.pdf">here</a> (submitted to <a href="http://colt2011.sztaki.hu/cfop.html">COLT 2011 open problems session</a>).<br /><br /></div>Elad Hazanhttps://plus.google.com/111108599717601698901noreply@blogger.com0tag:blogger.com,1999:blog-7233553492253101490.post-36258770569357753332011-02-03T02:49:00.000-05:002011-02-03T02:50:56.509-05:00Duality for undergrads<div dir="ltr" style="text-align: left;" trbidi="on">The <a href="http://en.wikipedia.org/wiki/Linear_programming">duality theorem</a> of linear programming and its extensions is a fundamental pillar of optimization. It lies at the heart of an elegant theory which allows us not only to reason about the structure of continuous optimization problems, but also design combinatorial optimization and approximation algorithms.<br /><br />John von Neumann said that <span class="Apple-style-span" style="-webkit-border-horizontal-spacing: 2px; -webkit-border-vertical-spacing: 2px; color: #454545; font-family: 'Times New Roman', Times, serif; font-size: 17px; line-height: 19px;"><i>In mathematics you don't understand things. You just get used to them.</i></span> This is particularly true for duality (which von Neumann was the first to conjecture and essentially prove). I recall encountering linear programming duality as part of a graduate course. The standard pedagogy is first to try and <i>write</i> the dual program from the primal, and much later prove weak and strong duality, a very technical way indeed. All these years since, I still haven't gotten completely used to duality, but as close as I came is in teaching it this last semester, via a non-standard way, to undergrads, many of whom have never taken a course in basic linear programming.<br /><br />I'm familiar with three ways to prove duality. The first and hardest is via fixed-point theorems, as pioneered by von-Neumann in his proof of the minimax theorem. Don't try this on your undergrads.<br /><br />The second, and most popular, is a geometric proof via the <a href="http://en.wikipedia.org/wiki/Farkas'_lemma">Farkas' lemma</a>. This is usually grad-student material.<br /><br />The way we tried this last semester proceeds as follows:<br /><br />1. Define zero-sum games, and claim equivalence to linear programming. Zero sum games are clearly a special case, to show the other direction without resorting to heavy tools requires some care, as rigorously done by <a href="http://www.optimization-online.org/DB_FILE/2010/06/2659.pdf">Adler</a>. However, the intuition is very clear, and we did not go into details.<br /><br />2. Explain the minimax theorem. Again - a very intuitive theorem, easier to grasp than duaity - and equivalent by (1).<br /><br />3. Prove the minimax theorem on the board. We are not going to use toplogy (fixed-point theorems) nor geometry (Farkas), but an elementary construct called experts algorithms. The method of proving the minimax theorem via experts algorithms is due to <a href="http://www.sciencedirect.com/science?_ob=ArticleURL&_udi=B6WFW-45GMDYD-5&_user=10&_coverDate=10/31/1999&_rdoc=1&_fmt=high&_orig=search&_origin=search&_sort=d&_docanchor=&view=c&_searchStrId=1620650650&_rerunOrigin=google&_acct=C000050221&_version=1&_urlVersion=0&_userid=10&md5=eaaa00e3349ee127d6bd7307262ba2b3&searchtype=a">Freund and Schapire</a>.<br /><br />The last step is of course the main one. Here's an almost precise account of the proof: Recall that the minimax theorem for zero sum games tells us that<br /><br />$$ \lambda_A = \max_p \min_q p^T M q = \min_q \max_p p^T M q = \lambda_B $$<br /><br />Here $M$ is the game payoff matrix (payoffs to the row player and costs to the column player), $p$ and $q$ are distributions over the strategies of the players. The relation $ \max_p \min_q p^T M q \le \min_q \max_p p^T M q $ is called "weak duality", and follows easily from reasoning. The other direction, "strong duality", is where we need experts algorithms.<br /><br />Consider a repeated game, in which the row player constructs $p_t$ at each iteration $t \in [T]$ according to an experts' algorithm. An experts algorithm guarantees the following: (this is the definition of an experts' algorithm)<br /><br />$$ \frac{1}{T} \sum_t p_t^T M q_t \geq \frac{1}{T} \max_{p_*} \sum_t p_*^T M q_t - o(1) $$<br /><br />Here $t=1,2,...,T$ are the iterations of the repeated game, $p_t$ is the experts' mixed strategy at time $t$, and $q_t$ can be anything.<br /><br />Define the average distributions to be: $\bar{p} = \frac{1}{T} \sum_t p_t$ and $\bar{q} = \frac{1}{T} \sum_t q_t$.<br /><br />We have:<br /><br />$$ \lambda_A = \max_p \min_q p^T M q \geq \min_q \bar{p}^T M q = \min_q \frac{1}{T} \sum_t p_t^T M q $$<br /><br />By minimizing over q at each game iteration instead of minimizing over q on all<br />game rounds, player B can only decrease his loss (and decrease player A's profit) and<br />hence:<br /><br />$$ \min_q \bar{p}^T M q \geq \frac{1}{T} \sum_t p_t M q_t$$<br /><br />where $q_t = \arg \min_q p_t^T M q$.<br />By the low regret property we have<br /><br />$$ \frac{1}{T} \sum_t p_t^T M q_t \geq \max_p \frac{1}{T} \sum_t p^T M q_t - o(1) $$<br /><br />Hence:<br />$$ \frac{1}{T} \sum_t p_t M q_t \geq \max_p \frac{1}{T} \sum_t p M q_t - o(1) \geq \min_q \max_p p^T M q - o(1) $$<br /><br />Overall we got:<br /><br />$$\lambda_A \geq \lambda_B - o(1)$$<br />as needed.<br /><br /><br /><br /></div>Elad Hazanhttps://plus.google.com/111108599717601698901noreply@blogger.com0