(post by Zeyuan Allen-Zhu and Elad Hazan)

The following dilemma is encountered by many of my friends when teaching basic optimization: which variant/proof of gradient descent should one start with? Of course, one needs to decide on which depth of convex analysis one should dive into, and decide on issues such as "should I define strong-convexity?", "discuss smoothness?", "Nesterov acceleration?", etc.

This is especially acute for courses that do not deal directly with optimization, which is described as a tool for learning or as a basic building block for other algorithms. Some approaches:

This year I've tried something different in COS511 @ Princeton, which turns out also to have research significance. We've covered basic GD for well-conditioned functions, i.e. smooth and strongly-convex functions, and then

A (simplified) outline of this teaching strategy is given in chapter 2 of this book.

Given a non-strongly convex but smooth objective $f$, define a objective by

$$ \text{strong-convexity reduction:} \qquad f_1(x) = f(x) + \epsilon \|x\|^2 $$

It is straightforward to see that $f_1$ differs from $f$ by at most $\epsilon$ times a distance factor, and in addition it is $\epsilon$-strongly convex. Thus, one can apply $A$ to minimize $f_1$ and get a solution which is not too far from the optimal solution for $f$ itself. This simplistic reduction yields an almost optimal rate, up to logarithmic factors.

Similar simplistic assumptions can be derived for (finite-sum forms of) non-smooth by strongly-convex functions (via randomized smoothing or Fenchel duality), and for functions that are neither smooth nor strongly-convex by just applying both reductions simultaneously. Notice that such classes of functions include famous machine learning problems such as SVM, Logistic Regression, SVM, L1-SVM, Lasso, and many others.

These theoretical concerns also translate into running time losses and parameter tuning difficulties in practice. For such reasons, researchers usually make efforts on designing unbiased methods instead.

A figure to demonstrate the practical performance of our new reduction (red dotted curve) as compared to the classical biased reduction (blue curves, with different regularizer weights) are presented in the figure below.

The following dilemma is encountered by many of my friends when teaching basic optimization: which variant/proof of gradient descent should one start with? Of course, one needs to decide on which depth of convex analysis one should dive into, and decide on issues such as "should I define strong-convexity?", "discuss smoothness?", "Nesterov acceleration?", etc.

This is especially acute for courses that do not deal directly with optimization, which is described as a tool for learning or as a basic building block for other algorithms. Some approaches:

- I teach online gradient descent, in the context of online convex optimization, and then derive the offline counterpart. This is non-standard, but permits an extremely simple derivation and gets to the online learning component first.
- Sanjeev Arora teaches basic offline GD for the smooth and strongly-convex case first.
- In OR/optimization courses the smooth (but not strongly-convex) case is many times taught first.

This year I've tried something different in COS511 @ Princeton, which turns out also to have research significance. We've covered basic GD for well-conditioned functions, i.e. smooth and strongly-convex functions, and then

**extended**these result**by reduction**to all other cases!A (simplified) outline of this teaching strategy is given in chapter 2 of this book.

### Classical Strong-Convexity and Smoothness Reductions

Given any optimization algorithm A for the well-conditioned case (i.e., strongly convex and smooth case), we can derive an algorithm for smooth but not strongly functions as follows.Given a non-strongly convex but smooth objective $f$, define a objective by

$$ \text{strong-convexity reduction:} \qquad f_1(x) = f(x) + \epsilon \|x\|^2 $$

It is straightforward to see that $f_1$ differs from $f$ by at most $\epsilon$ times a distance factor, and in addition it is $\epsilon$-strongly convex. Thus, one can apply $A$ to minimize $f_1$ and get a solution which is not too far from the optimal solution for $f$ itself. This simplistic reduction yields an almost optimal rate, up to logarithmic factors.

Similar simplistic assumptions can be derived for (finite-sum forms of) non-smooth by strongly-convex functions (via randomized smoothing or Fenchel duality), and for functions that are neither smooth nor strongly-convex by just applying both reductions simultaneously. Notice that such classes of functions include famous machine learning problems such as SVM, Logistic Regression, SVM, L1-SVM, Lasso, and many others.

### Necessity of Reductions

This is not only a pedagogical question. In fact, very few algorithms apply to the entire spectrum of strong-convexity / smoothness regimes, and thus reductions are very often

*intrinsically*necessary. To name a few examples,- Variance-reduction methods such as SAG, SAGA and SVRG require the objective to be smooth, and do not work for non-smooth problems like SVM. This is because for loss functions such as hinge loss, no unbiased gradient estimator can achieve a variance that approaches to zero
- Dual methods such as SDCA or APCG require the objective to be strongly convex, and do not directly apply to non-strongly convex problems. This is because for non-strongly convex objectives such as Lasso, their duals are not even be well-defined.
- Primal-dual methods such as SPDC require the objective to be both smooth and SC.

### Optimality and Practicality of Reductions

The folklore strong-convexity and smoothing reductions are suboptimal. Focusing on the strong-convexity reduction for instance:

- It incurs a logarithmic factor log(1/ε) in the running time so leading to slower algorithms than direct methods. For instance, if $f(x)$ is smooth but non-strongly convex, gradient descent minimizes it to an $\epsilon$ accuracy in $O(1/\epsilon)$ iterations; if one uses the reduction instead, the complexity becomes $O(\log(1/\epsilon)/\epsilon)$.
- More importantly, algorithms based on such reductions become
: the original objective value $f(x)$ does not converge to the global minimum, and if the desired accuracy is changed, one has to change the weight of the regularizer $\|x\|^2$ and restart the algorithm.**biased**

- For instance, SVRG in theory solves the Lasso problem (with smoothness L) in time $O((nd+\frac{Ld}{\epsilon})\log\frac{1}{\epsilon})$ using reduction. Later, direct and unbiased methods for Lasso are introduced, including SAGA which has running time $O(\frac{nd+Ld}{\epsilon})$, and SVRG++ which has running time $O(nd \log\frac{1}{\epsilon} + \frac{Ld}{\epsilon})$.

**black-box reduction is thus a tool to extend optimization algorithms from one domain to the rest.***optimal and unbiased*### Optimal, Unbiased, and Practical Reductions

In this paper, we give optimal and unbiased reductions. For instance, the new reduction, when applied to SVRG, implies the same running time as SVRG++ up to constants, and is unbiased so converges to the global minimum. Perhaps more surprisingly, these new results imply

**theoretical results that were not previously known by direct methods. To name two of such results:***new*- On Lasso, it gives an accelerated running time $O(nd \log\frac{1}{\epsilon} + \frac{\sqrt{nL}d}{\sqrt{\epsilon}})$ where the best known result was not only an biased algorithm but also slower $O( (nd + \frac{\sqrt{nL}d}{\sqrt{\epsilon}}) \log\frac{1}{\epsilon} )$.
- On SVM with strong convexity $\sigma$, it gives an accelerated running time $O(nd \log\frac{1}{\epsilon} + \frac{\sqrt{n}d}{\sqrt{\sigma \epsilon}})$ where the best known result was not only an biased algorithm but also slower $O( (nd + \frac{\sqrt{n}d}{\sqrt{\sigma \epsilon}}) \log\frac{1}{\epsilon} )$.

A figure to demonstrate the practical performance of our new reduction (red dotted curve) as compared to the classical biased reduction (blue curves, with different regularizer weights) are presented in the figure below.

As a final word - if you were every debating whether to post your paper on ArXiV, yet another example of how quickly it helps research propagate: only a few weeks after our paper was made available online, Woodworth and Srebro have already made use of our reductions in their new paper.

I think that this is really complicated to teach basic optimization: which variant/proof of gradient descent should one start with? This is especially acute for online translation services reviews courses that do not deal directly with optimization, which is described as a tool for learning or as a basic building block for other algorithms. One can find academic papers derive various optimization improvements many times for only one of the settings, leaving the other settings desirable.

ReplyDeleteI want to we appreciate you this passion you cash in on throughout establishing the next few paragraphs.Naturally I’ll give you a link in your internet blog. Thanks for sharing.

ReplyDeletebest autoketing, Shopify free shipping bar app, shipping bar shopify

This is a great inspiring article. I have to thank you for ones time for this especially astonishing read! You can do something much better but i still say this perfect.Keep trying for the best.

ReplyDeletefacebook live chat help, facebook messenger online, autoketing.com

email with love online

ReplyDeleteapps autoketing

email with love free

Hey there! Do you know if they make any plugins to safeguard against hackers?

I’m kinda paranoid about losing everything I’ve worked hard on. Any suggestions?

I accidentally read this and it really made me think. I will have to work hard to make my website useful and more popular. Everyone please refer to offline !!! thank you very much

ReplyDeletecurrency converter master, currency converter box by autoketing, best autoketing app

iogames for school

ReplyDeletefriv land now

abcya games for boy

This pass practice is commonly costly as professional web developers can warfare you each and every share of one from a hundred bucks to thousands of dollars, depending not in the isolate away away off from how perplexing you nonappearance your website to be

I want to read your articles.Thanks to your sharing, I can enrich my knowledge. thank you for the wonderful sharing. They are useful pieces of advice. abcya3 Games for boys juegos motox3m3 gogy for school

ReplyDeleteI'm a decent implied for all articles or blog entries, That I emphatically relished, We'd to a great degree pick much more records identified with the, because of the reality it is in reality okay

ReplyDeleteThe story of cinderella Games to play

Crime city 3d 2 Games online

Top burger Games for girls