Consider the famous composite convex minimization problem

\begin{equation}\label{eqn:the-problem}

\min_{x\in \mathbb{R}^d} \Big\{ F(x) := f(x) + \psi(x) := \frac{1}{n}\sum_{i=1}^n f_i(x) + \psi(x) \Big\} \enspace, \tag{1}

\end{equation}

in which $f(x) = \frac{1}{n}\sum_{i=1}^n f_i(x)$ is a finite average of $n$ functions, and $\psi(x)$ is simple "proximal" function such as the $\ell_1$ or $\ell_2$ norm. In this finite-sum form, each function $f_i(x)$ usually represents the loss function with respect to the $i$-th data vector. Problem \ref{eqn:the-problem} arises in many places:

- convex classification and regression problems (e.g. Lasso, SVM, Logistic Regression) fall into \ref{eqn:the-problem}.
- some notable non-convex problems including PCA, SVD, CCA can be reduced to \ref{eqn:the-problem}.
- the neural net objective can be written in \ref{eqn:the-problem} as well although the function $F(x)$ becomes non-convex; in any case, methods solving convex versions of \ref{eqn:the-problem} sometimes do generalize to non-convex settings.

## Recall: Stochastic Gradient Descent (SGD)

To minimize objective $F(x)$, stochastic gradient methods iteratively perform the following update

$$x_{k+1} \gets \mathrm{argmin}_{y\in \mathbb{R}^d} \Big\{ \frac{1}{2 \eta } \|y-x_k\|_2^2 + \langle \tilde{\nabla}_k, y \rangle + \psi(y) \Big\} \enspace,$$

where $\eta$ is the step length and $\tilde{\nabla}_k$ is a random vector satisfying $\mathbb{E}[\tilde{\nabla}_k] = \nabla f(x_k)$ and is referred to as the

*gradient estimator*. If the proximal function $\psi(y)$ equals zero, the update reduces to $x_{k+1} \gets x_k - \eta \tilde{\nabla}_k$.
A popular choice for the gradient estimator is $\tilde{\nabla}_k = \nabla f_i(x_k)$ for some random index $i \in [n]$ per iteration, and methods based on this choice are known as

*stochastic gradient descent (SGD)*. Since computing $\nabla f_i(x)$ is usually $n$ times faster than that of $\nabla f(x)$, SGD enjoys a low per-iteration cost as compared to full-gradient methods; however, SGD cannot converge at a rate faster than $1/\varepsilon$ even if $F(\cdot)$ is very nice.## Key Idea: Variance Reduction Gives Faster SGD

The theory of*variance reduction*states that, SGD can converge much faster if one makes a better choice of the gradient estimator $\tilde{\nabla}_k$, so that its variance "reduces as $k$ increases". Of course, such a better choice must have (asymptotically) the same per-iteration cost as compared with SGD.

There are two (but only two!) fundamentally different ways to choose a better gradient estimator, the first one is known as SVRG, and the second one is known as SAGA. Both of them require each function $f_i(x)$ to be smooth, but such a requirement is not intrinsic and can be somehow removed.

- Choice 1: the SVRG estimator (my favorite!)

Keep a snapshot vector $\tilde{x} = x_k$ every $m$ iterations (where $m$ is some parameter usually around $2n$), and compute the full gradient $\nabla f(\tilde{x})$ only for such snapshots. Then, set

$$\tilde{\nabla}_k := \nabla f_i (x_k) - \nabla f_i(\tilde{x}) + \nabla f(\tilde{x})$$

where $i$ is randomly chosen from $1,\dots,n$. The amortized cost of computing $\tilde{\nabla}_k$ is only 3/2 times bigger than SGD if $m=2n$ and if we store $\nabla f_i(\tilde{x})$ in memory. - Choice 2: the SAGA estimator.

Store in memory $n$ vectors $\phi_1,\dots,\phi_n$ and set all of them to be zero at the beginning. Then, in each iteration $k$, set

$$\tilde{\nabla}_k := \nabla f_i(x_k) - \nabla f_i(\phi_i) + \frac{1}{n} \sum_{j=1}^n \nabla f_j(\phi_j)$$

where $i$ is randomly chosen from $1,\dots,n$. Then, very importantly, update $\phi_i \gets x_k$ for this $i$. If properly implemented, the per-iteration cost to compute $\tilde{\nabla}_k$ is the same as SGD.

## How Variance is Reduced?

Both choices of gradient estimators ensure that the variance of $\tilde{\nabla}_k$ approaches to zero as $k$ grows. In a rough sense, both of them ensure that $\mathbb{E}[\|\tilde{\nabla}_k - \nabla f(x_k)\|^2] \leq O(f(x_k) - f(x^*))$ so the variance decreases as we approach to the minimizer $x^*$. The proof of this is two-lined if $\psi(x)=0$ and requires a little more effort in the general setting, see for instance Lemma A.2 of this paper.Using this key observation one can prove that, if $F(x)$ is $\sigma$-strongly convex and if each $f_i(x)$ is $L$-smooth, then the "gradient complexity" (i.e., # of computations of $\nabla f_i (x)$) of variance-reduced SGD methods to minimize Problem \ref{eqn:the-problem} is only $O\big( \big(n + \frac{L}{\sigma} \big) \log \frac{1}{\varepsilon}\big)$. This is much faster than the original SGD method.

## Is Variance Reduction Significant?

Short answer: NO when first introduced, but gradually becoming YES, YES and YES.Arguably the original purpose of variance reduction is to make SGD run faster on convex classification / regression problems. However, variance-reduction methods cannot beat the earlier proposed coordinate-descent method SDCA, and performs much worse than its accelerated variant too (see for instance the comparison here.)

Then why is variance reduction useful at all? The answer is on the generality of Problem \eqref{eqn:the-problem}. In all classification and regression problems, each $f_i(x)$ is of a restricted form $loss(\langle a_i, x\rangle; b_i)$ where $a_i$ is the $i$-th data vector and $b_i$ is its label. However, Problem \eqref{eqn:the-problem} is a much bigger class, and each function $f_i(x)$ can encode a complicated structure of the learning problem. In the extreme case, $f_i(x)$ could encode a neural network where $x$ characterize the weights of the connections (so becoming nonconvex). For such general problems, SDCA does not work at all.

In sum, variance reduction methods, although converging in the same speed as SDCA, applies more widely.

## The History of Variance Reduction

There are too many variance-reduction papers that even an expert sometime can't keep track of all of them. Below, let me point out the only interesting papers that one should cite (others prove similar results)

- The first variance-reduction method is SAG. However, in my experiments SAG performs very poor, partially because its gradient estimator is biased, i.e., $\mathbb{E}[\tilde{\nabla}_k] \neq \nabla f(x_k)$.
- SAGA is a simple unbiased fix of SAG, and performs very great in practice.
- SVRG was actually independently discovered by two groups of authors, group 1 and group 2. Perhaps because there's no experiment in it, the first group's paper quickly got unnoticed (cited by 24) and the second one becomes very famous (cited by 200+). What a pity.

Because the only two practical methods are SAGA and SVRG, one may ask which one runs faster? My answer is

- It depends on the structure of the dataset: a corollary of this paper suggests that if the feature vectors are pairwisely close, then SVRG is better, and vice versa.
- If the objective is not strongly convex (such as Lasso), then a simple modification of SVRG outperforms both SVRG and SAGA.
- Also, when $f_i(x)$ is a general function, SAGA requires $O(nd)$ memory storage which could be too large to load into memory; SVRG only needs $O(d)$.

## What's Next Beyond Variance Reduction?

There are many works that tried to extend SVRG to other settings. Most of them are no-so-interesting tweaks, but there are three fundamental extensions.

- Shalev-Shwartz first studied Problem \ref{eqn:the-problem} but each $f_i(x)$ is non-convex (although the summation $f(x)$ is convex). He showed that SVRG also works there, and this has been later better formalized and slightly improved. This class of problems has given rise to the fastest low-rank solvers (in theory) on SVD and related problems.
- Elad and I showed that SVRG also works for totally non-convex functions $F(x)$. This is independently discovered by another group of authors. They published at least two more papers on this problem too, one supporting proximal, and one proving SAGA's variant.

- This March, I found a way to design accelerated SGD, and the technique is based on variance reduction. (Recall that SGD, when equipped with Nesterov's momentum, does not perform well; it has been an open question regarding how to fix that, and in this paper I found out that variance reduction is the key.) I plan to talk about it in my next post in one week or two.