Tuesday, May 29, 2018

Taking control by convex optimization



Image result for rudolph kalman


The handsome fellow above is the late Rudolf Kalman, a pioneer of modern control, receiving the presidential medal of freedom from president Barack Obama.

The setting which Kalman considered, and has been a cornerstone of modern control, is that of stateful time-series. More precisely, consider a sequence of inputs and outputs, or measurements, coming from some real-world machine
$$ x_1,x_2,...,x_T,...   \ \ \Rightarrow  \ \ \ y_1,y_2, y_3 , ..., y_T , ... $$
Examples include:

  • $y_t$ are measurements of ocean temperature, in a study of global warming. $x_t$ are various environmental knowns, such as time of the year, water acidity, etc. 
  • Language translation, $x_t$ are words in Hebrew, and $y_t$ is their English translation
  • $x_t$ are physical controls of a robot, i.e. force in various directions, and $y_t$ are the location to which it moves
Such systems are generally called dynamical systems, and one of the simplest and most useful mathematical models to capture a subset of them is called  linear dynamical systems (LDS), in which the output is generated according to the following equations:
\begin{align*}
h_{t+1} & = A h_t + B x_t + \eta_t \\
y_{t+1} &  = C h_t + D x_t + \zeta_t
\end{align*}
Here $h_t$ is a hidden state, which can have very large or even infinite dimensionality, $A,B,C,D$ are linear transformations and $\eta_t,\zeta_t$ are noise vectors.

This setting is general enough to capture a variety of machine learning models previously considered in isolation, such as HMM (hidden Markov models), principle and independent component analysis, mixture of Gaussian clusters and many more, see this survey by Roweis and Ghahramani.

Alas - without any further assumptions the Kalman filtering model is non-convex and computationally hard, thus general polynomial time algorithms for efficient worst-case prediction were not known, till some exciting recent developments I'll survey below.

If the linear transformations $A,B,C,D$ are known, then it is possible to efficiently predict $y_t$ given all previous inputs and outputs of the system using convex optimization. Indeed, this is what the Kalman filter famously achieves, and in an optimal way. Similarly, if there is no hidden state, the problem becomes trivially convex, and thus easily solvable (see recent blog posts and papers by the Recht group further analyzing this case).

However, if the system given by the transformations A,B,C,D is unknown, recovering them from observations is called "system identification", a notoriously hard problem. Even for HMMs, which can be seen as a special case of LDS (see the RG survey), the identification problem is known to be NP-hard without additional assumptions. (Aside: HMMs are identifiable with further distributional assumption, see the breakthrough paper of Hsu, Kakade and Zhang, which started the recent tensor revolution in theoretical ML).

In the face of NP-hardness, state-of-the-art reduces to a set of heuristics. Amongst them are:
  • By far the most common is the use of EM, or expectation maximization, to alternately solve for the system and the hidden states. Since this is a heuristic for non-convex optimization, global optimality is not guaranteed, and indeed EM can settle on a local optimum, as shown here.
  • Gradient descent: this wonderful heuristic can be applied for the non-convex problem of finding both the system and hidden states simultaneously. In a striking recent result, Hardt and Ma showed that under some generative assumptions, gradient descent converges to the global minimum. 

This is where we go back to the convex path: using the failsafe methodology of convex relaxation and regret minimization in games, in a recent set of results our group has found assumption-free, efficient algorithms for predicting in general LDS, as well as some other control extensions. At the heart of these results is a new method of spectral filtering. This method requires some mathematical exposition, we'll defer it to a future longer post. For the rest of this post, we'll describe the autoregressive (a.k.a. regression) methodology for LDS, which similar to our result, is a worst-case and convex-relaxation based method.

To see how this works, consider the LDS equations above, and notice that for zero-mean noise, the expected output at time $t+1$ is given by: (we take $D=0$ for simplicity)
$$ E[y_{t+1} ] =  \sum_{i=1}^t  C A^i B x_{t-i} $$
This has few parameters to learn, namely only three matrices $A,B,C$, but has a very visible non-convex component of raising $A$ to high powers.

The obvious convex relaxation takes $M_i = C A^i B$ as a set of matrices for $i=1,...,t$. We now have a set of linear equalities of the form:
$$ y_{t+1} = \sum_{i=1}^t M_i x_{t-1} $$
which can be solved by your favorite convex linear-regression optimizer. The fallacy is also clear: we have parameters that scale as the number of observations, and thus cannot hope to generalize.

Luckily, many systems are well behaved in the following way. First, notice that the very definition of an LDS requires the spectral norm of $A$ to be bounded by one, otherwise we have signal explosion - it's magnitude grows indefinitely with time. It is thus not too unreasonable to assume a strict bound away from one, say $|A| \leq 1- \delta$, for some small constant $\delta >0$. This allows us to bound the magnitude of the $i$'th matrix $M_i$, that is supposed to represent $C A^i B$, by $(1-\delta)^i$.

With this assumption, we can then take only $ k = O(\frac{1}{\delta} \log \frac{1}{\epsilon})$ matrices, and write
$$ y_{t+1} = \sum_{i=1}^k M_i x_{t-1} + O(\epsilon) $$

This folklore method, called an autoregressive model or simply learning LDS by regression, is very powerful despite its simplicity. It has the usual wonderful merits of online learning by convex relaxation: is worst-case (no generative assumptions), agnostic, and online. Caveats? we have a linear dependence on the eigengap of the system, which can be very large for many interesting systems.

However, it is possible to remove this gap completely, and learn arbitrarily ill-defined LDS without dependence on the condition number. This method is also online, agnostic, and worst case, and is based on a new tool: wave-filtering. In essence it applies the online regression method on a specially-designed fixed filter of the input. To see how these filters are designed, and their theoretical properties, check out our recent papers






Thursday, March 1, 2018

unsupervised learning III: worst-case compression-based metrics that generalize


This post is a long-overdue sequel to parts I and II of previous posts on unsupervised learning.  The long time it took me to publish this one is not by chance - I am still ambivalent on the merit, or lack of, for the notion of a worst-case PAC/statistical learnability (or, god forbid, regret minimization) without supervision.

And yet,
1. I owe a followup to the stubborn few readers of the blog who have made it thus far.
2. The algorithms we'll discuss are mathematically correct, and even if completely useless in practice, can serve as amusement to us, fans of regret minimization in convex games.
3. I really want to get on to more exciting recent developments in control that have kept me busy at night...

So, here goes:

The idea is to define unsupervised learning as PAC learning in the usual sense, with a real-valued objective function, and specially-structured hypothesis classes.

  • A hypothesis is a pair of functions, one from domain to range and the other vice versa, called encoder-decoder pair. 
  • The real-valued loss is also composed of two parts, one of which is the reconstruction error of the encoding-decoding scheme, and the other proportional to encoding length. 
Et-voila, all notions of generalizability carry over from statistical learning theory together with dimensionality concepts such as Rademacher complexity.

More significantly - these worst-case definitions allow improper learning of NP-hard concepts. For the fully-precise definitions, see  the paper with Tengyu Ma. The latter also gives a striking example of a very well-studied NP-hard problem, namely dictionary learning, showing it can be learned in polynomial time using convex relaxations.

Henceforth, I'll give a much simpler example that perhaps illustrates the concepts in an easier way, one that arose in conversations with our post-docs Roi Livni and Pravesh Kothari.

This example is on learning the Mixture-Of-Gaussians (MOG) unsupervised learning model, one of the most well-studied problems in machine learning. In this setting, a given distribution over data is assumed to be from a mixture distribution over normal random variables, and the goal is to associate each data point with the corresponding Gaussian, as well as to identify these Gaussians.

The MOG model has long served as a tool for scientific discovery, see e.g. this blog post by Moritz Hardt and links therein. However, computationally it is not well behaved in terms of worst-case complexity. Even the simpler "k-means" model is known to be NP-hard.

Here is where things get interesting: using reconstruction error and compression as a metric, we can learn MOG by a seemingly different unsupervised learning method - Principle Component Analysis (PCA)! The latter admits very efficient linear-algebra-based algorithms.

More formally: let $X \in R^{m \times d}$ be a data matrix sampled i.i.d from some arbitrary distribution $x \sim D$. Assume w.l.o.g. that all norms are at most one. Given a set of $k$ centers of Gaussians, $\mu_1,...,\mu_k$, minimize reconstruction error given by
$$ E_{x \sim D} [ \min_i \| \mu_i - x \|^2  ]  $$
A more general formulation, allowing several means to contribute to the reconstruction error, is
$$ E_{x \sim D} [ \min_{ \| \alpha_x\|_2 \leq 1  }  \| \sum_{j} \alpha_j  \mu_j - x \|^2  ]  $$
Let $M \in R^{d \times k}$ be the matrix of all $\mu$ vectors. Then we can write the optimization problem as
$$  \min_{M} E_{x \sim D} [ \min_{ \| \alpha_x\|_2 \leq 1  }  \|   x  - M \alpha_x \|^2  ]  $$
This corresponds to an encoding by vector $\alpha$ rather than by a single coordinate. We can write the closed form solution to $\alpha_x$ as:
$$  \arg \min_{\alpha_x }    \|   x  - M \alpha_x \|^2   = M^{-1} x $$
and the objective becomes
$$  \min_{\alpha_x }    \|   x  - M \alpha_x \|^2   = \| x - M M^{-1} x\|_\star^2 $$
for $\| \|_\star$ being the spectral norm.
Thus, we're left with the optimization problem of
$$  \min_{M} E_{x \sim D} [  \|   x  - M M^{\dagger} x \|_\star^2  ]  $$
which amounts to PCA.

We have just semi-formally seen how MOG can be learned by PCA!

A more general theorem applies also to MOG that are not spherical, some details appear in the paper with Tengyu, in the section on spectral auto-encoders.

The keen readers will notice we lost something in compression along the way: the encoding is now a k-dimensional vector as opposed to a 1-hot k-dimensional vector, and thus takes $k \log \frac{1}{\epsilon}$ bits to represent in binary, as opposed to $O(\log k)$. We leave it as an open question to come up with an efficient poly-log(k) compression that matches MOG. The solver gets a free humus at Sayeed's, my treat!







Thursday, June 8, 2017

Hyperparameter optimization and the analysis of boolean functions

I've always admired the beautiful theory on the analysis of boolean functions (see the excellent book of Ryan O'Donnell, as well Gil Kalai's lectures). Heck, it used to be my main area of investigation back in the early grad-school days we were studying hardness of approximation, the PCP theorem, and the "new" (back at the time) technology of Fourier analysis for boolean functions.  This technology also gave rise to seminal results on learnability with respect to the uniform distribution. Uniform distribution learning has been widely criticized as unrealistic, and the focus of the theoretical machine learning community has shifted to (mostly) convex, real-valued, agnostic learning in recent years.

It is thus particularly exciting to me that some of the amazing work on boolean learning can be used for the very practical problem of Hyperparameter Optimization (HO), which has on the face of it nothing to do with the uniform distribution. Here is the draft, joint work with Adam Klivans and Yang Yuan.

Saturday, October 29, 2016

Unsupervised Learning II: the power of improper convex relaxations

In this continuation post I'd like to bring out a critical component of learning theory that is essentially missing in today's approaches for unsupervised learning: improper learning by convex relaxations.

Consider the task of learning a sparse linear separator. The sparsity, or $\ell_0$, constraint is non-convex and computationally hard. Here comes the Lasso - replace the $\ell_0$ constraint with $\ell_1$ convex relaxation --- et voila --- you've enabled polynomial-time solvable convex optimization.

Thursday, October 6, 2016

A mathematical definition of unsupervised learning?

Extracting structure from data is considered by many to be the frontier of machine learning. Yet even defining "structure", or the very goal of learning structure from unlabelled data, is not well understood or rigorously defined.

In this post we'll give a little background to unsupervised learning and argue that, compared to supervised learning, we lack a well-founded theory to tackle even the most basic questions. In the next post, we'll introduce a new candidate framework.

Sunday, July 31, 2016

Faster Than SGD 1: Variance Reduction

SGD is well-known for large-scale optimization. In my mind, there are so-far two fundamentally different improvements since the original introduction of SGD: (1) variance reduction, and (2) acceleration. In this post I'd love to conduct a survey regarding (1), and I'd like to especially thank those ICML'16 participants who pushed me to write this post.

Wednesday, July 6, 2016