Mirror descent

From HandWiki

In mathematics, mirror descent is an iterative optimization algorithm for finding a local minimum of a differentiable function. It generalizes algorithms such as gradient descent and multiplicative weights.

History

Mirror descent was originally proposed by Nemirovski and Yudin in 1983.[1]

Motivation

In gradient descent with the sequence of learning rates [math]\displaystyle{ (\eta_n)_{n \geq 0} }[/math] applied to a differentiable function [math]\displaystyle{ F }[/math], one starts with a guess [math]\displaystyle{ \mathbf{x}_0 }[/math] for a local minimum of [math]\displaystyle{ F, }[/math] and considers the sequence [math]\displaystyle{ \mathbf{x}_0, \mathbf{x}_1, \mathbf{x}_2, \ldots }[/math] such that

[math]\displaystyle{ \mathbf{x}_{n+1}=\mathbf{x}_n-\eta_n \nabla F(\mathbf{x}_n),\ n \ge 0. }[/math]

This can be reformulated by noting that

[math]\displaystyle{ \mathbf{x}_{n+1}=\arg \min_{\mathbf{x}} \left(F(\mathbf{x}_n) + \nabla F(\mathbf{x}_n)^T (\mathbf{x} - \mathbf{x}_n) + \frac{1}{2 \eta_n}\|\mathbf{x} - \mathbf{x}_n\|^2\right) }[/math]

In other words, [math]\displaystyle{ \mathbf{x}_{n+1} }[/math] minimizes the first-order approximation to [math]\displaystyle{ F }[/math] at [math]\displaystyle{ \mathbf{x}_n }[/math] with added proximity term [math]\displaystyle{ \|\mathbf{x} - \mathbf{x}_n\|^2 }[/math].

This squared Euclidean distance term is a particular example of a Bregman distance. Using other Bregman distances will yield other algorithms such as Hedge which may be more suited to optimization over particular geometries.[2][3]

Formulation

We are given convex function [math]\displaystyle{ f }[/math] to optimize over a convex set [math]\displaystyle{ K \subset \mathbb{R}^n }[/math], and given some norm [math]\displaystyle{ \|\cdot\| }[/math] on [math]\displaystyle{ \mathbb{R}^n }[/math].

We are also given differentiable convex function [math]\displaystyle{ h \colon \mathbb{R}^n \to \mathbb{R} }[/math], [math]\displaystyle{ \alpha }[/math]-strongly convex with respect to the given norm. This is called the distance-generating function, and its gradient [math]\displaystyle{ \nabla h \colon \mathbb{R}^n \to \mathbb{R}^n }[/math] is known as the mirror map.

Starting from initial [math]\displaystyle{ x_0 \in K }[/math], in each iteration of Mirror Descent:

  • Map to the dual space: [math]\displaystyle{ \theta_t \leftarrow \nabla h (x_t) }[/math]
  • Update in the dual space using a gradient step: [math]\displaystyle{ \theta_{t+1} \leftarrow \theta_t - \eta_t \nabla f(x_t) }[/math]
  • Map back to the primal space: [math]\displaystyle{ x'_{t+1} \leftarrow (\nabla h)^{-1}(\theta_{t+1}) }[/math]
  • Project back to the feasible region [math]\displaystyle{ K }[/math]: [math]\displaystyle{ x_{t+1} \leftarrow \mathrm{arg}\min_{x \in K}D_h(x||x'_{t+1}) }[/math], where [math]\displaystyle{ D_h }[/math] is the Bregman divergence.

Extensions

Mirror descent in the online optimization setting is known as Online Mirror Descent (OMD).[4]

See also

References

  1. Arkadi Nemirovsky and David Yudin. Problem Complexity and Method Efficiency in Optimization. John Wiley & Sons, 1983
  2. Nemirovski, Arkadi (2012) Tutorial: mirror descent algorithms for large-scale deterministic and stochastic convex optimization.https://www2.isye.gatech.edu/~nemirovs/COLT2012Tut.pdf
  3. "Mirror descent algorithm". https://tlienart.github.io/posts/2018/10/27-mirror-descent-algorithm/. 
  4. Fang, Huang; Harvey, Nicholas J. A.; Portella, Victor S.; Friedlander, Michael P. (2021-09-03). "Online mirror descent and dual averaging: keeping pace in the dynamic case". arXiv:2006.02585 [cs.LG].