Mirror descent
In mathematics, mirror descent is an iterative optimization algorithm for finding a local minimum of a differentiable function. It generalizes algorithms such as gradient descent and multiplicative weights.
History
Mirror descent was originally proposed by Nemirovski and Yudin in 1983.[1]
Motivation
In gradient descent with the sequence of learning rates [math]\displaystyle{ (\eta_n)_{n \geq 0} }[/math] applied to a differentiable function [math]\displaystyle{ F }[/math], one starts with a guess [math]\displaystyle{ \mathbf{x}_0 }[/math] for a local minimum of [math]\displaystyle{ F, }[/math] and considers the sequence [math]\displaystyle{ \mathbf{x}_0, \mathbf{x}_1, \mathbf{x}_2, \ldots }[/math] such that
- [math]\displaystyle{ \mathbf{x}_{n+1}=\mathbf{x}_n-\eta_n \nabla F(\mathbf{x}_n),\ n \ge 0. }[/math]
This can be reformulated by noting that
- [math]\displaystyle{ \mathbf{x}_{n+1}=\arg \min_{\mathbf{x}} \left(F(\mathbf{x}_n) + \nabla F(\mathbf{x}_n)^T (\mathbf{x} - \mathbf{x}_n) + \frac{1}{2 \eta_n}\|\mathbf{x} - \mathbf{x}_n\|^2\right) }[/math]
In other words, [math]\displaystyle{ \mathbf{x}_{n+1} }[/math] minimizes the first-order approximation to [math]\displaystyle{ F }[/math] at [math]\displaystyle{ \mathbf{x}_n }[/math] with added proximity term [math]\displaystyle{ \|\mathbf{x} - \mathbf{x}_n\|^2 }[/math].
This squared Euclidean distance term is a particular example of a Bregman distance. Using other Bregman distances will yield other algorithms such as Hedge which may be more suited to optimization over particular geometries.[2][3]
Formulation
We are given convex function [math]\displaystyle{ f }[/math] to optimize over a convex set [math]\displaystyle{ K \subset \mathbb{R}^n }[/math], and given some norm [math]\displaystyle{ \|\cdot\| }[/math] on [math]\displaystyle{ \mathbb{R}^n }[/math].
We are also given differentiable convex function [math]\displaystyle{ h \colon \mathbb{R}^n \to \mathbb{R} }[/math], [math]\displaystyle{ \alpha }[/math]-strongly convex with respect to the given norm. This is called the distance-generating function, and its gradient [math]\displaystyle{ \nabla h \colon \mathbb{R}^n \to \mathbb{R}^n }[/math] is known as the mirror map.
Starting from initial [math]\displaystyle{ x_0 \in K }[/math], in each iteration of Mirror Descent:
- Map to the dual space: [math]\displaystyle{ \theta_t \leftarrow \nabla h (x_t) }[/math]
- Update in the dual space using a gradient step: [math]\displaystyle{ \theta_{t+1} \leftarrow \theta_t - \eta_t \nabla f(x_t) }[/math]
- Map back to the primal space: [math]\displaystyle{ x'_{t+1} \leftarrow (\nabla h)^{-1}(\theta_{t+1}) }[/math]
- Project back to the feasible region [math]\displaystyle{ K }[/math]: [math]\displaystyle{ x_{t+1} \leftarrow \mathrm{arg}\min_{x \in K}D_h(x||x'_{t+1}) }[/math], where [math]\displaystyle{ D_h }[/math] is the Bregman divergence.
Extensions
Mirror descent in the online optimization setting is known as Online Mirror Descent (OMD).[4]
See also
- Gradient descent
- Multiplicative weight update method
- Hedge algorithm
- Bregman divergence
References
- ↑ Arkadi Nemirovsky and David Yudin. Problem Complexity and Method Efficiency in Optimization. John Wiley & Sons, 1983
- ↑ Nemirovski, Arkadi (2012) Tutorial: mirror descent algorithms for large-scale deterministic and stochastic convex optimization.https://www2.isye.gatech.edu/~nemirovs/COLT2012Tut.pdf
- ↑ "Mirror descent algorithm". https://tlienart.github.io/posts/2018/10/27-mirror-descent-algorithm/.
- ↑ Fang, Huang; Harvey, Nicholas J. A.; Portella, Victor S.; Friedlander, Michael P. (2021-09-03). "Online mirror descent and dual averaging: keeping pace in the dynamic case". arXiv:2006.02585 [cs.LG].
Original source: https://en.wikipedia.org/wiki/Mirror descent.
Read more |