Łojasiewicz inequality

From HandWiki
Short description: Inequality from distance to a zero of a real analytic function

In real algebraic geometry, the Łojasiewicz inequality, named after Stanisław Łojasiewicz, gives an upper bound for the distance of a point to the nearest zero of a given real analytic function. Specifically, let ƒ : U → R be a real analytic function on an open set U in Rn, and let Z be the zero locus of ƒ. Assume that Z is not empty. Then for any compact set K in U, there exist positive constants α and C such that, for all x in K

dist(x,Z)αC|f(x)|.

Here, α can be small.

The following form of this inequality is often seen in more analytic contexts: with the same assumptions on f, for every p ∈ U there is a possibly smaller open neighborhood W of p and constants θ ∈ (0,1) and c > 0 such that

|f(x)f(p)|θc|f(x)|.

Polyak inequality

A special case of the Łojasiewicz inequality, due to Boris Polyak (ru) (see condition C in [1]), is commonly used to prove linear convergence of gradient descent algorithms. This section is based on (Karimi Nutini) and (Liu Zhu).

Definitions

f is a function of type d, and has a continuous derivative f.

X* is the subset of d on which f achieves its global minimum (if one exists). Throughout this section we assume such a global minimum value f* exists, unless otherwise stated. The optimization objective is to find some point x in X*.

μ,L>0 are constants.

f is L-Lipschitz continuous iff

f(x)f(y)Lxy,x,y

f is μ-strongly convex ifff(y)f(x)+f(x)T(yx)+μ2yx2x,y

f is μ-PL (where "PL" means "Polyak-Łojasiewicz") iff12f(x)2μ(f(x)f(x*)),x

Basic properties

Theorem — 1. If f is μ-PL, then it is invex.

2. If f is L-Lipschitz continuous, then f(y)f(x)+f(x),yx+L2yx2

3. If f is μ-strongly convex then it is μ-PL.

4. If g is μ-strongly convex, and A is linear, then f:=gA is (μσ2)-PL, where σ is the smallest nonzero singular value of A.

5. (quadratic growth) If f is μ -PL, x is a point, and x* is the point on the optimum set that is closest to x in L2-norm, then f(x)f(x*)+μ2xx*22

Proof

Gradient descent

Theorem (linear convergence of gradient descent) — If f is μ-PL and f is L-Lipschitz, then gradient descent with constant step size ηxk+1=xkηf(xk)converges linearly asf(xk)f(x*)(12μη(1Lη/2))k(f(x0)f(x*)),η(0,2/L)

The convergence is the fastest when η=1/L, at which pointf(xk)f(x*)(1μ/L)k(f(x0)f(x*))

Proof

Corollary — 1. xk converges to the optimum set X* at a rate of (1μη(2Lη)).

2. If f is μ -PL, not constant, and f is L -Lipschitz, then Lμ.

3. Under the same conditions, gradient descent with optimal step size (which might be found by line-searching) satisfies

f(xk)f(x*)(1μ/L)k(f(x0)f(x*))

Coordinate descent

The coordinate descent algorithm first samples a random coordinate ik uniformly, then perform gradient descent by xk+1=xkηikf(xk)eik

Theorem — Assume that f is μ -PL, and that f is L -Lipschitz at each coordinate, meaning that |if(x+tei)if(x)|L|t|Then, 𝔼[f(xk)f(x*)] converges linearly for all η(0,2/L) by 𝔼[f(xk)f(x*)](1μη(2Lη)d)k(f(x0)f(x*))

Proof

Stochastic gradient descent

In stochastic gradient descent, we have a function to minimize f(x), but we cannot sample its gradient directly. Instead, we sample a random gradient fi(x), where fi are such that f(x)=𝔼i[fi(x)] For example, in typical machine learning, x are the parameters of the neural network, and fi(x) is the loss incurred on the i -th training data point, while f(x) is the average loss over all training data points.

The gradient update step is xk+1=xkηkfik(xk) where ηk>0 are a sequence of learning rates (the learning rate schedule).

Theorem — If each fi is L -Lipschitz, f is μ -PL, and f has global minimum f*, then 𝔼[f(xk+1)f*](12ηkμ)[f(xk)f*]+Lηk22𝔼i[fi(xk)2]We can also write it using the variance of gradient L2 norm: 𝔼[f(xk+1)f*](1μ(2ηkLηk2))[f(xk)f*]+Lηk22𝔼i[fi(xk)f(xk)2]

Proof

As it is, the proposition is difficult to use. We can make it easier to use by some further assumptions.

The second-moment on the right can be removed by assuming a uniform upper bound. That is, if there exists some C>0 such that during the SG process, we have𝔼i[fi(xk)2]C for all k=0,1,, then𝔼[f(xk+1)f*](12ηkμ)[f(xk)f*]+LCηk22Similarly, if k,𝔼i[fi(xk)f(xk)2]C then𝔼[f(xk+1)f*](1μ(2ηkLηk2))[f(xk)f*]+LCηk22

Learning rate schedules

For constant learning rate schedule, with ηk=η=1/L, we have𝔼[f(xk+1)f*](1μ/L)[f(xk)f*]+C2LBy induction, we have 𝔼[f(xk)f*](1μ/L)k[f(x0)f*]+C2μWe see that the loss decreases in expectation first exponentially, but then stops decreasing, which is caused by the C/(2L) term. In short, because the gradient descent steps are too large, the variance in the stochastic gradient starts to dominate, and xk starts doing a random walk in the vicinity of X*.

For decreasing learning rate schedule with ηk=O(1/k), we have 𝔼[f(xk)f*]=O(1/k).

References

  1. Polyak, B. T. (1 January 1963). "Gradient methods for the minimisation of functionals". USSR Computational Mathematics and Mathematical Physics 3 (4): 864–878. doi:10.1016/0041-5553(63)90382-3. https://doi.org/10.1016/0041-5553(63)90382-3. Retrieved 27 December 2025. 
  • Bierstone, Edward; Milman, Pierre D. (1988), "Semianalytic and subanalytic sets", Publications Mathématiques de l'IHÉS 67 (67): 5–42, doi:10.1007/BF02699126, ISSN 1618-1913, http://www.numdam.org/item?id=PMIHES_1988__67__5_0 
  • Ji, Shanyu; Kollár, János; Shiffman, Bernard (1992), "A global Łojasiewicz inequality for algebraic varieties", Transactions of the American Mathematical Society 329 (2): 813–818, doi:10.2307/2153965, ISSN 0002-9947, http://www.ams.org/journals/tran/1992-329-02/S0002-9947-1992-1046016-6/ 
  • Karimi, Hamed; Nutini, Julie; Schmidt, Mark (2016). "Linear Convergence of Gradient and Proximal-Gradient Methods Under the Polyak–Łojasiewicz Condition". arXiv:1608.04636 [cs.LG].
  • Liu, Chaoyue; Zhu, Libin; Belkin, Mikhail (2022-07-01). "Loss landscapes and optimization in over-parameterized non-linear systems and neural networks". Applied and Computational Harmonic Analysis. Special Issue on Harmonic Analysis and Machine Learning 59: 85–116. doi:10.1016/j.acha.2021.12.009. ISSN 1063-5203. https://www.sciencedirect.com/science/article/abs/pii/S106352032100110X.