Anderson acceleration

From HandWiki

In mathematics, Anderson acceleration, also called Anderson mixing, is a method for the acceleration of the convergence rate of fixed-point iterations. Introduced by Donald G. Anderson,[1] this technique can be used to find the solution to fixed point equations [math]\displaystyle{ f(x) = x }[/math] often arising in the field of computational science.

Definition

Given a function [math]\displaystyle{ f:\mathbb{R}^n \to \mathbb{R}^n }[/math], consider the problem of finding a fixed point of [math]\displaystyle{ f }[/math], which is a solution to the equation [math]\displaystyle{ f(x) = x }[/math]. A classical approach to the problem is to employ a fixed-point iteration scheme;[2] that is, given an initial guess [math]\displaystyle{ x_0 }[/math] for the solution, to compute the sequence [math]\displaystyle{ x_{i+1} = f(x_i) }[/math] until some convergence criterion is met. However, the convergence of such a scheme is not guaranteed in general; moreover, the rate of convergence is usually linear, which can become too slow if the evaluation of the function [math]\displaystyle{ f }[/math] is computationally expensive.[2] Anderson acceleration is a method to accelerate the convergence of the fixed-point sequence.[2]

Define the residual [math]\displaystyle{ g(x) = f(x) - x }[/math], and denote [math]\displaystyle{ f_k = f(x_k) }[/math] and [math]\displaystyle{ g_k = g(x_k) }[/math] (where [math]\displaystyle{ x_k }[/math] corresponds to the sequence of iterates from the previous paragraph). Given an initial guess [math]\displaystyle{ x_0 }[/math] and an integer parameter [math]\displaystyle{ m \geq 1 }[/math], the method can be formulated as follows:[3][note 1]

[math]\displaystyle{ x_1 = f(x_0) }[/math]
[math]\displaystyle{ \forall k = 1, 2, \dots }[/math]
[math]\displaystyle{ m_k = \min\{m, k\} }[/math]
[math]\displaystyle{ G_k = \begin{bmatrix} g_{k-m_k} & \dots & g_k \end{bmatrix} }[/math]
[math]\displaystyle{ \alpha_k = \operatorname{argmin}_{\alpha\in A_k} \|G_k \alpha\|_2, \quad \text{where}\;A_k = \{\alpha = (\alpha_0, \dots, \alpha_{m_k}) \in \mathbb{R}^{m_k+1} : \sum_{i=0}^{m_k}\alpha_i = 1\} }[/math]
[math]\displaystyle{ x_{k+1} = \sum_{i=0}^{m_k}(\alpha_k)_i f_{k-m_k+i} }[/math]

where the matrix–vector multiplication [math]\displaystyle{ G_k \alpha = \sum_{i=0}^{m_k}(\alpha)_i g_{k-m_k+i} }[/math], and [math]\displaystyle{ (\alpha)_i }[/math] is the [math]\displaystyle{ i }[/math]th element of [math]\displaystyle{ \alpha }[/math]. Conventional stopping criteria can be used to end the iterations of the method. For example, iterations can be stopped when [math]\displaystyle{ \|x_{k+1} - x_k\| }[/math] falls under a prescribed tolerance, or when the residual [math]\displaystyle{ g(x_k) }[/math] falls under a prescribed tolerance.[2]

With respect to the standard fixed-point iteration, the method has been found to converge faster and be more robust, and in some cases avoid the divergence of the fixed-point sequence.[3][4]

Derivation

For the solution [math]\displaystyle{ x^* }[/math], we know that [math]\displaystyle{ f(x^*) = x^* }[/math], which is equivalent to saying that [math]\displaystyle{ g(x^*) = \vec{0} }[/math]. We can therefore rephrase the problem as an optimization problem where we want to minimize [math]\displaystyle{ \|g(x)\|_2 }[/math].

Instead of going directly from [math]\displaystyle{ x_k }[/math] to [math]\displaystyle{ x_{k+1} }[/math] by choosing [math]\displaystyle{ x_{k+1} = f(x_k) }[/math] as in fixed-point iteration, let's consider an intermediate point [math]\displaystyle{ x'_{k+1} }[/math] that we choose to be the linear combination [math]\displaystyle{ x'_{k+1} = X_k \alpha_k }[/math], where the coefficient vector [math]\displaystyle{ \alpha_k \in A_k }[/math], and [math]\displaystyle{ X_k = \begin{bmatrix} x_{k-m_k} & \dots & x_k \end{bmatrix} }[/math] is the matrix containing the last [math]\displaystyle{ m_k+1 }[/math] points, and choose [math]\displaystyle{ x'_{k+1} }[/math] such that it minimizes [math]\displaystyle{ \|g(x'_{k+1})\|_2 }[/math]. Since the elements in [math]\displaystyle{ \alpha_k }[/math] sum to one, we can make the first order approximation [math]\displaystyle{ g(X_k\alpha_k) = g\left(\sum_{i=0}^{m_k} (\alpha_k)_i x_{k-m_k+i}\right) \approx \sum_{i=0}^{m_k} (\alpha_k)_i g(x_{k-m_k+i}) = G_k\alpha_k }[/math], and our problem becomes to find the [math]\displaystyle{ \alpha }[/math] that minimizes [math]\displaystyle{ \|G_k\alpha\|_2 }[/math]. After having found [math]\displaystyle{ \alpha_k }[/math], we could in principle calculate [math]\displaystyle{ x'_{k+1} }[/math].

However, since [math]\displaystyle{ f }[/math] is designed to bring a point closer to [math]\displaystyle{ x^* }[/math], [math]\displaystyle{ f(x'_{k+1}) }[/math] is probably closer to [math]\displaystyle{ x^* }[/math] than what [math]\displaystyle{ x'_{k+1} }[/math] is, so it makes sense to choose [math]\displaystyle{ x_{k+1}=f(x'_{k+1}) }[/math] rather than [math]\displaystyle{ x_{k+1}=x'_{k+1} }[/math]. Furthermore, since the elements in [math]\displaystyle{ \alpha_k }[/math] sum to one, we can make the first order approximation [math]\displaystyle{ f(x'_{k+1}) = f\left(\sum_{i=0}^{m_k}(\alpha_k)_i x_{k-m_k+i}\right) \approx \sum_{i=0}^{m_k}(\alpha_k)_i f(x_{k-m_k+i}) = \sum_{i=0}^{m_k}(\alpha_k)_i f_{k-m_k+i} }[/math]. We therefore choose

[math]\displaystyle{ x_{k+1} = \sum_{i=0}^{m_k}(\alpha_k)_i f_{k-m_k+i} }[/math].

Solution of the minimization problem

At each iteration of the algorithm, the constrained optimization problem [math]\displaystyle{ \operatorname{argmin}\|G_k \alpha\|_2 }[/math], subject to [math]\displaystyle{ \alpha\in A_k }[/math] needs to be solved. The problem can be recast in several equivalent formulations,[3] yielding different solution methods which may result in a more convenient implementation:

  • defining the matrices [math]\displaystyle{ \mathcal{G}_k = \begin{bmatrix} g_{k-m_k+1} - g_{k-m_k} & \dots & g_{k} - g_{k-1}\end{bmatrix} }[/math] and [math]\displaystyle{ \mathcal{X}_k = \begin{bmatrix} x_{k-m_k+1} - x_{k-m_k} & \dots & x_{k} - x_{k-1} \end{bmatrix} }[/math], solve [math]\displaystyle{ \gamma_k = \operatorname{argmin}_{\gamma\in\mathbb{R}^{m_k}}\|g_k - \mathcal{G}_k\gamma\|_2 }[/math], and set [math]\displaystyle{ x_{k+1} = x_k + g_k - (\mathcal{X}_k + \mathcal{G}_k)\gamma_k }[/math];[3][4]
  • solve [math]\displaystyle{ \theta_k = \{(\theta_k)_i\}_{i = 1}^{m_k} = \operatorname{argmin}_{\theta\in\mathbb{R}^{m_k}}\left\|g_k + \sum_{i=1}^{m_k}\theta_i(g_{k-i} - g_k)\right\|_2 }[/math], then set [math]\displaystyle{ x_{k+1} = x_k + g_k + \sum_{j = 1}^{m_k}(\theta_k)_j(x_{k-j} - x_k + g_{k-j} - g_k) }[/math].[1]

For both choices, the optimization problem is in the form of an unconstrained linear least-squares problem, which can be solved by standard methods including QR decomposition[3] and singular value decomposition,[4] possibly including regularization techniques to deal with rank deficiencies and conditioning issues in the optimization problem. Solving the least-squares problem by solving the normal equations is generally not advisable due to potential numerical instabilities and generally high computational cost.[4]

Stagnation in the method (i.e. subsequent iterations with the same value, [math]\displaystyle{ x_{k+1} = x_k }[/math]) causes the method to break down, due to the singularity of the least-squares problem. Similarly, near-stagnation ([math]\displaystyle{ x_{k+1}\approx x_k }[/math]) results in bad conditioning of the least squares problem. Moreover, the choice of the parameter [math]\displaystyle{ m }[/math] might be relevant in determining the conditioning of the least-squares problem, as discussed below.[3]

Relaxation

The algorithm can be modified introducing a variable relaxation parameter (or mixing parameter) [math]\displaystyle{ \beta_k \gt 0 }[/math].[1][3][4] At each step, compute the new iterate as [math]\displaystyle{ x_{k+1} = (1 - \beta_k)\sum_{i=0}^{m_k}(\alpha_k)_i x_{k-m_k+i} + \beta_k \sum_{i=0}^{m_k}(\alpha_k)_i f(x_{k-m_k+i})\;. }[/math]The choice of [math]\displaystyle{ \beta_k }[/math] is crucial to the convergence properties of the method; in principle, [math]\displaystyle{ \beta_k }[/math] might vary at each iteration, although it is often chosen to be constant.[4]

Choice of m

The parameter [math]\displaystyle{ m }[/math] determines how much information from previous iterations is used to compute the new iteration [math]\displaystyle{ x_{k+1} }[/math]. On the one hand, if [math]\displaystyle{ m }[/math] is chosen to be too small, too little information is used and convergence may be undesirably slow. On the other hand, if [math]\displaystyle{ m }[/math] is too large, information from old iterations may be retained for too many subsequent iterations, so that again convergence may be slow.[3] Moreover, the choice of [math]\displaystyle{ m }[/math] affects the size of the optimization problem. A too large value of [math]\displaystyle{ m }[/math] may worsen the conditioning of the least squares problem and the cost of its solution.[3] In general, the particular problem to be solved determines the best choice of the [math]\displaystyle{ m }[/math] parameter.[3]

Choice of mk

With respect to the algorithm described above, the choice of [math]\displaystyle{ m_k }[/math] at each iteration can be modified. One possibility is to choose [math]\displaystyle{ m_k = k }[/math] for each iteration [math]\displaystyle{ k }[/math] (sometimes referred to as Anderson acceleration without truncation).[3] This way, every new iteration [math]\displaystyle{ x_{k+1} }[/math] is computed using all the previously computed iterations. A more sophisticated technique is based on choosing [math]\displaystyle{ m_k }[/math] so as to maintain a small enough conditioning for the least-squares problem.[3]

Relations to other classes of methods

Newton's method can be applied to the solution of [math]\displaystyle{ f(x) - x = 0 }[/math] to compute a fixed point of [math]\displaystyle{ f(x) }[/math] with quadratic convergence. However, such method requires the evaluation of the exact derivative of [math]\displaystyle{ f(x) }[/math], which can be very costly.[4] Approximating the derivative by means of finite differences is a possible alternative, but it requires multiple evaluations of [math]\displaystyle{ f(x) }[/math] at each iteration, which again can become very costly. Anderson acceleration requires only one evaluation of the function [math]\displaystyle{ f(x) }[/math] per iteration, and no evaluation of its derivative. On the other hand, the convergence of an Anderson-accelerated fixed point sequence is still linear in general.[5]

Several authors have pointed out similarities between the Anderson acceleration scheme and other methods for the solution of non-linear equations. In particular:

  • Eyert[6] and Fang and Saad[4] interpreted the algorithm within the class of quasi-Newton and multisecant methods, that generalize the well known secant method, for the solution of the non-linear equation [math]\displaystyle{ g(x) = 0 }[/math]; they also showed how the scheme can be seen as a method in the Broyden class;[7]
  • Walker and Ni[3][8] showed that the Anderson acceleration scheme is equivalent to the GMRES method in the case of linear problems (i.e. the problem of finding a solution to [math]\displaystyle{ A\mathbf{x} = \mathbf{x} }[/math] for some square matrix [math]\displaystyle{ A }[/math]), and can thus be seen as a generalization of GMRES to the non-linear case; a similar result was found by Washio and Oosterlee.[9]

Moreover, several equivalent or nearly equivalent methods have been independently developed by other authors,[9][10][11][12][13] although most often in the context of some specific application of interest rather than as a general method for fixed point equations.

Example MATLAB implementation

The following is an example implementation in MATLAB language of the Anderson acceleration scheme for finding the fixed-point of the function [math]\displaystyle{ f(x) = \sin(x) + \arctan(x) }[/math]. Notice that:

  • the optimization problem was solved in the form [math]\displaystyle{ \gamma_k = \operatorname{argmin}_{\gamma\in\mathbb{R}^{m_k}}\|g_k - \mathcal{G}_k\gamma\|_2 }[/math] using QR decomposition;
  • the computation of the QR decomposition is sub-optimal: indeed, at each iteration a single column is added to the matrix [math]\displaystyle{ \mathcal{G}_k }[/math], and possibly a single column is removed; this fact can be exploited to efficiently update the QR decomposition with less computational effort;[14]
  • the algorithm can be made more memory-efficient by storing only the latest few iterations and residuals, if the whole vector of iterations [math]\displaystyle{ x_k }[/math] is not needed;
  • the code is straightforwardly generalized to the case of a vector-valued [math]\displaystyle{ f(x) }[/math].
f = @(x) sin(x) + atan(x); % Function whose fixed point is to be computed.
x0 = 1; % Initial guess.

k_max = 100; % Maximum number of iterations.
tol_res = 1e-6; % Tolerance on the residual.
m = 3; % Parameter m.

x = [x0, f(x0)]; % Vector of iterates x.
g = f(x) - x; % Vector of residuals.

G_k = g(2) - g(1); % Matrix of increments in residuals.
X_k = x(2) - x(1); % Matrix of increments in x.

k = 2;
while k < k_max && abs(g(k)) > tol_res
    m_k = min(k, m);
 
    % Solve the optimization problem by QR decomposition.
    [Q, R] = qr(G_k);
    gamma_k = R \ (Q' * g(k));
 
    % Compute new iterate and new residual.
    x(k + 1) = x(k) + g(k) - (X_k + G_k) * gamma_k;
    g(k + 1) = f(x(k + 1)) - x(k + 1);
 
    % Update increment matrices with new elements.
    X_k = [X_k, x(k + 1) - x(k)];
    G_k = [G_k, g(k + 1) - g(k)];
 
    n = size(X_k, 2);
    if n > m_k
        X_k = X_k(:, n - m_k + 1:end);
        G_k = G_k(:, n - m_k + 1:end);
    end
 
    k = k + 1;
end

% Prints result: Computed fixed point 2.013444 after 9 iterations
fprintf("Computed fixed point %f after %d iterations\n", x(end), k);

See also

Notes

  1. This formulation is not the same as given by the original author;[1] it is an equivalent, more explicit formulation given by Walker and Ni.[3]

References

  1. 1.0 1.1 1.2 1.3 Anderson, Donald G. (October 1965). "Iterative Procedures for Nonlinear Integral Equations". Journal of the ACM 12 (4): 547–560. doi:10.1145/321296.321305. 
  2. 2.0 2.1 2.2 2.3 Quarteroni, Alfio; Sacco, Riccardo; Saleri, Fausto. Numerical mathematics (2nd ed.). Springer. ISBN 978-3-540-49809-4. 
  3. 3.00 3.01 3.02 3.03 3.04 3.05 3.06 3.07 3.08 3.09 3.10 3.11 3.12 3.13 Walker, Homer F.; Ni, Peng (January 2011). "Anderson Acceleration for Fixed-Point Iterations". SIAM Journal on Numerical Analysis 49 (4): 1715–1735. doi:10.1137/10078356X. 
  4. 4.0 4.1 4.2 4.3 4.4 4.5 4.6 4.7 Fang, Haw-ren; Saad, Yousef (March 2009). "Two classes of multisecant methods for nonlinear acceleration". Numerical Linear Algebra with Applications 16 (3): 197–221. doi:10.1002/nla.617. 
  5. Evans, Claire; Pollock, Sara; Rebholz, Leo G.; Xiao, Mengying (20 February 2020). "A Proof That Anderson Acceleration Improves the Convergence Rate in Linearly Converging Fixed-Point Methods (But Not in Those Converging Quadratically)". SIAM Journal on Numerical Analysis 58 (1): 788–810. doi:10.1137/19M1245384. 
  6. Eyert, V. (March 1996). "A Comparative Study on Methods for Convergence Acceleration of Iterative Vector Sequences". Journal of Computational Physics 124 (2): 271–285. doi:10.1006/jcph.1996.0059. 
  7. Broyden, C. G. (1965). "A class of methods for solving nonlinear simultaneous equations". Mathematics of Computation 19 (92): 577–577. doi:10.1090/S0025-5718-1965-0198670-6. 
  8. Ni, Peng (November 2009). Anderson Acceleration of Fixed-point Iteration with Applications to Electronic Structure Computations (PhD).
  9. 9.0 9.1 Oosterlee, C. W.; Washio, T. (January 2000). "Krylov Subspace Acceleration of Nonlinear Multigrid with Application to Recirculating Flows". SIAM Journal on Scientific Computing 21 (5): 1670–1690. doi:10.1137/S1064827598338093. 
  10. Pulay, Péter (July 1980). "Convergence acceleration of iterative sequences. the case of scf iteration". Chemical Physics Letters 73 (2): 393–398. doi:10.1016/0009-2614(80)80396-4. 
  11. Pulay, P. (1982). "ImprovedSCF convergence acceleration". Journal of Computational Chemistry 3 (4): 556–560. doi:10.1002/jcc.540030413. 
  12. Carlson, Neil N.; Miller, Keith (May 1998). "Design and Application of a Gradient-Weighted Moving Finite Element Code I: in One Dimension". SIAM Journal on Scientific Computing 19 (3): 728–765. doi:10.1137/S106482759426955X. 
  13. Miller, Keith (November 2005). "Nonlinear Krylov and moving nodes in the method of lines". Journal of Computational and Applied Mathematics 183 (2): 275–287. doi:10.1016/j.cam.2004.12.032. 
  14. Daniel, J. W.; Gragg, W. B.; Kaufman, L.; Stewart, G. W. (October 1976). "Reorthogonalization and stable algorithms for updating the Gram-Schmidt $QR$ factorization". Mathematics of Computation 30 (136): 772–772. doi:10.1090/S0025-5718-1976-0431641-8.