EAS 657
Geophysical Inverse Theory
Robert L. Nowack
Lecture 12
From an earlier lecture, we showed that the stochastic inverse solution minimizes
where x are the model parameters, x0 are the prior model parameters, and are the data and prior model covariances, and g(x) is the predicted solution to the forward problem g(x), where this is, in general, nonlinear.
A linearization procedure involves
expanding g(x) in a
where y is the observed data, A is the Frechet derivative operator or the sensitivity operator which in component form , and .
If at the kth iteration the model is xk, then
where is chosen to minimize
Differentiating with respect to x and setting this to zero gives
This then results in
where in the linear case this converges in one iteration. In the nonlinear case, Ak is the sensitivity matrix .
Tarantola (1984), while investigating the seismic migration problem, noted that since for a nonlinear problem this is only a linearization, it may not be worthwhile to evaluate exactly. If this is just set to a small constant, then
This forms a descent algorithm where no matrix inversions are required. This is important since for large data sets it may be prohibitive to evaluate large matrix inversions.
Ex) In mantle tomography, there may be up to several million travel time observations. Also, the mantle is often separated into up to 100,000 pixels. This would result in a 100,000 x 100,000 matrix inversion which is prohibitive.
Iterative procedures
We will confine our attention here to descent algorithms. We first look at the total error surface
where g(x) is the computed forward problem. The vector is the gradient of the total error surface. This is a vector which points in the direction of maximum ascent of the function F(x).
The gradient vector for a scalar temperature field point in the direction of maximum temperature change.
Ex) Direction of temperature gradient
Ex) The gradient vector for topography points in the direction of maximum ascent (perpendicular to contours of equal contours)
Now, linearize the forward problem as
where Ak is the sensitivity or Frechet derivative operator . Then,
where we have used a for simplicity. The gradient vector is then and points in the direction of maximum ascent. Let,
where is the adjoint of Ak.
We want to construct a descent algorithm where at each iteration we move in the direction of maximum descent, thus
where is a scalar which tells us how far to move. Thus,
We now have to find a way of picking .
We want to pick so that we move along the line until we reach a minimum and are tangent to some contour of F(x). We thus want , or
where
Then,
This results in
Finally, we get
This algorithm is called the Steepest Descent algorithm. Note that in the nonlinear case, , where we have applied it to the stochastic inverse objective function, can only be expected to be an approximation. . The following steps are used
1) Compute Ak, the sensitivity operator , and also compute g(xk), the forward problem. Hopefully this can be done in the same step.
2) Compute where is the adjoint and is easy to compute given Ak.
3) Compute the scalar weight . We may opt for an approximate specification or computation of this scalar for nonlinear problems.
4)
5) Repeat until is small. Thus in the descent approach, no inverses are required. The figure below shows an illustration of this where the contours of the objective function to be minimized are shown as well as the iterative descent path to a minimum.
One problem with this approach is that for highly elongated error surfaces (characterized by a wide spread of singular values for Ak), then many iterations may be required for convergence.
Convergence improvement can be obtained by using several techniques
1) Scaling the problem to reduce the spread of the singular values of Ak.
2) Use a modified descent algorithm called the conjugate gradient method.
Ex) Given y and , as well as , and (no knowledge of the range of x), then,
Let,
, and
then
and
(Note cancels in the combined term )
Let Ax = y be,
The solution to these equations is . Now let,
.
The total error surface is then,
(Note the factor of added for simplicity later), and
Thus,
The gradient vector is then,
or
Now for our case,
and
For the first iteration of a descent algorithm,
The first iteration then gives
Further iterations would then be required to converge to the solution .
We will find that multiplying the data residual by the adjoint of the sensitivity matrix is kinematically equivalent to seismic migration. An iterative sequence of seismic migrations then results in a complete seismic inversion (Tarantola, 1984). For a complete seismic inversion, the ultimate goal is to obtain at each point in the medium which would imply many unknowns and usually iterative methods would be required.
Conjugate Gradient
Method
The idea of the conjugate gradient method is to modify the descent direction in order to reduce the number of iterations required. The process is then continued as before
where now is a modified descent vector and is chosen to minimize a line search in the direction .
For simplicity, lets just look at the linear problem g(xk) = Axk and let with . Then, we will minimize
Letting Q = A*A and b = A*y where Q is a symmetric and positive definite matrix (all eigenvalues > 0), then
For a positive definite Q, a set of “Q-orthogonal” vectors p0,…,pN can be constructed with a modified inner product such that , for .
The vector x can be expanded in a generalized Fourier series in terms of p0,…,pN as,
where . But, we want or , and . Then,
For ,, then
where hk is the grad vector above. Then, an iterative sequence can be constructed as
where
Note that the pk are a “Q orthogonal” basis for the domain where
Note that this is just a modified Gram Schmidt procedure. For the linear case, it is guaranteed to converge to the correct answer in at most N iterations for an NxN Q matrix. All we need to do is be able to compute the “Q orthogonal” vectors pi.
In addition, the set of partial solutions, xk, k < N, minimize F(xk) not just on each line segment, but on the whole linear subspace spanned by p0,…,pk-1 (at least for a quadratic F(xk)). Thus, the gradient error vector hk is perpendicular to all pi, i = 0,k-1. Thus,
for
This is a very useful property which allows us to compute the vectors pi from the gradient vectors hk.
The Conjugate
Gradient Algorithm
1) Choose . Move to then .
2) The vector p1 must be in the space spanned by {h0,h1} and must be“Q orthogonal” to p0.
3) Continue this procedure to find other pk.
In other words, the vectors {p1,…,pk} are the Q orthogonal vectors {h1,…,hk}. This leads to the procedure,
For any , define , then
(where is the computed gradient vector).
We
construct pk+1 such
that pk+1 is Q
orthogonal to pi, i = 0. … k, where
For i = k, the two terms cancel giving zero. For i < k, the second term is zero and the first term is zero since subspace [p0,…,pk]is also spanned by [h0,…,hk].
In summary, for a linear problem Qx = b for Q an (NxN) matrix, then we can solve this using several approaches.
1) Using matrix inversion, this can be solved in one step.
2) Using the steepest descent method, there will be a geometric error reduction at each step.
3) Using the conjugate gradient method, this will converge in at most N steps
Now for the more error surface for the stochastic problem, then
and
We can use this vector in the conjugate gradient algorithm above. For the general nonlinear case, it may not be necessary to compute exactly and some approximations could be used.
The convergence of a single step in descent algorithms is related to the spread of the eigenvalues in . In the simplified case, for , then the convergence rate is proportional to . Thus, we may want to look for “preconditioner operators” for
For example, let
If and are chosen appropriately, will have compressed eigenvalues for with values near unity. Preconditioner steps are extremely important for the convergence speed of descent algorithms.