Purdue University

EAS 657

Geophysical Inverse Theory

Robert L. Nowack

Lecture 12

From an earlier lecture, we showed that the stochastic inverse solution minimizes

where x are the model parameters, x₀ are the prior model parameters, and are the data and prior model covariances, and g(x) is the predicted solution to the forward problem g(x), where this is, in general, nonlinear.

A linearization procedure involves expanding g(x) in a Taylor series

where y is the observed data, A is the Frechet derivative operator or the sensitivity operator which in component form , and .

If at the k^th iteration the model is x_k, then

where is chosen to minimize

Differentiating with respect to x and setting this to zero gives

This then results in

where in the linear case this converges in one iteration. In the nonlinear case, A_k is the sensitivity matrix .

Tarantola (1984), while investigating the seismic migration problem, noted that since for a nonlinear problem this is only a linearization, it may not be worthwhile to evaluate exactly. If this is just set to a small constant, then

This forms a descent algorithm where no matrix inversions are required. This is important since for large data sets it may be prohibitive to evaluate large matrix inversions.

Ex) In mantle tomography, there may be up to several million travel time observations. Also, the mantle is often separated into up to 100,000 pixels. This would result in a 100,000 x 100,000 matrix inversion which is prohibitive.

Iterative procedures

We will confine our attention here to descent algorithms. We first look at the total error surface

where g(x) is the computed forward problem. The vector is the gradient of the total error surface. This is a vector which points in the direction of maximum ascent of the function F(x).

The gradient vector for a scalar temperature field point in the direction of maximum temperature change.

Ex) Direction of temperature gradient

Ex) The gradient vector for topography points in the direction of maximum ascent (perpendicular to contours of equal contours)

Now, linearize the forward problem as

where A_k is the sensitivity or Frechet derivative operator . Then,

where we have used a for simplicity. The gradient vector is then and points in the direction of maximum ascent. Let,

where is the adjoint of A_k.

We want to construct a descent algorithm where at each iteration we move in the direction of maximum descent, thus

where is a scalar which tells us how far to move. Thus,

We now have to find a way of picking .

We want to pick so that we move along the line until we reach a minimum and are tangent to some contour of F(x). We thus want , or

where

Then,

This results in

Finally, we get

This algorithm is called the Steepest Descent algorithm. Note that in the nonlinear case, , where we have applied it to the stochastic inverse objective function, can only be expected to be an approximation. . The following steps are used

1) Compute A_k, the sensitivity operator , and also compute g(x_k), the forward problem. Hopefully this can be done in the same step.

2) Compute where is the adjoint and is easy to compute given A_k.

3) Compute the scalar weight . We may opt for an approximate specification or computation of this scalar for nonlinear problems.

5) Repeat until is small. Thus in the descent approach, no inverses are required. The figure below shows an illustration of this where the contours of the objective function to be minimized are shown as well as the iterative descent path to a minimum.

One problem with this approach is that for highly elongated error surfaces (characterized by a wide spread of singular values for A_k), then many iterations may be required for convergence.

Convergence improvement can be obtained by using several techniques

1) Scaling the problem to reduce the spread of the singular values of A_k.

2) Use a modified descent algorithm called the conjugate gradient method.

Ex) Given y and , as well as , and (no knowledge of the range of x), then,

Let,

, and

then

and

(Note cancels in the combined term )

Let Ax = y be,

The solution to these equations is . Now let,

The total error surface is then,

(Note the factor of added for simplicity later), and

Thus,

The gradient vector is then,

Now for our case,

and

For the first iteration of a descent algorithm,

The first iteration then gives

Further iterations would then be required to converge to the solution .

We will find that multiplying the data residual by the adjoint of the sensitivity matrix is kinematically equivalent to seismic migration. An iterative sequence of seismic migrations then results in a complete seismic inversion (Tarantola, 1984). For a complete seismic inversion, the ultimate goal is to obtain at each point in the medium which would imply many unknowns and usually iterative methods would be required.

Conjugate Gradient Method

The idea of the conjugate gradient method is to modify the descent direction in order to reduce the number of iterations required. The process is then continued as before

where now is a modified descent vector and is chosen to minimize a line search in the direction .

For simplicity, lets just look at the linear problem g(x_k) = Ax_k and let with . Then, we will minimize

Letting Q = A*A and b = A*y where Q is a symmetric and positive definite matrix (all eigenvalues > 0), then

For a positive definite Q, a set of “Q-orthogonal” vectors p₀,…,p_N can be constructed with a modified inner product such that , for .

The vector x can be expanded in a generalized Fourier series in terms of p₀,…,p_Nas,

where . But, we want or , and . Then,

For ,, then

where h_k is the grad vector above. Then, an iterative sequence can be constructed as

where

Note that the p_k are a “Q orthogonal” basis for the domain where

Note that this is just a modified Gram Schmidt procedure. For the linear case, it is guaranteed to converge to the correct answer in at most N iterations for an NxN Q matrix. All we need to do is be able to compute the “Q orthogonal” vectors p_i.

In addition, the set of partial solutions, x_k, k < N, minimize F(x_k) not just on each line segment, but on the whole linear subspace spanned by p₀,…,p_k_-1 (at least for a quadratic F(x_k)). Thus, the gradient error vector h_k is perpendicular to all p_i, i = 0,k-1. Thus,

for

This is a very useful property which allows us to compute the vectors p_i from the gradient vectors h_k.

The Conjugate Gradient Algorithm

1) Choose . Move to then .

2) The vector p₁ must be in the space spanned by {h_0,h₁} and must be“Q orthogonal” to p₀.

3) Continue this procedure to find other p_k.

In other words, the vectors {p₁,…,p_k} are the Q orthogonal vectors {h₁,…,h_k}. This leads to the procedure,

For any , define , then

(where is the computed gradient vector).

We construct p_k+1 such that p_k+1is Q orthogonal to p_i, i = 0. … k, where

For i = k, the two terms cancel giving zero. For i < k, the second term is zero and the first term is zero since subspace [p_0,…,p_k]is also spanned by [h₀,…,h_k].

In summary, for a linear problem Qx = b for Q an (NxN) matrix, then we can solve this using several approaches.

1) Using matrix inversion, this can be solved in one step.

2) Using the steepest descent method, there will be a geometric error reduction at each step.

3) Using the conjugate gradient method, this will converge in at most N steps

Now for the more error surface for the stochastic problem, then

and

We can use this vector in the conjugate gradient algorithm above. For the general nonlinear case, it may not be necessary to compute exactly and some approximations could be used.

The convergence of a single step in descent algorithms is related to the spread of the eigenvalues in . In the simplified case, for , then the convergence rate is proportional to . Thus, we may want to look for “preconditioner operators” for

For example, let

If and are chosen appropriately, will have compressed eigenvalues for with values near unity. Preconditioner steps are extremely important for the convergence speed of descent algorithms.