Purdue University

EAS 657

Geophysical Inverse Theory

Robert L. Nowack

Lecture 12

 

 

            From an earlier lecture, we showed that the stochastic inverse solution minimizes

 

 

where x are the model parameters, x0 are the prior model parameters,  and  are the data and prior model covariances, and g(x) is the predicted solution to the forward problem g(x), where this is, in general, nonlinear.

 

A linearization procedure involves expanding g(x) in a Taylor series

 

 

where y is the observed data, A is the Frechet derivative operator or the sensitivity operator  which in component form , and .

 

            If at the kth iteration the model is xk, then

 

 

where  is chosen to minimize

 

 

 

Differentiating with respect to x and setting this to zero gives

 

 

This then results in

 

 

where in the linear case this converges in one iteration.  In the nonlinear case, Ak is the sensitivity matrix .

 

            Tarantola (1984), while investigating the seismic migration problem, noted that since for a nonlinear problem this is only a linearization, it may not be worthwhile to evaluate  exactly.  If this is just set to a small constant, then

 

 

This forms a descent algorithm where no matrix inversions are required.  This is important since for large data sets it may be prohibitive to evaluate large matrix inversions.

 

            Ex)  In mantle tomography, there may be up to several million travel time observations.  Also, the mantle is often separated into up to 100,000 pixels.  This would result in a 100,000 x 100,000 matrix inversion which is prohibitive.

 

 

Iterative procedures

 

            We will confine our attention here to descent algorithms.  We first look at the total error surface

 

 

where g(x) is the computed forward problem.  The vector  is the gradient of the total error surface.  This is a vector which points in the direction of maximum ascent of the function F(x).

 

 

 

 

The gradient vector for a scalar temperature field point in the direction of maximum temperature change.

 

            Ex)  Direction of temperature gradient

 

            Ex)  The gradient vector for topography points in the direction of maximum ascent (perpendicular to contours of equal contours)

 

Now, linearize the forward problem as

 

 

where Ak is the sensitivity or Frechet derivative operator .  Then,

 

 

 

where we have used a  for simplicity.  The gradient vector is then  and points in the direction of maximum ascent.  Let,

 

 

where  is the adjoint of Ak.

 

            We want to construct a descent algorithm where at each iteration we move in the direction of maximum descent, thus

 

 

where  is a scalar which tells us how far to move.  Thus,

 

 

We now have to find a way of picking .

 

 

 

 

We want to pick  so that we move along the  line until we reach a minimum and are tangent to some contour of F(x).  We thus want , or

 

 

where

 

 

 

Then,

 

 

This results in

 

 

 

 

Finally, we get

 

 

This algorithm is called the Steepest Descent algorithm.  Note that in the nonlinear case, , where we have applied it to the stochastic inverse objective function, can only be expected to be an approximation.  .  The following steps are used

 

1)   Compute Ak, the sensitivity operator , and also compute g(xk), the forward problem.  Hopefully this can be done in the same step. 

 

2)   Compute  where  is the adjoint and is easy to compute given Ak.

 

3)   Compute the scalar weight .  We may opt for an approximate specification or computation of this scalar for nonlinear problems.

 

4)  

 

5)   Repeat until  is small.  Thus in the descent approach, no inverses are required.  The figure below shows an illustration of this where the contours of the objective function to be minimized are shown as well as the iterative descent path to a minimum.

 

 

 

 

One problem with this approach is that for highly elongated error surfaces (characterized by a wide spread of singular values for Ak), then many iterations may be required for convergence.

 

            Convergence improvement can be obtained by using several techniques

 

1)   Scaling the problem to reduce the spread of the singular values of Ak.

 

2)   Use a modified descent algorithm called the conjugate gradient method.

 

            Ex)  Given y and , as well as , and  (no knowledge of the range of x), then,

 

 

Let,

 

,  and 

 

then

 

 

and

 

     (Note  cancels in the combined term )

 

Let Ax = y be,

 

 

The solution to these equations is .  Now let,

 

.

 

The total error surface is then,

 

 

(Note the factor of  added for simplicity later), and

 

 

Thus,

 

 

The gradient vector is then,

 

 

 

or

 

 

Now for our case,

 

 

 

and

 

 

For the first iteration of a descent algorithm,

 

 

 

 

The first iteration then gives

 

 

Further iterations would then be required to converge to the solution .

 

            We will find that multiplying the data residual  by the adjoint of the sensitivity matrix is kinematically equivalent to seismic migration.  An iterative sequence of seismic migrations then results in a complete seismic inversion (Tarantola, 1984).  For a complete seismic inversion, the ultimate goal is to obtain  at each point in the medium which would imply many unknowns and usually iterative methods would be required.

 

 

 

Conjugate Gradient Method

 

            The idea of the conjugate gradient method is to modify the descent direction  in order to reduce the number of iterations required.  The process is then continued as before

 

 

where now  is a modified descent vector and  is chosen to minimize a line search in the direction .

 

            For simplicity, lets just look at the linear problem g(xk) = Axk and let  with .  Then, we will minimize

 

 

 

Letting Q = A*A and b = A*y where Q is a symmetric and positive definite matrix (all eigenvalues > 0), then

 

 

For a positive definite Q, a set of “Q-orthogonal” vectors p0,…,pN can be constructed with a modified inner product such that , for .

 

The vector x can be expanded in a generalized Fourier series in terms of p0,…,pN as,

 

 

where .  But, we want  or , and .  Then,

 

 

For ,, then

 

 

where hk is the grad vector  above.  Then, an iterative sequence can be constructed as

 

 

where

 

 

Note that the pk are a “Q orthogonal” basis for the domain where

 

 

Note that this is just a modified Gram Schmidt procedure.  For the linear case, it is guaranteed to converge to the correct answer in at most N iterations for an NxN Q matrix.  All we need to do is be able to compute the “Q orthogonal” vectors pi.

 

            In addition, the set of partial solutions, xk, k < N, minimize F(xk) not just on each line segment, but on the whole linear subspace spanned by p0,…,pk-1 (at least for a quadratic F(xk)).  Thus, the gradient error vector hk is perpendicular to all pi, i = 0,k-1.  Thus,

 

  for 

 

This is a very useful property which allows us to compute the vectors pi from the gradient vectors hk.

 

 

 

The Conjugate Gradient Algorithm

 

1)   Choose .  Move to  then .

 

2)   The vector p1 must be in the space spanned by {h0,h1} and must be“Q orthogonal” to p0.

 

3)   Continue this procedure to find other pk.

 

In other words, the vectors {p1,…,pk} are the Q orthogonal vectors {h1,…,hk}.  This leads to the procedure,

 

            For any , define , then

 

 

 

 

 

(where  is the computed gradient vector).

 

 

            We construct pk+1 such that pk+1 is Q orthogonal to pi, i = 0. … k, where

 

 

For i = k, the two terms cancel giving zero.  For i < k, the second term is zero and the first term is zero since subspace [p0,…,pk]is also spanned by [h0,…,hk].

 

            In summary, for a linear problem Qx = b for Q an (NxN) matrix, then we can solve this using several approaches.

 

1)   Using matrix inversion, this can be solved in one step.

2)   Using the steepest descent method, there will be a geometric error reduction at each step.

3)   Using the conjugate gradient method, this will converge in at most N steps

 

 

Now for the more error surface for the stochastic problem, then

 

 

and

 

 

We can use this vector in the conjugate gradient algorithm above.  For the general nonlinear case, it may not be necessary to compute  exactly and some approximations could be used.

 

            The convergence of a single step in descent algorithms is related to the spread of the eigenvalues in .  In the simplified case, for , then the convergence rate is proportional to .  Thus, we may want to look for “preconditioner operators” for

 

 

For example, let

 

 

If  and  are chosen appropriately,  will have compressed eigenvalues for  with values near unity.  Preconditioner steps are extremely important for the convergence speed of descent algorithms.