Purdue University

EAS 657

Geophysical Inverse Theory

Robert L. Nowack

Lecture 7

The Generalized Inverse Operator

The singular value decomposition of a general linear operator can be written

where

are the eigenvectors of (AA*) associated with the non-zero eigenvalues and span R(A) = R(AA*).

are eigenvectors of (A*A) associated with the non-zero eigenvalues that span N^perp(A) = N(A*A). The singular values are,

where are the non-zero eigenvalues of A*A, AA*. Also, are the non-zero positive eigenvalues of the larger matrix .

Now, we are interested in solving the linear matrix equation

Ax = y

We have previously classified problems Ax = y into the following categories.

1) N(A) = {0}, R^perp(A) = {0}. For this case, A is then one-to-one and invertible and

Recall that

with

where Det M_ij is the minor formed by deleting row i and column j of A. A is in invertible if Det A 0.

In terms of the SVD, M=N=P, and

where U_p is MxP, is PxP, and is PxN. Because U_p and V_p now completely span the domain and range, then

U_p*U_p = U_pU_p* = 1 V_p*V_p = V_pV_p* = 1

thus,

The solution is then,

(Note in general, for

Note that for the case where N(A) = 0 and R^perp(A) = 0, instead of explicitly computing A^-1, we can also factor A as

A = L U

where L is a lower triangular matrix and U is an upper triangular matrix .

Then, Ax = y can be written as L Ux = y.

Now solve and by back substitution. This is called Gaussian elimination. The number of algebraic operation counts is

backsubstitution ~ N²

We can derive this factorization of A with householder transformations and QR algorithms later. Note, to directly compute, A^-1 is on the order of N³ operations which is larger. Thus, inverses are computed only when they are needed.

2) Purely underdetermined systems. For this case, a nonunique solution to Ax = y results where

N(A) 0

We know from the previous sections that if R^perp(A) = {0}, then the minimum norm solution is

X_MN = A*(AA*)^-1y

Recall that (ABC)* = C*B*C*, then

If U₀ = {0} or R^perp(A) = {0}, then

Thus,

This is the generalized inverse for purely underdetermined case in terms of the SVD.

3a) Purely overdetermined system. For this case, the system is incompatible, but the solution is unique, where

R^perp(A) 0, N(A) = {0}

We need to find the best approximation solution. From previous sections, this can be found from the least squares solution

If V₀ = {0} or N(A) = {0}, then

thus

3b) Both N(A) 0 and R^perp(A) 0. For this case, we want to find the best approximation – minimum norm solution since the system is incompatible and the solution is nonunique. In this case, the generalized inverse can directly be written in terms of the SVD as

The solution to Ax = y for all cases can then be written as

where V₀ are the basis vectors spanning the nullspace N(A). The first term on the right is the minimum norm solution and the second term results from the nullspace required by additional constraints to the problem. However, in many cases, if it is known that either R^perp(A) = {0} or N(A) = {0}, it may be computationally more efficient to compute the specialized forms for .

But for the specialized solutions, there may be numerical difficulties when dealing with small eigenvalues for (A*A) and (AA*) since these have eigenvalues that are squared . For the earlier example, Ax = y, then

then,

and

Thus,

and

This is the best approximation – minimum norm solution.

Resolution Matrix

The best approximation-minimum norm solution is given by

Let x by any solution to y = Ax, then

where is called the resolution matrix and is the operator that determines the relationship between any possible solution x and the particular solution, x_g. In terms of the SVD, R can be written

Thus,

where V_p includes the vectors in N^perp(A) and are coefficients. This operator projects x onto the V_p space resulting in the minimum norm solution in N^perp(A). Note that R is an NxN matrix. If P = N, then is an identity matrix.

If a particular row of R is “-like” with a one on the main diagonal, then that component of x_g is called “well resolved”. If a row of R is not “-like”, then certain components of x_g are not well resolved. Since

then,

For P=N, then R is –like and x_g = x.

The so-called Backus-Gilbert Method finds solutions that minimize various norms of the “off diagonal” components of R in order to obtain a generalized inverse and solution x_g.

As an example of computing R, consider

with

Thus,

Note that the “trace” of R (sum of diagonal elements) = P, which is the rank of A. This shows that the x₃ component is well separated and uniquely resolved, but only the average of the components x₁ and x₂ is well resolved. The value of the R_ii is often used to quantify the degree of resolution of model parameter x_i.

For an incompatible system Ax y, then

where y_g is component of the data in R(A). Then, we have

where U_p are vectors in R(A) and are the coefficients. N is called the relevancy matrix and is MxM. If N=I, then all the data are sensitive to the operator A and P=M.

Next we look at the errors in the solution. Assume the data have errors , then

This is the part of the error in the solution mapped from the data errors , but doesn’t include the nonuniqueness ambiguity. Let’s look at the “covariance matrix”. Recall that the average of a random variable x is

where f(x) is called the probability density function or pdf.

The variance measures the spread about the mean <x>, where

The covariance of two random variables is defined as

where f(x,y) is the joint pdf of random variables x and y.

If x and y are formally independent and uncorrelated, then f(x,y) = f(x)f(y) and <xy> = <x><y>; then = 0.

For a vector of random variables

then we can define a covariance matrix about the mean as

where is the outer product.

2-D Example

The linear correlation coefficient between two random variables x₁ and x₂ is

In the extreme cases, , then x₁ and x₂ are linearly related to one another. The figure below shows the pdf of two random variables with a positive correlation coefficient.

The figure below shows an example of the pdf of two random variable x₁ and x₂ which are uncorrelated with = 0.

The covariance matrix of two random variables can be written

and is a symmetric matrix.

1) The diagonal elements of C measure the “width” of the joint pdf along the axes.

2) The off-diagonal elements of C indicate the degree of correlation between pairs of random variables.

For the special case of Gaussian or normally distributed random variables, the mean and variances are all that are needed to specify the entire pdf.

1) 1-D case

2) N-D case

For our case, and . The model covariance matrix can be written as

where is the average. Now

where = C_d is the data covariance. If all are identically distributed and uncorrelated independent random variables, then

where is the individual component of the data variances Then,

where is called the unit covariance matrix for the solution due to errors in the data. It amplifies data errors back to the model.

Summary

For each of the cases, we will write down , R, N, and C_u.

1) R^perp(A) = {0}, N(A) = {0}. This is the invertible case where

C_u = A^-1A^-1* = (A*A)^-1

R = A^-1A = I

N = AA^-1 = I

2) R^perp(A) = {0}, N(A) {0}. This is the purely undetermined case and we want the minimum norm solution.

3a) R^perp(A) {0}, N(A) = {0}. This is the purely overdetermined case and we want the best approximation or least squares solution.

3b) R^perp(A) {0}, N(A) {0}. This is the general case and we want the best approximation – minimum norm solution.

(This is the most general form)

Note that

Tradeoffs Between Variance and Resolution

If is small but non-zero, then will be large. Thus, even if the data errors are small, a small will magnify these tremendously and will disturb the solution.

We would like to eliminate the effects of small singular values on the solution. We could just eliminate small ’s, but this would reduce P resulting in a poorer resolution matrix R (or a less -like resolution). Thus, if we want stability in the solution due to perturbations or errors in the data, we must sacrifice resolution or resolving power of the various components of the solution. Recall

and for uncorrelated data errors , then

For our example case,

then,

The cross correlation of x₁ and x₂ is

resulting in perfect correlation between x₁ and x₂

Also,

Thus, . Note that C_x is not a measure of the absolute error in the model since the true model can also differ by any amount in N(A). Recall the original observation equation x₁ + x₂ = 1 which does not require x₁ = x₂ which is only the minimum norm solution.

Thus, C_x will underestimate the true model errors if

1) There exists a non-zero N(A).

2) There are additional modeling errors.

3) If nonlinearities exist and the problem was linearized.

There is also a tradeoff between –like resolution and the covariance of the solution resulting from the data errors

As a note on notation, we have used for the singular values

these are the singular values of matrix A

where

where are the eigenvalues of A*A and AA* for i=1,P

We also use and for variances of random variables x₁ and x₂, as well as and for the variances of the data and model parameters. You need to watch carefully for the context that these symbols are used.

The R (resolution) and N (relevance) are matrices which are just projection operators, thus,

The unit covariance operator is . The data covariance is .

For independent identically distributed data, then

The model errors resulting from data errors are then

For , then as shown before.

The Approximate Generalized Inverse and Damped Least Squares

The generalized inverse takes care of any singularities due to zero singular values. But, in practice, we can have severe difficulties with small, but non-zero singular values, especially with numerical round-off errors.

One way around this is to set a cutoff in the singular values and place all singular value vectors with into an extended null space . A very similar effect can be attained without resorting to SVD. by a procedure called “damping”. This involves solving the extended system

This is formally overdetermined where the first M rows are just the system Ax = y. The last N rows are the system Ix ~ 0. These last equations try to force the solution x to zero. The constant relatively weights the equations Ax = y with equations that force x as close as possible to zero. For , one gets the original system. For , then x goes to zero.

Solving this extended system by standard least squares results in

and

this is the damped least squares solution. Using the SVD, then

and

Note that

since [V_pV₀] is complete and orthonormal. Then,

and

This is the damped least squares operator written in terms of the SVD.

The figure below shows the effect of truncating the singular values in SVD or by filtering the singular values smoothly using damped least squares.

The resolution operator for damped least squares becomes

For increasing , reduces the overall resolution which now only approximates to , but is reasonably close for small .

The approximate unit covariance operator for damped least squares becomes

and with , then

Thus, increasing artificially reduces the variance of the model parameters resulting from variance in the data.

Note that the damped least squares solution is a straightforward way of adding constraints. We can add prior information as “data” to our problem in a similar fashion.

Ex) We want to solve Ax = y and we want the solution x to be as “smooth” as possible.

Let’s look at a 1-D model vector. An augmented system of equations can be written as

For this case, we will attempt to simultaneously solve Ax = y and force the first derivative to be small as (x_i+1– x_i) ~ 0. The weighting parameter is .

If we want to minimize the second derivative of the model, then

where the a values in the first and last rows depend on the boundary conditions.

We can also solve the equations in terms of data residuals as

where x₀ is an initial solution where y₀ = Ax₀. The generalized solution can be written

In terms of the SVD,

For small errors in the data, and small errors in the initial solution, then

Forming the model covariance results in

where the first term results from errors in the data and the second term results from the projection of errors in the initial model onto N(A). If there were errors in the A matrix itself, this would result in a third term related to modeling errors.

If we assume a general damped inverse of the form

then

and

This will be further discussed in a later chapter (see also Tarantola, 1987).