Homework Hints and FAQ

Homework #12

Problems 3 & 4: Remember the detergent machine/carton question?  Think about it.  If your dfE = 0, you haven’t thought about this problem enough.

 Problems 5 & 6:  Yes I want you to check assumptions for both analyses.  Note that an additional assumption of ANCOVA is that the lines are parallel.  Also see the hint for 3 & 4.

Homework #11

Problems 1 & 2:  Ideally, for checking the assumptions you should have 8 plots for each question. I’ll bend the rules a little bit here, but please provide for each question, at minimum, a plot of residuals vs. each of the 3 factors, the predicted values, and a qqplot.  That’s 5 plots – I’ll let you skip the plots vs the 3 combined factors. You should be able to fit all the necessary plots on one page per question.

 Problem 3:  How many plots?  As many as you feel are necessary to make the point.  Make sure you can use the plots to say something about all significant factors and interactions.  I believe that at least two plots are needed to make the points clearly (more is okay).

 Problem 4:  Please consider the combined factor for the significant interaction, as well as the main effects (as we did with fs in the notes).

Homework #10

Problem 1:  To input the actual amounts, you could edit the file, but it is easier to use arithmetic in a data step.  Note that if A and B are the original factor levels, then amt1 = 5*A; and amt2 = 5+2.5*B;  To center the terms before squaring and making the interaction term, you can either use proc standard, as we did in Topic 4, or just subtract the mean from each variable in a data step (this is simpler).  It is not necessary to center the linear terms, because that would only change the equation by a constant, which would be absorbed into the intercept term.

 Problem 3:  Look at the results of your regression and compare it to the results of the ANOVA from HW9.  Which approach do you prefer and why?  By now you should know enough about these things to have an opinion.

Homework #9

Problem 1b):  “Obtain the main effects of factor A” is the same as “obtain estimates for the ai”.  Remember that the ai and bj are called the main effects of factors A and B respectively.  Use the zero-sum constraint system.

Problem 1d):  A “treatment means plot” is the same as an interaction plot.

Problem 2:  Are the cartons being re-used?  Read the data description given with problem 16.14 on page 705.

Problem 3:  It is sufficient to highlight the relevant part of the means statement output.

Problem 4:  It is not sufficient to just give the SAS output.  You must first write the model (hint: it starts with Y not m), then give the parameter estimates giving values for the estimated parameters     and using those symbols.  (These symbols may not display properly on all browsers:  mu-hat, alpha_1-hat, alpha_2-hat, … alphabeta_33-hat).  Also you need to show that the constraints are satisfied.  If you did it by hand, show your work along with the means statement output you started with.  If you did it all in SAS, support your answer with the SAS output; since each estimate is repeated for every observation in the treatment, you can edit your output to show only one observation per treatment.

Problem 7:  Yes, you should include the plots.

Homework #8

General comment on Tukey:  In class I neglected to mention something about the tukey option to the means statement in proc glm.  As shown in the notes, there are two ways the output can be displayed:  a  short version with the Tukey groupings labelled A, B, C, etc., and a long version which gives the confidence interval for each difference.  To get the short version you add the word lines to the option ( / lines tukey) and to get the long version add the word cldiff to the option ( / cldiff tukey).  For balanced designs (equal sample sizes), lines is the default, whereas for unbalanced designs (unequal sample sizes), cldiff is the default.  Thus you only need to add the extra word if what you want is not the default.  In the cereal example in the notes, the design was unbalanced, which is why I got the confidence intervals without specifying cldiff, but since the detergent dataset is balanced, in problem 1 you will get just the tukey grouping unless you specify cldiff.  Note that the question does not ask you for confidence intervals, so that is not a problem.

Problem 2:  The “first two machines” are machines 1 and 2.  The “last four” are machines 3, 4, 5, 6.

Problem 4a):  You can give me the predicted values since there are only four of them.

Problem 4c):  You can use a = 0.05  or 0.10.  You do not need to tell SAS which value of a you are using.  SAS reports the p-value, so you can just compare it to 0.1 or 0.05 in your head.

Problem 5:  Just as in the real Box-Cox, the transformation is Y to the power l.  So when you get l-hat you can say what transformation it corresponds to.  You do not need to give separate answers to parts (i) – (iv).  I was just breaking down the steps for you.

Problem 7a):  Give me the proc glm output (ANOVA table, R2, etc) and the four mi-hats.

Problem 8:  The transformed data referred to in the question is the square root transformation of problem 7, not the “plus 1” from problem 6.  Do not add 1 to Y. 

Homework #7

Problem 1: “Treatment means” are m1, m2, and m3.  You have to figure out how to express what is being estimated in terms of these (i.e. some combination of the mi’s).  The two constraint systems are, as done in class, the “zero sum” constraint and the tr = 0 constraint.  It is your job to figure out which constraint system corresponds to each parameterization.

 

Problem 3:  You will need to merge the original dataset and the dataset containing the means in order to get the overlay plot.  The merge is done by machine. An example of merging is on pages 27-28 of Topic 6.  It works like this: 

data combined;

    merge origdataset datasetwithmeans;

    by machine;

Also you may find that you need to set the colours of the lines (using c=) in order to get the two symbols to actually be different.  This is not because you are doing anything wrong but because SAS has a bug.  You can even set both colours to black if you want.  For example:

symbol1 v=circle i=none c=black;

symbol2 v=plus i=join c=red;

See page 19 of topic 5 for an example of an overlay plot.  You also did one on HW #3.

Homework #6

Problem 1:  The sign of the residuals, dffits, and dfbetas indicates the direction of the effect, and the absolute value indicates the magnitude of the effect.  Therefore it is the absolute value of the quantity that is compared to the cutoff to determine whether the effect is large.   Dffits and dfbetas are calculated as ((value with point) – (value without point))/(something positive) so that a positive value means the predictor or parameter estimate is larger with than without the data point, while a negative value means the reverse.  A positive value of a residual (regular, studentized, or studentized deleted) means that the observed value is greater than the predicted, and a negative value means the reverse.  See formulas 9.21, 9.30, and 9.34.

Homework #5

Problem 1a): Do the actual subtraction according to the definition of extra SS.  By all means verify your answer using type II SS but you must include the actual subtraction in your answer.

Problem 1b): When it says “the same test statistic”, that means the same one as calculated in part a).

 

Problem 2a):  “for the five predictor variables” means do not include the intercept SS in your addition.

Homework #4

Problem 1:  You do not need to calculate the standard error for the CI or the predicted value of Y when X=5 since they are given in the output.  You can get the value of t to use from the table in the book, or from tinv(area, df) in SAS.  The first sentence of the problem states that you are considering confidence intervals for the mean of Y.  This problem has nothing to do with computing CI’s for b0 or b1.

 

Problem 2:  Don’t forget that the design matrix has two columns.  If you do things correctly, you should get a determinant of 200 for X’X, which should make the arithmetic not too messy (I planned it this way).  To check your work, you can try running SAS on the data and make sure you get the same b0 and b1; also try the /xpx i; options to the model statement.  However, no SAS is necessary with this problem and none should be handed in.

 

Problem 3:  If you do not have the text CD, the class website has a link to "Datasets".  Go to "Chapter 6" and then right-click and do "Save Target As" on "CH06PR18" and save it as CH06PR19.TXT.  They are in the following order: rate age expense vacancy squarefoot.

 

Problem 4:  The three CI’s are obtained from the same multiple linear regression model as in problem 3, i.e. one call to proc reg with three X variables in the model statement.  Do not run proc reg three times.

 

Problem 6:  Most people are finding the qqplot more informative.

 

Problem 7:  You could get the prediction by hand if you really wanted to, but getting the standard error for the CI would involve a lot of matrix calculations.  So use SAS for this problem and make your life easier.  To make the fake observation, you will need semicolons between the variable assignments, followed by one output statement.

Homework #3

Problem 1:  This is an open-ended problem which has more than one correct answer.  Most people find that proc univariate output and a histogram are useful.  Some also find stem-and-leaf and boxplots useful.

 

Problem 2c):  See top of page 2, topic 2, for how to create a variable “seq” that gives the position (line number) of each observation in the file, so that you can plot the residuals versus that number. 

 

Problem 4a):  The problem does not specify the type of line (if any) to put on the plot, so straight or smoothed line, or none, are all acceptable.

Problem 4b):  Again the type of line for the scatterplot is not specified, but a straight line is most helpful in answering the question “Does the fit appear linear?”  The question about assumptions in this part refers only to those that can be examined using this type of plot.

Problem 4c):  Provide summaries of the variables as in problem 1; in addition, it is helpful to notice any gaps in the distributions and where they appear.

Problem 4d):  This question is asking for the equation of the line, as well as the residual plot and its interpretation.

Problem 4e):  For “Comment on the fit” please include an explanation of why some points fit the line better than others, for which the previous parts of this question are helpful.

Homework #2

Problem 4c):  Yes, you do need to change the axis on one of the plots because it says “using the same scales”.  It is actually the axis on the first plot you will need to change.  To plot yhat`ybar, it will be necessary to create a new dataset and a new variable.  This is done in a data step.  Remember as in the power calculation you can do arithmetic in a data step.  The steps are 1) begin a new dataset, 2) copy the old dataset into the new dataset (you need to copy the one that has the predictors in it), 3) make a new variable using arithmetic.  Note that in step 3) it is not necessary to do the arithmetic in a loop;  with just one subtraction SAS will do it for all rows, and use the numerical value of the mean of y, which you can get from the proc reg output under “Dependent mean”.

 

Problem 5b):  If your plot looks too jagged, there are two possibilities.  Either you need to decrease the increment between successive power calculations (e.g. use "do beta1 = -2.5 to 2.5 by 0.25;" instead of "by 1"), or you are joining the points with straight lines instead of a smooth curve (e.g. use i=sm30 instead of i=join).

 

Problem 6:  Another way to think about the problem is this.  Suppose that you  had a dataset of size 30 for which r2 was non-zero.  Could you pick out 10 points for which the r2 was zero?  Or, suppose you had a dataset of size 30 for which r2 was zero?  Could you pick out 10 points for which the r2 was non-zero?  What if those 10 happened to come first? 

 

Problem 8:  There is actually nothing special about zero, at least as far as df and "reduction" are concerned.  A parameter is either fixed (at, say, 0, or 5, or 2), or it is not fixed (or "free"), in which case it is estimated from the data.  You could imagine replacing the "2" and "5" in the problem with "0"'s, and the answer would be the same. A reduced model is simply a special case of a more general (or full) model, which is obtained by fixing one or more of the parameters.  Most often we are interested in the special cases which involve terms in the model disappearing (i.e. setting a parameter to 0), but in principle the method applies to any fixed values.

Homework #1

Problem 1c):  Remember that slope = rise / run.

 

Problem 4):  You can use the table in the back of the book to get an idea of the P-value.

 

Problem 5c):  A “point estimate” just means an estimate that is a single number (as opposed to an interval).

 

Problem 5d) and 6d):  Ask yourself i) would it be interesting and ii) do I have data near X=0?