Important Dates/Information

Due date: Friday, February 23, 2018 11:59PM . No late submission allowed! Please submit to blackboard.

Update to submission guidelines: Submit both RMD and PDF files. R script files are not necessary, nor will they be accepted. Please submit all files in such format: _lab#_mmdd.extension ********

Introduction

Overview

Learning Goals

  1. Understand backfitting
  2. Use gam to build a General Additive Model

General Additive Models

Thus far, we have learned about linear models, ridge regression, and lasso regression. Our understanding should be general enough that if I told you something called polynomial regression existed, then regardless of how it was computationally fit you would understand taht if we gave a series of covariates, we would end up regressing to try to best fit the response variable (potentially using polynomials to make the regression better). What I am trying to impress upon you is that we currently have countless ways to regress on data and each represents an incremntal improvement upon the other. What can lead to huge gains in model performance now is using an additive combination of general models to improve performance. Let’s dig in.

Assume for a second we have some data which appears to have a non-linear relationship between covariate and response. We want to model it as a series of general functions -each being a function of a covariate. For our covariates \(X_1 \dots X_p\), we define a set of \(p\) functions \(\{ f_1(X_1),f_2(X_2),\dots,f_p(X_p) \}\) (or a series if we are being more precise). We let our predictor, \(Y\), be a function of \(p\) variables (plus an intercept, \(\alpha\)).

\[ Y=\alpha + f_1(X_1) +f_2(X_2)+\dots+f_p(X_p) + \epsilon= \alpha + \sum_{j=1}^{p} f_j(X_j)+\epsilon\] We have an addative model of general functions. Straightforward? Good. Now the question comes in how we fit the respective \(f\)s. To compute the \(f\)s we use a technique called backfitting. Backfitting updates the fit for each predictor respectively holding the others fixed. We do this by fitting to a partial residual. That is to say we fit for variable 1, then subtract out the fit for variable one and use the residual to predict variable 2 etc… Let’s go through an example:

First we will create some fake data. The beaufy of this is that we know apriori what our \(\eta\) values will be (and that we are performing a linear additive model).

x1 <- 1:500
x2 <- runif(500,45,95)
x3 <- runif(500,0.001,900)

b0 <- 17
b1 <- 0.5
b2 <- 0.037
b3 <- -5.2
sigma <- 0.2

eps <- rnorm(x1,0,sigma)
y <- b0 + b1*x1  + b2*x2  + b3*x3 + eps

df <- as.data.frame(cbind(y,x1,x2,x3))

Now we want to start by estimating \(\beta_1\) (We will get to 0 in a second). We do this by specifying an arbitrary value of \(\beta_1\) (this is a convergent process so pick close but it doesn’t have to be too close). In this case we let \(\beta_1 =0\). We then subtract out the ‘predicted’ value of \(f_1(X_1)\) which in this case is simply \(\beta_1 X_1\) because we are choosing linear addative models. In this case we are calling the first residual a. We then fit a linear model to a as a function of x2. We pull out the first coefficient from that linear model (\(\beta_1\) in the model but it’s the slope of the regression on x2 so it’s actually \(\beta_2\) in our parlance) and we can then update! We also need an intercept, so we simply use the average y value in this case.

# estimate beta1
beta0 <-mean(df$y)
beta1 <-0
beta2 <-0
beta3 <-0

# get our first partial residual based on our estimate of beta1
a <- df$y-beta0

# now get beta2 by fitting to the first partial
beta2 <- lm(a~df$x2)$coef[2]

# do it once more (calculate second partial residual)
# note: reusing variables like I am is bad practice
a <- df$y-beta0-beta1*df$x1 

# get beta3 
beta3 <- lm(a~df$x3)$coef[2]

beta0
## [1] -2125.673
beta1
## [1] 0
beta2
##      df$x2 
## -0.4165455
beta3
##     df$x3 
## -5.217721

We can see we actually got rather close for only performing one iteration of this algorithm! We need to recompute the value for our first coefficient again (update the guess). For all the subsequent iterations the process goes: for the first varaible use all the other predictors to create a residual. Then fit a model (in this case linear) using the first variable to predict the residual and then update our prediction based on variable 1. Move on to the next variable: use the best prediction we have for the other variables to create our second residual then fit a model to predict the second residual with the second variable. Continue until you are satisfied.

# iterate again for x1
a <- df$y -beta0 -beta2*df$x2 -beta3*df$x3

# pull out beta1 again
beta1 <- lm(a~x1)$coef[2]

# compute for x2
a <- df$y - beta0 - beta1*df$x1 -beta3*df$x3

# pull out beta2
beta2 <- lm(a~df$x2)$coef[2]

# compute for x3
a <-df$y-beta0-beta1*df$x1-beta2*df$x2

# pull out beta3
beta3 <- lm(a~df$x3)$coef[2]


beta1
##        x1 
## 0.4971731
beta2
##      df$x2 
## 0.03818379
beta3
##     df$x3 
## -5.200143

Above, we run the algorithm one more time. Knowing the values which went in, we did a pretty good job calculating them in only one iteration. In practice you will keep performing this iterrative process until you see a sufficiently tiny gap between your estimates during the iteration.

GAM In Practice

Backfitting (which is actually a grouped cyclic coordinate descent algorithm) is quite useful to have a practical understanding of. However when actually using additive models, we will want to leverage the power of R. In R, the syntax for additive models is gam provided we have loaded the package gam.

library('gam')
## Loading required package: splines
## Loading required package: foreach
## Loaded gam 1.14-4
data("kyphosis")

# build up an example (excuse the pun)
ex.gam.ple  <- gam(Kyphosis~s(Age,4)+s(Start,4)+Number,family=binomial,data=kyphosis)

In the gam world, we apply smoothing functions to the individual variables. In this case, they are analogous to the linear models we were generating in the backfitting example discussed above. Smoothing means we are applying a cubic spline to the data and the syntax (as seen in the example) is to use s. We are building models which return additive combinations of smoothed variables.

However it is of importance to us to know whether or not we should do the smoothing. Perhaps a variable means nothing to us, or perhaps it is better off un-smoothed. In this case gam provides us an option to compare models against eachother. Here we evaluate whether Age is important and whether or not it should be smoothed.

gam.1 <- gam(Kyphosis~Number,family=binomial,data = kyphosis)
gam.2 <- gam(Kyphosis~Age+Number,family=binomial,data = kyphosis)
gam.3 <- gam(Kyphosis~s(Age,4)+Number,family=binomial,data = kyphosis)

anova(gam.1,gam.2,gam.3,test = "F")
## Warning: using F test with a 'binomial' family is inappropriate
## Analysis of Deviance Table
## 
## Model 1: Kyphosis ~ Number
## Model 2: Kyphosis ~ Age + Number
## Model 3: Kyphosis ~ s(Age, 4) + Number
##   Resid. Df Resid. Dev Df Deviance      F  Pr(>F)  
## 1        79     73.357                             
## 2        78     71.627  1   1.7305 1.7305 0.18835  
## 3        75     63.107  3   8.5193 2.8398 0.03641 *
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

And from this output we can see that the third model (gam.3) performs significantly better than the model without Age or with it un-smoothed. Remember also that we are dealing with general linear models, so we could also perform local regression using the lo function instead of s.

What if, however, we have tons of variables and we need to determine (1) whether we should add them and (2) whether or we shoould scale them. We certainly aren’t the first people to come up with this dilemma, so naturally there is a component of the package which will help us step through these combinations for us. We can generate a step.gam using the syntax below. For each variable we are considering not including it, including it linearly, or including it transformatively. There are a variety of ways to accomplish this syntactically, but what follows is one simple example:

gam.obj <- gam(Kyphosis~Age+Number+Start,family=binomial,data=kyphosis)
gam.step <- step.gam(gam.obj,scope = list("Age" =~ 1+ Age + s(Age, df = 2)+ s(Age, df = 3)+ s(Age, df = 5)+ lo(Age),
        "Number" =~ 1+ Number+ s(Number, df = 2)+ s(Number, df = 3)+ s(Number, df = 5)+ lo(Number),
        "Start" =~ 1+ Start+ s(Start, df = 2)+ s(Start, df = 3)+ s(Start, df = 5)+ lo(Start)),
        direction = "both",
        trace=2)
## Start:  Kyphosis ~ Age + Number + Start; AIC= 69.3799 
## Trial:  Kyphosis ~  1 + Number + Start ; AIC= 70.5365 
## Trial:  Kyphosis ~  s(Age, df = 2) + Number + Start ; AIC= 65.3886 
## Trial:  Kyphosis ~  Age + 1 + Start ; AIC= 71.2991 
## Trial:  Kyphosis ~  Age + s(Number, df = 2) + Start ; AIC= 69.6724 
## Trial:  Kyphosis ~  Age + Number + 1 ; AIC= 77.6266 
## Trial:  Kyphosis ~  Age + Number + s(Start, df = 2) ; AIC= 67.3732 
## Step:1 Kyphosis ~ s(Age, df = 2) + Number + Start ; AIC= 65.3886 
## Trial:  Kyphosis ~  s(Age, df = 3) + Number + Start ; AIC= 66.1469 
## Trial:  Kyphosis ~  s(Age, df = 2) + 1 + Start ; AIC= 67.4701 
## Trial:  Kyphosis ~  s(Age, df = 2) + s(Number, df = 2) + Start ; AIC= 65.771 
## Trial:  Kyphosis ~  s(Age, df = 2) + Number + 1 ; AIC= 72.9099 
## Trial:  Kyphosis ~  s(Age, df = 2) + Number + s(Start, df = 2) ; AIC= 63.0675 
## Step:2 Kyphosis ~ s(Age, df = 2) + Number + s(Start, df = 2) ; AIC= 63.0675 
## Trial:  Kyphosis ~  s(Age, df = 3) + Number + s(Start, df = 2) ; AIC= 63.6246 
## Trial:  Kyphosis ~  s(Age, df = 2) + 1 + s(Start, df = 2) ; AIC= 63.3467 
## Trial:  Kyphosis ~  s(Age, df = 2) + s(Number, df = 2) + s(Start, df = 2) ; AIC= 63.5205 
## Trial:  Kyphosis ~  s(Age, df = 2) + Number + s(Start, df = 3) ; AIC= 63.6165

This shows us what combinations were tried, and the AIC for each (lower is better). From this we can see which combinations and which smoothings of the data produce the best models! This-of course- should be done in cross validation, ie performing this analysis on only training data.


Questions

Suppose that we would like to perform multiple linear regression, but we do not have software to do so. Instead, we only have software to perform simple linear regression. Therefore, we take the following iterative approach: we repeatedly hold all but one coefficient estimate fixed at its current value, and update only that coefficient estimate using a simple linear regression. The process is continued until convergence—that is, until the coefficient estimates stop changing. (5 points)

Perform your own backfitting calculations to create an additive linear model for the ToyExample.csv data. Follow these steps

  1. Initialze beta1 to take on a value of your choice. It does not matter what value you choose.
  2. Keeping beta1 fixed, fit the model to the first residual. eg a=y-beta1*xy; beta2=lm(a~x2)$coef[2]
  3. Now keeping beta2 fixed, fit the model for beta1: a=y-beta1*x1 ; beta2=lm(a~x2)$coef[2]

Loop through this process 100 times. Report the beta estimates at each iteration. Plot each of the values on the same plots in different colors. Additionally, estimate the parameters using linear regression. How many iterations were required to have ‘good’ model performance?


Now load in the previously used College dataset and answer the following questions (5 pts)

  1. Split the data into a 85% training set and a 15% test set. Fit a GAM using gam on the training data, using out-of-state tuition as the response. Plot the results, and explain your findings. For which variables, if any, is there evidence of a non- linear relationship with the response?

  2. Tune the parameters of GAM based on the results of the previous part and create 4 different models (4 models using gam). Use anova() to check which are the best model from each package. Compare your best models with gam.selection() and step.gam().

  3. Interprete why one model may have performed better than others.