# CMDA-4654

CMDA-4654 HW4
Due March 8th, 2019

Instructions

You should must use R Markdown to do the assignment. Failure to compile your document results in a zero. You must turn in both your .Rmd file and your .pdf in the format Lastname_Firstname_HW4.pdf, please put both together in a .zip file with the same naming convention.

Generalized Least Squares (GLS)
The general linear model states that
Y=Xβ+ε, with ε～N(0,σ2V)
Y=Xβ+ε, with ε～N(0,σ2V)

The matrix VV is a (symmetric) positive define matrix which means that we can factorize the matrix into
V=LLT
V=LLT
where LL is a lower-triangular matrix coming from the Cholesky decomposition. We can then muliply both sides of the linear model equation by L1L1which gives
L1YZ=L1Xβ+L1ε=Bβ+η
L1Y=L1Xβ+L1εZ=Bβ+η
where Z=L1YZ=L1Y, B=L1XB=L1X, and η=L1εη=L1ε.

The generalized least square estimate β?β? of ββ is the ordinary least squares estimate of ββ in this new model and it is given by
β=(BTB)1BTZ=(XTV1X)1XTV1
β=(BTB)1BTZ=(XTV1X)1XTV1
which was found by minimizing the error sum of squares in the form
SSE=(ZZ)T(Z)T=…=(YXβ)TV1(YXβ)
SSE=(ZZ^)T(ZZ^)T=…=(YXβ)TV1(YXβ)

The purpose of this Cholesky transfrom is that under this transform
E[η]=0
E[η]=0
and
Var[η]=Var[L1ε]=L1[σ2V]L1T=σ2L1LLTL1T=σ2I.
Var[η]=Var[L1ε]=L1[σ2V]L1T=σ2L1LLTL1T=σ2I.
It can be shown that the estimator ββ is unbiased and that
Var(β)=σ2(XTV1X)1
Var(β)=σ2(XTV1X)1

Therefore by using the transformation, we arrive at a ordinary least squares problem with homoscedastic observations on ηiηi and zizi, despite the original noise observations being correlated.

To get practice with this concept let’s do this for the Orange dataset which is in the nlme package in R.

library(nlme)
Orange

Tree

age

circumference

1 1 118 30
2 1 484 58
3 1 664 87
4 1 1004 115
5 1 1231 120
6 1 1372 142
7 1 1582 145
8 2 118 33
9 2 484 69
10 2 664 111
Next1234Previous
1-10 of 35 rows
The orange dataset contains the growth records for 5 different Orange trees where they recorded the circumference of the trees at 7 different ages as the trees go older and bigger over time. Let’s start out by fitting a ordinary least squares regression model to the data

fit <- lm(circumference ~ age, data = Orange)
raw_resids <- resid(fit)
The residuals from longitudinal data are correlated so fitting an ordinary linear model to this data is incorrect as it doesn’t account for the correlation between measurements caused by taking 7 different measurements on each tree in the study.

We can spread the residuals out for each of the 5 trees by creating a long data set and then reshaping this data into a wide dataset using the reshape2 package as follows

library(reshape2)
dat <- data.frame(Orange, res = round(raw_resids, 4), time = rep(1:7, 5) )
D <- reshape(dat, idvar = "Tree", v.names = "res", times=1:7, direction = "wide")

# Did the below to fit on the screen

DT::datatable(D, options = list(dom = "t"))
Tree age circumference res.1 res.2 res.3 res.4 res.5 res.6 res.7
1 1 118 30 0.0015 -11.0765 -1.2951 -9.5971 -28.8339 -21.8885 -41.3103
8 2 118 33 3.0015 -0.0765 22.7049 31.4029 23.1661 39.1115 16.6897
15 3 118 30 0.0015 -18.0765 -13.2951 -16.5971 -33.8339 -24.8885 -46.3103
22 4 118 32 2.0015 -7.0765 23.7049 42.4029 30.1661 45.1115 27.6897
29 5 118 30 0.0015 -20.0765 -7.2951 0.4029 -6.8339 10.1115 -9.3103
You notice that the residuals for tree 1 are mostly all negative and the same is true for tree 3. However the residuals for tree 2 are all positive. It is clear that this signs of the residuals should change randomly if the OLS regression model assumptions hold and since the signs of the residuals do not change it is clear that something the OLS regression model does not account for the fact that the data is positively correlated for repeated observations taken on each tree.

In this example, we will assume that the correlation matrix has an autocorrelation pattern. This is the so-called autoregressive first order process (AR1) model structure and the AR1 correlation structure looks like the following
R=1ρρ2ρ6ρ1ρρ5ρ2ρ1ρ4ρ6ρ5ρ4
R=[1ρρ2ρ6ρ1ρρ5ρ2ρ1ρ4ρ6ρ5ρ41]

To estimate the correlation present within observations from the same tree, we use the cor function.

r <- diag( cor( t(D[ ,4:9]), t(D[ ,5:10] ) ) );

print( rave <- mean(r) )
 0.3052431
The average correlation is approximately ρ?=0.3ρ^=0.3. We can then create a (7×7)(7×7) AR1 type correlation matrix where {}_{ij} = ^{|i-j|}\$.

R <- diag(7)
R <- 0.305^abs(row(R)-col(R))
round(R, 4)

``   [,1]   [,2]   [,3]   [,4]   [,5]   [,6]   [,7]``

[1,] 1.0000 0.3050 0.0930 0.0284 0.0087 0.0026 0.0008
[2,] 0.3050 1.0000 0.3050 0.0930 0.0284 0.0087 0.0026
[3,] 0.0930 0.3050 1.0000 0.3050 0.0930 0.0284 0.0087
[4,] 0.0284 0.0930 0.3050 1.0000 0.3050 0.0930 0.0284
[5,] 0.0087 0.0284 0.0930 0.3050 1.0000 0.3050 0.0930
[6,] 0.0026 0.0087 0.0284 0.0930 0.3050 1.0000 0.3050
[7,] 0.0008 0.0026 0.0087 0.0284 0.0930 0.3050 1.0000
Now let’s create a (35×35)(35×35) matrix which is the Kronecker product I5?RI5?R which can be obtained as follows

I <- diag(5)
V <- I %x% R

# You can view this on your own and verify that it looks like the form below.

The kronecker product I5?RI5?R creates a (35×35)(35×35) matrix that looks like the following in block form

V=R00000R00000R00000R00000R
V=[R00000R00000R00000R00000R]

Why this structure? Well the observations from the same tree are correlated, but the trees themselves are assumed to be independent from one another. So we have these AR1 correlation sub-matrices within VV along the diagononals.

We can then take the Cholesky decomposition and get the lower-triangular matrix LL as follows

L <- t( chol(V) ) # t() for transpose to lower-triangular
Linv <- solve(L)

# Check the dimension

dim(Linv)
 35 35
Now let’s transform the linear model by multiplying both sides of the equation by L?1L?1 this gives us Z=L?1YZ=L?1Y, B=L?1XB=L?1X, and η=L?1εη=L?1ε which we can obtain as follows in R

Yvec <- Orange\$circumference

# Get our model matrix with a built-in function

Xmat <- model.matrix(~age, data = Orange)

Zvec <- Linv %*% Yvec
Bmat <- Linv %*% Xmat
We can then apply ordinary regression using the model matrix BB instead of XX. Here are the GLS regression coefficients

BTBinv <- solve( t(Bmat) %*% Bmat )
betahat_GLS <- BTBinv %% t(Bmat) %% Yvec
betahat_GLS

``              [,1]``

(Intercept) 10.3617701
age 0.1451746
These regression coefficients are better than the OLS regression coefficients because the account for the longitudinal (repeated measures) nature of the data. Compare these to the regression coeffiecients from OLS regression

betahat_OLS <- solve( t(Xmat) %% Xmat ) %% t(Xmat) %*% Yvec
betahat_OLS

``              [,1]``

(Intercept) 17.3996502
age 0.1067703
Notice the slope and yy-intercept are different between the GLS and OLS fits.

In practice, we don’t know the value of ρρ, so we would estimate ρρ again from the new residuals and then fit the model again, and intervate until convergence.

While this is something you could potential program, let’s just use a function in R.

Using R to obtain a GLS fit.

Now we can obtain the regression coefficients and other model parameters more easily without all of this work by using the gls() function from the nlme library. The gls() function is created for the fitting linear models using Generalized Leasts Squares which estimates both the variance-covariance parameters as well as the regression coefficients. We can obtain this using the following commands in R.

modfit_gls <- gls( circumference ~ age, data = Orange,

``        correlation = corAR1(form= ~1|Tree) )``

summary(modfit_gls)
Generalized least squares fit by REML
Model: circumference ~ age
Data: Orange

``   AIC      BIC    logLik``

299.1256 305.1116 -145.5628

Correlation Structure: AR(1)
Formula: ~1 | Tree
Parameter estimate(s):

``  Phi ``

0.8182431

Coefficients:

``            Value Std.Error   t-value p-value``

(Intercept) 23.790723 11.776313 2.020218 0.0515
age 0.096486 0.008783 10.985747 0.0000

Correlation:

``(Intr)``

age -0.657

Standardized residuals:

``   Min         Q1        Med         Q3        Max ``

-1.4517369 -0.5084699 -0.1265608 0.9420953 2.1051595

Residual standard error: 25.09551
Degrees of freedom: 35 total; 33 residual
The correlation parameter was computed iteratively so using the GLS function is much better than the work I had done by hand. The within tree measurements are correlated by the parameter ρ=0.81ρ^=0.81, this corresponds to the AR1 correlation parameter. Now the variance covariance matrix for this problem is Σ=σ2VΣ=σ2V where VV contains AR1 correlation sub-matrices for the repeated measures on the trees. The parameter σσ is the residual standard error. This parameter is estimated to be σ=25.09σ^=25.09. We can calculate the estimated variance covariance matrix by

rho <- 0.81
sigma <- 25.09
V <- diag(7)

# R[i,j] = rho^|i - j| # Autoregressive correlation submatrix.

V <- rho^abs(row(V)-col(V))
I <- diag(5)
V <- kronecker(I,V)
Sigma <- sigma^2*V
Problem 1 (25 points)
Now let’s do the following generalized linear regression problem using the sleepstudy data in the lme4 package in R.

library(lme4)

Reaction

Days

Subject

1 249.5600 0 308
2 258.7047 1 308
3 250.8006 2 308
4 321.4398 3 308
5 356.8519 4 308
6 414.6901 5 308
6 rows
Problem 1 (a)
Find an OLS fit with yy=Reaction and x=Daysx=Days. Check the assumptions of the model. Clearly the observations are correlated. Provide a scatterplot that shows different colors (and symbols if you like) based upon the Subject.

Problem 1 (b)
Use the gls() function to find a generalized least squares fit. How does this fit compare to what you found using OLS in part (a)?

Problem 2 (Weighted Least Squares) (25 points)
Weighted Least Squares is a special case of GLS where the errors are uncorrelated but have unequal variance where the form of the inequality is known.

In this case Var(εi)=σ2iVar(εi)=σi2 for i=1,…,ni=1,…,n.

Hence the variance-covariance matrix is of the form
If we define the reciprocal of each variance, σ2iσi2 as the weight, wi=1/σ2iwi=1/σi2, then let WW be a diagonoal matrix containing these weights:
The weighted least-squares estimate is then
βWLS=(XTWX)1XTWY
β^WLS=(XTWX)1XTWY
This is equivalent carrying out regression on the model
W1/2Y=W1/2Xβ+W1/2ε
W1/2Y=W1/2Xβ+W1/2ε
where W1/2W1/2 is a diagonoal matrix containing the square-roots of the diagonals of WW. Hopefully you can begin to see how this is a special case of GLS.

Since each weight is inversely proportional to the error variance, it reflects the information in that observation. So, an observation with small error variance has a large weight since it contains relatively more information than an observation with large error variance (small weight).

When it comes to weighted least-squares, the most important thing to consider is how to determine the weights.

There are several common ways to obtain the weights depending on the type of information that we have.

Option 1: Errors are proportional to the predictor variable Var(εi)∝xiVar(εi)∝xi then wi=x?1iwi=xi?1. We often choose this option after observing a positive relationship in the plot of |ei||ei| (the absolute value of the residuals) versus xixi. (Another option might also be wi=x2iwi=xi2)

Option 2: When YiYi are the averages of nini observations, then Var(Yi)=Var(εi)=σ2niVar(Yi)=Var(εi)=σ2ni then wi=niwi=ni.

Option 3: When the observed responses are known to be of varying quality, weights may be assigned
wi=1σ2i=1sd(yi)2
wi=1σ^i2=1sd(yi)^2

Option 4: This is a combination of options 2 and 3. Suppose YiYi are the averages of nini observations, and the variance of Var(Yi)=Var(εi)=σ2iniVar(Yi)=Var(εi)=σi2ni then
wi=niσ2i
wi=niσ^i2

Option 5: (This is what is done most often in practice) Suppose we have visual evidence that the variance is not constant, generally this will come from a plot of the residuals eiei versus fitted values yiy^i. We can use our data to estimate weights by following the procedure below.

Store the residuals and fitted values from the OLS regression.

Calculate the absolute value of the residuals |ei||ei|.

Regression the absolute values |ei||ei| versus the fitted values yiy^i and store the fitted values from this regression. These fitted values are an estimate of the error standard deviations, i.e. Var(Y|X=x)√Var(Y|X=x).

Calculate the weights wi=1fitted values2=1SD2wi=1fitted values2=1SD^2

This process can be repeated if the WLS regression residuals still show evidence of heteroscedasticity (i.e. nonconstant variation). If we repeat steps (a)-(d) in order to stabilize the variance then we are actually performing a version of iteratively reweighted least squares (IRLS) regression. This procedure could be employed in a multiple regression setting as well.

Problem 2 (a)
Researchers at National Institutes of Standards and Technology (NIST) collected pipeline data on ultrasonic measurements of the depth of defects in the Alaska pipeline in the field. The depth of the defects were then remeasured in the laboratory. These measurements were performed in six different batches. It turns out that this batch effect is not significant and so can be ignored in the analysis that follows. The laboratory measurements are more accurate than the in-field measurements, but more time consuming and expensive. We want to develop a regression equation for correcting the in-field measurements.

The data can be found in the faraway library in the dataframe called pipeline.

Fit a regression model with y=Laby=Lab and x=Fieldx=Field. Check for non-constant variance.

Problem 2 (b)
Use the method of weighted least-squares to obtain a WLS fit of the variables in (a). You’ll have to find the weights based upon the method outlined in option 5. Once you have the weights, to obtain a weighted OLS fit in R, simply do the following:

modfit_weighted <- lm(Lab ~ Field, weights = wts, data = pipeline)
Problem 2 (c)
Overlay the two regression fit lines from (a) and (b). Comment on any differences you may see.

Problem 2 (d)
The point of using weights is to account for the non-constant variance.

Below, we split the range of Field into 12 groups of size nine (except for the last group which has only eight values). Within each group, we compute the variance of Lab as varlab and the mean of Field as meanfield. Supposing pipeline is the name of your data frame, the following R code will make the needed computations:

i <- order(pipeline\$Field)
npipe <- pipeline[i, ]
ff <- gl(12,9)[-108]
meanfield <- unlist( lapply(split(npipe\$Field, ff), mean))
varlab <- unlist( lapply(split(npipe\$Lab, ff), var))
Suppose we guess that the variance in the response is linked to the predictor in the following way:
Var(Lab)=a0Fielda1
Var(Lab)=a0Fielda1
Regress log(varlab) on log(meanfield) to estimate a0a0 and a1a1

Use this to determine appropriate weights in the WLS fit of Lab on Field. Show the regression summary and compare with the result you got from (b) which uses the method outlined in Option 5.

Problem 3 (Weighted Least Squares on Aggregated Data) (25 points)
The crawl data set gives us an example of data where the variation is different in each group. The data comes from a study which investigated whether babies take longer to learn to crawl in cold months when they are often bundled in clothes that restrict their movement, than in warmer months. The study sought an association between babies’ first crawling age and the average temperature during the month they first try to crawl (about 6 months after birth). Parents brought their babies into the University of Denver Infant Study Center between 1988-1991 for the study. The parents reported the birth month and age at which their child was first able to creep or crawl a distance of four feet in one minute. Data were collected on 208 boys and 206 girls (40 pairs of which were twins).

library(faraway)
crawl

crawling

SD

n

temperature

January 29.84 7.08 32 66
February 30.52 6.96 36 73
March 29.70 8.33 23 72
April 31.84 6.21 26 63
May 28.58 8.07 27 52
June 31.44 8.10 29 39
July 33.64 6.91 21 33
August 32.82 7.61 45 30
September 33.83 6.93 38 33
October 33.35 7.29 44 37
Next12Previous
1-10 of 12 rows
Notice that the data is aggregated into groups that were divided into the month that the baby was first able to crawl. Notice that both the mean and the standard deviation vary by each month. Now according to GLM theory the optimal weight is based upon the inverse of the standard error wi=niσ?2iwi=niσ^i2 where {ni,σ?i}{ni,σ^i} is the sample size and sample standard deviation for the ithith group.

Shown below is a plot where the size of the data point varies based upon the optimal weight wi=niσ2iwi=niσi2.

library(ggplot2)
wt <- crawl\$n/(crawl\$SD^2)
ggplot(crawl, aes(y=crawling, x=temperature, size = wt, col=wt)) + geom_point()

Problem (3) Part (a)
Fit a SLR model using ordinary least squares (OLS) using y=crawlingy=crawling and x=temperaturex=temperature.

Problem (3) Part (b)
Fit a new model using weighted least squares (WLS) by choosing appropriate weights.

Problem (3) Part (c)
Compare the fits from (a) and (b) by producing a plot for each model and then comment upon which model best fits the data from a graphical standpoint.

Problem (3) Part (d)
Change the SDs for January–April to 50, 65, 55, and 66, respectively and refit WLS. Compare this fit the OLS.

Problem 4 (Robust Regression) (25 points)
Consider the Animals dataset from the Mass library.

Problem 4(a)
Fit an OLS model using y=ln(brain)y=ln?(brain) and x=ln(body)x=ln?(body).

Problem 4(b)
Use diagnostics to check model assumptions and to find outliers and influential points.

Problem 4(c)
Use the Huber loss function to fit a robust regression model on the transformed variables from (a).

Problem 4(d)
Use the biweight lost function to fit a robust regression model on the transformed variables from (a).

Problem 4(e)
Compare the three model fits to the data. Which weight function do you think best fits the data?