|
Discussion 3: Linear Models
and Matrix Algebra in R
|
In this discussion, we will explore how to fit linear models using
matrix algebra in R. A matrix is just an array of numbers. Every
matrix has a dimension. We say the matrix A is an m x n
("m by n") matrix if it has m rows and n
columns. A column vector has dimension m x 1. A row vector
has dimension 1 x n. In general, whenever we define a vector
y, assume it is a column vector unless told otherwise.
Matrix operations differ from scalar operations. For a quick review
of matrix algebra, in particular, matrix addition/subtraction,
multiplication, transposition, and inverses, click here.
Matrix Operations in R
The following R code demonstrates basic matrix operations in R.
First, let's define some matrices.
A = matrix(1,3,2) # Creates a 3x2 matrix of ones
B = matrix(rbinom(9,20,0.5),3,3) # Generates a 3x3 matrix of random Binomial variables with n=20, p=0.5
C = matrix(seq(0,10,length=6),2,3) # Creates a 2x3 matrix of a sequence between 0 and 10 of length 6
y = matrix(1:3,3,1) # Creates a column vector
Next, let's perform some matrix operations.
t(A) # Transpose of A
t(A) + C # Matrix addition
t(A) - C # Matrix subtraction
C %*% A # Matrix multiplication
B %*% y
solve(B) # Matrix inverse
Using Matrix Algebra to Fit Linear Models
We are going to use the 1985 Current Population Survey (CPS) data.
These data consist of a random sample of 534 persons from the CPS,
with information on wages and other characteristics of the workers,
including sex, number of years of education, years of work
experience, occupational status, region of residence and union
membership. The variables are as follows:
Variable
|
Description
|
educ
|
Number of years of education
|
south
|
Indicator variable for living
in a southern region: S = lives in south, NS = does not live
in south
|
sex
|
Gender: M = male, F = female
|
exper
|
Number of years of work
experience (inferred from age and education)
|
union
|
Indicator variable for union
membership: Union or Not
|
wage
|
Wage (dollars per hour)
|
age
|
Age (years)
|
race
|
Race: W = white, NW = not
white
|
sector
|
Sector of the
economy: clerical, const (construction), management,
manufacturing, professional, sales, service, other
|
married
|
Marital status: Married or
Single
|
Sources: The data are from Statistical
Modeling:
A
Fresh Approach (2010) by Danny Kaplan. The original
source is The Practice of
Econometrics (1991) by E.R. Berndt.
We can import the data directly from the internet using the
following command:
cps = read.csv("http://www.macalester.edu/~kaplan/ism/datasets/cps.csv")
Now attach the data to our R session so we can refer to the
variables directly by name:
attach(cps)
First consider the model we used in class yesterday:
E(wage) = b0 + b1*age + b2*sex + b3*age*sex
Create the observed response vector:
y = matrix(wage)
Then, the design matrix X:
n = dim(cps)[1] # Sample size = number of rows in cps
x0 = matrix(rep(1,n))
x1 = matrix(age)
x2 = matrix(as.numeric(sex=="M")) # Transforms sex to an indicator variable for "M"
x3 = matrix(as.numeric(sex=="M")*age)
X = cbind(x0,x1,x2,x3) # "cbind" = column bind
We can now calculate the following using matrix operations:
# Fitted coefficients
betahat = solve(t(X)%*%X)%*%t(X)%*%y
# Fitted values
yhat = X%*%betahat
# Vector of residuals
res = y - yhat
# Sum of squared error (residuals)
sse = sum(res^2)
# Estimate of sigma^2 (variance of the errors)
mse = sse/(n-4)
# Variance matrix of betahat
var.betahat = mse*solve(t(X)%*%X)
Now let's check our calculations using the lm function:
mod = lm(wage~age*sex)
summary(mod)
# See what we can get out of mod and its summary:
attributes(mod)
attributes(summary(mod))
# Fitted coefficients
betahat
mod$coef
# Fitted values (the first six)
head(yhat)
head(mod$fitted)
# Residuals (the first six)
head(res)
head(mod$resid)
# Residual degrees of freedom
n-4
mod$df.residual
# Estimate of sigma
sqrt(mse)
summary(mod)$sigma
# Standard errors of the coefficients
sqrt(diag(var.betahat))
summary(mod)$coef[,2]
Inference in Linear Models
Now that we have our fitted model, we can perform inference on the
coefficients. For example, we can do a t-test for H0: b1 = 0 versus
Ha: b1 != 0:
summary(mod)$coef[2,]
# Estimate
b1 = betahat[2]
b1
# Standard error
se = sqrt(diag(var.betahat))[2]
se
# Test statistic
t = b1/se
t
# p-value
2*pt(abs(t),n-4,lower.tail=FALSE)
The p-value is greater than any reasonable significance level, but
that does not mean that age is not useful in predicting
wage. Why?
Let's try an F-test comparing the full model above to the reduced
model
E(wage) = b0 + b2*sex
Without using an R function, we can do this with our matrix
calculations:
# SSE from full model (calculated previously)
sse
# df from full model (calculated previously)
df = n-4
# SSE from reduced model
X.r = cbind(x0,x2) # Reduced model design matrix
betahat.r = solve(t(X.r)%*%X.r)%*%t(X.r)%*%y # Reduced model coefficient estimates
res.r = y - X.r%*%betahat.r # Reduced model residuals
sse.r = sum(res.r^2) # Reduced model SSE
# df from reduced model
df.r = n-2
# Test statistic comparing the two models
F = ((sse.r - sse)/(df.r - df))/(sse/df)
The R function anova will perform F-tests for comparing two nested
models:
mod.r = lm(wage~sex)
anova(mod.r,mod)
The small p-value indicates that the full model has more predictive
power than the reduced model; i.e., adding age and the interaction
between age and sex to the reduced model improves its wage
predictions.