Discussion 3: Linear Models and Matrix Algebra in R


In this discussion, we will explore how to fit linear models using matrix algebra in R. A matrix is just an array of numbers. Every matrix has a dimension. We say the matrix A is an m x n ("m by n") matrix if it has m rows and n columns. A column vector has dimension m x 1. A row vector has dimension 1 x n. In general, whenever we define a vector y, assume it is a column vector unless told otherwise.

Matrix operations differ from scalar operations. For a quick review of matrix algebra, in particular, matrix addition/subtraction, multiplication, transposition, and inverses, click here.

Matrix Operations in R

The following R code demonstrates basic matrix operations in R. First, let's define some matrices.
A = matrix(1,3,2)                  # Creates a 3x2 matrix of ones
B = matrix(rbinom(9,20,0.5),3,3) # Generates a 3x3 matrix of random Binomial variables with n=20, p=0.5
C = matrix(seq(0,10,length=6),2,3) # Creates a 2x3 matrix of a sequence between 0 and 10 of length 6
y = matrix(1:3,3,1) # Creates a column vector
Next, let's perform some matrix operations.
t(A)       # Transpose of A
t(A) + C # Matrix addition
t(A) - C # Matrix subtraction
C %*% A # Matrix multiplication
B %*% y
solve(B) # Matrix inverse

Using Matrix Algebra to Fit Linear Models

We are going to use the 1985 Current Population Survey (CPS) data. These data consist of a random sample of 534 persons from the CPS, with information on wages and other characteristics of the workers, including sex, number of years of education, years of work experience, occupational status, region of residence and union membership. The variables are as follows:

Variable
Description
educ
Number of years of education
south
Indicator variable for living in a southern region: S = lives in south, NS = does not live in south
sex
Gender: M = male, F = female
exper
Number of years of work experience (inferred from age and education)
union
Indicator variable for union membership: Union or Not
wage
Wage (dollars per hour)
age
Age (years)
race
Race: W = white, NW = not white
sector
Sector of the economy: clerical, const (construction), management, manufacturing, professional, sales, service, other
married
Marital status: Married or Single
Sources: The data are from Statistical Modeling: A Fresh Approach (2010) by Danny Kaplan. The original source is The Practice of Econometrics (1991) by E.R. Berndt.

We can import the data directly from the internet using the following command:
cps = read.csv("http://www.macalester.edu/~kaplan/ism/datasets/cps.csv")
Now attach the data to our R session so we can refer to the variables directly by name:
attach(cps)
First consider the model we used in class yesterday:

E(wage) = b0 + b1*age + b2*sex + b3*age*sex

Create the observed response vector:
y = matrix(wage)
Then, the design matrix X:
n = dim(cps)[1] # Sample size = number of rows in cps
x0 = matrix(rep(1,n))
x1 = matrix(age)
x2 = matrix(as.numeric(sex=="M")) # Transforms sex to an indicator variable for "M"
x3 = matrix(as.numeric(sex=="M")*age)

X = cbind(x0,x1,x2,x3) # "cbind" = column bind
We can now calculate the following using matrix operations:
# Fitted coefficients
betahat = solve(t(X)%*%X)%*%t(X)%*%y

# Fitted values
yhat = X%*%betahat

# Vector of residuals
res = y - yhat

# Sum of squared error (residuals)
sse = sum(res^2)

# Estimate of sigma^2 (variance of the errors)
mse = sse/(n-4)

# Variance matrix of betahat
var.betahat = mse*solve(t(X)%*%X)
Now let's check our calculations using the lm function:
mod = lm(wage~age*sex)
summary(mod)

# See what we can get out of mod and its summary:
attributes(mod)
attributes(summary(mod))

# Fitted coefficients
betahat
mod$coef

# Fitted values (the first six)
head(yhat)
head(mod$fitted)

# Residuals (the first six)
head(res)
head(mod$resid)

# Residual degrees of freedom
n-4
mod$df.residual

# Estimate of sigma
sqrt(mse)
summary(mod)$sigma

# Standard errors of the coefficients
sqrt(diag(var.betahat))
summary(mod)$coef[,2]

Inference in Linear Models

Now that we have our fitted model, we can perform inference on the coefficients. For example, we can do a t-test for H0: b1 = 0 versus Ha: b1 != 0:
summary(mod)$coef[2,]

# Estimate
b1 = betahat[2]
b1

# Standard error
se = sqrt(diag(var.betahat))[2]
se

# Test statistic
t = b1/se
t

# p-value
2*pt(abs(t),n-4,lower.tail=FALSE)
The p-value is greater than any reasonable significance level, but that does not mean that age is not useful in predicting wage. Why?

Let's try an F-test comparing the full model above to the reduced model

E(wage) = b0 +  b2*sex

Without using an R function, we can do this with our matrix calculations:
# SSE from full model (calculated previously)
sse

# df from full model (calculated previously)
df = n-4

# SSE from reduced model
X.r = cbind(x0,x2) # Reduced model design matrix betahat.r = solve(t(X.r)%*%X.r)%*%t(X.r)%*%y # Reduced model coefficient estimates
res.r = y - X.r%*%betahat.r # Reduced model residuals
sse.r = sum(res.r^2) # Reduced model SSE

# df from reduced model
df.r = n-2

# Test statistic comparing the two models
F = ((sse.r - sse)/(df.r - df))/(sse/df)
The R function anova will perform F-tests for comparing two nested models:
mod.r = lm(wage~sex)
anova(mod.r,mod)
The small p-value indicates that the full model has more predictive power than the reduced model; i.e., adding age and the interaction between age and sex to the reduced model improves its wage predictions.


Stats 111/202 Webpage