Linear Models and Matrix Algebra in R


In this discussion, we will explore how to manipulate matrices in R. A matrix is just an array of numbers. Every matrix has a dimension. We say the matrix A is an m x n ("m by n") matrix if it has m rows and n columns. A column vector has dimension m x 1. A row vector has dimension 1 x n. In general, whenever we define a vector y, assume it is a column vector unless told otherwise.

Matrix operations differ from scalar operations. For a quick review of matrix algebra, in particular, matrix addition/subtraction, multiplication, transposition, and inverses, click here.

Matrix Operations in R

The following R code demonstrates basic matrix operations in R. First, let's define some matrices.
A = matrix(1,3,2)                  # Creates a 3x2 matrix of ones
B = matrix(rbinom(9,20,0.5),3,3) # Generates a 3x3 matrix of random Binomial variables with n=20, p=0.5
C = matrix(seq(0,10,length=6),2,3) # Creates a 2x3 matrix of a sequence between 0 and 10 of length 6
y = matrix(c(7,5,3),3,1) # Creates a 3x1 column vector
Next, let's perform some matrix operations.
t(A)       # Transpose of A
t(A) + C # Matrix addition
t(A) - C # Matrix subtraction
C %*% A # Matrix multiplication
B %*% y
solve(B) # Matrix inverse

Using Matrix Algebra to Fit Linear Models

We are going to use the 1985 Current Population Survey (CPS) data. These data consist of a random sample of 534 persons from the CPS, with information on wages and other characteristics of the workers, including sex, number of years of education, years of work experience, occupational status, region of residence and union membership. The variables are as follows:

Variable
Description
educ
Number of years of education
south
Indicator variable for living in a southern region: S = lives in south, NS = does not live in south
sex
Gender: M = male, F = female
exper
Number of years of work experience (inferred from age and education)
union
Indicator variable for union membership: Union or Not
wage
Wage (dollars per hour)
age
Age (years)
race
Race: W = white, NW = not white
sector
Sector of the economy: clerical, const (construction), management, manufacturing, professional, sales, service, other
married
Marital status: Married or Single
Sources: The data are from Statistical Modeling: A Fresh Approach (2010) by Danny Kaplan. The original source is The Practice of Econometrics (1991) by E.R. Berndt.

We can import the data directly from the internet using the following command:
cps = read.csv("http://www.macalester.edu/~kaplan/ism/datasets/cps.csv")
Now attach the data to our R session so we can refer to the variables directly by name:
attach(cps)

Data Exploration

Let's look at the data for a minute:
names(cps)        # variable names
dim(cps) # dimension of the data set (no. of rows, no. of cols)
class(cps) # R views the object "cps" as a data.frame class

plot(wage ~ age) # scatterplot of wage vs. age
plot(wage ~ sex) # side-by-side boxplots of wage vs. sex

Variable Types in R

R will automatically choose a plot appropriate for the type of variable. Since wage and age are both numeric, it plots a scatterplot. Since sex is a factor (categorical), it plots side-by-side boxplots:
is.numeric(wage)
is.numeric(age)
is.factor(sex)

If a variable is not of the desired class, we can coerce the variable into the desired class:
new.variable = rbinom(20,1,0.25)   # Creates a random vector of zeroes and ones
new.variable # Prints vector
is.numeric(new.variable)
factor(new.variable, levels=c(0,1), labels=c("Male","Female"))
Note that a "vector" in R is different from a "matrix" in R:
is.vector(new.variable)
matrix(new.variable) # Coerces vector into a column vector

Fitting Linear Models

First consider the simple linear regression model:

E(wage) = b0 + b1*age

Recall that the normal equations derived from the method of least squares, expressed in matrix form, are:

(X'X)b = X'y

where X is the design matrix, y is the vector of responses, and b is the vector of regression parameters. The solution (assuming the inverse exists) for b is

b = (X'X)^(-1)X'y

Create the observed response vector:
y = matrix(wage)
Then, the design matrix X (be sure this code makes sense!):
x0 = matrix(rep(1,n))  # First column of X (corresponds to intercept)
x1 = matrix(age) # Second column of X (corresponds to slope)

X = cbind(x0,x1) # "cbind" = column bind
X
We can now calculate the fitted regression coefficients using matrix operations:
# Fitted coefficients
betahat = solve(t(X)%*%X)%*%t(X)%*%y
betahat

# Fitted values
yhat = X%*%betahat
head(yhat)

# Vector of residuals
res = y - yhat
head(res)

# Sum of squared errors (residuals)
sse = sum(res^2)
sse
Now let's check our calculations using the lm function. The
mod = lm(wage ~ 1+age)
summary(mod)

# See what we can get out of mod and its summary:
attributes(mod)
attributes(summary(mod))

# Fitted coefficients
betahat
mod$coef

# Fitted values (the first six)
head(yhat)
head(mod$fitted)

# Residuals (the first six)
head(res)
head(mod$resid)

What does our model look like?
plot(wage ~ age, xlab="Age (years)", ylab="Wage ($/hr)", main = "Wage vs. Age")
abline(mod$coef[1], mod$coef[2]) # Adds line with intercept betahat[1] and slope betahat[2]
# Could also have just typed: abline(mod)
Now let's consider some more complicated linear models. How about fitting a quadratic function in age?
age2 = age^2
mod2 = lm(wage ~ 1+age+age2)
mod2
# Add fitted quadratic curve to previous plot:
curve(mod2$coef[1] + mod2$coef[2]*x + mod2$coef[3]*x^2, add=T, col="red", lty=2)
Let's try fitting a separate regression line to each sex. This model has a regression equation:

E(wage) = b0 + b1*age + b2*sex + b3*age*sex

where "sex" equals 1 if Male, and 0 if Female. Thus, the regression line for females is:

E(wage) = b0 + b1*age + b2*(0) + b3*age*(0) = b0 + b1*age

while the regression line for males is:

E(wage) = b0 + b1*age + b2*(1) + b3*age*(1) = (b0+b2) + (b1+b3)*age

Let's see what this model looks like:
mod3 = lm(wage ~ 1+age+sex+age:sex)
# Shortcut formula: wage ~ age*sex
mod3$coef

plot(wage ~ age, xlab="Age (years)", ylab="Wage ($/hr)", main = "Wage vs. Age")

# Add least-squares regression line for Females: abline(mod3$coef[1], mod3$coef[2], col="blue")
# Add least-squares regression line for Males:
abline(mod3$coef[1]+mod3$coef[3], mod3$coef[2]+mod3$coef[4], col="hotpink", lty=2)
# Add legend:
legend(50,40,c("Females","Males"),col=c("blue","hotpink"),lty=c(1,2))

# To see all the possible color names in R:
colors()
What do the fitted regression lines imply about the relationship between wage, age, and sex?



Stats 120C/Math 131C Webpage