|
Linear Models and Matrix
Algebra in R
|
In this discussion, we will explore how to manipulate matrices in R.
A matrix is just an array of numbers. Every matrix has a dimension.
We say the matrix A is an m x n ("m by n")
matrix if it has m rows and n columns. A column
vector has dimension m x 1. A row vector has dimension 1 x n.
In general, whenever we define a vector y, assume it is a
column vector unless told otherwise.
Matrix operations differ from scalar operations. For a quick review
of matrix algebra, in particular, matrix addition/subtraction,
multiplication, transposition, and inverses, click here.
Matrix Operations in R
The following R code demonstrates basic matrix operations in R.
First, let's define some matrices.
A = matrix(1,3,2) # Creates a 3x2 matrix of ones
B = matrix(rbinom(9,20,0.5),3,3) # Generates a 3x3 matrix of random Binomial variables with n=20, p=0.5
C = matrix(seq(0,10,length=6),2,3) # Creates a 2x3 matrix of a sequence between 0 and 10 of length 6
y = matrix(c(7,5,3),3,1) # Creates a 3x1 column vector
Next, let's perform some matrix operations.
t(A) # Transpose of A
t(A) + C # Matrix addition
t(A) - C # Matrix subtraction
C %*% A # Matrix multiplication
B %*% y
solve(B) # Matrix inverse
Using Matrix Algebra to Fit Linear Models
We are going to use the 1985 Current Population Survey (CPS) data.
These data consist of a random sample of 534 persons from the CPS,
with information on wages and other characteristics of the workers,
including sex, number of years of education, years of work
experience, occupational status, region of residence and union
membership. The variables are as follows:
Variable
|
Description
|
educ
|
Number of years of education
|
south
|
Indicator variable for living
in a southern region: S = lives in south, NS = does not live
in south
|
sex
|
Gender: M = male, F = female
|
exper
|
Number of years of work
experience (inferred from age and education)
|
union
|
Indicator variable for union
membership: Union or Not
|
wage
|
Wage (dollars per hour)
|
age
|
Age (years)
|
race
|
Race: W = white, NW = not
white
|
sector
|
Sector of the
economy: clerical, const (construction), management,
manufacturing, professional, sales, service, other
|
married
|
Marital status: Married or
Single
|
Sources: The data are from Statistical
Modeling:
A
Fresh Approach (2010) by Danny Kaplan. The original
source is The Practice of
Econometrics (1991) by E.R. Berndt.
We can import the data directly from the internet using the
following command:
cps = read.csv("http://www.macalester.edu/~kaplan/ism/datasets/cps.csv")
Now attach the data to our R session so we can refer to the
variables directly by name:
attach(cps)
Data Exploration
Let's look at the data for a minute:
names(cps) # variable names
dim(cps) # dimension of the data set (no. of rows, no. of cols)
class(cps) # R views the object "cps" as a data.frame class
plot(wage ~ age) # scatterplot of wage vs. age
plot(wage ~ sex) # side-by-side boxplots of wage vs. sex
Variable Types in R
R will automatically choose a plot appropriate for the type of
variable. Since wage and age are both numeric, it plots a
scatterplot. Since sex is a factor (categorical), it plots
side-by-side boxplots:
is.numeric(wage)
is.numeric(age)
is.factor(sex)
If a variable is not of the desired class, we can coerce the
variable into the desired class:
new.variable = rbinom(20,1,0.25) # Creates a random vector of zeroes and ones
new.variable # Prints vector
is.numeric(new.variable)
factor(new.variable, levels=c(0,1), labels=c("Male","Female"))
Note that a "vector" in R is different from a "matrix" in R:
is.vector(new.variable)
matrix(new.variable) # Coerces vector into a column vector
Fitting Linear Models
First consider the simple linear regression model:
E(wage) = b0 + b1*age
Recall that the normal equations derived from the method of
least squares, expressed in matrix form, are:
(X'X)b = X'y
where X is the design matrix, y is the vector of
responses, and b is the vector of regression parameters. The
solution (assuming the inverse exists) for b is
b = (X'X)^(-1)X'y
Create the observed response vector:
y = matrix(wage)
Then, the design matrix X (be sure this code makes sense!):
x0 = matrix(rep(1,n)) # First column of X (corresponds to intercept)
x1 = matrix(age) # Second column of X (corresponds to slope)
X = cbind(x0,x1) # "cbind" = column bind
X
We can now calculate the fitted regression coefficients using matrix
operations:
# Fitted coefficients
betahat = solve(t(X)%*%X)%*%t(X)%*%y
betahat
# Fitted values
yhat = X%*%betahat
head(yhat)
# Vector of residuals
res = y - yhat
head(res)
# Sum of squared errors (residuals)
sse = sum(res^2)
sse
Now let's check our calculations using the lm function. The
mod = lm(wage ~ 1+age)
summary(mod)
# See what we can get out of mod and its summary:
attributes(mod)
attributes(summary(mod))
# Fitted coefficients
betahat
mod$coef
# Fitted values (the first six)
head(yhat)
head(mod$fitted)
# Residuals (the first six)
head(res)
head(mod$resid)
What does our model look like?
plot(wage ~ age, xlab="Age (years)", ylab="Wage ($/hr)", main = "Wage vs. Age")
abline(mod$coef[1], mod$coef[2]) # Adds line with intercept betahat[1] and slope betahat[2]
# Could also have just typed: abline(mod)
Now let's consider some more complicated linear models. How about
fitting a quadratic function in age?
age2 = age^2
mod2 = lm(wage ~ 1+age+age2)
mod2
# Add fitted quadratic curve to previous plot:
curve(mod2$coef[1] + mod2$coef[2]*x + mod2$coef[3]*x^2, add=T, col="red", lty=2)
Let's try fitting a separate regression line to each sex. This model
has a regression equation:
E(wage) = b0 + b1*age + b2*sex + b3*age*sex
where "sex" equals 1 if Male, and 0 if Female. Thus, the
regression line for females is:
E(wage) = b0 + b1*age + b2*(0) + b3*age*(0)
= b0 + b1*age
while the regression line for males is:
E(wage) = b0 + b1*age + b2*(1) +
b3*age*(1) = (b0+b2) + (b1+b3)*age
Let's see what this model looks like:
mod3 = lm(wage ~ 1+age+sex+age:sex)
# Shortcut formula: wage ~ age*sex
mod3$coef
plot(wage ~ age, xlab="Age (years)", ylab="Wage ($/hr)", main = "Wage vs. Age")
# Add least-squares regression line for Females:
abline(mod3$coef[1], mod3$coef[2], col="blue")
# Add least-squares regression line for Males:
abline(mod3$coef[1]+mod3$coef[3], mod3$coef[2]+mod3$coef[4], col="hotpink", lty=2)
# Add legend:
legend(50,40,c("Females","Males"),col=c("blue","hotpink"),lty=c(1,2))
# To see all the possible color names in R:
colors()
What do the fitted regression lines imply about the relationship
between wage, age, and sex?