Discussion 7: Data Visualization in R


We're going to practice plotting in R using the 1985 Current Population Survey (CPS) data set. Use the following command to read the data into R and attach the data set.
cps = read.csv("http://www.macalester.edu/~kaplan/ism/datasets/cps.csv")
attach(cps)
These data consist of a random sample of 534 persons from the CPS, with information on wages and other characteristics of the workers, including sex, number of years of education, years of work experience, occupational status, region of residence and union membership. The variables are as follows:

Variable
Description
educ
Number of years of education
south
Indicator variable for living in a southern region:
S = lives in south, NS = does not live in south
sex
Gender: M = male, F = female
exper
Number of years of work experience (inferred from age and education)
union
Indicator variable for union membership: Union or Not
wage
Wage (dollars per hour)
age
Age (years)
race
Race: W = white, NW = not white
sector
Sector of the economy: clerical, const (construction), management,
manufacturing, professional, sales, service, other

married
Marital status: Married or Single
Sources: The data are from Statistical Modeling: A Fresh Approach (2010) by Danny Kaplan. The original source is The Practice of Econometrics (1991) by E.R. Berndt.

Before we create plots, here are some useful commands to explore a data set:
head(cps)	# Prints the first few lines of a data set.
tail(cps) # Prints the last few lines of a data set.
names(cps) # Variable names.
dim(cps) # Size of the data set (no. rows, no. columns)
is.data.frame(cps) # You want this to say TRUE


If you have a plot window open, R by default will write over that plot when you ask it for another plot. If you'd like to open another plot window, type the following command:

Mac:
quartz()

PC:
x11()

Univariate Plots

Histograms

Let's take a look at the wage distribution using a histogram:
hist(wage)
Notice that the labels on the vertical axis are counts. To look at the "density" histogram, use
hist(wage,freq=F)
When you use the freq=F argument in hist(), you are asking for the density histogram, which has total area 1. Area is proportional to relative frequency. For example, in the interval from 5 to 10, the frequency histogram shows a count of 243. The height of that interval in the density histogram is about .091, and the width is 5. Thus the area for that interval is about .091*5 = 0.455 (45.5% of the sample), and 0.455*534 is about 243.

It is worth looking at the same data with different break points for the intervals. The seq() function produces a regularly spaced sequence with a given starting value, ending value, and interval size (specified in that order, e.g., seq(3,12,1) is the same as 3:12). Try the following commands:
par(mfrow=c(1,2))	# Tells R to plot 1 row, 2 columns in plot window.
hist(wage,breaks=seq(0,45,3),freq=F)
hist(wage,breaks=20,freq=F)
What did the option  breaks=20 tell R to do? Try it with breaks=5. Now try breaks=100. Which plot looks best?

In every plot function, there are various arguments that add to the figure, such as adding titles, axes labels, or color:
hist(wage,breaks=20,col="seagreen",main="Wage Distribution",xlab="$/hr")
Would you say the distribution of wages is symmetric? If not, which direction is the skewness?

Density Plots

R has several functions for generating smoothed histograms, or density plots. We will focus on the "density" function.

Basic usage is straightforward:
plot(density(wage))
The density function produces a data structure with several components - the first ($x) is the set of points on the x-axis where the density was evaluated; the second ($y) is the height of the density curve at those points.
d <- density(wage)
names(d)
The main issue with density estimates is how much smoothing to do. There is a "bandwidth" parameter (bw) in the density function that determines this.

If the bandwidth parameter is too small, you get a punk hair style:
plot(density(wage,bw=.5))
If the width parameter is too large, you get a nice smooth curve that looks the same no matter what your data were!
plot(density(wage,bw=100))
Try a few different values.

It is often useful to combine a density plot and some other plot such as a histogram or dotplot. The "lines" function adds a lineplot to an existing plot:
hist(wage,freq=F)
lines(density(wage),lty=2) # lty = "line type"
Note that the histogram must be on the density scale to match the density plot. Our density plot is a little higher than our plot boundaries. Let's fix this:
hist(wage,freq=F,ylim=c(0,.11))
lines(density(wage),lty=2)
We can compare two variables by using overlaid density plots:
plot(density(wage[sex=="F"]), col="darkgreen", lty=1, main="Distribution of Wages for Each Gender")
lines(density(wage[sex=="M"]), col="maroon", lty=2)
and with a bit of finesse, we can add a legend. Type the following command, hit Enter, and then click on the plot where you would like to place the legend.
legend(locator(1), legend=c("Female","Male"), col=c("darkgreen","maroon"), lty=c(1,2))
To see what the locator function is doing, try using the locator function all by itself in an existing plot:
locator(1)

Box Plots

A boxplot is really not much more than a graphical display of a 5-number summary (min, 1st quartile, median, 3rd quartile, max). The body of the box represents the location of the quartiles, with a line added at the median. The "whiskers", or lines extending out from the box, display the distance to the furthest observations which are no more than 1.5 times the inner-quartile range (Q3-Q1) from the quartiles. Outliers are displyed as points or lines beyond the whiskers.
boxplot(wage)

Bar Charts

For categorical data, the most common way to graphically display the sample is a barchart. We could use the "barplot" function in the base library, but the "lattice" library has a much nicer looking one through its function "barchart". Let's look at the sector variable:
library(lattice)
barchart(sector,horizontal=F)
The "lattice" library also has several nice functions for box plots, density plots, histograms, and scatterplots:
help(lattice)
For example,
densityplot(wage)
bwplot(wage)
histogram(wage)
xyplot(wage~age)


Bivariate/Multivariate Plots

For the most part, we are interested not in just one variable, but the relationship between two or more variables. Depending on if the variables are quantitative or categorical, this can be done in a variety of ways. The "plot" function is a generic plotting function. Its output depends on its input.

We can feed it one quantitative variable:
plot(wage)    # Plots wage vs. observation number
or one categorical variable:
plot(married) # Produces a bar graph.
or two quantitative variables (scatterplot):
plot(age,wage)    # Order of arguments are x-axis variable, y-axis variable. 
# (Could instead use argument "wage~age".)
or one quantitative response and one categorical explanatory variable (side-by-side boxplots):
plot(wage~sex)
or an entire data set:
plot(cps)

You can also feed the "plot" function other R objects. For instance, if you input a fitted linear model, the "plot" function will produce four diagnostic plots:
par(mfrow=c(2,2))  # Set up a plot window with 2 rows, 2 columns of plots.
mod = lm(wage~age+married)
plot(mod)

To create side-by-side boxplots directly, we can also use the "boxplot" function:
boxplot(wage~sector,ylab="Hourly Wage")  # You may need to resize the window to see all categories
# of sector on the x-axis.
boxplot(wage~married,ylab="Hourly Wage")
This command is useful if the categorical variable is coded in zeros and ones, and read in by R as a numerical variable. In that case, the "plot" function will plot actual points, whereas the "boxplot" function will produce boxplots. For example,
married.quant = as.numeric(married=="Married") # Change married into a quantitative binary variable.
head(married.quant)
plot(wage~married.quant)
boxplot(wage~married.quant)

What if we want to compare more than two variables? If they are all quantitative, we would need a 3-D scatterplot. However, if there are two quantitative variables and one categorical variable, we can use a scatterplot of the two quantitative variables with plot symbols denoting the levels of the categorical variable. Try this with wage, exper, and sex:
plot(wage~exper,type="n",main="Wage vs. Experience by Gender",
xlab="Experience (Years)",ylab="Wage ($/hr)")
# type="n" tells R just to set up the plot window
# and don't plot the points (think of "n" standing for "nothing")
points(exper[sex=="F"],wage[sex=="F"],pch=15,col="hotpink") # Plot females
points(exper[sex=="M"],wage[sex=="M"],pch=18,col="rosybrown") # Plot males
legend(locator(1),c("Female","Male"),pch=c(15,18),col=c("hotpink","rosybrown")) # Add legend
abline(lm(wage[sex=="F"]~age[sex=="F"]),col="hotpink",lty=2) # Add least squares regression lines
abline(lm(wage[sex=="M"]~age[sex=="M"]),col="rosybrown",lty=3)

The xyplot in the lattice package will plot each sex separately:
xyplot(wage~exper|sex)

Quantile-Quantile Plots

We saw earlier that the wage variable was skewed right (positive). Comparing the sample quantiles for wage against theoretical normal quantiles would also demonstrate this characteristic:
qqnorm(wage)  # Creates Normal Q-Q Plot
qqline(wage) # Adds line to plot
Does the shape in the normal quantile-quantile plot make sense?

You can also produce quantile-quantile plots for two quantitative samples:
qqplot(wage[sex=="F"], wage[sex=="M"])
abline(0,1)
For more details on quantile-quantile plots (and other plots), see Albyn Jones' "Graphical Methods" page (Reed College): http://people.reed.edu/~jones/141/dist1.html.



Summary of R Plotting Commands

Arguments when using the plot function:

type= : Denotes the type of plot (e.g., "p" for points, "l" for lines, "n" for no plotting)
main= : Main title (e.g., "Histogram of Wage")
sub= : Sub-title (e.g., "Fitted regression line: y=2+3x")
xlab= : x-axis label (e.g., "Wage")
ylab= : y-axis label (e.g., "Frequency")
xlim= : x-axis limits (e.g., c(0,50))
ylim= : y-axis limits (e.g., c(0,250))

pch: point character (given as a number)
lty: line type (given as a number)
col: color (given as a color name in quotes, or RGB components)

Functions to add features to a plot:

Each of the following R functions will add to an existing plot. All of them can also take any of the arguments from the plot function listed above.
lines(x, y, ...)  : Adds connecting lines at (x,y) points.
points(x, y, ...) : Adds points with (x,y) coordinates.

abline(a, b) : Adds a line with intercept a and slope b.
curve(expr, from, to, add=TRUE) : Adds a curve with expression expr (in terms of x).

legend(x, y, legend, ...) : Adds a legend where the top left corner of the legend
is located at point (x,y).
or
legend(locator(1), legend, ...) : Adds a legend once you click on the plot where
the top left corner should be located.

For further details and more arguments to the plot function, look at the following help files:
?plot
?par



Return to Stat 111/202 Course Webpage