Discussion 7: Data
Visualization in R
|
cps = read.csv("http://www.macalester.edu/~kaplan/ism/datasets/cps.csv")
attach(cps)
These data consist of a random sample of 534 persons from the CPS,
with information on wages and other characteristics of the workers,
including sex, number of years of education, years of work
experience, occupational status, region of residence and union
membership. The variables are as follows:| Variable |
Description |
| educ |
Number of years of education |
| south |
Indicator variable for living
in a southern region: S = lives in south, NS = does not live in south |
| sex |
Gender: M = male, F = female |
| exper |
Number of years of work
experience (inferred from age and education) |
| union |
Indicator variable for union
membership: Union or Not |
| wage |
Wage (dollars per hour) |
| age |
Age (years) |
| race |
Race: W = white, NW = not
white |
| sector |
Sector of the
economy: clerical, const (construction), management, manufacturing, professional, sales, service, other |
| married |
Marital status: Married or
Single |
head(cps) # Prints the first few lines of a data set.
tail(cps) # Prints the last few lines of a data set.
names(cps) # Variable names.
dim(cps) # Size of the data set (no. rows, no. columns)
is.data.frame(cps) # You want this to say TRUE
| If you have a
plot window open, R by default will write over that plot
when you ask it for another plot. If you'd like to open
another plot window, type the following command: Mac: quartz() PC: x11() |
hist(wage)Notice that the labels on the vertical axis are counts. To look at the "density" histogram, use
hist(wage,freq=F)When you use the freq=F argument in hist(), you are asking for the density histogram, which has total area 1. Area is proportional to relative frequency. For example, in the interval from 5 to 10, the frequency histogram shows a count of 243. The height of that interval in the density histogram is about .091, and the width is 5. Thus the area for that interval is about .091*5 = 0.455 (45.5% of the sample), and 0.455*534 is about 243.
par(mfrow=c(1,2)) # Tells R to plot 1 row, 2 columns in plot window.What did the option breaks=20 tell R to do? Try it with breaks=5. Now try breaks=100. Which plot looks best?
hist(wage,breaks=seq(0,45,3),freq=F)
hist(wage,breaks=20,freq=F)
hist(wage,breaks=20,col="seagreen",main="Wage Distribution",xlab="$/hr")Would you say the distribution of wages is symmetric? If not, which direction is the skewness?
plot(density(wage))The density function produces a data structure with several components - the first ($x) is the set of points on the x-axis where the density was evaluated; the second ($y) is the height of the density curve at those points.
d <- density(wage)The main issue with density estimates is how much smoothing to do. There is a "bandwidth" parameter (bw) in the density function that determines this.
names(d)
plot(density(wage,bw=.5))If the width parameter is too large, you get a nice smooth curve that looks the same no matter what your data were!
plot(density(wage,bw=100))Try a few different values.
hist(wage,freq=F)Note that the histogram must be on the density scale to match the density plot. Our density plot is a little higher than our plot boundaries. Let's fix this:
lines(density(wage),lty=2) # lty = "line type"
hist(wage,freq=F,ylim=c(0,.11))We can compare two variables by using overlaid density plots:
lines(density(wage),lty=2)
plot(density(wage[sex=="F"]), col="darkgreen", lty=1, main="Distribution of Wages for Each Gender")and with a bit of finesse, we can add a legend. Type the following command, hit Enter, and then click on the plot where you would like to place the legend.
lines(density(wage[sex=="M"]), col="maroon", lty=2)
legend(locator(1), legend=c("Female","Male"), col=c("darkgreen","maroon"), lty=c(1,2))
To see what the locator function is doing, try using the locator
function all by itself in an existing plot:locator(1)
boxplot(wage)
library(lattice)The "lattice" library also has several nice functions for box plots, density plots, histograms, and scatterplots:
barchart(sector,horizontal=F)
help(lattice)For example,
densityplot(wage)
bwplot(wage)
histogram(wage)
xyplot(wage~age)
plot(wage) # Plots wage vs. observation numberor one categorical variable:
or two quantitative variables (scatterplot):plot(married) # Produces a bar graph.
plot(age,wage) # Order of arguments are x-axis variable, y-axis variable.or one quantitative response and one categorical explanatory variable (side-by-side boxplots):
# (Could instead use argument "wage~age".)
plot(wage~sex)or an entire data set:
plot(cps)You can also feed the "plot" function other R objects. For instance, if you input a fitted linear model, the "plot" function will produce four diagnostic plots:
par(mfrow=c(2,2)) # Set up a plot window with 2 rows, 2 columns of plots.
mod = lm(wage~age+married)
plot(mod)
This command is useful if the categorical variable is coded in zeros and ones, and read in by R as a numerical variable. In that case, the "plot" function will plot actual points, whereas the "boxplot" function will produce boxplots. For example,boxplot(wage~sector,ylab="Hourly Wage") # You may need to resize the window to see all categories
# of sector on the x-axis.
boxplot(wage~married,ylab="Hourly Wage")
married.quant = as.numeric(married=="Married") # Change married into a quantitative binary variable.
head(married.quant)
plot(wage~married.quant)
boxplot(wage~married.quant)
plot(wage~exper,type="n",main="Wage vs. Experience by Gender",
xlab="Experience (Years)",ylab="Wage ($/hr)")
# type="n" tells R just to set up the plot window
# and don't plot the points (think of "n" standing for "nothing")
points(exper[sex=="F"],wage[sex=="F"],pch=15,col="hotpink") # Plot females
points(exper[sex=="M"],wage[sex=="M"],pch=18,col="rosybrown") # Plot males
legend(locator(1),c("Female","Male"),pch=c(15,18),col=c("hotpink","rosybrown")) # Add legend
abline(lm(wage[sex=="F"]~age[sex=="F"]),col="hotpink",lty=2) # Add least squares regression lines
abline(lm(wage[sex=="M"]~age[sex=="M"]),col="rosybrown",lty=3)
xyplot(wage~exper|sex)
qqnorm(wage) # Creates Normal Q-Q PlotDoes the shape in the normal quantile-quantile plot make sense?
qqline(wage) # Adds line to plot
For more details on quantile-quantile plots (and other plots), see Albyn Jones' "Graphical Methods" page (Reed College): http://people.reed.edu/~jones/141/dist1.html.qqplot(wage[sex=="F"], wage[sex=="M"])
abline(0,1)
type= : Denotes the type of plot (e.g., "p" for points, "l" for lines, "n" for no plotting)
main= : Main title (e.g., "Histogram of Wage")
sub= : Sub-title (e.g., "Fitted regression line: y=2+3x")
xlab= : x-axis label (e.g., "Wage")
ylab= : y-axis label (e.g., "Frequency")
xlim= : x-axis limits (e.g., c(0,50))
ylim= : y-axis limits (e.g., c(0,250))
pch: point character (given as a number)
lty: line type (given as a number)
col: color (given as a color name in quotes, or RGB components)
lines(x, y, ...) : Adds connecting lines at (x,y) points.
points(x, y, ...) : Adds points with (x,y) coordinates.
abline(a, b) : Adds a line with intercept a and slope b.
curve(expr, from, to, add=TRUE) : Adds a curve with expression expr (in terms of x).
legend(x, y, legend, ...) : Adds a legend where the top left corner of the legend
is located at point (x,y).
or
legend(locator(1), legend, ...) : Adds a legend once you click on the plot where
the top left corner should be located.
?plot
?par