Updated: 2020
            More Examples of Pivot Tables/Charts + Simpson's Paradox


0) Introduction:
1) Analyzing the Car-Sales Database
2) Analyzing the Books Database
3) Simpson's Paradox and Analyzing Databases
4) Summary

--------------------------------------------------------------------------------

In this lecture we continue examining examples of pivot tables used to quickly
analyze/summarize data. Ultimately, we look at more complicated pivot tables,
with multiple selections in rows and the use of page fields for filtering. The
end of this lecture discusses Simpson's Paradox, which examines differences
when looking at parts of the data vs. looking at all of the data.

--------------------------------------------------------------------------------

1) Analyzing The Car-Sales Database

Examine the database on the Cars worksheet for this lecture. This data records
whether or not (Yes or No) over 300 different families (of a size characterized
as either Small or Large and of family income/salary characterized as either
High or Low) decided to purchase a Station Wagon. Imagine that we want to use
this information to determine what characteristics make a family more or less
likely to purchase a station wagon.

Following the steps from the previous lecture, let's create a pivot table with
this data. Drop the "Salary" first and "Family Size" second into the Row
Labels of the PivotTable itself or the PivotTable Fields. Notice these labels
appear in this order: left to right in two columns on the left of the Pivot
table. We can change their order by dragging and dropping these labels in the
table or into he PivotTable Fields or selecting from the pulldown list for
Salaray/Family Size in the rows on the index. Thus, we can put multiple data
-any amount- in rows (and can do so for columns too).

Drop the "Station Wagon?" into the Column Labels AND into the Sigma Values
areas of the of the PivotTable itself or the PivotTable Fields. Because the
data items are non-numeric, Excel defaults to count these items: how many No
and Yes entries appear.

It is difficult to compare these counts directly, so instead we want to report 
these counts as percents. To do so we right-click the data items, and click
"Value Field Settings..." There we click the "Show values as" and in the "Show
values as" box we click the "% of row" entry.

Looking at the pivot table, we see that for high salaried families, if the 
family size is large, 75% did (and 25% didn't) buy a station wagon.  And, for
high salaried families, if the family size is small, 7% did (and 93% didn't)
buy a station wagon. Overall, the High Total row (produced because we have 2
criteria in the rows: Salary first and then Family Size) shows 56% bought
(and 44% didn't buy) a station wagon. 

Looking at the pivot table, we see that for low salaried families, if the 
family size is large, 73% did (and 27% didn't) buy a station wagon.  And, for
low salaried families, if the family size is small, 4% did (and 96% didn't)
buy a station wagon. Overall, the Low Total row (produced because we have 2
criteria in the rows: Salary first and then Family Size) shows 59% bought
(and 41% didn't buy) a station wagon.

So, although the totals are about the same, each near 50%, we can see that
whether a family has a low or high salary, they are about equally likely to
buy or not buy a station wagon (when family size is not taken into account).

Let's analyze the data another way, dragging/dropping the Family Size in front
to the Salary (done as the second pivot table on the worksheet). Notice now
instead of High and Low totals (for Salary) the pivot table shows Large and
Small totals (for Family Size). The Large Total row shows 74% bought (and 26%
didn't buy) a station wagon. Similarly, the Small Total row shows 6% bought
(and 94% didn't buy) a station wagon.

So, here we have an easier to see correlation between family size and station
wagon purchases. If the family size is large, independent of their income, 3/4
buy a station wagon; and if the family size is small, independent of their
income, less than 1/10 buy a station wagon.

I have created pivot charts for both pivot tables. The top is shown on Car
Chart1 and the bottom one is shown on Car Chart2. The second one more easily
shows a stronger effect for family size than the first one does for income.

Finally, in the second pivot table, drag and drop Salary from the "Choose
fields to add to report" section on the top of the PivotTable Fields into
the "Drop Report Filters Here" in the Pivot table itself, or move the Salary
icon from the Row Labels Section to the Report Filter section in the PivotTable
Field List. This simplifies the pivot table, as it now has just one selection
for the row (Family Size). We can recalculate the data inside the pivot table
by selecting which "page" to base the calculations on (High Salary, Low Salary,
or both) in the filter activated by the disclosure triangle for Salary. Also
see how the pivot chart changes what it displays when filtering Salary.

In fact, we can manipulate the pivot chart directly (which also changes the
pivot table it came from). Go to the Cars Chart1 worksheet. Click on the
chart and the PivotChart Fields will appear. Then hover over Family Size
(in the "Choose fileds to add to report" window) and click the disclose
triangle, at which point we can filter these values: let's choose to display
information for only Large family sizes. Click OK and watch the chart change to
show High and Low salaries, but display only the values for Large families (see
the legend on the bottom).

After doing so, if we go back to the Cars tab, we will see a condensed Pivot
table, with Family Size filtered to include only large familes. We can unfilter
this category and it will change both on this pivot table and the pivot chart
associated with it.

--------------------------------------------------------------------------------

2) Analyzing The Books Database

In the books database, let's create pivot table for the amount of money spent
on books in different categories by the years they were bought. Because this
database has so many columns, and fills what we can see on the worksheet, let's
create the pivot table on its own worksheet.

We drop Category into the Rows and (the year) Bought into the Columns. Typically
we put the item with the most values in the row (here there are more categories
than years bought). We drop Cost into the Sigma Values, and specify a field
setting of sum for the summary (the default is count, because not all cells
contain numbers: some cells are blank) and a Number with a comma  and no decimal
digits.

Notice that the years 1984 and 1998 have very few entries, and (blank) has
none. So, we will use the filter for Bought to eliminate those columns.
Likewise, there is a Category that is (blank), so we use the filter for
Category to elimate that row.

Let's make a pivot chart of this data. The heights of each category show the
cost of the books purchased (the various years show in different colors), with
the total height indicating on which category of books the most money was
spent.

In this database most book purchases were in the General category, followed by
Programming and Theory. Also by inspection, 2002 was a year a lot of money was
spent on Programming and Theory books. Note that I specified an Alignment on
the X-axis to be horizontal for the Category names, but I changed the angle.

It is easy to see a breakdown on the chart by each year Bought. Click on the
chart and the PivotTable Fields will appear. Notice that a funnel already
appears beside the year Bought checked box in the "Choose fields to add to
report" window). Hover over Bought and click the disclosure triangle, at which
point we can refilter these values. Click Select All twice (to select none) and
then click any particular year, say 2002, and click OK. Watch the chart change
to show the categories and total purchases for just that year.

To make this pivot table a bit more useful, we can put the Publisher icon/label
as a Page Field/Report Filter. Its default is (All), but we can select
individual publishers (and arbitrary combinations) to see whether a publisher
(or publishers) mostly publishes books in some category. For example, if we
click the Books Chart chart, we can click on Publisher (in the "Choose fields
to add to report" in the upper box) and click the disclose triangle to allow us
to filter by publisher. We can specify Addison-Wesley, click OK, and see that
most books from that publisher are Programming and then Language books.

Likewise, we could have selected just this publisher on the pivot chart to see
the results graphically, just as we did for year Bought above.

Finally, also on the Books worksheet, note that the Grand Total on the bottom
row shows the amount spent during each year; the Grand Total on the right
column shows the amount spent on books in each category. If we wanted to see
this data as percentages, we could either use specify the options row % or
col %.

As one last change, click on the Books Chart tab and the pivot chart and 
interchange Category and Bought. Now the chart nicely shows how much money
was spent on these books each year (but the legend is huge, since it contains
all the categories: too big to display so it is truncated; we can make the
legend bigger to show all the categories). To not see this breakdown, click the
checked box Category, to make all these categories disappear, leaving behind
just the total amount spent each year.

Finally, suppose we wanted a chart whose columns went from the category of the
most money spent to the category of the least money spent. We cannot specify
this ordering with the standard Pivot tools. But, we can copy/paste this Pivot
Table into some Excel region and then just insert a Chart of this data with
the reordered data.

Show how to copy+sort by Grand Total Columns and then chart 

--------------------------------------------------------------------------------

3) Simpson's Paradox and Analyzing Databases

In Lab 10 I'm going to ask you to analyze some databases that exhibit Simpson's
Paradox, so I thought I would talk about it here first. It is not a true
paradox: it just shows something that is counter-intuitive: something that can
confuse/surprise people, but can be explained away by simple calculations.
But it is worth knowing about so you won't be (too) confused in the situations
where it occurs.

Simpson's Paradox generally is about differences when looking at parts of the
data vs. looking at all of the data. Specifically it is about averages and
combining averages; it is often stated using the shooting averages for
basketball players. It states. If Player 1 has a better shooting average than
Player 2 in both the first and second half of a game (that is, looking at parts
of the data), it is possible that Player 2 has a better shooting average over
the entire game (that is looking at all the data). That doesn't seem right or
possible, but look at the numbers in the following example and follow the
calculations.

          Player 1	              Player 2
1st half:   1 basket  for  7 shots      0 basket  for  3 shots
2nd half:   3 baskets for  3 shots      5 baskets for  7 shots
------------------------------------------------------------------
Whole Game: 4 baskets for 10 shots      5 baskets for 10 shots

In the first  half Player 1 out shoots Player 2:  14% to  0%
In the second half Player 1 out shoots Player 2: 100% to 71%
In the entire game Player 2 out shoots Player 1:  50% to 40%

It is counterintuitive that Player 1's average is better than Player 2's average
in both halfs, but when both halves are taken together, Player 2's average is
better. But if you compute the averages for these numbers, that is exactly what
you discover.

Mathematically what Simpson's law is saying is that if
  a1/b1 > x1/y1 and
  a2/b2 > x2/y2 then it is not necessarily the case that their combined averages
  (a1+a2)/(b1+b2) > (x1+x2)/(y1+y2)
It is true that a1/b1 + a2/b2 > x1/y1 + x2/y2, but that just represents the SUMS
of their averages, not their COMBINED averages. 

Let's look at a more extreme example by skewing the averages to see what might
be going on: more obvious here, more subtle in the example above.

          Player 1	               Player 2
1st half:   1 basket  for   2 shots      45 baskets for  91 shots
2nd half:   1 baskets for  98 shots       0 baskets for   9 shots
------------------------------------------------------------------
Whole Game: 2 baskets for 100 shots      45 baskets for 100 shots

Here, Player 1's average in the 1st half is 50% (but this "high" average doesn't
count many of her shots in the game). Player 2's average in the 1st half is a
lower than 50% (and this less "high" average counts most of her shots in the
game).

Then, Player 1's average in the 2nd half is just over 1% (and this "low" average
counts most of her shots in the game). Player 2's average in the 2nd half is
again lower, 0% (but this "low" average doesn't count many of her shots in the
game).

So both players have high averages in the first half but very low ones in the
second half: but the halves have a very different number of shots). Player 2
shoots most of her shots in the first half, so her overall final average will
be close to 50%. Both players have low averages in the second half (and again,
a very different number of shots). Player 1 shoots most of her shots in the
second half, so her overall final average will be close to 1%).

Sometimes we can find this same paradox when breaking down data into more than
two parts or having more than two players. The examples abover are just the
simplest way to illustrate the paradox.
 
Now, it is OFTEN the case that a player doing better in each half will also do
better overall, which is intuitive.

            Player 1	                 Player 2
1st half  :   40 baskets for  45 shots      20 baskets for  45 shots
2nd half  :   50 baskets for  55 shots      25 baskets for  55 shots
------------------------------------------------------------------
Whole Game:   90 baskets for 100 shots      45 baskets for 100 shots

But Simpson's paradox says that we might find numbers that go against our
intutition.

How might this paradox show up when analyzing something important: a medical
database in a drug trial? Well suppose that we look at a database where we look
at wellness outcomes (cure rate) for men and women in an study taking Drug 1 or
Drug 2. We can find that on average women do better on Drug 1 than Drug 2, and
on average men also do better on Drug 1 than Drug 2...but on average over all
the people in the study, people do better on Drug 2 than Drug 1!

We can show an example by just switching around the meanings of the numbers in
the first example we studied.

        Drug 1	              Drug 2
Men   :   1 cured of 7 men      0 cured of 3 men    (10 men in the study)
Women :   3 cured of 3 women    5 cured of 7 women  (10 women in the study)
------------------------------------------------------------------
People:   4 cured of 10 people  5 cured of 10 people

It looks like a paradox that Drug 1 has a higer cure rate for men and a higher
cure rate for women, but Drug 2 has a higher cure rate for people!

So, the manufacturer of Drug 1 gets to advertise: our drug is proven to work
better for men and work better for women; the manufacturer of Drug 2 gets to
advertise: overall, our drug works better than Drug 1 when people take it.
They both have made correct claims! Blame Simpson's paradox for the confusion.

--------------------------------------------------------------------------------

4) Summary

Not much new here, besides being aware of Simpson's "paradox".