Updated: 2020 More Examples of Pivot Tables/Charts + Simpson's Paradox 0) Introduction: 1) Analyzing the Car-Sales Database 2) Analyzing the Books Database 3) Simpson's Paradox and Analyzing Databases 4) Summary -------------------------------------------------------------------------------- In this lecture we continue examining examples of pivot tables used to quickly analyze/summarize data. Ultimately, we look at more complicated pivot tables, with multiple selections in rows and the use of page fields for filtering. The end of this lecture discusses Simpson's Paradox, which examines differences when looking at parts of the data vs. looking at all of the data. -------------------------------------------------------------------------------- 1) Analyzing The Car-Sales Database Examine the database on the Cars worksheet for this lecture. This data records whether or not (Yes or No) over 300 different families (of a size characterized as either Small or Large and of family income/salary characterized as either High or Low) decided to purchase a Station Wagon. Imagine that we want to use this information to determine what characteristics make a family more or less likely to purchase a station wagon. Following the steps from the previous lecture, let's create a pivot table with this data. Drop the "Salary" first and "Family Size" second into the Row Labels of the PivotTable itself or the PivotTable Fields. Notice these labels appear in this order: left to right in two columns on the left of the Pivot table. We can change their order by dragging and dropping these labels in the table or into he PivotTable Fields or selecting from the pulldown list for Salaray/Family Size in the rows on the index. Thus, we can put multiple data -any amount- in rows (and can do so for columns too). Drop the "Station Wagon?" into the Column Labels AND into the Sigma Values areas of the of the PivotTable itself or the PivotTable Fields. Because the data items are non-numeric, Excel defaults to count these items: how many No and Yes entries appear. It is difficult to compare these counts directly, so instead we want to report these counts as percents. To do so we right-click the data items, and click "Value Field Settings..." There we click the "Show values as" and in the "Show values as" box we click the "% of row" entry. Looking at the pivot table, we see that for high salaried families, if the family size is large, 75% did (and 25% didn't) buy a station wagon. And, for high salaried families, if the family size is small, 7% did (and 93% didn't) buy a station wagon. Overall, the High Total row (produced because we have 2 criteria in the rows: Salary first and then Family Size) shows 56% bought (and 44% didn't buy) a station wagon. Looking at the pivot table, we see that for low salaried families, if the family size is large, 73% did (and 27% didn't) buy a station wagon. And, for low salaried families, if the family size is small, 4% did (and 96% didn't) buy a station wagon. Overall, the Low Total row (produced because we have 2 criteria in the rows: Salary first and then Family Size) shows 59% bought (and 41% didn't buy) a station wagon. So, although the totals are about the same, each near 50%, we can see that whether a family has a low or high salary, they are about equally likely to buy or not buy a station wagon (when family size is not taken into account). Let's analyze the data another way, dragging/dropping the Family Size in front to the Salary (done as the second pivot table on the worksheet). Notice now instead of High and Low totals (for Salary) the pivot table shows Large and Small totals (for Family Size). The Large Total row shows 74% bought (and 26% didn't buy) a station wagon. Similarly, the Small Total row shows 6% bought (and 94% didn't buy) a station wagon. So, here we have an easier to see correlation between family size and station wagon purchases. If the family size is large, independent of their income, 3/4 buy a station wagon; and if the family size is small, independent of their income, less than 1/10 buy a station wagon. I have created pivot charts for both pivot tables. The top is shown on Car Chart1 and the bottom one is shown on Car Chart2. The second one more easily shows a stronger effect for family size than the first one does for income. Finally, in the second pivot table, drag and drop Salary from the "Choose fields to add to report" section on the top of the PivotTable Fields into the "Drop Report Filters Here" in the Pivot table itself, or move the Salary icon from the Row Labels Section to the Report Filter section in the PivotTable Field List. This simplifies the pivot table, as it now has just one selection for the row (Family Size). We can recalculate the data inside the pivot table by selecting which "page" to base the calculations on (High Salary, Low Salary, or both) in the filter activated by the disclosure triangle for Salary. Also see how the pivot chart changes what it displays when filtering Salary. In fact, we can manipulate the pivot chart directly (which also changes the pivot table it came from). Go to the Cars Chart1 worksheet. Click on the chart and the PivotChart Fields will appear. Then hover over Family Size (in the "Choose fileds to add to report" window) and click the disclose triangle, at which point we can filter these values: let's choose to display information for only Large family sizes. Click OK and watch the chart change to show High and Low salaries, but display only the values for Large families (see the legend on the bottom). After doing so, if we go back to the Cars tab, we will see a condensed Pivot table, with Family Size filtered to include only large familes. We can unfilter this category and it will change both on this pivot table and the pivot chart associated with it. -------------------------------------------------------------------------------- 2) Analyzing The Books Database In the books database, let's create pivot table for the amount of money spent on books in different categories by the years they were bought. Because this database has so many columns, and fills what we can see on the worksheet, let's create the pivot table on its own worksheet. We drop Category into the Rows and (the year) Bought into the Columns. Typically we put the item with the most values in the row (here there are more categories than years bought). We drop Cost into the Sigma Values, and specify a field setting of sum for the summary (the default is count, because not all cells contain numbers: some cells are blank) and a Number with a comma and no decimal digits. Notice that the years 1984 and 1998 have very few entries, and (blank) has none. So, we will use the filter for Bought to eliminate those columns. Likewise, there is a Category that is (blank), so we use the filter for Category to elimate that row. Let's make a pivot chart of this data. The heights of each category show the cost of the books purchased (the various years show in different colors), with the total height indicating on which category of books the most money was spent. In this database most book purchases were in the General category, followed by Programming and Theory. Also by inspection, 2002 was a year a lot of money was spent on Programming and Theory books. Note that I specified an Alignment on the X-axis to be horizontal for the Category names, but I changed the angle. It is easy to see a breakdown on the chart by each year Bought. Click on the chart and the PivotTable Fields will appear. Notice that a funnel already appears beside the year Bought checked box in the "Choose fields to add to report" window). Hover over Bought and click the disclosure triangle, at which point we can refilter these values. Click Select All twice (to select none) and then click any particular year, say 2002, and click OK. Watch the chart change to show the categories and total purchases for just that year. To make this pivot table a bit more useful, we can put the Publisher icon/label as a Page Field/Report Filter. Its default is (All), but we can select individual publishers (and arbitrary combinations) to see whether a publisher (or publishers) mostly publishes books in some category. For example, if we click the Books Chart chart, we can click on Publisher (in the "Choose fields to add to report" in the upper box) and click the disclose triangle to allow us to filter by publisher. We can specify Addison-Wesley, click OK, and see that most books from that publisher are Programming and then Language books. Likewise, we could have selected just this publisher on the pivot chart to see the results graphically, just as we did for year Bought above. Finally, also on the Books worksheet, note that the Grand Total on the bottom row shows the amount spent during each year; the Grand Total on the right column shows the amount spent on books in each category. If we wanted to see this data as percentages, we could either use specify the options row % or col %. As one last change, click on the Books Chart tab and the pivot chart and interchange Category and Bought. Now the chart nicely shows how much money was spent on these books each year (but the legend is huge, since it contains all the categories: too big to display so it is truncated; we can make the legend bigger to show all the categories). To not see this breakdown, click the checked box Category, to make all these categories disappear, leaving behind just the total amount spent each year. Finally, suppose we wanted a chart whose columns went from the category of the most money spent to the category of the least money spent. We cannot specify this ordering with the standard Pivot tools. But, we can copy/paste this Pivot Table into some Excel region and then just insert a Chart of this data with the reordered data. Show how to copy+sort by Grand Total Columns and then chart -------------------------------------------------------------------------------- 3) Simpson's Paradox and Analyzing Databases In Lab 10 I'm going to ask you to analyze some databases that exhibit Simpson's Paradox, so I thought I would talk about it here first. It is not a true paradox: it just shows something that is counter-intuitive: something that can confuse/surprise people, but can be explained away by simple calculations. But it is worth knowing about so you won't be (too) confused in the situations where it occurs. Simpson's Paradox generally is about differences when looking at parts of the data vs. looking at all of the data. Specifically it is about averages and combining averages; it is often stated using the shooting averages for basketball players. It states. If Player 1 has a better shooting average than Player 2 in both the first and second half of a game (that is, looking at parts of the data), it is possible that Player 2 has a better shooting average over the entire game (that is looking at all the data). That doesn't seem right or possible, but look at the numbers in the following example and follow the calculations. Player 1 Player 2 1st half: 1 basket for 7 shots 0 basket for 3 shots 2nd half: 3 baskets for 3 shots 5 baskets for 7 shots ------------------------------------------------------------------ Whole Game: 4 baskets for 10 shots 5 baskets for 10 shots In the first half Player 1 out shoots Player 2: 14% to 0% In the second half Player 1 out shoots Player 2: 100% to 71% In the entire game Player 2 out shoots Player 1: 50% to 40% It is counterintuitive that Player 1's average is better than Player 2's average in both halfs, but when both halves are taken together, Player 2's average is better. But if you compute the averages for these numbers, that is exactly what you discover. Mathematically what Simpson's law is saying is that if a1/b1 > x1/y1 and a2/b2 > x2/y2 then it is not necessarily the case that their combined averages (a1+a2)/(b1+b2) > (x1+x2)/(y1+y2) It is true that a1/b1 + a2/b2 > x1/y1 + x2/y2, but that just represents the SUMS of their averages, not their COMBINED averages. Let's look at a more extreme example by skewing the averages to see what might be going on: more obvious here, more subtle in the example above. Player 1 Player 2 1st half: 1 basket for 2 shots 45 baskets for 91 shots 2nd half: 1 baskets for 98 shots 0 baskets for 9 shots ------------------------------------------------------------------ Whole Game: 2 baskets for 100 shots 45 baskets for 100 shots Here, Player 1's average in the 1st half is 50% (but this "high" average doesn't count many of her shots in the game). Player 2's average in the 1st half is a lower than 50% (and this less "high" average counts most of her shots in the game). Then, Player 1's average in the 2nd half is just over 1% (and this "low" average counts most of her shots in the game). Player 2's average in the 2nd half is again lower, 0% (but this "low" average doesn't count many of her shots in the game). So both players have high averages in the first half but very low ones in the second half: but the halves have a very different number of shots). Player 2 shoots most of her shots in the first half, so her overall final average will be close to 50%. Both players have low averages in the second half (and again, a very different number of shots). Player 1 shoots most of her shots in the second half, so her overall final average will be close to 1%). Sometimes we can find this same paradox when breaking down data into more than two parts or having more than two players. The examples abover are just the simplest way to illustrate the paradox. Now, it is OFTEN the case that a player doing better in each half will also do better overall, which is intuitive. Player 1 Player 2 1st half : 40 baskets for 45 shots 20 baskets for 45 shots 2nd half : 50 baskets for 55 shots 25 baskets for 55 shots ------------------------------------------------------------------ Whole Game: 90 baskets for 100 shots 45 baskets for 100 shots But Simpson's paradox says that we might find numbers that go against our intutition. How might this paradox show up when analyzing something important: a medical database in a drug trial? Well suppose that we look at a database where we look at wellness outcomes (cure rate) for men and women in an study taking Drug 1 or Drug 2. We can find that on average women do better on Drug 1 than Drug 2, and on average men also do better on Drug 1 than Drug 2...but on average over all the people in the study, people do better on Drug 2 than Drug 1! We can show an example by just switching around the meanings of the numbers in the first example we studied. Drug 1 Drug 2 Men : 1 cured of 7 men 0 cured of 3 men (10 men in the study) Women : 3 cured of 3 women 5 cured of 7 women (10 women in the study) ------------------------------------------------------------------ People: 4 cured of 10 people 5 cured of 10 people It looks like a paradox that Drug 1 has a higer cure rate for men and a higher cure rate for women, but Drug 2 has a higher cure rate for people! So, the manufacturer of Drug 1 gets to advertise: our drug is proven to work better for men and work better for women; the manufacturer of Drug 2 gets to advertise: overall, our drug works better than Drug 1 when people take it. They both have made correct claims! Blame Simpson's paradox for the confusion. -------------------------------------------------------------------------------- 4) Summary Not much new here, besides being aware of Simpson's "paradox".