2020-10-13

Naked Statistics

Stripping the Dread from the Data

by Charles Wheelan

On Amazon
ISBN: 978-0393071955
My Rating: 7/10

What is it about?

Naked Statistics is an introduction to statistics, with a focus on concepts and not on mathematical formulas.

My impression

I found Naked Statistics an informative book. The author makes a good job in introducing the concepts in an approachable way, sprinkled with a little bit of humor. The examples are somewhat US-specific, which makes some of them (e.g. those related to baseball) a bit more difficult to follow for people outside of the USA.

My notes

Introduction: Why I hated calculus but love statistics

Statistics is like a high-caliber weapon: helpful when used correctly and potentially disastrous in the wrong hands.

It's easy to lie with statistics, but it's hard to tell the truth without them.

What's the Point?

[...] an overreliance on any descriptive statistic can lead to misleading conclusions, or cause undesirable behavior.

One key function of statistics is to use the data we have to make informed conjectures about larger questions for which we do not have full information. In short, we can use data from the "known world" to make informed inferences about the "unknown world".

The reality is that you can lie with statistics. Or you can make inadvertent errors. In either case, the mathematical precision attached to statistical analysis can dress up some serious nonsense.

Descriptive Statistics: Who was the best baseball player of all time?

[...] the most basic task when working with data is to summarize a great deal of information.

The irony is that more data can often present less clarity. So we simplify. We perform calculations that reduce a complex array of data into a handful of numbers that describe those data, just as we might encapsulate a complex, multifaceted Olympic gymnastics performance with one number: 9.8.

The first descriptive task is often to find some measure of the "middle" of a set of data, or what statisticians might describe as its "central tendency". [...] The most basic measure of the "middle" of a distribution is the mean, or average.

The mean, or average, turns out to have some problems [...], namely, that it is prone to distortion by "outliers", which are observations that lie farther from the center. [...] For this reason, we have another statistic that also signals the "middle" of a distribution, albeit differently: the median. The median is the point that divides a distribution in half, meaning that half of the observations lie above the median and half lie below. (If there is an even number of observations, the median is the midpoint between the two middle observations.)

For distributions without serious outliers, the median and the mean will be similar.

Neither the median nor the mean is hard to calculate; the key is determining which measure of the "middle" is more accurate in a particular situation [...]. Meanwhile, the median has some useful relatives. [...] the median divides a distribution in half. The distribution can be further divided into quarters, or quartiles. The first quartile consists of the bottom 25 percent of the observations; the second quartile consists of the next 25 percent of the observations; and so on. Or the distribution can be divided into deciles, each with 10 percent of the observations. [...] We can go even further and divide the distribution into hundredths, or percentiles. Each percentile represents 1 percent of the distribution, so that the 1st percentile represents the bottom 1 percent of the distribution and the 99th percentile represents the top 1 percent of the distribution. The benefit of these kinds of descriptive statistics is that they describe where a particular observation lies compared with everyone else.

Another statistic that can help us describe what might otherwise be a jumble of numbers is the standard deviation, which is a measure of how dispersed the data are from their mean. In other words, how spread out are the observations?

For many typical distributions of data, a high proportion of the observations lie within one standard deviation of the mean (meaning that they are in the range from one standard deviation below the mean to one standard deviation above the mean).

[...] one of the most important, helpful, and common distributions in statistics: the normal distribution. Data that are distributed normally are symmetrical around their mean in a bell shape that will look familiar to you.

The beauty of the normal distribution [...] comes from the fact that we know by definition exactly what proportion of the observations in a normal distribution lie within one standard deviation of the mean (68.2 percent), within two standard deviations of the mean (95.4 percent), within three standard deviations (99.7 percent), and so on.

Percentages are useful – but also potentially confusing or even deceptive. The formula for calculating a percentage difference (or change) is the following: (new figure - original figure)/original figure. The numerator (the part on the top of the fraction) gives us the size of the change in absolute terms; the denominator (the bottom of the fraction) is what puts this change in context by comparing it with our starting point.

Percentage change must not be confused with a change in percentage points.

Any index is highly sensitive to the descriptive statistics that are cobbled together to build it, and to the weight given to each of those components. As a result, indices range from useful but imperfect tools to complete charades.

Deceptive Description: "He's got a great personality!" and other true but grossly misleading statements

Although the field of statistics is rooted in mathematics, and mathematics is exact, the use of statistics to describe complex phenomena is not exact. That leaves plenty of room for shading the truth.

Once there are multiple ways of describing the same thing (e.g., "he's got a great personality" or "he was convicted of securities fraud"), the descriptive statistics that we choose to use (or not to use) will have a profound impact on the impression that we leave. Someone with nefarious motives can use perfectly good facts and figures to support entirely disputable or illegitimate conclusions.

Accuracy is a measure of whether a figure is broadly consistent with the truth – hence the danger of confusing precision with accuracy. If an answer is accurate, then more precision is usually better. But no amount of precision can make up for inaccuracy. In fact, precision can mask inaccuracy by giving us a false sense of certainty, either inadvertently or quite deliberately.

[...] even the most precise measurements or calculations should be checked against common sense.

Even the most precise and accurate descriptive statistics can suffer from a more fundamental problem: a lack of clarity over what exactly we are trying to define, describe, or explain.

From the standpoint of accuracy, the median versus mean question revolves around whether the outliers in a distribution distort what is being described or are instead an important part of the message. [...] Of course, nothing says that you must choose the median or the mean. Any comprehensive statistical analysis would likely present both. When just the median or the mean appears, it may be for the sake of brevity – or it may be because someone is seeking to "persuade" with statistics.

[...] a great many statistical shenanigans arise from "apples and oranges" comparisons.

[...] a more subtle example of apples and oranges: inflation. A dollar today is not the same as a dollar sixty years ago; it buys much less. [...] This is such an important phenomenon that economists have terms to denote whether figures have been adjusted for inflation or not. Nominal figures are not adjusted for inflation. [...] Real figures, on the other hand, are adjusted for inflation. The most commonly accepted methodology is to convert all of the figures into a single unit, such as 2011 dollars, to make an "apples and apples" comparison.

Percentages don't lie – but they can exaggerate. One way to make growth look explosive is to use percentage change to describe some change relative to a very low starting point. [...] Researchers will sometimes qualify a growth figure by pointing out that it is "from a low base", meaning that any increase is going to look large by comparison. Obviously the flip side is true. A small percentage of an enormous sum can be a big number.

There is a common business aphorism: "You can't manage what you can't measure." True. But you had better be darn sure that what you are measuring is really what you are trying to manage.

Statistics measure the outcomes that matter; incentives give us a reason to improve those outcomes. Or, in some cases, just to make the statistics look better.

Correlation: How does Netflix know what movies I like?

Correlation measures the degree to which two phenomena are related to one another. For example, there is a correlation between summer temperatures and ice cream sales. When one goes up, so does the other. Two variables are positively correlated if a change in one is associated with a change in the other in the same direction, such as the relationship between height and weight. Taller people weigh more (on average); shorter people weigh less. A correlation is negative if a positive change in one variable is associated with a negative change in the other, such as the relationship between exercise and weight.

[...] the power of correlation as a statistical tool is that we can encapsulate an association between two variables in a single descriptive statistic: the correlation coefficient.

The correlation coefficient has two fabulously attractive characteristics. First, [...] it is a single number ranging from -1 to 1. A correlation of 1, often described as perfect correlation, means that every change in one variable is associated with an equivalent change in the other variable in the same direction. A correlation of -1, or perfect negative correlation, means that every change in one variable is associated with an equivalent change in the other variable in the opposite direction. The closer the correlation is to 1 or -1, the stronger the association. A correlation of 0 (or close to it) means that the variables have no meaningful association with one another [...].

The second attractive feature of the correlation coefficient is that it has no units attached to it.

One crucial point [...] is that correlation does not imply causation; a positive or negative association between two variables does not necessarily mean that a change in one of the variables is causing the change in the other.

Basic Probability: Don't buy the extended warranty on your $99 printer

Probability is the study of events and outcomes involving an element of uncertainty.

Probabilities do not tell us what will happen for sure; they tell us what is likely to happen and what is less likely to happen.

Probability can also sometimes tell us after the fact what likely happened and what likely did not happen – as in the case of DNA analysis.

Often it is extremely valuable to know the likelihood of multiple events' happening. What is the probability that the electricity goes out and the generator doesn't work? The probability of two independent events' both happening is the product of their respective probabilities.

Suppose you are interested in the probability that one event happens or another event happens: outcome A or outcome B (again assuming that they are independent). In this case, the probability of getting A or B consists of the sum of their individual probabilities: the probability of A plus the probability of B.

Probability also enables us to calculate what might be the most useful tool in all of managerial decision making, particularly finance: expected value. The expected value takes basic probability one step further. The expected value or payoff from some event, say purchasing a lottery ticket, is the sum of all the different outcomes, each weighted by its probability and payoff.

This is one of the crucial lessons of probability. Good decisions – as measured by the underlying probabilities – can turn out badly. And bad decisions [...] can still turn out well, at least in the short run. But probability triumphs in the end. An important theorem known as the law of large numbers tells us that as the number of trials increases, the average of the outcomes will get closer and closer to its expected value.

The Monty Hall Problem

In short, if you ever find yourself as a contestant on Let's Make a Deal, you should definitely switch doors when Monty Hall (or his replacement) gives you the option. The more broadly applicable lesson is that your gut instinct on probability can sometimes steer you astray.

Problems with Probability: How overconfident math geeks nearly destroyed the global financial system

The greatest risks are never the ones you can see and measure, but the ones you can't see and therefore can never measure. The ones that seem so far outside the boundary of normal probability that you can't imagine they could happen in your lifetime – even though, of course, they do happen, more often than you care to realize.

Nassim Nicholas Taleb

Unlikely things happen. In fact, over a long enough period of time, they are not even that unlikely.

Probability doesn't make mistakes; people using probability make mistakes.

Probability tells us that any outlier – an observation that is particularly far from the mean in one direction or the other – is likely to be followed by outcomes that are more consistent with the long-term average.

We like to think of numbers as "cold, hard facts". If we do the calculations right, then we must have the right answer. The more interesting and dangerous reality is that we can sometimes do the calculations correctly and end up blundering in a dangerous direction. We can blow up the financial system or harass a twenty-two-year-old white guy standing on a particular street corner at a particular time of day, because, according to our statistical model, he is almost certainly there to buy drugs. For all the elegance and precision of probability, there is no substitute for thinking about what calculations we are doing and why we are doing them.

The Importance of Data: "Garbage in, garbage out"

[...] no amount of fancy analysis can make up for fundamentally flawed data. Hence the expression "garbage in, garbage out".

(1) A representative sample is a fabulously important thing, for it opens the door to some of the most powerful tools that statistics has to offer. (2) Getting a good sample is harder than it looks. (3) Many of the most egregious statistical assertions are caused by good statistical methods applied to bad samples, not the opposite. (4) Size matters, and bigger is better.

One crucial caveat is that a bigger sample will not make up for errors in its composition, or "bias". A bad sample is a bad sample. [...] In fact, a large, biased sample is arguably worse than a small, biased sample because it will give a false sense of confidence regarding the results.

The Central Limit Theorem: The Lebron James of statistics

The core principle underlying the central limit theorem is that a large, properly drawn sample will resemble the population from which it is drawn. Obviously there will be variation from sample to sample [...], but the probability that any sample will deviate massively from the underlying population is very low.

The central limit theorem enables us to make the following inferences [...].

If we have detailed information about some population, then we can make powerful inferences about any properly drawn sample from that population.
If we have detailed information about a properly drawn sample (mean and standard deviation), we can make strikingly accurate inferences about the population from which that sample was drawn.
If we have data describing a particular sample, and data on a particular population, we can infer whether or not that sample is consistent with a sample that is likely to be drawn from that population. [...] The central limit theorem enables us to calculate the probability that a particular sample [...] was drawn from a given population [...]. If that probability is low, then we can conclude with a high degree of confidence that the sample was not drawn from the population in question [...].
Last, if we know the underlying characteristics of two samples, we can infer whether or not both samples were likely drawn from the same population.

According to the central limit theorem, the sample means for any population will be distributed roughly as a normal distribution around the population mean.

The standard error measures the dispersion of the sample means. How tightly do we expect the sample means to cluster around the population mean?

A large standard error means that the sample means are spread out widely around the population mean; a small standard error means that they are clustered relatively tightly.

The "big picture" here is simple and massively powerful:

If you draw large, random samples from any population, the means of those samples will be distributed normally around the population mean (regardless of what the distribution of the underlying population looks like).
Most sample means will lie reasonably close to the population mean; the standard error is what defines "reasonably close".
The central limit theorem tells us the probability that a sample mean will lie within a certain distance of the population mean. It is relatively unlikely that a sample mean will lie more than two standard errors from the population mean, and extremely unlikely that it will lie three or more standard errors from the population mean.
The less likely it is that an outcome has been observed by chance, the more confident we can be in surmising that some other factor is in play.

Inference: Why my statistics professor thought I might have cheated

Statistics cannot prove anything with certainty. Instead, the power of statistical inference derives from observing some pattern or outcome and then using probability to determine the most likely explanation for that outcome.

Of course, the most likely explanation is not always the right explanation. Extremely rare things happen.

One of the most common tools in statistical inference is hypothesis testing. [...] statistics alone cannot prove anything; instead, we use statistical inference to accept or reject explanations on the basis of their relative likelihood. To be more precise, any statistical inference begins with an implicit or explicit null hypothesis. This is our starting assumption, which will be rejected or not on the basis of subsequent statistical analysis. If we reject the null hypothesis, then we typically accept some alternative hypothesis that is more consistent with the data observed. For example, in a court of law the starting assumption, or null hypothesis, is that the defendant is innocent. The job of the prosecution is to persuade the judge or jury to reject that assumption and accept the alternative hypothesis, which is that the defendant is guilty. As a matter of logic, the alternative hypothesis is a conclusion that must be true if we can reject the null hypothesis.

One of the most common thresholds that researchers use for rejecting a null hypothesis is 5 percent, which is often written in decimal form: .05. This probability is known as a significance level, and it represents the upper bound for the likelihood of observing some pattern of data if the null hypothesis were true.

If the .05 significance level seems somewhat arbitrary, that's because it is. There is no single standardized statistical threshold for rejecting a null hypothesis.

[...] when we can reject a null hypothesis at some reasonable significance level, the results are said to be "statistically significant".

A Type I error involves wrongly rejecting a null hypothesis. Though the terminology is somewhat counterintuitive, this is also known as a "false positive". Here is one way to reconcile the jargon. When you go to the doctor to get tested for some disease, the null hypothesis is that you do not have that disease. If the lab results can be used to reject the null hypothesis, then you are said to test positive. And if you test positive but are not really sick, then it's a false positive.

In any case, the lower our statistical burden for rejecting the null hypothesis, the more likely it is to happen. [...] But there is a tension here. The higher the threshold for rejecting the null hypothesis, the more likely it is that we will fail to reject a null hypothesis that ought to be rejected. [...] This is known as a Type II error, or false negative.

Polling: How we know that 64 percent of Americans support the death penalty (with a sampling error +/- 3 percent)

[...] the methodology of polling is just one more form of statistical inference. A poll (or survey) is an inference about the opinions of some population that is based on the views expressed by some sample drawn from that population.

One fundamental difference between a poll and other forms of sampling is that the sample statistic we care about will be not a mean (e.g., 187 pounds) but rather a percentage or proportion (e.g., 47 percent of voters, or .47). In other respects, the process is identical.

[...] a bigger sample makes for a shrinking standard error, which is how large national polls can end up with shockingly accurate results.

Bad polling results do not typically stem from bad math when calculating the standard errors. Bad polling results typically stem from a biased sample, or bad questions, or both. The mantra "garbage in, garbage out" applies in spades when it comes to sampling public opinion.

Any poll that depends on individuals who select into the sample, such as a radio call-in show or a voluntary Internet survey, will capture only the views of those who make the effort to voice their opinions. These are likely to be the people who feel particularly strongly about an issue, or those who happen to have a lot of free time on their hands. Neither of these groups is likely to be representative of the public at large.

Any method of gathering opinion that systematically excludes some segment of the population is also prone to bias.

One indicator of a poll's validity is the response rate: What proportion of respondents who were chosen to be contacted ultimately completed the poll or survey? A low response rate can be a warning sign for potential sampling bias. The more people there are who opt not to answer the poll, or who just can't be reached, the greater the possibility that this large group is different in some material way from those who did answer the questions.

Survey results can be extremely sensitive to the way a question is asked.

We know that people shade the truth, particularly when the questions asked are embarrassing or sensitive. Respondents may overstate their income, or inflate the number of times they have sex in a typical month. They may not admit that they do not vote. They may hesitate to express views that are unpopular or socially unacceptable. For all these reasons, even the most carefully designed poll is dependent on the integrity of the respondents' answers.

Regression Analysis: The miracle elixir

[...] regression analysis allows us to quantify the relationship between a particular variable and an outcome that we care about while controlling for other factors. In other words, we can isolate the effect of one variable, such as having a certain kind of job, while holding the effects of other variables constant.

At its core, regression analysis seeks to find the "best fit" for a linear relationship between two variables.

Regression analysis typically uses a methodology called ordinary least squares, or OLS. [...] The key point lies in the "least squares" part of the name; OLS fits the line that minimizes the sum of the squared residuals. [...] Each observation [...] has a residual, which is its vertical distance from the regression line, except for those observations that lie directly on the line, for which the residual equals zero [...].

[...] ordinary least squares gives us the best description of a linear relationship between two variables. The result is not only a line but [...] an equation describing that line. This is known as the regression equation, and it takes the following form: y = a + bx, [...] a is the y-intercept of the line (the value for y when x = 0); b is the slope of the line [...] The slope of the line [...] b, describes the "best linear relationship [...].

The variable that is being explained [...] is known as the dependent variable (because it depends on other factors). The variables that we are using to explain our dependent variable are known as explanatory variables since they explain the outcome that we care about.

When we include multiple variables in the regression equation, the analysis gives us an estimate of the linear association between each explanatory variable and the dependent variable while holding other dependent variables constant, or "controlling for" these other factors.

Common Regression Mistakes: The mandatory warning label

If regression analysis had a [...] warning label, it would say, Do Not Use When There Is Not a Linear Association between the Variables That You Are Analyzing. Remember, a regression coefficient describes the slope of the "line of best fit" for the data; a line that is not straight will have a different slope in different places.

Regression analysis is meant to be used when the relationship between variables is linear.

Regression analysis can only demonstrate an association between two variables. [...] we cannot prove with statistics alone that a change in one variable is causing a change in the other. In fact, a sloppy regression equation can produce a large and statistically significant association between two variables that have nothing to do with one another.

A statistical association between A and B does not prove that A causes B. In fact, it's entirely plausible that B is causing A.

[...] we should not use explanatory variables that might be affected by the outcome that we are trying to explain, or else the results will become hopelessly tangled. For example, it would be inappropriate to use the unemployment rate in a regression equation explaining GDP growth, since unemployment is clearly affected by the rate of GDP growth. [...] We should have reason to believe that our explanatory variables affect the dependent variable, and not the other way around.

Regression results will be misleading and inaccurate if the regression equation leaves out an important explanatory variable, particularly if other variables in the equation "pick up" that effect.

If a regression equation includes two or more explanatory variables that are highly correlated with each other, the analysis will not necessarily be able to discern the true relationship between each of those variables and the outcome that we are trying to explain.

When two explanatory variables are highly correlated, researchers will usually use one or the other in the regression equation, or they may create some kind of composite variable, such as "used cocaine or heroin".

Regression analysis, like all forms of statistical inference, is designed to offer us insights into the world around us. We seek patterns that will hold true for the larger population. However, our results are valid only for a population that is similar to the sample on which the analysis has been done.

Your results can be compromised if you include too many variables, particularly extraneous explanatory variables with no theoretical justification. For example, one should not design a research strategy built around the following premise: Since we don't know what causes autism, we should put as many potential explanatory variables as possible in the regression equation just to see what might turn up as statistically significant; then maybe we'll get some answers. If you put enough junk variables in a regression equation, one of them is bound to meet the threshold for statistical significance just by chance.

[...] designing a good regression equation – figuring out what variables should be examined and where the data should come from – is more important than the underlying statistical calculations. This process is referred to as estimating the equation, or specifying a good regression equation.

[...] like most other statistical inference, regression analysis builds only a circumstantial case. An association between two variables is like a fingerprint at the scene of the crime. It points us in the right direction, but it's rarely enough to convict. [...] Any regression analysis needs a theoretical underpinning: Why are the explanatory variables in the equation? What phenomena from other disciplines can explain the observed results?

Program Evaluation: Will going to Harvard change your life?

[...] program evaluation [...] is the process by which we seek to measure the causal effect of some intervention – anything from a new cancer drug to a job placement program for high school dropouts. [...] The intervention that we care about is typically called the "treatment", though that word is used more expansively in a statistical context than in normal parlance. A treatment can be a literal treatment, as in some kind of medical intervention, or it can be something like attending college or receiving job training upon release from prison. The point is that we are seeking to isolate the effect of that single factor; ideally we would like to know how the group receiving that treatment fares compared with some other group whose members are identical in all other respects but for the treatment.

The most straightforward way to create a treatment and control group is to [...] create a treatment and control group. There are two big challenges to this approach. First, there are many kinds of experiments that we cannot perform on people. This constraint [...] is not going away anytime soon. As a result, we can do controlled experiments on human subjects only when there is reason to believe that the treatment effect has a potentially positive outcome. [...] Second, there is a lot more variation among people than among laboratory rats. The treatment effect that we are testing could easily be confounded by other variations in the treatment and control groups [...].

The optimal way to create any treatment and control group is to distribute the study participants randomly across the two groups. The beauty of randomization is that it will generally distribute the non-treatment-related variables more or less evenly between the two groups – both the characteristics that are obvious, such as sex, race, age, and education and the nonobservable characteristics that might otherwise mess up the results.

Not everybody has millions of dollars lying around to create a large, randomized trial. A more economical alternative is to exploit a natural experiment, which happens when random circumstances somehow create something approximating a randomized, controlled experiment.

Sometimes the best available option for studying a treatment effect is to create nonrandomized treatment and control groups. Our hope/expectation is that the two groups are broadly similar even though circumstances have not allowed us the statistical luxury of randomizing. The good news is that we have a treatment and a control group. The bad news is that any nonrandom assignment creates at least the potential for bias. There may be unobserved differences between the treatment and control groups related to how participants are assigned to one group or the other. Hence the name "nonequivalent control".

The challenge with any "before and after" kind of analysis is that just because one thing follows another does not mean that there is a causal relationship between the two.

A "difference in differences" approach can help us identify the effects of some intervention by doing two things. First, we examine the "before" and "after" data for whatever group or jurisdiction has received the treatment, such as the unemployment figures for a county that has implemented a job training program. Second, we compare those data with the unemployment figures over the same time period for a similar county that did not implement any such program. The important assumption is that the two groups used for the analysis are largely comparable except for the treatment; as a result, any significant difference in outcomes between the two groups can be attributed to the program or policy being evaluated.

One way to create a treatment and control group is to compare the outcomes for some group that barely qualified for an intervention or treatment with the outcomes for a group that just missed the cutoff for eligibility and did not receive the treatment. Those individuals who fall just above and just below some arbitrary cutoff, such as an exam score or a minimum household income, will be nearly identical in many important aspects; the fact that one group received the treatment and the other didn't is essentially arbitrary. As a result, we can compare their outcomes in ways that provide meaningful results about the effectiveness of the relevant intervention.

To understand the true impact of a treatment, we need to know the "counterfactual", which is what would have happened in the absence of that treatment or intervention. Often the counterfactual is difficult or impossible to observe.

The purpose of any program evaluation is to provide some kind of counterfactual against which a treatment or intervention can be measured. In the case of a randomized, controlled experiment, the control group is the counterfactual. In cases where a controlled experiment is impractical or immoral, we need to find some other way of approximating the counterfactual.

Conclusion: Five questions that statistics can help answer

Statistics is more important than ever before because we have more meaningful opportunities to make use of data. Yet the formulas will not tell us which uses of data are appropriate and which are not. Math cannot supplant judgment.