Making Mistakes with Mathematical Certainty
Written by Aaron Douglas Thursday, 06 May 2010 12:05
"it is easy to lie with statistics, but easier to lie without them" - Frederick Mosteller*
Data allows us to make observations of the world in such a way that allows us to better understand and predict what is happening around us. Statistics allows us to describe how the world is behaving, and can even inform us how to influence the world to behave the way we want. The modern age is defined by statistics - air traffic control, GPS, spam filters, Google, intensive care units - all based on statistical data and inference. Data is evidence, but without statistics data is anecdotal at best. It is a misunderstood, often misaligned technology that, despite being responsible for most major scientific discoveries in the last century, is also the source of some of the worst decisions people have made. You may be interested to know that research has shown that over 95% of statistics are completely fictitious.**
Statistical Methods can often be a (unintentional) source of misleading information. Researchers looking to measure trends in the economy may perform cross-sectional analysis of "same-store growth", wherein individual store's annual financial performance is compared to the previous year. The aggregation of this data (hundreds of stores, often in a variety of markets) is often used as an indicator of a broader economic climate. One would assume that, given a significant sample size, an upward trend in the data would indicate an upward trend in the economy. However, if the methodology were examined more closely, we would see that only the stores that remained open through the course of the study (two years) remained in the data set. This means that all the stores that closed during that time due to economic hardships are not represented in the sample. If the number of stores that went out of business largely outweighed the stores that remained solvent, how meaningful are the results? In addition to being complicated, statistics can often be counter-intuitive. The Prosecutors Fallacy is a classic example in criminal law highlighting the confusion between the probability of an individual matching the description and the probability of an individual who does match the description being guilty.
When a particular measure is used as an indicator of the performance of a system, people may choose to target that measure, improving its value but at the cost of other aspects of the system. The chosen measure then improves disproportionately, and become useless as a measure of performance of the system***. A past example from the search engine optimization industry is "keyword stuffing". Keyword stuffing (increasing the frequency and variations of a word or phrase in an effort to artificially inflate the page's relevance to search engines) historically could improve the search engine ranking of a particular page (a primary metric of SEO), but consequently detracts from the searchers experience, resulting in a higher "bounce rate" (abandoning a site after viewing only a singe page) and fewer return visits (i.e. repeat customers).
Modern statistics is more than calculation of data sets (the computers do most of that), it is investigation using a wide variety of tools to capture and conceptualize the world. Medicine, government policy, agriculture, engineering, pharmaceutical, marketing - statistics underlie most every aspect of our lives. We can choose to harness the available tools, using them to shape our present and future through the power of statistics, or... we can just be a statistic.
*Founder, Harvard University's statistics department
**also, 99% of statisticians are facetious
***David J. Hand - "Statistics: a brief insight"
