Data Distortion: What is it? And how is it misleading?

Have you ever seen a graph that seemed too good to be true? Or heard a percent-based fact that felt too high or too low? You’re not the only one. The reality is that unbelievable graphs and statistics are often correct, but warped. Their creators use a series of techniques to represent the data in a way that favors them. We call this process data distortion.

In a sentence, data distortion is the intentional or unintentional misrepresentation of a dataset through the use of cherry picking, statistical over-engineering, or graphical complexity and proportional warping.

Cherry picking selects unrepresentative portions of a full data set, whereas statistical over-engineering misuses basic techniques (i.e mean, median, mode) to convey a biased message, and graphical complexity and proportional warping play with chart elements, such as changing the axis ranges, to bend the viewer’s first impression.

In the academic world, the most common use of the term “data distortion” is for statistical misuses. Statistical misuses are very vast, but most of them don’t apply to business. In this article, we’ll focus on what’s most common in business (and to some degree, in popular culture).

Don’t forget, you can learn data fundamentals for free with Intro to Data Analysis.

Contents

Definition

The word “distortion” refers to changing any object or idea from its original, or natural, state. A melted crayon or your reflection in a broken mirror (similar to the cover photo) are examples of distortion. In the case of data, some claim that any change to an original dataset is distortion — even when done without bias.

I disagree with this claim. Data is, on the one hand, the unique IDs and attributes found in a tabular form, but it is equally the relationships between, and wholistic interpretation of, those IDs and attributes.

Data distortion, therefore, is the misrepresentation of IDs and attributes, OR the misrepresentation of the relationships and wholistic interpretation of those IDs and attributes. And data distortion is the misrepresentation of these IDs and attributes through cherry picking (IDs and attributes themselves), statistical misuses (relationships and interpretation), and graphical warping (relationships and interpretation).

Misrepresentation of Data

Before we look at examples, a small note on the misrepresentation of data. Data distortion is a subset of a larger concept often referred to as the misrepresentation of data. While data distortion deals directly with misrepresentations in source sets and in graphs, the misrepresentation of data as a general field delves into other categories such as the oral communication of data, data shared through various mediums such as television or social media, and the impact of misrepresented data in specific social contexts such as business and personal settings.

Cherry Picking Example

Cherry picking is the use of an unrepresentative subset of data from a larger table to communicate an idea about the whole set. A classic example of unrepresentative data is the bucket of pennies vs the bucket of quarters, dimes, and nickels.

If you put a blindfold on someone and ask them to select a coin from the bucket of pennies, s/he will draw a penny and assume the whole bucket is full of pennies. She is right, since the penny represents 100% of the whole bucket.

However, the same action in the bucket of various coins would not be representative. If the person draws a nickel and assumes the whole bucket is full of nickels, he would be wrong. Instead, he would need to take a handful of coins, analyze how many of each coin their are in his hand, and then draw a conclusion about the whole bucket.

This same principle applies to datasets. They are never like the bucket of pennies and always like the bucket of other coins. You should always try to analyze the whole set, and never just a sample of the data. However, on some occasions, you have no choice.

When you’re obligated to select a sample set, you should prefer the largest set possible. In fact, the Law of Large Numbers suggests that the larger is the data set, the more closely its mean represents the mean of the whole.

Cherry picking occurs when an analyst — intentionally or unintentionally — selects a sample set of data that does not represent the whole. For example, imagine there’s a survey on the ages of citizens in the town of Oldville. The town’s population is 1,000, but the surveyor is only able to get a sample of 400 people to come out for a night of bingo. Of those 400, 380 are over the age of 65. The surveyor thus concludes that 95% of the total population is 65+.

However, a few years later, another surveyor gets a turnout of 400 to bingo and a turnout of 500 to an all night dance party. The second surveyor’s data shows that only 40% of the population is aged 65+, and another 50% is younger than 27.

This is a silly, fictional example, but it drives the point home: cherry picking distorts the reality of data. In the above example, the problem involved data collection. In the world of data analysis, most cases of cherry picking are intentional table slicing to provide a biased view of a dataset.

Statistical Misuse Example

Statistical misuse is the act of blending, merging, or overlapping (generally, over-engineering) statistical metrics to distort meaning of the whole dataset. A classic example of this is the expression that says, “I’m 100% right, 10% of the time.” The use of “100%” up-front makes the listener believe the speaker is always right, when in reality, he could have said “I’m right 10% of the time.”

In terms of data analysis, the most common example of statistical misuse is an average of an average. For example, imagine you’re looking at Watch company’s revenues. You want to know what percent of the company’s revenue comes from the sale of Luxury watches because you want to convince the it to buy luxury watch supplies from your company.

In basic terms, only 20% of their sales come from Luxury watches. That’s not a very convincing number, so you take another approach. The company sells to retailers (80% of revenue) and directly to customers (10% of revenue). They’re very keen on increasing direct customer sales. In this case, you might say that customers only buy luxury watches directly from the source — they don’t like retail.

You argue that “90% of direct customer sales come from Luxury watches.” In relative terms, this is true, but it does not represent the whole. A more wholistic claim would be “90% of 10% of the revenues come from Luxury watches, or 9% of total revenues.”

In this example, I’ve provided the full context. But most of the time when statistical misuses come into play, we don’t have the full context. Here are a few examples you may have heard before:

“The average New Yorker in the state earns 10% less than the average Texan.” New Yorkers in the state exclude those working in New York City, so this claim is not representative.
“President Jane Doe improved rural literacy rates by 80% in Bravard County.” Only 5% of Brevard County’s population is rural, so only 80% or 5%, or 4% of the total, saw improved literacy.

“December revenues have increased 80% since October.” The company provides Christmas light decorations, so an off-season monthly reference does not provide insights about the whole year.

Graphical Warping Example

Graphical warping is the use of complexity or warped proportions in a data visualization to distort the meaning of the underlying data. A classic example is the use of elongated axes to show emphasis. See this chart below:

Good news: America’s high school graduation rate has increased to an all-time high.🎓 https://t.co/Ih564hAo2u pic.twitter.com/C4h5JdIvwQ
— White House Archived (@ObamaWhiteHouse) December 16, 2015

You might think that this shows a huge improvement, but think for a second. The growth is from 75% in 2007 to 82% in 2013. That’s an increase of only 7% in 6 years, or ~1% per year. It’s less flattering from this point of view. Another way of showing the same data is the following graphic:

As you can see, when we look at the whole spectrum, from 0% to 100%, the effect is quite different. In addition to the warped axis in the Twitter chart, we also see the stacked books from an angle. This is common for visualization warping in social media. Because the books seem to approach the viewer as the stacks grow over time, we have the impression that growth is greater than the reality.

The reality of the data is distorted because the graphical proportions are warped.

Data Distortion vs. Data Distraction

Now that we’ve seen examples of cherry picking, statistical misuse, and graphical warping, it’s important to distinguish two easily confused terms: data distortion and data distration.

Data distraction is the concept that an over-emphasis on data deters the viewer from thinking about the real substance behind the data — the challenge or opportunity available to him/her. For example, by focusing on a complete data set provided to a company, it might miss the opportunity to hone in on only certain pieces of information pertinent to its success.

Instead of the Watch company focusing on analyzing user behaviors or any number of other metrics, for example, it should focus on locating customers with money to buy Luxury watches.

Data distraction differs from data distortion in that data distortion limits the sample size of 1 metric, whereas data distraction simply eliminates superfluous metrics.

About the Author

Noah

Noah is the founder & Editor-in-Chief at AnalystAnswers. He is a transatlantic professional and entrepreneur with 5+ years of corporate finance and data analytics experience, as well as 3+ years in consumer financial products and business software. He started AnalystAnswers to provide aspiring professionals with accessible explanations of otherwise dense finance and data concepts. Noah believes everyone can benefit from an analytical mindset in growing digital world. When he's not busy at work, Noah likes to explore new European cities, exercise, and spend time with friends and family.