Data fishing is a real threat to the reputation and integrity of data analysis, which explains why analysts get frustrating when they see fishing used to persuade audiences who don’t understand what it is.
To understand why data fishing is so frustrating for analysts, just take a look at the history of analytic decision making:
Before there was data, decisions were made on intuition and hearsay, which led to poor outcomes. When data arrived on the scene, decisions were made on comprehensive knowledge of available choices and their consequences. In other words, good data leads to better decisions.
Data fishing (or data dredging) poses an existential threat to the integrity of data decision making because it ignores the principle of representative samples and twists data to the will of the analyst, rather than the analyst to the will of the data.
Definition
In a sentence, data fishing (aka data dredging) is defined as one of two cases:
1. the misuse of data mining techniques to falsely suggest trends and correlations within a sub-dataset that do not appear in the full dataset, or
2. applying multiple statistical significance tests (i.e t-test) to a number of variables within a dataset to establish correlations that may actually occur purely by chance.
Another way of seeing the two kinds of data fishing are:
- Data Fishing by Unrepresentative Sampling
- Data Fishing by Unscrutinized Statistical Significance Testing
To give a little background, a “dredge” in the construction world refers to a crane-like machine used to retrieve objects from a body of water. Just as a dredge operator scrapes the bottom of a river to retrieve objects, so do analysts scrape parts of datasets to retrieve what they want to see with data fishing. The dredge machine operator doesn’t see everything under the water, and the analyst doesn’t see everything about the dataset.
Examples of Data Fishing
Since data fishing shows up in two ways, let’s look at an example of each. The first concerns unrepresentative sampling, and the second concerns unscrutinized statistical significance testing.
1. Data Fishing by Unrepresentative Sampling
Imagine you work for a political campaign, and you want to show how well your candidate did during her last term. The campaign asks you to look for ways the candidate performed well in terms of social metrics. How did literacy rates, poverty rates, medical debt rates, and hunger rates change over the her term?
Obviously, you want to show some positive information here — but you can’t, of course, lie. So you go to a few different sources to collect data on the relevant metrics and show how they change over the 8 years your candidates was in office. Imagine our sample data set looks like the following:
It’s hard to see the trends in the raw data, so you decide to visualize it with the following line graph:
To put this into words, it looks like during her 8-year tenure, the candidate slightly improved literacy rates and slightly decreased the poverty rate. At the same time, she slightly increased average medical debt rate and slightly increased the hunger rate. In other words, she really didn’t do much, and none of this will help her campaign.
Under pressure from your boss, you decide to look at some of the better metrics. It looks like from late 2017 through 2018, literacy rates increased significantly. But that’s a bit out dated. In fact, every metric but poverty rate looks negative in the most recent data, so you decide to go with that.
You Isolate the last two quarters of 2020 and show only the drop in poverty, and you show only this graph:
We clearly see how this graph uses selective sampling to communicate a positive message that does not reflect the truth of the full dataset.
2. Data Fishing by Unscrutinized Statistical Significance Testing
Imagine you’re working for a company whose managers want to learn more about the impact of market metrics on the performance of the company. They ask you to set out to identify any possible correlations between a set of market indicators and company revenue.
Right away, you realize the challenge of such a large request. How will you go about tackling so many different correlations? You realize there’s no way to deliver on time and with a truly thorough, quality study, so you decide to use the coding language R to rapidly perform single variable regressions and multivariable regressions between a set of market indicators and company revenue.
At the same time, you know you have to check for collinearity between independent variables on all of your multivariable regressions. Thank goodness you have R, or that would have taken a long time. You set out a plan to test the following indicators against your revenue, both independently and as a collection:
- GDP
- GNI
- Disposable Income
- Employment Rates
Here’s where the risk of data fishing comes into play. Let’s imagine you run 4 single variable regressions, all of which shows a p-value greater than 10% to your revenues. A p-value higher than 5% is generally accepted as strong evidence that there is no real relationship between the two variables.
However, when you run combinations of those independent variables against the revenues in multivariable regressions, you’re able to get GDP and Employment rate correlations with a p-value less than 5% — which is statistically significant.
This is strange. Given that none of the variables alone has a statistically significant correlation to revenue, how can GDP and Employment rate have a strong relationship when paired with other variables? In short, the answer is collinearity — when two of the independent variables in a multivariable regression are strongly correlated, they can warp each other’s coefficients and p-values.
Since you’re in a bind, you decide to claim that GDP and Employment rate have a strong relationship to revenue, and that you’re going to explore this further. You have just data fished. You had conflicting evidence of the statistical significance of two correlations, but since one side of the evidence supports a personal claim you would like to make, you decide to accept it. An honest analyst would not make this claim, but show that there is no relationship.
This could have gone another way…
BUT wait! This could have gone another way. Imagine that instead of choosing only four macroeconomic indicators, you chose 100. You download a database from a government website and decide to run a series of automated regressions. Of the 100, two show statistical significance: interest rates on boat loans in south Louisiana and household debt in north Ohio.
You scratch your head. These relationships seem very strange. Your company has no business in Ohio or Louisiana, and it there is no reasonable conclusion why these correlations exist, even though the p-values are less than 5%. You would like to show progress, so you decide to include them in your report to management. You have just fished for data.
Because you did not set out to test these two variables and their correlation to company revenue, you did not believe their was a real correlation. Even after the fact, you realize there is not one. It’s just a numerical coincidence that the two are closely related. This is data fishing — you hook onto something and want to tell the story of catching the fish, even if it turns out to be an old tire.
Here’s a funny example of another data fishing correlation:
Data Fishing: Not Always Intentional
I know I talk down on analysts that use data fishing, but there’s a big difference between those who do it accidentally and those who do it intentionally.
Sometimes it’s easy to identify and avoid data fishing, and sometimes it’s not. I remember the first time I made this mistake. I was working as a market intelligence financial analyst, and one of my big tasks was to identify correlations between company spending and macroeconomic indicators.
I made all the mistakes you can, but I did not realize it at the time. While we should be critical of data fishing, and should fight it at all costs when it falsely guides our opinions, we must also realize that it is human nature.
This begs the question: how can we avoid data dredging?
How to Avoid Data Dredging
My personal advice to analysts is to trust your gut. Most of us are familiar with the scientific principles of our burden of proof, and we know when something doesn’t smell right.
When you get that feeling, you should ask yourself the following questions:
- Does all of the evidence, both statistically and intuitively, support the claim I’m making?
- Do I want to make a claim because I’m under pressure to perform?
- Am I using a subset of data to produce a claim I want to make, even if this subset does not represent the full population?
- Am I using a statistical significance test on many different variables, in different combinations, to get a result I want?
If you answer No to question 1, or Yes to questions 2-4, then you might want to consider if you’re fishing for data, and what that might mean for your results.
Data Fishing Is Also Known As…
Synonyms for data fishing include the following:
- Data dredging
- Data snooping
- Data butchery
- P-hacking
- Data slicing
- Data angling
- Data trawling