Is Data Visualization a Data Cleaning Task?

When we’re learning about data, it’s easy to get overwhelmed with the terminology — there’s cleaning, transforming, querying, normalizing, visualizing, and more. Don’t let the jargon confuse you.

If you’ve ever played with a small dataset in Excel, then you have probably completed most of these tasks, even if you didn’t realize it. That’s because most work with data is intuitive. We give our data “intuitions” names like transforming to make it easier to talk about them, but in practice the ideas are natural.

Data cleaning and data visualization are no exceptions! Intuitively, data cleaning requires modifying and cross-checking. Since we often use visualization as a cross-check, many people wonder if data visualization is considered a cleaning task. This is understandable.

Data visualization is not a data cleaning task. However, it can be used to ensure data cleaning has been done correctly. In this way, the relationship between visualization and cleaning is similar to that of quality assurance testing and product development.

Data Visualization as a Quality Assurance Tool

Data visualization refers to the use of charts, graphs, and pictures to communicate data and ideas. While the term conjures images of design-motivated infographics, visualizations can be a simple as a line graph used to identify a trend in Excel.

Personally, I use the quick charts in Excel on a daily basis not just for presentation purposes, but also to get a quick view of what a massive amount of data is telling me. I can’t check a big database line-by-line, but I can pop the data in a graph to quickly understand if I’ve made a mistake.

In other words, data visualization is a great way to cross-check my work on a rolling basis. This is similar to how it can be used to check that data cleaning has been done correctly. But in order to understand precisely how visualization can substantiate data cleaning, you need to know the 5 steps of cleaning.

5 Steps of Data Cleaning

Data cleaning consists of:

  1. Remove duplicate value
  2. Replace incorrect values
  3. Fix structural errors
  4. Filter outliers
  5. Eliminate or substitute for missing values

The way in which visualization can be used to support data cleaning depends on which of these 5 steps we’re checking. Let’s look at each of them with short examples.

Data Viz of Removed Duplicate Data

Imagine you have a dataset on your company’s product sales. The dataset looks like this:

ProductColorQuantityPrice (USD)Customer IDDate
DynamoBlack129.99C101/02/2021
RoseRed129.99C201/05/2021
MagnetoSilver133.99C301/01/2021
OspreyGreen1C401/03/2021
SkylineBlue112.99C501/04/2021
RoseRed129.99C612/11/2020
EmpireGold199.99C701/06/2021
EmpireGold199.99C701/06/2021

The first step in data cleaning is to remove duplicate data. We can plug this data into a spreadsheet program such as Excel to get a quick look if there are duplicates. I did it by looking at customer IDs and the number of appearances.

I can see in my table that customer C7 has more entries than the others. This helps me identify that in fact there are two entries of single purchases of the same for C7. This is duplicate data since it should have been entered as one line with a Quantity of 2.

Once you remove the duplicate line, you will see that your bar chart is flat — confirming you have done the cleaning correctly. This amount of work to visualize may seem excessive for such a small dataset, and it is. But it’s extremely useful when you’re dealing with large tables.

Data Viz of Replaced Incorrect Values

Let’s say we know that our company doesn’t sell any watches for less than 20 USD. This means that is we find the base price for all of our watches sold, none should be below the 20 USD line. However, we can see in our database that the Skyline watch is priced at 12.99. This is a mistake — it should be 21.99.

Once we’ve corrected it, we can use a simple bar chart to check, as in the below picture:

Data Viz of Fixed Structural Errors

Imagine our table looked a bit different, and rather than periods to indicate the space between the 1s and 1/10th place (or the decimal) there are commas. This is a common challenge that companies working between the US and Europe face, since in many European countries like France, the convention is to use a comma.

I’ve modified the table to reflect this for the C1 and C2 price entries:

ProductColorQuantityPrice (USD)Customer IDDate
DynamoBlack129,99C101/02/2021
RoseRed129,99C201/05/2021
MagnetoSilver133.99C301/01/2021
OspreyGreen1C401/03/2021
SkylineBlue112.99C501/04/2021
RoseRed129.99C612/11/2020
EmpireGold199.99C701/06/2021
EmpireGold199.99C701/06/2021

If we try to visualize this, the prices will be recognized by Excel as TEXT data, which means they will not show up in our visualization. We need to correct them back to periods, in which case we could visualize to confirm the cleaning worked. A viz here would look like the visualization of prices shown in the previous section.

Data Viz of Filtered Outliers

Sometimes when we’re condensing a dataset based on filters, some outliers show up. In our database, this shows up in the “Date” attribute, where we have C6’s purchase outside the date range we were looking for.

We only want transactions from January 1, 2021 – January 7, 2021, but C6’s purchase was in December 2020. We can filter this out and show a simple visualization to be sure it’s removed:

We can see that no transaction took place outside of out date range once we filtered for C6.

Data Viz of Eliminated or Filled Missing Values

Now let’s say we want to see a simple revenue calculation consisting of volume * unit value. To do so, we need to clean our dataset of any missing values.

You probably noticed earlier that we’re missing the price for our Osprey watch purchased by C4. We cannot leave it empty, or it will corrupt our analysis.

ProductColorQuantityPrice (USD)Customer IDDate
DynamoBlack129.99C101/02/2021
RoseRed129.99C201/05/2021
MagnetoSilver133.99C301/01/2021
OspreyGreen1C401/03/2021
SkylineBlue112.99C501/04/2021
EmpireGold199.99C701/06/2021

There are three ways we can handle a missing data point.

  1. We remove the entire line.
  2. We replace it using known data.
  3. We substitute using dummy data.

In almost all cases, you don’t want to delete the entry because this removes any insights we could get from the correct data fields. So, we want to use real data. In this case, we know what the price is since we own the company, so let’s put it in as 24.99 USD.

(NOTE: sometimes we don’t know what value to fill in, in that case we need to use dummy data. You can learn more about how to create dummy data here.)

ProductColorQuantityPrice (USD)Customer IDDate
DynamoBlack129.99C101/02/2021
RoseRed129.99C201/05/2021
MagnetoSilver133.99C301/01/2021
OspreyGreen124.99C401/03/2021
SkylineBlue112.99C501/04/2021
EmpireGold199.99C701/06/2021

Now, if we use a simple bar graph to check for missing data, we see there is none. We have therefore correctly used a visualization to cross check our cleaning.

To complete our volume * quantity analysis, we would multiply each line’s price by the quantity, then take the sum of these numbers. In this case, it’s 240.94 USD.

Conclusion

Data visualization is not a data cleaning task, but it it useful as a cross-check throughout the cleaning process. It’s always a good idea to visualize your work as a QA tool. I do it every day.

If you liked this article, check out more free content for data, finance, and business analysts at the AnalystAnswers.com homepage.

About the Author

Noah

Noah is the founder & Editor-in-Chief at AnalystAnswers. He is a transatlantic professional and entrepreneur with 5+ years of corporate finance and data analytics experience, as well as 3+ years in consumer financial products and business software. He started AnalystAnswers to provide aspiring professionals with accessible explanations of otherwise dense finance and data concepts. Noah believes everyone can benefit from an analytical mindset in growing digital world. When he's not busy at work, Noah likes to explore new European cities, exercise, and spend time with friends and family.

LinkedIn

Scroll to Top