Data normalization is a crucial element of data analysis. It’s what allows analysts to compile and compare numbers of different sizes, from various data sources. And yet, normalization is little understood and little used.
The reason normalization goes under-appreciated is probably linked to confusion surrounding what it actually is. There are easy normalization techniques, such as removing decimal places, and there are advanced normalization techniques, such as z-score normalization.
Analysts can quickly execute the easy ones in Excel or SQL, while advanced techniques require the use of statistical calculations. But don’t fret — even the “advanced” techniques are easy to conceptualize.
This article covers the 7 core data normalization techniques:
- Easy (manual) techniques:
- Decimal place normalization
- Data type normalization
- Formatting normalization (date abbreviations, date order, & deliminators)
- Advanced (automated) techniques
- Z-Score normalization
- Linear normalization (or “Max-Min,” & how to normalize to 100)
- Clipping normalization
Decimal place normalization (easy)
Decimal place normalization occurs in data tables with numerical data types. If you’ve ever played with Excel, you know how this happens. By default, Excel places two digits after the decimal for normal comma-separated numbers. You have to decide how many decimals you want, and scale this throughout the table.
To do so, use the Home>Decrease Decimal/Increase Decimal as shown in this short video:
Don’t forget, you can get the free Intro to Data Analysis eBook, which will ensure you build the right practical skills for success in your analytical endeavors.
Data type normalization (easy)
Another common type of normalization is data types, and more specifically, subtypes of numerical data. When you build a data table in Excel, or in a SQL-queried database, you may find yourself looking at numerical data that’s sometimes recognized as a currency, sometimes as an accounting number, sometimes as text, sometimes as general, sometimes as a number, and sometimes as comma-style. As a list, the data type possibilities for numbers are:
- Currency
- Accounting number
- Text
- General
- Number
- Comma-style
The problem is that these subtypes of numerical data respond differently to formulas and various analytical treatments. In other words, you need them to be normalized to the same type.
In my experience, the best type to reference by default is comma-style. It’s the easiest to read, and can be labelled as a currency or accounting number is a presentation later. Moreover, it undergoes the fewest updates over time, so your Excel file stays relevant across programs and across time.
You can change non comma-style numbers in Excel using the Home>Number Formats (drop down) options as show in this short clip:
Formatting normalization (easy)
A final easy normalization technique is formatting. While this is most common for strings (text, not numbers), it can happen with numbers as well. In most cases of formatting inconsistencies, the challenge is with italics.
This is an easy fix, and most digital natives are familiar with it. Just highlight your italicized, bolded, or underlined cells and navigate to Home>Bold/Italics/Underline.
Having one of these typographical emphases will not disturb your analysis. However, it can be distracting and prevent you from catching more significant inconsistencies, such as decimal places and data types.
In addition, if you choose to abbreviate months, make sure to use the first three letters (March becomes Mar, for example). As a general principle, dates should be ordered as “YY-MM-DD.” When in doubt, use underscores as deliminators, and avoid spaces at all costs.
Z-score normalization (advanced)
We’ve talked about inconsistencies in datasets, but what about when you have numerical values in multiple dimensions with significant differences in size?
For example, if you have values ranging from 10 to 100 in one dimension and values ranging from 100 to 100,000 in another, it’s difficult to compare the relative changes of both. When we face this challenge, the solution is normalization.
Perhaps the most common type of normalization is z-scores. In simple terms, a z-score normalizes each data point to the standard deviation. The formula is the following:
where X is the data value, μ is the mean of the dataset, and σ is the standard deviation.
Let’s look at an example. Imagine you run a wholesale watch company and you would like to normalize the data concerning the number of orders they place in a year and the average price of those order. A snippet of your database looks like the following:
customer-id | #_of_orders | avg_price |
---|---|---|
c1 | 10 | 1,050 |
c2 | 51 | 20,500 |
c3 | 42 | 3,800 |
c4 | 29 | 86,346 |
c5 | 15 | 512 |
Most computer programs will calculate the z-scores for you. Let’s look at how to do this easily in Excel.
The formula you’ll use is STANDARDIZE(x,mean,standard_dev). It will apply the formula shown above. As you can see, the three arguements needed are the data point (x), the mean of the dat set (mean), and the standard deviation of the data set (standard_dev).
Using our dataset, the process would look like this:
As you can see, now all of our numbers fall within the -2 to +2 range. This is because we divide them by each range’s standard deviation, which, if you remember from statistics 101, contains 99% of data after 3 deviations — making it a comprehensive metric to use as a denominator.
One standardized, it’s much easier to work with these numbers.
Linear normalization, or “Max-Min” & how to normalize to 100 (advanced)
Linear normalization is arguably the easier and most flexible normalization technique. In laymen’s terms, it consists of establishing a new “base” of reference for each data point. Often called “max-min” normalization, this technique allows analysts to take the difference of the maximum x value and minimum x value in the set, and establish a base.
This is a good starting strategy, and in reality, analysts can normalize data points to any base once they have completed linear normalization. Here’s the formula for linear normalization:
If they want to get a base of 100, for example, it’s a question of simple arithmetic. For example, imagine you’ve got an “x” value of 20, your max number is 55, and your min number is 5. To normalize this number, let’s start by calculating the base of 50 (55-5). Now we just need to modify the numerator with the same idea: x – min. In this case, it becomes 15 (20-5). So our standardized x, or x’, is 15/50 = 0.3.
Of course, if we want to normalize to 100, we just have to multiply or divide the fraction by the number needed to get the denominator to 100. In this case, it’s multiplying by 2. We multiply 50*2 to get 100 and 15*2 to get 30. The standardization is 30/100 = 0.3. Do this to all numbers in the data set to get a 100 base standardization.
Clipping normalization (advanced)
Clipping is not exactly a normalization technique, but it’s a tool analysts use before or after using normalization techniques. In short, clipping consists of establishing maximum and minimum values for the dataset and requalifies outliers to these new max or mins.
Imagine you have a dataset consisting of number [14, 12, 19, 11, 15, 17, 18, 95]. As you can see, the number 95 is a big outlier. We can clip it out of the data by reassigning a new high. Since your range without 95 is 11 – 19, you could reassign it a value of 19.
It’s important to note that clipping does NOT remove points from a data set, it REASSIGNS data in a dataset. A quick check to make sure you’ve done it right is to make sure the data population N is the same before and after clipping, but that no outliers exist.
Best Data Normalization Techniques
In my opinion, the best normalization technique is linear normalization (max – min). It’s by far the easiest, most flexible, and most intuitive.
This means you can execute it quickly on the job as part of a bigger analysis, you can mold it to the particular needs of any project, and you can easily explain it to other data analysts and non-analysts alike.
That said, decimal place, data type, and formatting normalizations are of the utmost importance. When you may pick and choose an advanced techniques, you always need to normalize data with the easy techniques.
Normalization Techniques in Database Management Systems (DBMS)
When we speak of normalization techniques in database management systems, we refer to all of the techniques mentioned in this article. As a quick reference, here they are again:
- Decimal place normalization
- Data type normalization
- Formatting normalization (date abbreviations, date order, & deliminators)
- Z-Score normalization
- Linear normalization (or “Max-Min,” & how to normalize to 100)
- Clipping normalization
In this article, the examples I’ve shown are in Excel. That said, Excel is not a good database management tool. It’s more of a data wrangling tool.
Eventually, what you really want to use is a program like IBM, SAP, or Microsoft SQL Server to store the data, and SQL to query, modify, and normalize points.
However, because SQL can cause permanent changes to the database, you should practice many times before executing a normalization techniques. Moreover, for wrangling purposes, Excel may even work better.
Data Normalization vs Data Standardization
You may hear the words normalization and standardization used interchangeably. In reality, the difference between them is subtle but important.
In short, normalization refers to modifying a dataset so values across variables can be easily compared. Standardization, on the other hand, is the precise step used on each value in the dataset to implement the common standard.
For example, dividing every number by the max – min is standardization, whereas normalization is the result of the whole dataset after this standardization.