Clusters are everywhere. In school, students are placed in different grades and classes. In business, employees belong to different departments. How do we decide who goes where? Shared characteristics such as age, subject matter, or skill, tell us who belongs together.
In the same way, data analysts cluster data based on similarities among data points. Though data clustering is more complex than “clustering” students or employees, the goal is the same. Data clusters show which data points are closely related so we can structure, analyze, and understand the dataset better.
But what exactly are data clusters? And how do we create them? This article defines data clusters, provides examples, and explains how we make them.
Note: data clusters are a slightly advanced topic. To cover the basics, I recommend you read our free Intro to Data Analysis eBook if this article is challenging for you.
Data Cluster Definition
Written formally, a data cluster is a subpopulation of a larger dataset in which each data point is closer to the cluster center than to other cluster centers in the dataset — a closeness determined by iteratively minimizing squared distances in a process called cluster analysis.
In addition to the above definition, it’s imperative to keep in mind the following truths about data clusters:
- Data clusters can be complex or simple. A complicated example is a multidimensional group of observations based on a number of continuous or binary variables, or a combination of both. A simple example is a two-dimensional group based on visual closeness between points on a graph. The number of dimensions determined the complexity of the cluster, and thereby the type of cluster analysis needed to determine it.
- Data clusters in a single dataset can vary depending on the type of cluster analysis used to calculate them. The most common type of data cluster is a k-means cluster, which is created by minimizing the euclidian distance between a cluster center (created as a result of the iterative analysis) and the points in the cluster. If you use a different kind of analysis, the clusters will look different. We’ll look at different analyses below, so don’t worry. It’s easier to understand with an example.
- Data clusters will change based on the number of calculation iterations. We use computers to calculate the minimum distance between points and cluster centers. The number of iterations we have the computer run will determine the efficacy of the minimization. However, it’s rare that two iterations produce the same result.
- Data clusters in a two-dimensional spaces appear obvious, so it may seem like the statistical analysis used to obtain them is overkill. However, this is a perception trap. While you can “get away with” visual clusters in simple analyses, you can’t with complicated clusters. Data clusters in multidimensional spaces (i.e those with 4 or more dimensions), are nearly impossible for the human mind to conceptualize, so you need to use statistics. Don’t worry, it’s not hard with Excel. Once you wrap your head around the concept of minimizing distances, you’ll get it. And I’ll show you how it works in Excel so it’s easy to understand.
Data “Clusters” in SQL Databases
It’s important to note that the term “cluster” can also refer to data that are stored close together in a dataset. For example, SQL analysts may refer to a row in a dataset as a data cluster because it groups related data points.
Database engineers often group multiple datasets together for ease of access, and they refer to these as data clusters as well. If you’ve ever seen a data model, you can get a good idea of why data engineers call these clusters. Here’s an example data model:
Data Table A could be considered a cluster. Moreover, each of the data models and the database as a whole could also be considered “clusters.”
In most cases, a “cluster” refers to data points whose values are close together, but you should always keep in mind that professionals in various fields apply the word to their jobs in special ways — such as a database analyst who refers to a row as a cluster.
Data Clusters Example
Imagine you have a pig farm with 15 pigs. You want to cluster them together based on age and weight. This means you want to minimize the distance between each pig to a cluster center. Let’s imagine you have a graph of the pigs’ age and weight that looks like this:
You can already see how some of the points fit closely together. Visually, we can group some of them together, thereby creating data clusters. Here’s the same data with circles around the clusters.
It’s as simple as that. Each circle is an example of a data cluster. Keep in mind, however, that this is a relatively easy example for the following reasons:
- For starters, we only have two dimensions. More complex analysis may have a larger number of dimensions.
- In addition, the number of observations (pigs in our case) is small, making it easy to conceptualize the results. If we had 10,000 pigs, it’s harder to visually determine data clusters.
- Finally, we’re starting and stopping with 4 data clusters. In a more through analysis, we would need to evaluate the optimal number of clusters. The ultimate goal is to minimize distance between points and cluster centers, and there’s always an optimal number of data clusters to do so.
So data clusters are pretty easy, right? We’ve created them visually, and it’s clear. However, we can’t prove this, and if our example was harder, we wouldn’t be able to do it visually. So let’s talk about formal cluster analysis now.
Cluster Analysis: How to Create Data Clusters
To really understand data clusters, we need to know how they’re created: through cluster analysis. Cluster analysis is the process of creating data clusters by minimizing the distance between data points and a reference.
There are several types of cluster analysis:
- Density clustering. Data clusters are determined by how densely related (minimized distance) they are.
- Distribution clustering. Data clusters are determined by the probability that each point it the cluster center.
- Connectivity clustering. Data clusters are determined by initially assuming each data point is a cluster. It then calculates which points are best suited to be cluster centers based on which are closest.
- K-means clustering. Data clusters are determined by minimizing the distance between data points and a predetermined k number of cluster centers.
Each type of analysis has it’s advantages and disadvantages, but in industry the most common and most useful one is k-means clustering. Let’s look at the data clusters in our pig example to understand better.
Most Popular Clustering Analysis: K-Means Clustering Example
K-means clustering uses a presupposed number of clusters, then minimizes the distance of each data point in the whole set to that number of centers. The key concept to understand in k-means clustering is that only the number of cluster centers is predetermined. It’s only when a computer algorithm starts to minimize distances that we find out where those centers are located.
Disclaimer: the below analysis only starts to make sense when you get to the last step, so if it’s not clear until you’ve gone through it entirely.
Let’s look again at our example of the pigs to understand. The going-in number of clusters we’ll look for is 4. Let’s layout the details here:
- In the gray cells we outline the variables we’re testing — age and weight.
- In the orange cells we create empty space that Excel will later solve for in order to minimize the distance between each point and the assumed 4 clusters.
- In blue cells we list our 15 pigs with their age and weight.
- In the green cells we create our distance calculations between each pig and the clusters. You can see from the formula that this distance is euclidian — that is, it’s the square root of the squared distance between the cluster center and the point.
- The numbers you see in green are just placeholders. We haven’t asked Excel to run a calculation on the cluster centers yet, so the values are just the sum of our two variables.
- The reason we square and square root the equation is to get rid of any negative values (a negative * a negative = a positive). Written here, the formula is as follows:
Once you have set up this layout, you need to add a cell that looks for the minimum distance that each point has with regards to each cluster center. We can do this using Excel’s MIN() function. In addition, we want another cell to easily identify which cluster the points belong to once Excel runs its calculation. We can do this using Excel’s MATCH() function. Here’s what these formulas look like:
Now that you’ve set this up, we’re ready to let Excel minimize the distances. To do so, we need to use an Excel add-on called Solver. You can install it using this guide. Solver works by optimizing a target cell by changing a single or range of cells, given a set of constraints.
In the below example, we’ll set the target cell as the sum of the minimum distance cells in line 9. To do so, we’ll tell Excel to modify the orange cluster cells. Let’s assume our only constraint is that the values Solver produced must be less than our largest know variable, which is cell P3: the weight of the pig named Kim.
In addition, you have to tell solver if the optimization will be linear or non-linear. Since euclidian distance is non-linear, we need to use the Evolutionary setting you see in the picture below.
Now hit solve!
Now we have our data clusters. Solver calculated the optimal cluster centers by minimizing the distance between all of the data points and these 4 clusters. Let’s check to see how it worked graphically by creating a new scatter plot:
But wait, there’s a problem here. Two of our 4 clusters are placed as outliers, which is obviously not correct. What’s happened? Excel could not minimize around all of the clusters because the original data points are too close together.
Remember: we decide somewhat arbitrarily how many clusters we want to use at the start of k-means analysis. in this case, it seems like we chose too many. Let’s try the analysis again using only 3 clusters and see if that helps:
We still have an outlying cluster center. We could remove another, but to me, three cluster centers seems reasonable… we’re missing something else here. The range of possible cluster center values is too wide. We’re letting solver choose values for the cluster centers that are outside our range of ages. We need to scope down the value constraints for the age variables to those between the minimum and maximum ages. Likewise, let’s scope down the weight variables to the minimum and maximum values. It looks like this:
Let’s check the output in a new graph. This looks much better:
Now our data clusters look more reasonable. Not surprisingly, they are different from the clusters we created visually at the beginning of the article. To our eyes, the whole core was best as one cluster, but Excel’s solver has determined a better way to organize the clusters, and has better minimized the distances in doing so.
What does this tell us about data clusters?
It’s easy to incorrectly group observations based on visuals and intuition. Some even argue that data clusters are defined as the result of this statistical approach. This would mean that our visual clusters at the beginning of the article is were not data clusters at all — just circles on a graph. To get trustworthy data clusters, we need to perform a statistical analysis.
Other Techniques: Mean-Shift, Density-Based Spatial Clustering, Expectation-Maximization, Agglomerative Hierarchical
K-Means is the most popular type of clustering because it is the most intuitive. However, it’s far from the only technique. Here are four other popular ones:
Mean-Shift Clustering
Mean-Shift Clustering works very much like k-means, but instead of creating values that serve as cluster centers, mean-shifting uses existing data points to serve as cluster centroids.
Density-Based Spatial Clustering
In density-based spatial clustering, each data point is analyzed as a potential cluster center. A distance allowance of Epsilon determines the number of other points labelled within the cluster. If there are enough points, the point becomes a cluster center. The process is repeated for all points until an optimum center is determined. This process then repeats for the whole dataset.
Expectation-Maximization
Expectation-Maximization is similar to K-means clustering, except that it adds standard deviation to the calculation on top of averaging. This allows the clusters to take on more dynamic forms instead of following a circular structure.
Agglomerative Hierarchical Clustering
Agglomerative Hierarchical Clustering performs normal clustering using one of the above techniques, then combines determined clusters until the whole data set becomes one “big” cluster. This approach allows the analyst to choose the number of clusters he/she wants based on a the hierarchy of combination – a welcome flexibility for analysis.
Final Note: How Dimensions Impact Data Clusters
In our clustering exercise, we only examined two dimensions: weight and age. However, we could have looked at 5, 10, 20, or more dimensions. This is difficult to understand.
If we introduce more than 3 dimensions, data clusters are no longer a graphical, visual exercise. Instead, they’re abstract by nature. It’s nearly impossible for a human to “visualize” or “imagine” a dynamic in which there are more than the x, y, and z planes, but they exist nevertheless.
A good way to think about 4th and 5th dimensions is to imagine space and color as dimensions on a graph. You have points on the x, y, and z axes. Then imagine that those axes move through space — this would be a 4th dimension. On top of that, imagine that each point is a shade or color — this would be a 5th dimension. Not easy, huh?
Good data clusters are able to provide valuable insights based on a maximum number of variables. The more variables there are at play, the more information we have feeding the analysis. This is why data clusters are only true subsets when they’re based on statistical analysis.