In biology, taxonomy means naming, defining, and classifying organisms. In data, it’s the same, only the process doesn’t apply to all possible data in the world–only the data within a given set. It’s easy to understand why data taxonomy is useful. If you’ve ever looked at a data table, the last thing you want to do is go line-by-line to make sense of it. Data taxonomy allows you to skip that step.
So you know, this article:
- defines data taxonomy,
- explains how data taxonomy and taxonomy charts are related to other data classification concepts and terminologies,
- provides free tools to get started with data taxonomy,
- discusses taxonomy within the context of digital marketing,
- shows taxonomy best practices, and finally,
- provides a data taxonomy template.
What is the definition of data taxonomy?
Data taxonomy is the classification of data into hierarchical groups to create structure, standardize terminology, and popularize a dataset within an organization. The closely related data taxonomy chart shows this hierarchy using boxes and lines, and limits the data shown to observation name and available attributes.
Though on the surface they look like data models, data taxonomy charts have flexible design and granularity rules, meaning they can be custom built to an organization’s needs. In fact, data taxonomy charts look very similar to a number of data structure representations, including:
- data ontologies,
- data hierarchies,
- metadata,
- data classifications, and
- data dictionaries.
Let’s look at how and why they’re different, and what relationships exist between them.
Don’t forget, you can get the free 67 skills and concepts for data analysts checklist to cover the fundamentals.
Data Taxonomy vs Data Model
Data taxonomy as a concept is not the same as a data taxonomy chart. Taxonomy itself is the process of classifying, which does not require writing anything down. The moment two analyst orally agree to categories, they have constructed a data taxonomy. However, when they draw this out in a tree diagram, it becomes a data taxonomy chart.
It’s the taxonomy chart that seems to be in conflict with data models. In short, advanced data models, or physical data models, use special notation called crows feet, as well as UML-based constructs aimed at showing unambiguous object contents and relationships. Data taxonomy charts, while similar in appearance, do not have the same degree of formatting and notational rules.
Their goal is to show hierarchy, not in all cases to show observations and attributes. With that said, in many data taxonomy charts, some observation and attribute data is useful. Here’s a sample table to show you observations and attributes:
Product ID [observations] | Weight (g) [attribute #1] | Price ($) [attribute #2] |
---|---|---|
Maple | 50 | 500 |
Big Blue | 250 | 250 |
The Basic | 100 | 100 |
To illustrate, here’s how this table fits into and example data taxonomy chart and a physical data model both:
As you can see, the data taxonomy chart is simple and hierarchical. The Product_ID table we showed above fits under the database at the same class level as Retailer_ID and Vendor_ID. Now let’s look at the data model:
The inclusion of entity titles, the PK destination, attributes, and crows feet notation — which indicate the type of relationship between entities — make data models more robust and informative of entity contents, but less indicative of hierarchy. If you’re interested in data modeling, check out this visual article with examples.
Sample Taxonomy Template Download
Data Taxonomy vs Data Dictionary
Where data taxonomy charts are similar to data models, data taxonomy as a concept is similar to a data dictionary.
A data dictionary describes a table’s columns based on common traits (i.e name, definition, data type) within another table. Admins use data dictionaries when a data table is simply too large to view directly. Data dictionaries allow readers to understand complex databases without having to investigate each column. You can think of them as a summary of data about data.
For example, check out this sample data table. As you can see, it’s quite simple. You’ve got a customer ID with 3 attributes.
Customer_ID | Customer_Height_CM | Customer_Weight_KG | Customer_Age |
---|---|---|---|
C1 | 180 | 65 | 24 |
C2 | 174 | 72 | 20 |
C3 | 186 | 47 | NULL |
C4 | 182 | 50 | 18 |
C5 | 175 | 55 | 21 |
C6 | 180 | 62 | 23 |
C7 | 190 | 73 | NULL |
C8 | 170 | 59 | 28 |
Now let’s look at a data dictionary view of this table. We’re defining different elements of the Customer_Age dataset so the analyst can get a quick view on what’s inside without exploring the raw data column.
Name | Definition | Data type | Possible values | Required? |
---|---|---|---|---|
Customer_Age | Age of users | Integer | 15, 18, 20, 23, NULL | No |
The main difference between data taxonomy and data dictionaries is format. Whereas a data taxonomy is a concept, a data dictionary is by definition a table. The second significant difference concerns angle. Data dictionaries simply seek to summarize data, not to give it structure. Data taxonomy, as we have clearly defined it, shows hierarchy in a dataset.
Data Taxonomy vs Data Classification
This one is easy. Data classification is a blanket term for all activities addressing the structure, contents, and hierarchy of data within a dataset. Data taxonomy aims to structure and give hierarchy to data, so it is a sub-discipline of data classification.
Data Taxonomy vs Data Ontology
Data ontology is another blanket term, but it’s more high-level than data classification. Data ontology spans fields like computer science, information technology, database management, and data analysis itself, whereas the scope of most of the other terms in this article are limited to data analysis alone.
In a sentence, data ontology must consist of representation, formal naming, and a definition of data classes. Representation may look similar to a data taxonomy chart or a data model, depending on the degree of complexity in the underlying database. Formal naming is the official names given to observations, attributes, and repetitive non-numeric data items. Data classes represent the hierarchy of observation IDs in the dataset. In this way, they are similar to data taxonomies.
On the other hand, data taxonomy is only required to address hierarchy. While having observations IDs (aka PKs) and attributes are a plus, by definition they are not required.
Data Taxonomy vs Hierarchy
This one is easy. Data hierarchy is a concept inherent to data taxonomy, since data taxonomies aim to classify data. Hierarchies are visible in data taxonomy charts, but they do not have their “own” representations.
Data Taxonomy vs Metadata
As with hierarchy, metadata is a concept inherent to data taxonomies, since taxonomies aim to summarize data within their classifications. It should be noted, however, that metadata is a concept that usually refers to a comprehensive summary of a dataset, as with a data dictionary. Whereas data taxonomies’ metadata is usually limited to the observation ID (aka PK) only.
Data Taxonomy Tools
Data taxonomy tools do not exist as such. Instead, you can use any standardized taxonomy tool. Most of these are linked to biology or natural sciences at their core, but allow anyone to participate, as long as you respect a small set of submission criteria. One free tool you can use is The W32.
As an alternative, if you just want to build simple, visual data taxonomy charts like the one shown in this article, you can use shapes and lines in Power Point, or the smart art function. When I do this, I typically structure my data in Excel beforehand.
Data Taxonomy Template
If you liked the the taxonomy in this article above, you can download it as a template here:
Sample Taxonomy Template
Data Taxonomy in Digital Marketing
The more you work with taxonomy, the more you realize that it’s scope is thesaurus-sized. As a concept outside of data analysis, taxonomy is the basis on which all hierarchical information is built.
When you scale down to a specific industry, in this case digital marketing, you can see that the goal becomes to organize the critical data involved within that field, not to build a complete disciplinary taxonomy (which would take years).
So, if we apply the principles of taxonomy to digital marketing, we would perform the following steps:
- Identify all business units and data entities on which you would like to build your taxonomy. For example, you might choose vendors, products, and customers. Google analytics does a good job of breaking down the different possible hierarchies in internet analytics, and giving you the flexibility to rearrange them.
- Build a framework in Excel and Microsoft PowerPoint.
Data Taxonomy Best Practices
A few best practices or tips to keep in mind as you develop a data taxonomy include the following:
- Speak/write always in terms of how the users view the information, not how the creators view it. I.e “the iPhone X” vs “the product.”
- Ensure language is consistent across the organization.
- Simplify, simplify, simplify.
- Allow room for innovation — don’t get too attached to your document on the first go.
Conclusion
Taxonomy is the classification of elements into a hierarchical structure. In some ways, all taxonomy is data taxonomy, but in the context of data analysis the term represents a specific classification model. We call this a data taxonomy chart.
While it closely resemble data models, data dictionaries, data ontologies, and metadata, a data taxonomy chart is unique in that its sole purpose is to show hierarchy between PK entities. At the end of the day, expanding on observations and including attributes within the model is always a possibility, but the core is ranking.