If you’ve tried to understand data attributes while learning to program or exploring a big data tool, chances are you ended up with a lot of questions. Data attributes are simple, but their meaning can get lost in the noise.
The purpose of this article is to clearly define data attributes, explain the different types, and show examples in a simple database, as well as explain their role in HTML, Python, jQuery, GIS, and Six Sigma.
Data Attribute Definition & Description
In short, a data attribute is a single-value descriptor for a data point or data object. It exists most often as a column in a data table, but can also refer to special formatting or functionality for objects in programming languages such as Python.
It’s important to recognize that an attribute is simply data that describes other data — you should not imagine it as external to the dataset. In this sense, an attribute is a way we use one part of the data to describe other parts.
For example, imagine you’re looking at rainfall in 2 countries and 10 of their cities. If you decide you would like to focus on the rainfall by individual city, then the countries are attributes of those cities. Cities and countries are both data, but one describes the other.
Don’t forget, you can get the free 67 data skills and concepts checklist to cover all the essentials (including data attributes).
Types of Data Attributes
Data attributes exist as 3 different types. (Remember, numerical values are considered measures because they can be arithmetically manipulated up and down rows. They are not aggregate attributes.)
- Date – a date of a given year and month
- Text – often referred to as “string,” means simply any combination of letters or other symbols instead of numbers
- Boolean – TRUE or FALSE data, often migrated to YES or NO text, or 1 and 0 numbers. It is, in simple terms, binary data.
Example of Data Attribute in Databases
Using our example of rainfall, let’s look at a sample database and identify the attributes.
|Feb||24||USA||New York City||12|
The database shows rainfall for 10 cities in two countries over 10 months. But note: we don’t know that these months were all in the same year. Moreover, though each country has the same day entries, those days are not in the same months per country. This begs the question: what are we describing in this dataset?
To find out, we need to see what data differentiates each row — what makes each line unique. When we know what never repeats, then we know how the information was recorded. An easy way to visualize this is with time; if each line represented a moment in time, then none of them would repeat.
In this set, each month and city appear only once. We can thus say that they are unique identifiers of each row. The other columns (country, day, rainfall) describe month and city.
Since the other fields, such as “country” and “day” have duplicates entries, or serve as a value such as “rainfall”, they won’t help us ID the row. We can say that our dataset is normalized to month and city.
Here’s a better way to view the data with this in mind. It will help us understand attributes:
|Feb||New York City||24||USA||12|
So, where are the attributes? Let’s recall our definition: a data attribute is a single-value descriptor for a data point or data object. Given that our data points are based on month and city, the other fields contain single-value descriptions and are thus attributes.
Each individual entry under day and country is an attribute. Rainfall, however, is not an attribute. To understand why, we need to understand aggregate attributes and measures.
Aggregate Attributes vs Single Attributes
An important distinction is that the term “attribute” is used loosely in this context to mean both each individual cell AND the entirety of the column. For example, you might say that France is an attribute of October in Dijon, or you may say country is an attribute of month and city. The former is a case of single attributes, whereas the latter a case of aggregate attributes.
Attributes vs Measure
Returning to our example of rainfall, we cannot consider rainfall an attribute. This is because is does not have a fixed set of descriptors of the unique IDs — it’s not categorical. Instead, it provides a numeric value that can be arithmetically manipulated with other rows. Rather than “attributes,” we refer to columns that provide numeric data as measures.
This may seem counterintuitive at first. If the number provides information about the unique ID in the same way a word does, then they both describe, right? Not exactly. The fact that we can multiply, divide, add and subtract numbers up and down rows is significant. The numbers do not describe each unique ID in a fixed number of ways, but attributes do.
However, if the column only contained 1s and 0s, then it would be a categorical dimension — an attribute — because you cannot manipulate the numbers arithmetically.
Rows as Attributes
So far we’ve discussed columns as attributes and measures, but what about rows? Can we somehow use unique IDs as attributes? The answer is yes.
To understand this, we need to come back to the basic premise about attributes: they are nothing more than one piece of data that describes another.
For example, imagine we want to know in which city 13 cm of rainfall came down in the USA on the 24th day of a month. In this case, the attribute is the row value, Houston, whereas the “unique ID” is 13cm, USA, 24th day. What matters is our search criteria — depending how you filter the data, a row can become an attribute, and a column the unique ID.
With that said, only one city is relevant to 13 cm in the USA on the 24th day. In a larger database, there could be many relevant cities. This is another reason why we can refer to attributes as single or aggregate.
When we’re performing data analytics, the goal is to pull insights from a dataset by consolidating and comparing its rows and columns. If an attribute only applies to one unique ID, it is not very useful for analysis. Aggregate attributes, therefore, are necessary in analysis.
Criteria of Data Attributes Usable in Analytics
In order for a data attribute to be valuable for big-data analytics, it must meet the following criteria:
- The attribute must be present in more than one row or column (aggregate), unless it’s a one-off in its field. For example, if one of our cities was in Germany, this would still be an “country” attribute.
- The attribute value must be self-explanatory or have metadata descriptions.
- The attribute must not be normalized to a unique ID if it is to be used as a descriptor for another unique ID. If not, the analysis may produce false associations.
This third point is difficult to conceptualize. To understand, imagine the following database:
|ID1||# presentID1||ID2||# presentID2|
This structure occurs when we want to consolidate two unique IDs in one table in order to compare them. But it presents a problem: since #presentID2 is not the correct value for ID1, we cannot consider it as an attribute for ID1.
In order for # presentID2 to be useful as an attribute for ID1, we would need to remap those values to ensure they are correct in ID1. This is why we normalize databases in the first place.
Data Attributes in Data Models
In the context of data models, data attributes represent the columns of a data table. For example, a logical data model shows the primary key (aka unique ID), as well as the attribute column names. Since it does not list all of the attribute options, this is an example of aggregate data attributes.
Here is an example of a simple business data model:
Since these boxes represent data tables, you can envision how they are structured given the primary key (unique ID) and the attribute titles (other columns).
Data Attribute in HTML and jQuery
In HTML and jQuery, an attribute is any special formatting or functionality you add to an object. One of the most essential ones is the data-ID attribute.
What is a data-ID attribute?
In short, when you create a variable or other data element in your code, you can assign a numeric or text ID to enter it as a data object using id=”name”. In this context, the data ID is an attribute of the element, and it uniquely IDs the element for further use.
For example, imagine any random <object>…</object> element. We can add a data-ID attribute in the first brackets with the following notation: <object id=”name of object”>…<object>.
Data Attribute in Python
In python, every object you create can take on attributes such as name, age, height and other descriptors. For example, when defining a class, a programmer might name it with the following syntax: self.name = name.
Data Attribute in GIS
Data objects in GIS can take on data attributes that describe the where, what, and why of the data object, much like a traditional database. However, GIS is a framework and not a programming language itself. The attributes of GIS are thus the output of Python, HTML, and other scripts.
Data Attribute in Six Sigma
In Six Sigma methodologies, data attributes are defined in the same way as they are in traditional databases. They are most often column names, but can also be row values depending on the analyst’s approach to the question.