From a young age, we’re all exposed to data. You probably remember seeing data tables in science class as young as elementary school.
Any one of those data tables was probably called a data “set” at some point. Why? Because it’s an easy, intuitive way to speak about data.
But what is a data set really? Can any table be called a set, are there defining criteria? What are the different types of data sets? And how do they work across industries?
Unfortunately, there’s no official definition. Instead, I’ve analyzed 8 use cases to determine how the term “set” is used and provide a wholistic definition.
The use cases are:
- industry definitions from data leaders such as IBM and Google,
- linguistic definitions from dictionaries such as Oxford Languages and Webster,
- technical forums,
- use by governmental organizations such as Eurostat and data.gov,
- traditional mathematical textbooks,
- research papers,
- healthcare leaders, and
- my own experience as a financial analyst.
Strictly speaking, a data set is a collection of one or more tables, schemas, points, and/or objects that are grouped together either because they’re stored in the same location or because they’re related to the same subject. That said, in most cases the term simply refers to a table of data on a specific topic.
Don’t forget, you can get the free 67 data skills and concepts checklist to cover all the essentials (including data sets).
Definition
One more time: a data set is a collection of one or more tables, schemas, points, and/or objects that are grouped together either because they’re stored in the same location or because they’re related to the same subject.
Let’s break this down.
Most of us are familiar with data tables but less familiar with schemes, points, and objects. In a sentence these are just different formats for representing and storing information. But we’ll define these below under the Data Types section.
What’s important is that a data set can include tables whose contents are totally unrelated… as long as they’re stored in the same place.
To understand why, imagine you’re a database analyst. You manage a host of different tables within your data warehouse. Many of these tables have unrelated information, but they share a similar size. You decide to group them into sets to optimize storage. You’ve just created a data set of unrelated tables!
Nevertheless, unrelated data considered as a set occurs almost exclusively in the context of storage.
In virtually all other cases, data sets are made up of one or more tables that work together to provide information about the underlying subject.
How to Describe a Data Set
We’ve given a formal definition, but this is not usually how I like to describe them. Instead, the best way to describe data sets is as information. Data sets are collections of information that’s all related to the same topic, usually in the form of one table, although there is no limit to the number.
A data set is different from a data warehouse, data lake, and data mill because it focuses on a much narrower topic. For example, imagine you want to investigate the airplane industry. A data warehouse would contain information about transactions, flights, and individual companies. A data set, however, would describe only one of those items.
“Dataset” vs “Data Set”
The correct way to write it is with two words: data set. Much like the terms ice cream, living room, and roller coaster, data set is an open compound word. As one word, “dataset” does not appear in any dictionaries, including Webster.
Moreover, the sense of the term is correct in two stages. It is a set of data, each word carrying its own meaning and creating combined meaning as a whole. Unless a leading English dictionary adapts “dataset” as the correct form, “data set” will persist.
In reality, both are accepted in virtually any professional environment, so don’t get hung up on hitting the space bar!
List of 16 Awesome Public Data Sets
- Kaggle. Kaggle has a good variety of data sets on machine learning. It requires registration but is worth it.
- FiveThirtyEight. FiveThirtyEight is a news and sports site with data sets that are available on GitHub.
- BuzzFeed. BuzzFeed is a news and entertainment site that publishes data used in its articles on GitHub.
- NASA. NASA Earth observation data and much more is available on its website.
- Amazon AWS. Amazon’s AWS provides loads of data sets on different topics.
- Google. Google publishes many data sets on its BigQuery tool.
- University of California Irvine. UCI is one of the oldest sources of public data sets on the web that covers topics ranging from cars to breast cancer.
- Quandl. Quandl is a NASDAQ company with loads of financial data from stock prices to global indicators.
- data.world. data.world is a common source for the famous Makeover Monday data visualization event.
- Data.gov. Data.gov is the US government’s open data. This one is a must!
- The World Bank. A great source for world development data.
- Reddit. Reddit data sets from contributors.
- Weather Underground. Wunderground allows you to manipulate weather forecast data via its API.
- Socratas. Another great place for various data sets.
- Academic Torrents. Academic torrents allows you to download data from academic papers published all over the world.
- Data Is Plural. A weekly newspaper of insightful data sets.
Types of Data Sets
As explained in the definition section, data sets consist of one or more
– tables,
– schemas,
– points, and/or
– objects.
Each of these is a “type” of data set or component of a larger data set. Let’s give an example of each.
Data Table
A data table consists of columns and rows, where columns represent variables and rows represent records of those variables.
Item | Color | Weight |
---|---|---|
Jeep | Green | 2.5 tons |
Honda | Blue | 2 tons |
BMW | Gray | 2 tons |
Data Schema
A data schema shows relationships between different data units in a data set. For example, the above table showing color and weight for three cars could be related to another table showing price and purchase date for the same cars. A schema between the two could look like the following:
Data Points
A data point is one atomic unit of data. It can exist alone or within another data unit such as a table. In the car table example, Green and 2 tons are examples of data points.
Data Objects
A data object is a collection of one or more data points that create meaning as a whole. Data objects encompass data tables, arrays, pointers, records, files, sets, and scalar types.
In the hierarchy of data terms, data points are the smallest, data objects are larger, and data sets are larger still.
Common Examples of Data Sets
Common, everyday examples of data sets include:
- Class schedule
- Home working schedule
- Student grades on an exam
- Transactions on a website
- Search terms in Google
- Bank statement
- Sport match results
- Athlete statistics
- Performance reviews
- KPIs
Each of these items represents a small data set in its own respect. All of them are usually shown as single data tables, although they can be stored and represented in multiple objects.
Original vs Aggregate Data Sets
I can say from experience that the leading cause of confusion regarding data sets is not knowing the difference between original and aggregate data sets. Most non-data professionals have a hard time intuitively understanding the difference, which can lead to frustration for specialists.
So what is the difference? An original data set is one that contains the most granular level of detail available in a normalized structure. By granular, I mean there is no way to “split” the data further. The way it is captured is the way it is represented. By normalized, I mean each line consists of one point of each variable for the given record — there is no crossover.
Take for example, this original data set:
Item | Color | Weight |
---|---|---|
Jeep | Green | 2.5 tons |
Honda | Blue | 2 tons |
BMW | Gray | 2 tons |
Ford | Blue | 2.5 tons |
Chevrolet | Green | 2.5 tons |
Lincoln | Blue | 2 tons |
It’s original because each line represents the most granular level of detail for each car, which also means it’s normalized.
However, we often see data tables formatted like the following:
Color | Number | Avg. Weight |
---|---|---|
Green | 2 | 2.5 tons |
Blue | 3 | 2.17 tons |
Gray | 1 | 2 tons |
This data is not original — it provides information about the original data set by aggregating number and weight at the “color” level of detail.
Common Confusion
The above example is easy to understand in theory, but when we’re dealing with huge databases that consist of complex dimensions, it can be difficult to identify the original set. Moreover, when we’re not familiar with the data set, or there are many data sets in an organization, non-data professionals can find it frustrating to keep track.
This frustration can spill over when data and non-data professionals work together. Imagine two data analysts named Sam and Joe, as well as a marketing professional named James. James asks for data concerning his marketing campaign. Sam provides an aggregate table with information. Sam leaves the company a few days later, but James is having a hard time understanding the data.
When James asks Joe for help, Joe insists on having the original data set during their meeting. However, James provides the table he has. Joe is frustrated because they loose time in the meeting in the absence of the original data set, which could have been avoided if James mentioned it earlier.
There is only one effective response to this challenge. Data professionals need to be sensitive to the perspectives of non-analytical colleagues and non-data professionals need to work at understanding the organization’s different original data sets.
Data Set in Math & Statistics
A data set in math is slightly different than the general definition. A math data set is a collection of numbers than can be described by mean, median, and mode calculations.
How is this different from “general” data sets? Mathematical data sets only have numbers, whereas general sets can have numbers and words, or any other data type for that matter. Strictly speaking, one numeric column in a data table could be considered a mathematical data set.
Conclusion
If you liked this article, fee free to check out more free content at the AnalystAnswers.com homepage!