Information is everywhere, and determining its reliability is a crucial skill. While some sources of information are qualitative in nature, the vast majority today are rooted in data. And where data is involved, the source of information a digital place of storage. This place of storage is called a data source.
This article outlines what a data source is, its different types, and common examples. It is the culmination of primary and secondary research. In addition to drawing on my experience as an analyst, the methodology consisted of examining use of the term “data source” (1) in technical environments such as database documentation and (2) in finance and data publications. These perspectives provide a 365° assessment of the term.
Short Answer: What is a data source?
In short, a data source is the physical or digital location where data under question is held in the form of a data table, data object, or other storage format.
Full Definition
The first time you heard the term data source was probably as a youth in a science class. At that time, the “source” was likely a table of averages created from a larger data set. It’s easy to understand why we call it a source — it’s where we see the information coming from. But is this really what we mean by data source?
You’ve probably also heard “data source” in the context of fact-checking. When you read an article citing numbers, you might ask who the source is, such as a blog or government website. In this context, what you’re looking for is authority and not the data itself. But we use the same word: data source!
These aren’t the only two examples. Under the title “Types of Data Sources” below I will explore 8 different levels in which we use the term. To establish a full definition, we need these 8 levels.
We can also call them 8 contexts in which we hear the term. Imagine I say, “is it running?” talking about faucet, but you’re looking at a magazine called “Running.” We’re going to get confused. With “data source” it’s often the same idea.
A very thorough definition incorporating the 8 levels would be:
A data source is
(1) the physical or digital location where data under question is stored as a data table (or other format),
(2) the degree of originality of a data table,
(3) a brand name data provider
(4) the data used via a self-service data tool such as Excel, Tableau, or Power BI,
(5) the computer storage type, i.e File Data Source or Machine Data Source,
(6) a technical database such as Amazon AWS or Microsoft Azure
(7) a legacy data source with a proper name within an organization,
(8) a data type such as stock, accounting, or economic indicator.
Don’t forget, you can get the free 67 data skills and concepts checklist to cover all the essentials (including data sources).
“Data Source” vs “Datasource”
The correct way to write it is “data source.” A quick look shows that Merriam-Webster, Dictionary.com, and Oxford Dictionary all show zero results for both spellings. Clearly, data source is not a word as such.
However, the words “data” and “source” exist in all dictionaries. This means that the correct spelling of the term is as two words, making “data source” an open compound word.
Types of Data Sources
As discussed in our full definition section, the different types of data sources depend on context. To compensate for the variety and to create digestible information, I’ve outlined 8 levels in which the word “data source” often appears. I use the word “level” because we can think of these as “levels of detail.”
1. Data Table Level
The basic interaction with data sources is found at the data table level. A data table is nothing more than columns and rows. Each row holds an ID and entries under each column that describe the row, whereas each column contains all entries for every ID on the specific describer for that column. In my article on data sets, I explain this with the following example table:
Item | Color | Weight |
---|---|---|
Jeep | Green | 2.5 tons |
Honda | Blue | 2 tons |
BMW | Gray | 2 tons |
Ford | Blue | 2.5 tons |
Chevrolet | Green | 2.5 tons |
Lincoln | Blue | 2 tons |
If someone asks “what’s the data source?” At the data table level, the correct response is the name of the data table.
2. Conceptual Level
When discussing data sources in a professional setting, a common issue is misunderstanding around original data. Most data we consume and read in headlines is aggregate data — data that’s been averaged, summed, divided, or otherwise mathematically manipulated.
Original data is data that exists just as it was collected. Each row represents the raw form of data as it is collected, like the example shown above.
However, if I created a smaller table of the above Car original data table with the averages for each color type, such as the below table, it would be an aggregate data source.
Color | Number | Avg. Weight |
---|---|---|
Green | 2 | 2.5 tons |
Blue | 3 | 2.17 tons |
Gray | 1 | 2 tons |
You may hear someone ask “what kind of data source is that?” If they’re speaking at the conceptual level, the correct response is either “original” or “aggregate.”
3. Research Level
When we’re looking for data from an external provider such as Google Finance or Data.gov, “data source” refers to the brands themselves. This is the research level because it occurs when we’re looking for external data to use on an internal assessment, i.e research. In my article on data sets, I outlined the following data sources that can be used in research:
- Kaggle. Kaggle has a good variety of data sets on machine learning. It requires registration but is worth it.
- FiveThirtyEight. FiveThirtyEight is a news and sports site with data sets that are available on GitHub.
- BuzzFeed. BuzzFeed is a news and entertainment site that publishes data used in its articles on GitHub.
- NASA. NASA Earth observation data is available on its website.
- Amazon AWS. Amazon’s AWS provides loads of data sets on different topics.
- Google. Google publishes many data sets on its BigQuery tool.
- University of California Irvine. UCI is one of the oldest sources of public data sets on the web. It covers topics ranging from cars to breast cancer.
- Quandl. Quandl is a NASDAQ company with loads of financial data from stock prices to global indicators.
- data.world. data.world is a common source for the famous Makeover Monday data visualization event.
- Data.gov. Data.gov is the US government’s open data. This one is a must!
- The World Bank. A great source for world development data.
- Reddit. Reddit data sets from contributors.
- Weather Underground. Wunderground allows you to manipulate weather forecast data via its API.
- Socratas. Another great place for various data sets.
- Academic Torrents. Academic torrents allows you to download data from academic papers published all over the world.
- Data Is Plural. A weekly newspaper of insightful data sets.
You may hear the question “what is the data source?” At the research level, it’s the brand that provides the data.
4. Self-Service Application Level
When we’re working with self-service data applications such as Tableau and Power BI, the data source is tabular data available via our connection. We can connect to different servers, tables, and joins, but that is the extent of it.
At the self-service application level, data source can mean data from any brand, and data that’s original or aggregate. As long as it’s available for connection.
You may hear the question “what is the data source?” At the self-service application level, it’s the tabular data available in the connection.
5. Computer Level
When we’re talking about computers and the actual location of data storage, the topic is slightly different. Computer level scope does not concern tabular data used by analysts, but instead how a computer stores information.
Computers store data in two ways:
- Machine Data Sources
- File Data Sources
Machine Data Sources are unique to each physical machine. One desktop has many machine data sources that are stored in its Windows Registry. These sources are not transferable between machines. Moreover, Machine Data Sources can be further split into user-defined and system-defined.
File Data Sources are data stored in independent text files. They are not unique to each computer and can be transferred across devices.
It’s important to note that these are a category of file found a level below what most users observe on their desktop. For example, an Excel document may function as a data source for self-service applications (discussed above), but it is NOT a machine or file data source. It’s two different scopes.
You might hear someone in IT Ops ask a colleague “what is the data source?” If they’re speaking at the computer level, the correct response is machine or file data source.
6. Database Level
Perhaps the most common place for data sources is databases. A database is defined not only by the data it holds but also by the brand of the tool used to create it. Common examples include Microsoft Azure, Amazon AWS, Dynamics 365, and SAP. Each of these tools work as a data warehouse or as an enterprise resource planning (ERP) tool.
If you hear “what is the database data source?” At the database level, the correct answer is the brand name of the software that hosts the data AND the data itself.
7. Legacy Level
Legacy data sources are databases whose technical structure is built within a company that does not specialize in database creation.
Many digital companies have built internal data warehouses to handle transactional data. Today, databases are most often outsourced (to AWS or Azure for example), but there was a time when in-house solutions were preferable. As you can imagine, once the data infrastructure is set, it’s not altogether easy to modify, so these legacy systems still exist in many places.
You may hear the question “what is the data source?” If the question is at the legacy level, the correct answer is the name of the legacy system.
8. Data Type Level
Data sources can also be thought of as data types, such as accounting, stock, transactional, or economic indicators. Usually the data type comes from an external source, and there are few subcategories to choose from.
For example, NASA Earth Observation Data is concerned with biosphere, agriculture, and other Earthy topics:
If someone asks “what data source do we need?” If the question is at the data type level, the answer could be stock, economic, Earth, health, or other.
Data Source Name (DSN)
The above levels may feel wishy-washy in some ways. The term “data source” seems to apply to levels with different “intensity.” This is true, and it’s a reality. The term is heavily dependent on context.
However, one thing remains consistent. Tabular data at any of the above levels has a Data Source Name (DSN). A DSN is a file (occurring in several different formats) that contains the necessary information required to connect to a digital data source. Importantly, it contains the name of the targeted data table. The DSN is used in virtually all cases of digital data.
A Note on Paper Data Sources
We’ve talked a lot about digital data sources, but paper data sources still exist today. For example, much of the data used to analyze historical events is held on paper. There are exhaustive projects within European government to convert these into digital data, but for now the source is indeed paper.
For example, if you want to know the average length in words of books published in the Germany in the 1890s, your source data is paper, not digital. Even when these books are all converted to digital, the ultimate data source will remain paper.
Data Source Usage: Analytics and Operations
Most of our experience with data sources occurs in the context of analytics, but in reality the most common usage is automated operations. All IT systems that operate automatically run continuous queries to databases. As people, we don’t see these systems, but they make up the vast majority of data source instances.
When we think about data sources, we should remember that these are not exclusively the ones we see — but virtually every system we don’t see as well.
Data Source in Terraform
Terraform is an “infra structure as code” software, and one of the common questions asked is how data sources work.
As described by Terraform itself, “a data source is accessed via a special kind of resource known as a data resource, declared using a data
block1:
data "aws_ami" "example" {
most_recent = true
owners = ["self"]
tags = {
Name = "app-server"
Tested = "true"
}
}"
Unless you’re a developer specialized in this language, you probably won’t need to know this in detail, so we’ll leave it at that.
Conclusion
Data source is a complex term. Its use in 8 different contexts make it interesting to define. Simply put, a data source is the physical or digital location where data under question is held in the form of a data table, data object, or other storage format.
But you may also see it used at the:
- Data Table Level
- Conceptual Level
- Research Level
- Self-Service Application Level
- Computer Level
- Database Level
- Legacy Level
- Data Type Level
If you found this article helpful, feel free to check out the AnalystAnswers.com homepage for free content on data, finance, and business analytics.