Data points, also known as data items, are the atomic state of data. Conceptually, you can think of them as one cell in a data table, or one piece of information, about an observation, at a given point in time.
At face value, they seem so simple that many analysts blow past without a second thought. However, data points can be tricky due to limited visibility at the data collection level and suboptimal exclusion through aggregations.
This article (1) defines data points, (2) explores different types, and (3) provides examples. It also addresses the important points of (4) “unknown unknowns” due to data collection and (5) representativity through aggregation.
Electrical Data Points (Not the Scope of this Article)
A quick note on content. In electronics and wire networks, usually in the UK and Australia, “data point” refers to an access point or socket for cable or telephone wires in a home. This is not the scope of this article. If you’re looking for information on that, check out TheLocalElectrician.com’s article.
Data Point Definition
In general, any fact or piece of information is a data point.
In data analysis and statistics, a data point is a piece of information that describes one unit of observation, at one point in time, at the data collection level. It most commonly appears as one cell in a data table.
Oxford Languages defines data point as “an identifiable element in a data set,” but this is not entirely accurate. While a data point is an identifiable element in a data set, so too is each row (aka “record” or “tuple”) an identifiable element, but it is not a data point. Instead, rows are collections of data points.
Moreover, data points should not be confused with bits of information at the data analysis level, which often aggregates collected data to extract insights, but is not the true data point of data.
Don’t forget, you can get the free 67 data skills and concepts checklist to cover all the essentials (including data points).
Unit of Observation
Data points are best understood on the backdrop of units of observation. A unit of observation is the “things” that your data describes. Imagine you’re collecting data on butterflies. Each butterfly is one unit of observation.
You may collect information such as continent where the butterfly is found, the color of its wings, its weight, and its speed. Each of these pieces of information are called dimensions, and each entry in a cell is a data point. Each data point describes the unit of observation (aka each butterfly).
Data points are either words, numbers, or other symbols. These are the types of data points we create in, and query from, data tables. In most software, the common five types are:
- Integer – any number that doesn’t have a decimal point
- Date – a date of a given year and month
- Time – the time of day
- Text – often referred to as “string,” means simply any combination of letters instead of numbers or other symbols
- Boolean – TRUE or FALSE data, often migrated to YES or NO text, or 1 and 0 numbers. It is, in simple terms, binary data.
The above are simple, big-picture data point types, but they’re far from comprehensive. In fact, we can dig deeper with the following list:
- Numeric Data Item Types
- Integer – any number that is not a decimal. Examples include -11, 34, 0, 100.
- Tinyint – an integer, but only numbers from 0 to 255
- Bigint – an integer bigger than 1 trillion
- Float – numbers too big to write out, and the scientific method is needed
- Real – any fixed point on a line
- Date and Time Data Item Types
- Date – the date sorted in different forms, including “mm/dd/yyyy” (US), “dd/mm/yyyy” (Europe), “mmmm dd, yyyy”, and “mm-dd-yy” among many more.
- Time – the time of day, broken down as far as milliseconds
- Date time – the date and time value of an event
- Timestamp – stores number of seconds passes since 1970-01-01 00:00:00’ UTC
- Year – stores years ranging from 1901 to 2155 in two-digit or four-digit ranges
- Character and String Data Item Types
- Char – fixed length of characters, with a maximum of 8,000
- Varchar – max of 8,000 characters like char, but each entry can differ in length (variable)
- Text – similar to varchar, but the maximum is 2GB instead of a specific length
- Unicode Character and String Item Types – unicode is a way of structuring data in the form of U+0000, where the 0’s can be any type
- nchar – fixed length with maximum length of 8,000 characters
- nvarchar – variable length with maximum of 8,000 characters
- ntext – variable length storage, only now the maximum is 1GB rather than a specific length
- Binary Data Item Types – a combination of 0s and 1s
- binary – fixed length with maximum of 8,000 bytes
- varbinary – variable length storage with maximum bytes, topped at 8,000
- Miscellaneous Data Item Types
- clob – also known as Character Large Object, is a type of sub-character that carries Unicode texts up to 2GB
- blob – carries big binary objects
- xml – a specific data type that stores XML data. XML stands for extensible markups language, and is common in data bases
Data Point vs Data Set
In another article on data sets, I explain that data sets are not only data tables, but also and collection of one or more data objects (including tables) that are grouped together either because they’re stored in the same location OR because they’re related to the same subject.
We’ve already talked about data points in data tables and shown that a point represents one cell. The same logic applies to all data objects that constitute a data set.
In an array, record, or set, a point represents 1 cell. In a pointer object written as a dimension, points also represent 1 cell. In a scalar object, the single value of the scalar is a data point.
In file and schemas, data points do not exist. This is due to the nature of these items. A file is code written to ensure the correct structure of another data object, and in some sense could be considered a non-data object.
Schemas are summaries of other objects, and they ignore points entirely in order to quickly communicate object contents.
Data Point vs Data Attribute
A data attribute is synonymous with a data dimension. It’s the header of a column in a table. In the example of the butterfly data, wing color is an attribute.
Data points, therefore, are one single value entry of an attribute. To learn more about data attributes, check out this article.
Data Point vs Data Field
A data field is synonymous with a data attribute, although it is used in a slightly different way. “Field” usually refers to the column in a table itself, whereas “attribute” usually refers to the column when we’re talking about a specific row.
For example, you would say “Color of Wings” is a data field, but you would say “the Color of Wings attribute for Monarch butterflied is orange.”
Moreover, “field” has a technical meaning in the context of programming languages that “attribute” does not. To learn more about data fields, check out this article.
Unit of Observation vs Unit of Analysis
The most common cause of confusion around data points is the difference between units of observation and units of analysis.
Units of analysis are the single rows we have in a data table after analyzing and aggregating data. As discussed above, units of observation are each row that stands as a collection of data points in the base data set.
Using our example of the butterflies, let’s say our unit of analysis is “Continents Where Found” and we want to know how many colors and butterflies are found on those continents. Here’s what is would look like in a unit of observation view and a unit of analysis view:
As you can see, the analytical view counts the distinct number of butterflies and the colors of wings present in each continent. This is an aggregation, and now we are missing the original data points.
The removal of original data for analytical purposes is necessary to gain insight from big data, but there is debate around when it should and should not be done. The next short section discusses this and other important risks.
Data Collection Constraints & Portrayal through Aggregation
Portrayal Through Aggregation
As we’ve seen, representing data points can become a challenge at the analytical level because any aggregate we choose to employ will remove some data points. In other words, analysts make choices about how to treat data points, and this impacts our understanding of the data.
And to understand the impact of these choices, you don’t need to go as far as moral or ethical consequences.
The reader of your analysis will be swayed in the direction of what you communicate (unless he/she performs the full analysis independently, which rarely happens in companies and would make data analysts redundant). While you, as the analyst, are aware of “lost” dataafter aggregation, the reader will seldom retain them, even if disclosed.
This means you must be conscious when choosing your aggregations, and the data points you are willing to “remove” at the analytical level.
Data Collection Constraints
As shown above, data points are sensitive to levels of detail, so you have to be careful to treat them with the proper conceptual hierarchy in mind. That’s harder than it sounds. Levels of detail are easy to identify within a data table, but many times they exist only as a concept in the mind of the data collector.
For example, in the example of collecting data on butterflies, the two dimensions are the continent where they’re found and the color of their wings. Both “North America” and “Orange” are examples of data points, and “North America” will likely have butterflies with many different wing colors (in this case two).
However, it was the choice of the data collector to choose these two dimensions. If s/he had added “Country,” the level of detail for each data point would be more granular. In other words, data points are limited by and dependent on data collection.
Analysts need to be aware of this shortcoming and be equipped to explain it to anyone who views their visualizations.
If you found this article helpful, feel free to check out more free content on data, finance, and business analysis at the AnalystAnswers.com homepage.