The first thing that comes to mind when I think of data is information. It’s easy to forget that data is simply a digital reflection of real-world objects and ideas. Just as objects and ideas fade in the “real world,” so too does the data that represents them. This concept is called the life of data.
Moreover, just as real-world objects are not productive and fruitful their whole life, neither is data. The useful life of data is defined as the duration of time in which a data point accurately reflects the real-world information it originally recorded.
The value of a data point’s useful life depends entirely on time. The concept of a useful life is only relevant for data that reflects information in the present day. Historical data used as a reflection of the past never looses its useful life, since its intended use is not to inform present-day action, but to provide an historical reference for the present day.
In other words, when dealing with the useful life of data, we’re asking the questions:
- How well does the data reflect the real-world object or idea?
- Is it outdated?
- If I use this data to make a decision, is there a risk I’m ill-informed?
Examples to Explain the Useful Life of Data
Here are a few cases where we question the useful life of data both correctly and incorrectly. We’ll use financial, humanitarian, pricing, and time-series to address various behaviors of real-world information.
Financial Data
Imagine it’s December 2020. You own a Watch company and want to see how it is performing. However, your accountants and financial analysts tell you that the 2020 financial data is not yet ready. Instead, they can show you the 2019 financial data.
Correct treatment of Useful Life of Data. Since the data is 1 year old, and we’re interested in understanding data that’s less than a year old (January 1st, 2020 onward), the useful life of the 2019 data has passed. You would be better off waiting for the 2020 numbers to evaluation how well the business is performing.
Incorrect treatment of Useful Life of Data. Because you want to see business performance, some numbers are better than no numbers. You take a look at the 2019 numbers and see that revenue is significantly lower than you expected. You decide to overhaul the business for 2021, paying overtime and increasing costs to drive up revenue.
The reason this is an incorrect treatment is that you’re making a present day decision based on data passed its prime.
Humanitarian Data
Imagine you work for a non-profit organization that aims to eliminate the practice of child labor in East Africa. The data your organization uses is collected from surveys, which go out every 4 years, the last round being completed 3 years ago. You want to create a presentation using the organization’s data to convince donors to participate in raising funds.
Correct treatment of Useful Life of Data. You include data from the last survey but add a disclosure that the data reflects an older period. However, you say that the situation is rather stable. Over the past 4 reports (16 years), very little has changed in East Africa, with a steady decrease of .5% in child labor rates. This means the current period reflects a situation likely similar to the numbers given in the reports.
Incorrect treatment of Useful Life of Data. You show the data and explain how money contributed by donors will go to specific initiatives attached to key holdings as revealed by the data. One orphanage in Somalia started to show a tendency to regularly push children into child labor, and a new arrival of orphans is expected in 2 months (if the data is correct). Money donated will fight this transaction.
The reason this treatment of the Useful Life of Data is wrong is that it makes an actionable decision (preventing an upcoming arrival of children) based on outdated data. It’s reasonable to assume that orphan arrival timelines for the orphanage in question have changed in the past 4 years — but you won’t know until more recent survey results come in.
Pricing Data
Imagine it’s December 2020 again in your Watch company. You want to enter a new state market, but there’s a strong competition there because the people of this state love watches. You go to a market research company that compiles data for you to help decide where to open a boutique: in the north, south, east, or west.
They return after a week and a half with a large report that suggests the north is the best location, in a big city called Northsville. This seems strange to you because there was recently a big factory chemical spill outside of Northsville, and the city was evacuated. In addition, a city in the south, Southsville, just build a pop-up community with great amenities and boutiques close by.
Correct treatment of Useful Life of Data. You question the market researchers about these new developments and why they still seem confident about Northsville. To your dismay, they admit having used only Q1 data for the assessment instead of Q4 data. You thus prefer your observations over the research data.
Incorrect treatment of Useful Life of Data. You assume the market research is correct without looking at source data and its useful life. Because you’re making a decision in the present, data from Q1 on the market is null and void.
How to Calculate Useful Life of Data
Our examples thus far have focused on qualitative cross-checks on quantitative data. We’ve just tried to answer the question: “how well does the data reflect the real-world object or idea in the present day?” But there is a way to semi-formalize this idea mathematically.
The way we do this so is similar to the idea of expected value: our confidence in the data is dependent on its value multiplied by the probability that this value is still relevant in the present day. This requires some amount of intuition, but it helps guide our thinking.
We can write it as: VR × P, where RV stands for real value, is >0 but <10, and is the value assuming the data is 100% relevant, and P is the probability that this RV applies to the present day.
For example, you might say that the RV is 8, meaning it carries a high value, but that it only slightly applies to the present day at .1. This means that on a scale of 1 to 10, it’s present day value is only 0.8, a very low score.
Useful Life of Data and ROT Data
When we talk about the Useful Life of Data, we need to talk about ROT data. In some ways, they are two sides of the same coin. ROT stands for Redundant, Obsolete, and Trivial data.
Redundant data refers to the storage of duplicate data in two or more places. Redundant data is disadvantageous because it takes up memory space without adding any value. It does not have anything to do with that data losing its useful life, but falls under the same umbrella of value-less data.
Obsolete data is a synonym for data beyond its useful life. Obsolete data refers to information about the past that no longer carries any value in the present. Unlike time series data, which concerns the past and is useful for showing trends, obsolete data doesn’t maintain referential value. An example of obsolete value is a data point with incorrect information collected at a past date.
Trivial data refers to data that is not duplicate, but that carries no value for present day decision-making. It’s only similarity to the useful life of data is the dimension of uselessness.
Data Degradation and the Useful Life of Data
Data degradation is similar to data ROT, but instead of the affecting data itself, data degradation affects the storage device, such as a memory card or computer RAM.
In a sentence, data degradation is the accumulation of non-critical system failures over time that ultimately corrupt the data they store.
In fact, one of the 3 principle tasks that a database analyst is to maintain devices and data formats over time to prevent data degradation. And when s/he cannot prevent it, to show what the impact of data decay will be on the system.
For example, in 2019 I was working on a project at work with an Enterprise Resources Management system that tracked user activity on the platform. That use activity is important from a finance perspective because it helps understand if the company is overpaying for unused access rights.
However, after a few hours of investigation we realized that data more than 4 months old was unreliable. The reason is that the ERP’s storage devices trim the data over time for cost-savings purposes. After three months, we only had partial information — not enough to perform an audit.
In other words, the useful life of data usually refers to the nature of the data itself — that is, it’s value stays in the past and does not inform present day decisions. That said, data degradation could be considered a way in which data looses its useful life, since even data in its prime could be rendered useless by data degradation.
What to do With Data Past Its Prime
Now that we have distinguished between data ROT, data degradation, and the useful life of data, the question arises: what do we do with it?
When data is past its prime, the best thing to do is delete it. For many people, the idea of deleting data feels almost criminal. How could you delete data that could carry valuable information? I often feel this way, but I try to remember that we delete data only when we’re sure it carries no more value.
By holding on to data outside its useful life, we have to pay to store it and risk overloading our devices, which leads to data degradation of data that’s still in its useful life! That’s why, at the end of the day, just do it: delete data outside its useful life.