It may sound strange to have a dictionary for data since a dictionary is data, but these metadata repositories are an essential element to maintaining and communicating the structure of any underlying database.
In short, a data dictionary describe a table’s columns based on common traits (i.e name, definition, data type) within another table. Admins use data dictionaries when a data table is simply too large to view directly. Data dictionaries allow readers to understand complex databases without having to investigate each column. You can think of them as a summary of data about data.
Before we look deeper, make sure you know what these are:
- Database object – structure that stores or references data, most common example is a table
- Metadata – another word for “data about data,” most common example is the description you read on Google before clicking a link
If you’re looking for an intro to data analysis, you can get the free Intro to Data Analysis eBook, which will ensure you build the right practical skills for success in your analytical endeavors.
Structure and key components of data dictionaries
Like databases themselves, data dictionaries are almost always stored as tables (as opposed to other database objects we’ll discuss later).
Understanding them takes some getting used to because the mind intuitively wants to read data dictionaries as the data itself — but they’re not!
For example, if you have a database about an e-commerce website’s users, then you may want to store a column containing each customer’s age. A data dictionary for the age column “turns it sideways.”
Data table in database
Data dictionary for Customer_Age column:
|Name||Definition||Data type||Possible values||Required?|
|Customer_Age||Age of users||Integer||15, 18, 20, 23, NULL||No|
The definition for customer age is, simply, “Age of users.” These entries will all be 0 or greater, so they’re integers. Some examples include 15, 18, 20, and 23. Finally, this is not a mandatory field, which explains why there are “NULL” values in our data table.
It may seem like the data dictionary simply complicates, rather than simplifies, the job. But think of this. Imagine you have 10,000,000 rows and 500 columns of data. You plan to use a coding language to crunch the numbers for your analysis, but first you need to know if there are any “NULL” values.
If you try to manipulate the table directly to find NULLS, you put significant strain on the system, which can slow it down and ultimately slow you down.
Instead, you can simply look in the data dictionary to see if “NULL” or “n/a” are possible values. If not, then you know that you may begin your analysis.
However, if NULL is a possibility, you’ll need to spend some extra time manipulating the data for your analysis.
Helpful video that explains data dictionaries
Data dictionary data types
One of the most common fields (aka column headers) in a data dictionary is “data type” (highlighted in green above). Each coding language (JAVA, SQL, etc) has its own data types, but we almost always use SQL data types, as this is the dominant database language.
(Side note: we’re only discussing BI/User-side databases. Other types involve the collection point of data, and they may use alternative coding languages.)
Data types are divided into 6 different categories below, for a total of ~25 types. However, we really only use 5 on a daily basis. Let’s cover those first, then look at the others. The three most important data types are:
- Integer – any number that doesn’t have a decimal point
- Date – a date of a given year and month
- Time – the time of day
- Text – often referred to as “string,” means simply any combination of letters instead of numbers or other symbols
- Boolean – TRUE or FALSE data, often migrated to YES or NO text, or 1 and 0 numbers. It is, in simple terms, binary data.
These are the five most common types. If you plan to memorize any, memorize these.
But we still need to understand the six categories in data types:
- Numeric Data Types
- Integer – any number that is not a decimal. Examples include -11, 34, 0, 100.
- Tinyint – an integer, but only numbers from 0 to 255
- Bigint – an integer bigger than 1 trillion
- Float – numbers too big to write out, and the scientific method is needed
- Real – any fixed point on a line
- Date and Time Data Types
- Date – the date sorted in different forms, including “mm/dd/yyyy” (US), “dd/mm/yyyy” (Europe), “mmmm dd, yyyy”, and “mm-dd-yy” among many more.
- Time – the time of day, broken down as far as milliseconds
- Date time – the date and time value of an event
- Timestamp – stores number of seconds passes since 1970-01-01 00:00:00’ UTC
- Year – stores years ranging from 1901 to 2155 in two-digit or four-digit ranges
- Character and String Data Types
- Char – fixed length of characters, with a maximum of 8,000
- Varchar – max of 8,000 characters like char, but each entry can differ in length (variable)
- Text – similar to varchar, but the maximum is 2GB instead of a specific length
- Unicode Character and String Types – unicode is a way of structuring data in the form of U+0000, where the 0’s can be any type
- nchar – fixed length with maximum length of 8,000 characters
- nvarchar – variable length with maximum of 8,000 characters
- ntext – variable length storage, only now the maximum is 1GB rather than a specific length
- Binary Data Types – a combination of 0s and 1s
- binary – fixed length with maximum of 8,000 bytes
- varbinary – variable length storage with maximum bytes, topped at 8,000
- Miscellaneous Data Types
- clob – also known as Character Large Object, is a type of sub-character that carries Unicode texts up to 2GB
- blob – carries big binary objects
- xml – a specific data type that stores XML data. XML stands for extensible markups language, and is common in data bases
Abbreviating fields and data dictionaries, of data dictionaries
An important thing to remember about data dictionaries is that they usually come in numbers greater than two. What I mean is that you often need a separate data dictionary to explain abbreviations in the first data dictionary. Let’s take a look at our example above to understand better.
In other words, data dictionaries for abbreviations are data about data that’s about data.
Look at column 2, “Customer_Height_CM.” That title is a bit long to write out in data dictionary every time we would like to reference it. Instead, maybe we just want to simply write CH_CM. This practice is more common, and very useful, when we talk about long proper nouns, like expansive names of businesses or organizations. For example, the authority on business analysis is the International Institute of Business Analysis, but we might want to abbreviate this to IIBA in data tables.
It may seem unreasonable to create a second data dictionary for one column. And you would be right. Abbreviations become useful when we have a huge data dictionary with many, many (like 1000s) of names as long as the IIBA. If you’ve never tried to manage a table of this size, then you’ll have to take my word for it: you will be very happy to have data on data on data! It simplifies a complex hierarchy.
Types of data dictionary
So far we’ve talked about how data dictionaries work and their contents, but these elements can interact indifferent ways with the user. Different types of data dictionaries serve different purposes. You can roughly divide data dictionaries into two main categories: active data dictionaries and passive data dictionaries.
- Active data dictionary. In these active cases, any change a user makes to the database itself will be automatically reflected in the data dictionary. Because this process is automatic, it’s almost exclusive to database management systems, which most professional organizations will have.
- Passive data dictionary. Passive data dictionaries are slightly more complicated, and they can take many forms. In a sentence, they’re data dictionaries that do not automatically update based on changes in the underlying database. Some examples include:
- Document or spreadsheet – a spreadsheet, such as in excel, is probably the most common amateur database technology, and it is just as useful for data dictionaries. We say that excel data dictionaries are passive because by default the technology is not built to automate database-to-data dictionary encoding. However, a well-structured excel document can easily do this automatically.
- Data Catalogs – I love the blasé use of the word “catalog” for these technical documents. Data catalogs are exactly what they sound like: an illustration of metadata. They’re more user friendly than the majority of active data dictionaries, but it’s certainly not easy to build them.
- Data integration/ETL metadata repositories – this is a very manual process that involves linking data from multiple sources and combining them in a central location. The abbreviation “ETL” in the context of data means “extract, transform, load.” A user, i.e. you, have to perform the regrouping process. You must extract data, transform it all to the same format, and load it to a new location.
- Data modeling tools – data modeling is very similar to data integration and ETL technique because it requires manipulating data, but the way it comes together is different. The analyst uses different tools and processes to get to the result.
Data Dictionary in Excel – Example and Template Download
As mentioned above, a data dictionary built in Excel is not normally considered active. However, we can automate the process using a few different excel formulas. Click below to download the excel and follow along:
Data Dictionary Example in Excel
You can see that this is the same table we used above for customer profiling.
Here is the data dictionary to accompany it:
As you can see in the image, the Definition column will always be a manual job, whether you’re working with excel or a more advanced database management system, definitions and descriptions are manual. However, we can automate the “Data type” column as well as the “Required Column.”
I will show pictures of the formulas used, but this article is not designed to explain this process in depth. For more information on how to leverage excel, check out other article on AnalystAnswers.com.
Here is the formula to automate the common “Data type” data dictionary column:
Here is the formula to automate the common “Required?” data dictionary column:
Obviously, these formulas are not a full-proof solution. You cannot easily modify them, and in many cased they are prone to some manual work. For example, you may note the the column “B” highlighted in blue in the above picture must be manually altered every time you paste the formula down.
There are other limitations, such as the amount of data the excel can manage. Altogether, it’s impossible to work with more than 1,048,576 rows and 16,384 columns. That said, Excel usually stops working much sooner than that.
Data dictionary example
As shown above, here is an example of a data dictionary. Don’t forget — the columns in a database become the 1st row in a data dictionary!
Importance of a data dictionary
Data dictionaries are very important for teams that need to share huge amounts of data on a regular basis. This is the case for most organizations today, since most decisions are progressively more data-driven.
The exception to this is in organizations where only one team needs a working knowledge of a database. Otherwise, data dictionaries are a must.
For example, imagine a team of data analysts or business intelligence analysts in an e-commerce company. They collect information like what we have in our example database above. Other teams may start to analyze that data using tools like Tableau or Excel charts.
If any team can easily look at a database’s composition with a data dictionary, then the organization moves faster through decision-making processes.
On the other hand, imagine a team of analysts working in a steel mill, collecting data on particle quality. There are tons of analyses possible with that data, but this analyst team is the only consumer of the data. It’s simply too complex and specific to share with other teams. Their job is to communicate it clearly to decision-makers in the.
You can see how a data dictionary would not be as useful for this team.
How to make a data dictionary (3 easy steps)
Making a data dictionary is not as complicated as it might seem, but the process depends entirely on which tool you use. In Excel, you will need to do much more manual work than if you were building it with an automatic, active database management software.
Nevertheless, the steps needed to make a data dictionary are as follows:
- Make each field (column header) in the data table and list it as a row in the data dictionary
- Decide how you want to define each field in the data dictionary
- For example, you may want to say what the data type is
- Either use a database management software to compile the source data into the data dictionary, or build out logic in a spreadsheet software like Excel (refer to above example).
What is a data dictionary diagram?
A data dictionary diagram does not exist. What most people often confuse with a data dictionary diagram is called a entity relationship diagram. It’s easy to confuse the ideas conceptually, but be careful not to confuse them in practice!
Data dictionary tools
One of the biggest challenges with data dictionaries is find ing the right tool! In short, the best data dictionary tool depends on your needs. An easy way to know your needs is to look at the primary data you use. If you can manage it all in an Excel docuement, then using Excel for the data dictionary will be sufficient.
On the other hand, if you use another database management solution, then you will use the same service for a data dictionary. Almost all providers today offer a bundle including both. Then again, if you are using a solution more powerful than Excel, then you probably already know this.
Data objects in data dictionaries
In this article we’ve talked a lot about data tables. Quite simply, they’re the most common type of data object. That said, several other types exists, including views, clusters, sequences, and indexes. I leave this section for last because these are not the most important elements of data dictionaries to know.
However, you should see them at least once to be familiar with all possible scenarios. Here’s a definition for the 5 most common data object types:
- Tables. A series of rows and columns containing information. The first column always contains the reference data (or unique ID), while the other columns provide information on these IDs.
- Views. Data dictionaries can be used to grant special access to a user. A database manager may want to limit the visibility on secure information for certain users. In other words, s/he may want to change the user’s view. The word also refers to displayed data that the user can easily see but not edit. A query to the database for a view will display data quickly, which is useful for decision making. You can think of them as a window.
- Clusters. A cluster is simply a table built by connecting two other tables around a common column.
- Sequences. A set of data columns or tables that describe a specific real-world event. Clusters can be sequences.
- Index. A copy of key columns that can be easily accessed.