In their simplest form, data models are diagrams that show 3 dimensions: 1. what data an organization collects, 2. in which section of the organization it is collected, and 3. how each section’s data relates to others. Data modeling (modelling) is the process of creating those data models.
But for most people, this description isn’t very helpful, and it makes things seem more complex than they really are. I believe data models are easiest to understand in context, so let’s consider a business.
Businesses consist of three fundamental units: suppliers, products, and customers. Each of these units consists of data collected independently. For example, the sales team records customer name, location, and sale amount, whereas the production teams records product name, price, and size.
Businesses usually want to compile this data in a company database, which allows them to analyze it and better serve their customers.
To do so, they develop data models to understand what data they have, where it is, and how it all relates. Only then are they confident enough to put it in database. Since every business is different, data modeling is the process of creating a data model that meets the specific structure of the business in question.
Data models: data tables, data objects, & databases
To understand data models and how to model them, you need to know about data tables, data objects and databases.
A data table is a table of columns and rows in which the left most column is a unique ID (aka primary key), and the columns to its right are characteristics of that unique ID. Data tables are what most people think of when they hear “data.” For example, look at this data table of vendor data:
|Vendor ID||Avg. sale ($)||Last Purchase||Number of items purchased|
|MetalOne Inc.||100||September 4, 2020||5|
|Dynamic Metal Inc.||50||February 10, 2020||2|
|ForceFive Metals Inc.||75||March 25, 2020||3|
A database consists of many data tables either compiled into one big table or stored individually. The reason we consider multiple individual tables as part of one database is that they relate to each other. If not, they’re just different tables in space.
Don’t forget, you can get the free Intro to Data Analysis eBook, which covers data fundamentals (including models, tables, and objects).
The important thing to see here is that one database can also have multiple data models. As we said, a data model is just a combination of data tables that relate to each other. Once you store them together, they’re a model. And since databases can store a huge number of data tables, it can store multiple models. This sounds complicated, but it’s easy to understand with a picture:
Data object is another name for data tables within a database. The reason we use a separate names is that, at more advanced levels, data models contain objects other than tables (bit that’s outside the scope of this article). You will often hear or read data analysts refer to “data objects.” Nine time out of ten, they’re referring to tables, but you should be aware that there are others as you progress as a data analyst.
Types of Data Models with Real-World Example
So, we know that data models are pictorial representation of the contents and relationships of data tables. But there is not a on-size-fits-all data model, especially in a business.
C-level executives don’t want to see the gritty details behind the model. They just want to see the high-level relationships. At the same time, a database analyst wants as much detail as possible to ensure the relationships are correct.
The need for different views has led to three primary types of data models:
- Conceptual Data Models. High-level.
- Logical Data Models. Mid-level.
- Physical Data Models. Low-level.
Though they differ, each of these models consists of at least one of these elements:
- Entities. Entities represent data tables (or more generally, data objects) that contain data relevant for comparison.
- Primary keys. Primary keys are another word for the unique ID of the data table within an entity.
- Attributes. Attributes show additional information stored under each unique ID in an entity.
- Relationships. Relationships are shown by lines and symbols and explain how entities interact. The most common are “one to many,” “one to one,” and “many to many.” Relationships are also referred to as cardinality. The notation for these relationships is called crows foot notation, and it’s very simple. Here are the most important examples:
Imagine you own a wholesale e-commerce company that sells watches, and its called Batch Watch. Your three business units are vendors, products, and customers. You buy metal and glass from the vendors to build your high-quality watches, then sell them to boutiques and other retail stores. And let me tell you… people love your watches!
Let’s look at conceptual, logical, and physical data models using this example.
Conceptual Data Model
Conceptual models are the most general of the three. You don’t need to be a data analyst to understand them. Conceptual data models show the high-level business units that collect data, but do not show any information about the contents. They sometimes include pictures to more easily communicate their structure.
Using our Batch Watch example, a conceptual model may look as simple as this:
Logical Data Model
Logical models go a step further than conceptual models to show the primary key and attributes within each entity, as well as the relationships between them. Logical data models are the most common type of data model.
To understand logical data models, let’s look at this example of our three entities in Batch Watch to understand their primary keys and attributes. Then we can see how they’re related with crows feet notation.
As you can see, the primary keys (PKs) for each entity are a unique ID of the key component. The attributes under them provide a view on what data is stored within these data tables.
We saw a simplistic view of relationships in the conceptual model, and now we’ll add more detail with the crows feet notation. What will this help us understand? It will show how many primary keys in each entity links into primary keys in the other entities. Remember, the most common of these is “one to many.”
In the case of Batch Watch, each vendor supplies general materials for many of our watches (product), OR only one of our watches. This is because one watch requires a special kind of glass. Each product then sells to many retailers.
Using crows feet notation, it looks like this in a data model:
The logical model thus helps us understand that “one and only one” Vendor ID (along with their attributes) links into “one or many” Product IDs. Then, one and only one Product IDs link into many Retailer IDs.
Note: “many to many” relationships do not exist in data models
When analysts learn about crows feet relationships, they often get stuck on the idea of “many to many” relationships. After all, if one PK in Entity A links to many PKs in Entity B, aren’t there already “many” combinations? The answer is yes, “many to many” relationships exist, but they’re already accounted for in multiple “one to many” relationships.
Don’t let this confuse you (most professionals have a hard time explaining it). Just know that in data modeling, we do not use “many to many” relationships. Instead, whenever more than one PK links into other PKs, we simplify. We use two entity relationships of one to many.
From Logical Model to Database
The logical model is great, but it’s difficult to understand without seeing what it looks like once all of the entity’s data tables are linked together in a database.
Imagine we have only two vendors, three products, and four retail partners. Even with so few players, the database becomes complex, and quickly.
Here’s what it would look like using our example. I leave out the attributes in this picture so it’s easier to understand how this database is compiled:
This complexity is arguably the most important reason for using data models. They simplify the relationships between business units and entities so that it’s digestible, and easy to act on. Without them, it would be difficult to work with databases at all!
Why use a data model?
If they’re so complex, why even bother with databases and data models at all? The most obvious reason is that data is key to extracting insights and improving a company’s competitive edge.
When companies ignore data, they miss out on opportunities to understand their operations, markets, and customers better.
Moreover, data is becoming a compliance necessity. Businesses must be able to show how their company operates through data to be compliant with growing government data regulations.
Physical Data Model
Once you understand logical models, physical data models are easy. Physical data model entities are exactly the same as logical model entities, but they add in the types of data that each PK an attribute uses, as well as the number of characters. Here’s an example:
As you can see, the added value of physical models is the detail they provide on data in its tables. Experienced data modelers are able to quickly understand how the data model translates to the database and make decisions based on this knowledge.
Types of data are seemingly endless. The most common types include text, numeric, and boolean (true/false), but they can be as complex as the following list:
- Numeric Data Types
- Integer – any number that is not a decimal. Examples include -11, 34, 0, 100.
- Tinyint – an integer, but only numbers from 0 to 255
- Bigint – an integer bigger than 1 trillion
- Float – numbers too big to write out, and the scientific method is needed
- Real – any fixed point on a line
- Date and Time Data Types
- Date – the date sorted in different forms, including “mm/dd/yyyy” (US), “dd/mm/yyyy” (Europe), “mmmm dd, yyyy”, and “mm-dd-yy” among many more.
- Time – the time of day, broken down as far as milliseconds
- Date time – the date and time value of an event
- Timestamp – stores number of seconds passes since 1970-01-01 00:00:00’ UTC
- Year – stores years ranging from 1901 to 2155 in two-digit or four-digit ranges
- Character and String Data Types
- Char – fixed length of characters, with a maximum of 8,000
- Varchar – max of 8,000 characters like char, but each entry can differ in length (variable)
- Text – similar to varchar, but the maximum is 2GB instead of a specific length
- Unicode Character and String Types – unicode is a way of structuring data in the form of U+0000, where the 0’s can be any type
- nchar – fixed length with maximum length of 8,000 characters
- nvarchar – variable length with maximum of 8,000 characters
- ntext – variable length storage, only now the maximum is 1GB rather than a specific length
- Binary Data Types – a combination of 0s and 1s
- binary – fixed length with maximum of 8,000 bytes
- varbinary – variable length storage with maximum bytes, topped at 8,000
- Miscellaneous Data Types
- clob – also known as Character Large Object, is a type of sub-character that carries Unicode texts up to 2GB
- blob – carries big binary objects
- xml – a specific data type that stores XML data. XML stands for extensible markups language, and is common in data bases
Data Modeling tools
To build all three types of data models, you will need a tool. The most important one to have in your toolbox is Microsoft Powerpoint. While it is heavily manual, its available in almost every professional setting. Especially as a a consultant, you need to be flexible.
With that said, the best tools for data models are ERP modeling tools and UML modeling language. They’re common among systems administrators and software engineering, where structures similar to data models are an everyday event.
- Entity Relationship (E-R) Model programs. As you can imagine, structural models are not unique to data models. In fact, the idea of entities and relationships is a driving principle in engineering. That’s where E-R models surfaced, as well as the programs to automate them. These programs are user friendly and require minimal coding skills that you can learn as you go. While a tutorial on E-R technology is outside the scope of this article, you can check out a trial account with Lucidchart for free if you want to get your feet wet. It’s common in big companies.
- UML (Unified Modeling Language). UML is a “coding” language for entity relationship models. I put “coding” in quotes because it’s actually called a mockup language, but the principle is the same. You use a program to write code that becomes the model. For many people, coding separately to build a model that represents an underlying database feels like overkill. For this reason, UML is considered an advanced technique.
Data modeling steps
We’ve talked a lot about types of data models, their content, and why we should use them, but what about actually building one? What are the steps needed to build a data model, or “do” data modeling?
To answer this question, let’s take the perspective of an external data consultant rather than an internal analyst (since internal specialists can sometimes be biased).
The most important and first step is understanding the organization and its data collection capabilities and desires. Without data, there isn’t much to model.
Then, the consultant must understand the goals of the organization and set up a data collection plan to be approved by business decision-makers. That’s the bird’s eye view.
More specifically, 12 steps to data modeling are:
- Understand what kinds of data analysis and data insights the company is looking for. This is a crucial step. It consists of speaking with decision-makers to better understand
- Identify key business units. This step consists of identifying the most important business units. These are not necessarily departments, as products are a key unit as well. A good test is to ask yourself: which units do business units usually refer to when they ask “why”?
- Perform a data collection audit. This step consists of identifying which business units need data in order to build the business model. In almost all cases, they will be the units identified in step one, but not always. Business units without data collection, or without sufficient data collection, should be noted in a text document. A good test is to ask yourself: what data dimensions will this unit need to answer the “why” questions from business decision-makers?
- Perform a data collection GAP analysis. This step consists of identifying what technical and non-technical changes must be made in order to execute the data collection requirements identified in step 3.
- Build a draft conceptual data model. This step is the first model you build. It’s the conceptual model mentioned earlier in the article.
- Get feedback for this structure from management. In this step you get a critical view on how well your work responds to the decision-maker requirements identified in step 1.
- Adapt to feedback from management. Make changes to you conceptual model based on feedback from decision-makers.
- Build logical data model. This step consists of building a logical model with the information gathered and feedback. We move on from the conception model even without managerial approval because it’s better to adapt progressively than get stuck on the conceptual model.
- Get feedback and make adjustments. Repeat step six, but with the logical model. While decision makers may not want to see PKs and attributes, the conceptual structure remains, and it’s useful to get additional feedback and approval.
- Create physical data model to share with database management teams and BI. Using either an E-R modeling program or UML, build a physical model to share with relevant teams.
- Implement data collection improvements. This step consists of implementing the technical and non-technical requirements identified in step 4’s GAP analysis. You may need to work with external providers for this.
- Build dashboard. Build a dashboard to show the conceptual, logical, and physical models in a user-friendly framework for everyone in the company. Dashboard creation is often possible through the database management system the company stores its data in.
Techniques and best practices
Data modeling best practices include the following items:
- Where possible, use a single Enterprise Resource Planning (ERP) program to ensure ongoing integrity of data collection and modeling process.
- Always document decision-maker requirements to ensure coherence throughout the implementation process.
- Do not use both UML and E-R model programs. Choose one and stick with it. Since data collection and data modeling are an ongoing effort, you want it to keep it as user friendly as possible.
Advantages and disadvantages
While data modeling is an industry standard, it has its disadvantages. We’ve spoken a lot about the good parts, but here’s an overview of the advantages and limitations of data modeling:
- Easy to access
- Creates structure for an organization
- Flexible to the needs of any organization
- Ensures the integrity of data by splitting
- Compounding complexity. As company entities and the data stored in them become more complex, so does the data model. In fact, it can become so complex that it looses its simplistic appeal.
- Rigidity. Once a data model is put in place, it is incredibly difficult to modify. This article described the steps to set up a data model, but maintaining and modifying one is another story.
- Dependance with growth. Just like any web of logic, any change to one element has an impact on many others. While the purpose of the model is to limit this risk, as the business units grow, so too does the difficulty of maintaining entity independence.
Data modeling is a must-know technique for any good data analyst. It’s a window into the complex database that hosts any company’s data. While it may seem intimidating at first, you’ll quickly adjust to the logic as you spend more time with different materials.
At AnalystAnswers.com, I believe that data analysis is becoming more and more critical in our digital world. Learning it shouldn’t break the bank, and everyone should have access to understanding the data that’s growing under our fingertips every day.
To learn more about data analysis, check out the Understand Data Analysis tab!