Data analysts are some of the most critical, yet least understood, positions in business today. Some say that they are “responsible for all the data that a company collects and produces.” But this description so vast that it’s unclear what data analysts actually do. So… what does a data analyst do?
In short, data analysts collect, inspect, clean, and visualize business data, gain insights from it, provide those insights to managers for good decision making. Data analysts employ technical and statistical techniques to predict future performance and/or better understand current operations. They use Microsoft Excel, SQL, and R or Python in most tasks, and they work with all departments, with particular investment in finance, functional, and marketing teams.
As a list, data analysts:
- Collect data requirements
- Collect the data itself
- Inspect data to understand it
- Clean the data
- Visualize the data and tell a story
What is data analysis?
Data analysis is the process of pulling insights from operational and financial data within a company. It is a discipline in and of itself, and it is only possible after the completion of data collection (writing down, recording digital movements, or recording through sensors) and data integration (the regrouping of data from sparse locations into one database.
When companies operate, they collect numeric and textual information. When they get enough of this information, it’s what we call data. The most obvious example of data in a company is the financial numbers, such as revenue and net profit.
Between revenue and profit, there is a world of data to collect in any company, and not just financials. Good companies collect as much data as possible about their operations. Steal manufacturing companies collect data about the purity of steel and the energy expense needed to forge it, and e-commerce websites collect data about the number of visitors and conversion rates. Data analysis is the process of pulling value from financial and operational data like this.
In general, what do data analysts do? (5 Steps)
In general, there are 5 parts to the data analyst job:
- Collect data requirements
- Collect the data itself
- Inspect data to understand it
- Clean the data
- Visualize the data and tell a story
Some companies exclude the first 4, since the data requirements and data collection steps take place in the field, a place analysts rarely go. Moreover, some companies consider the cleaning process as part of a data wrangler’s role.
However, in most digital companies there are no hard lines between the collection, cleaning, and analysis. Moreover, data analysts are expected to be able to perform the entire suite of skills, so let’s look at them more closely.
Step 1: Collect data requirements
Data requirements are the “functionalities” that decision makers decide new data must contain.They make these choices based on what they believe the outcome should be.
The challenge is that decision makers often don’t have a complete understanding of how to get what they want. Data analysts, thus, have two tasks in collecting data requirements:
- Interpreting what decision makers say they want to determine what they really need
- Deciding how to collect the data to produce what’s really needed.
Business analysts are very familiar with requirement analysis, but it’s poorly understood among data analysts. It’s a serious skill that data analysts must practice in order to deliver results to their organization. Unfortunately, it’s more of a soft skill, so there’s no way to get better other than practice.
That said, getting good at interpreting means learning how to hear ideas, ask questions, and observe results. If you can focus on doing these three things in your everyday life, you will slowly but surely improve your requirement interpretation skills.
For example, the next time your family, spouse, or significant other asks for something that doesn’t seem clear. Try to observe the process you go though after several attempts in order to understand the real message. This will also work wonders in your personal life!
Decide what to collect
Once you understand the requirements, deciding what data to collect becomes a relatively easy task. You need to either buy the right recording materials or design a way to record it in house.
For example, imagine you work for a steal manufacturing company, and management says you need to collect data on the temperature needed to melt scrap metals. You push back and ask why. The management says they want to know if it’s too hot for a new technology, burner x.
What they don’t know, however, is that burner x can withstand heats higher than the current technology, burner a. The difference is that burner x can only do so for a fraction of the time.
You further realize that the real reason they want the burner x is because it would ostensibly be cheaper than burner a in 1 of the 4 business units. This means that the data the management really needs is melting temperatures on a seasonal basis in all 4 business units — since each of the four melt different scrap metal. Good data requirement elicitation is critical in this step.
Step 2: Collect data
Data collection is the process of recording data using sensors, digital trackers, or manual entry. Some consider data collection a part of data integration, which falls outside the data analysis umbrella. But the nuance is that data collection is part of the analysis as long as it immediately falls into a single database system.
Data analysts must collect data in order to analyze it. The challenge is that every company has different data collection needs. For example, data collection for the Multiple Indicator Cluster Surveys (MICS), a UNICEF program, requires field agents to manually record data from interviews with people.
On the other hand, Google has systems in place to record visitor data automatically. Likewise, steal manufacturers have sensors that automatically record data in a central database.
The diverse nature of data collection means that data analysts must be flexible, critical problem solvers who also know how to source information within the company in order to correctly collect data they need. Don’t worry — most companies either outsource this part of the job or have a special controls team or database team to manage the task.
Step 3: Inspect the data (with checklist)
Once you’ve got the data in your database, you need to give it a good review. You should be checking to make sure that the data is normalized, and that all of the target fields are present. Here’s a checklist you should use:
- Is the data normalized? 🖾 Normalizing is the process of recording each observation (surveyed persons, visitor IDs, steal bar vin numbers) in the far left column of a table, and recording each characteristic in a column to the right. You can see an example here.
- Are the target fields, previously defined in data requirements, all present? 🖾 You should have defined the information necessary to reach decisions in the data requirements section, and this is where you check to see that they all appear.
- Are there any empty, or “NULL” points? 🖾 NULL points cannot be analyzed.
- Are all of the field data recorded in the same format? 🖾 Different formatting can prevent analysis.
- Are all of the data recorded with the correct data “type”? 🖾 Databases recognize data under different types, such as strings, dates, or integers. If one data source saves dates as text, rather than as a date data, then you won’t be able to analyze it.
- Are any two data fields merged under one? 🖾 A common problem with data is that a source provides one input for data you need split into two columns. For example, imagine there’s a date field with day, month, and year. You need each of those in a separate column in order to run a proper analysis.
- Do all of the column headers make sense? 🖾 Usually, database managers use data dictionaries to describe unclear fields. Nevertheless, column headers should not look like a randomly-generated password.
Step 4: Clean the data
Once you have inspected the data and identified any problems, you have to fix those problems in order to perform an analysis. This skill is called data wrangling, or data cleaning, and sometimes falls under the scope of someone outside the analysts team. However, a data analyst must be autonomous enough to do it him/herself.
It’s impossible to outline all scenarios of “dirty” data and how to clean it, but let’s look at how to work through some of the most common, using our list from above. Since most people have access to Excel, we’ll use it for these examples. However, all of this is possible in SQL as well.
When we normalize data, we want to have are observation in the far left column, and all of the field data in column on the right. Our sample Excel sheet is based on orders, so they should be the reference column. But in our Excel sheet, orders are the title for data in the first row, not column.
To fix this problem, place your curser in the cell containing “Order,” then highlight all of the data cells. Copy those cells. Then click on Paste>Special>Transpose. This will turn the top row into the right column, and so on through the transposition. Here’s a video to show it:
Verifying presence of target fields
To make sure all of your target fields are present, let’s use a #N/A test with the MATCH function. The MATCH function looks at a one-dimensional array of cells (column or row, not both) and returns the relative position of a given value within that array.
Let’s say that our target fields are Order ID, Order Date, Order Quantity, Sales, Ship Mode, Profit, Unit Price, Customer Name, Customer Segment, Product Category, and Product Name.
We need to first create a column to the right with each of these values. Then we’ll use the MATCH formula in the column next to the target fields to identify whether they’re present. Take a look at this video for an example:
As you can see, we’re missing the Product Name field! Don’t worry — it wasn’t even included in the sheet. Just for the example, I created it. You can see how without doing the test, it would be easy to miss.
Handling “NULL” values
“NULL” values are essentially empty cells. There are 3 ways we can handle them. First, we could exclude the observation (row) that includes the NULL value. To do this we need a large dataset so the loss of an observation has no material impact.
Second, we could eliminate the field that contains the NULL value. We should only do this if the data field does not have value to our analysis, or if we have so many NULL values that any results would not be relieable.
Third, we accept the NULL values and consider our results reliable in spite of them. This is acceptable when we have relatively few NULL values.
In our spreadsheet, we have a few sales and profit cells empty (rows 12 to 19). Given that there are 1007 observations, I feel most comfortable excluding these 8 rows. This will not have a strong impact on the results (it accounts for less than 0.8%). To do this, simply highlight cells 12 to 19 and delete them.
Changing data types
Let’s say that we want to examine the financial information in our excel sheet in depth. We have a sales column, a cost column, and a profit column. But take another look at the sales column. The data in those cells is not formatted numbers, but as “General.”
Most programs today can automatically understand the difference, but you do not want to make a bad habit of using the wrong data type. The risk is that as you treat this data over time, you may run into data recognition issues. Let’s say, for example, you want to compare the past 10 year’s of sales from this Excel sheet with that of another sheet.
…only the chart you want to build cannot be generated. If you don’t correct the data type now, you may find yourself stuck in the future.
The most common example of merged fields is names. Let’s say you need the name of customers in a data table. In fact, you need the first name and the last name of customers split into two different columns. But from the start this data is merged into one. We need to use text wrangling formulas LEFT(), MID(), and FIND() to separate them out.
Don’t be frustrated if these don’t makes sense right away. They take some getting used to. In brief, the LEFT() formula grabs x number of characters in a cell from the left. On the other hand, MID() grabs x cells from a specified position. FIND() locates the relative positions of a character within the cell.
When we ask what does a data analyst do, we have to talk about knowing excel formulas.
In our example, customer names are indeed merged into one cell. Let’s push our target field check box to the left to create space for two new columns. We’ll put first names on the left and last names on the right. It’s important to remember to keep your formula cell aligned with the source cell. Otherwise it’s confusing.
Here’s a picture below to show you how to build the first cell for Customer_first_name. Once it’s complete, you just need to drag down the formula.
To get the Customer_last_name in another column on the right, we just need to use the MID() formula and FIND() formulas:
I recommend putting them side-by-side starting where the Customer name field was. This will help ensure consistency. Also, I highly recommend using underscores to separate words within a field name — it’s easier for programs to understand.
Sometimes you have data points that are of the correct data type, but have a formatting problem. When it comes to textual data, the most common example of this is extra spaces. Spaces can cause major problems when it comes to aligning values, filtering, or creating complex relationships. The reason for this is the way databases and programs recognize data.
For example, imagine you want to separate a merged field into two as we did above. We used the space as a reference cell to count up to (from LEFT) and to start from (with MID). But if there is an additional space in the mix, then our formula will not work. This is a small example, but it points to what can happen if data analysts don’t.
Intuitive column headers
This one is self explanatory. You want to use intuitive names to head your columns and fields to make sure people understand them. Normally a data dictionary is in place for this, but not always. Best practice dictates you make it easy!
Step 5: Visualize and tell a story with the data
When it comes to analysis, the going principle is that visualizations are the most efficient means of communication. There are a few hard and fast rules on how to present data with visualizations. Data analysts spend a good amount of time choosing the right method for their message.
Data analysts think about three things as they build their visualizations for storytelling:
- Do I understand the data?
- Have I built simple graphs to identify trends and insight opportunities?
- What story do I want to tell with this data?
At the end of the day, data analysis and visualization are more art than science. Understanding and presenting data is a discovery process that requires the analyst have a good toolkit of techniques and loads of creativity.
Let me explain what I mean by toolkit. You’ve probably heard of bar, line, & tree charts, and you’ve no doubt seen a few scatter plots in school. Well, data analysts use them in a handful of well-known cases. The kind of data, as well as the message, determines which ones we us. Here are a few examples:
Bar charts are best for comparing different data fields whose only connection is the comparative variable (sum or amount in most cases). Stacked bar charts are a special case in which the connections are the comparative variable the whole of which they are parts. In plain English, bar charts help compare two similar fields, and stacked bar chart help show how they fit together as parts of a whole.
Here it is as a stacked bar chart:
Line graphs are most useful in situations where we compare one or multiple variables over time. They’re one of the most common chart because most observations include a time element in one way or another. Here’s an example:
As you can see, only Jasper Cacioppo purchased on more than one occasion. He purchased in 2009 and in 2012. The others didn’t purchase at all. This is already a huge insight versus other months.
Tree charts are typically for hierarchical information, but we can also use them to show relative size of a large number of observations. As you can see below, however, if the relative sizes of these observations is not significant, there’s not much insight to take.
Scatter plots are the ultimate comparative chart. You can use them to see the relationship between two variables. These are what analysts use to identify correlations. Here’s an example using our spreadsheet:
Strangely, we do not see as direct a correlation between the two as I would have expected. This merits investigation, and I would assume its due to to varying margins by product.
What does a data analyst do on a daily basis?
On a given day, an analyst spends about 70% of his/her time on data preparation (steps 1-4) and 30% on data analysis (step 5). The time needed for preparation can diminish if there are data wranglers, data collection, and database analysts on the team. In addition, the time needed for preparation in data mining (predictive analysis) is smaller than the time needed in business intelligence (descriptive analysis) since data miners typically source information from the business intelligence team.
In their analyses, data analysts spend most of the time visualizing and comparing data. But they also need to perform statistical tests to be certain of their results. In a nutshell, data analysts spend <5% of their day (depending on the task) performing descriptive statistics, exploratory data analysis, and confirmatory data analysis to mathematically validate their findings.
What does a data analyst do? Role and responsibilities
A response to the question “what does a data analyst do?” requires a list roles and responsibilities. A typical data analyst list of roles and responsibilities part of a job description might include the following items:
- Acts as domain expert
- Drives data-based decision making
- Support financial and business analysis teams
- Support operational teams, especially developers
- Support marketing teams (especially in digital companies)
How to become a data analyst
Becoming a data analyst may not be as hard as you first think. It takes skill and patience, but you don’t have to be a mathematical genius or have a special degree to get there. Here’s a brief list on how to become one. It’s not at all comprehensive, but it is a good start.
- Earn a bachelors degree. While you don’t need a specialized degree, you do need a bachelors degree or equivalent in order to land a first data analyst job. It is possible to do so without one, but your chances are very slim.
- Learn how to use Microsoft Excel. Microsoft Excel has become a gateway through which all must pass in order to enter any analytics field, and for good reason. While Excel has its shortcomings, it is the ideal tool to intuitively discover how data analysis and relationships work. If you want a job as a data analysts, every employer will require you have good Excel skills and knowledge.
- Understand the database system. If you don’t already understand database systems (that is, the data journey from data entry to data analysis), then you may need to spend some time on AnalystAnswers.com or take one of our courses to get a better idea.
- Build a portfolio. While portfolios are typically for visual creators, it can be very impressive for a data analysts to keep a portfolio of works completed. This will help communicate your skill level with precision.
- Networking. I would be remiss to leave out how important networking is. Even for a technical job like data analysis, having a good network will help you leverage relationships and get your foot in the door. All too often I see technical profiles apply for jobs, when they should be out talking with people in the field. Employers know that an analyst with people skills and the ability to hustle will get things done.