4 Top Data Analyst Skills (That You Can Learn Right Now)

When I started learning data analysis, I had a mentor who taught me not to think about learning skills, but about learning the data treatment cycle. The data treatment cycle is cleaning, exploring, visualizing, & analyzing. When we think about skills as parts of these steps, it’s much easier to understand why we learn them, which speeds up our knowledge acquisition.

Each step in the data treatment cycle requires many skills, but I’ve broken them down into data principles and the technical skills:

Cleaning: 4 Data cleaning principles.
- Golden rule: make your data uniform
- Don’t leave blank or NULL values
- Ensure correct data type recognition
- Remove duplicates
Exploring: 4 Excel tricks & 10 SQL commands.
- Keyboard fluency (Excel)
- Data Filters (Excel)
- Pivot Tables (Excel)
- The SUM formula & calculating % change (Excel)
- SELECT (SQL)
- UPDATE (SQL)
- DELETE (SQL)
- INSERT INTO (SQL)
- CREATE DATABASE (SQL)
- ALTER DATABASE (SQL)
- CREATE TABLE (SQL)
- ALTER TABLE (SQL)
- DROP TABLE (SQL)
- CREATE INDEX (SQL)
Visualizing: 4 Data visualization principles.
- Answer only 1 question that everyone in your audience will care about
- Four charts cover 95% of visualization needs
- Use the rule of thirds
- Always write your point on the visualization

Analyzing: Common regression analysis in R.

Let’s look at specifics for each step in the cycle.

Skill #1: Understand these 4 Data Cleaning Principles

To the outsider, data analysis often seems “cool,” taking on the image of complex code used to discover secrets hiding in big numbers. While I like to envision this for myself 100% while I’m working, it would be silly to say it’s the whole job.

A huge part of data analysis is simply preparing data. In fact, several surveys in the past decade show data analysts spend upwards up 80% of their time cleaning and preparation.

The first of our top 4 skills is to abide by these 4 principles when cleaning data:

The Golden Rule: make it uniform

Don’t leave NULL or blank values
Ensure your software recognizes data types correctly
Remove duplicates

Golden Rule: make it uniform

The golden rule for data cleaning is to ensure your data is uniform. Each column will require a specific format, and you must ensure not one cell deviates from the format.

For example, if you have a column of dates in the form dd/mm/yyyy, you cannot have cells in the format mmmm dd, yyyy. Or, if you have numbers with two decimal points, you must add two zeros after the decimal for whole integers.

I won’t go through every kind of format because at the end of the day, you as the analyst will decide what makes most sense within the given project. But remember the golden rule: make your data uniform.

Remember the Golden Rule: make your data uniform.
AnalystAnswers.com

Don’t leave NULL or blank values

If you’ve ever seen an incomplete data set, then you know about NULL values and blanks. Here’s an example dataset of data on states’ driver safety:

STATE	Number of drivers involved in fatal collisions per billion miles	Percentage Of Drivers Involved In Fatal Collisions Who Were Speeding	Percentage Of Drivers Involved In Fatal Collisions Who Were Alcohol-Impaired
AL	18.8	39	NULL
AK		41	25
AZ		35	28
AR	22.4	NULL	26
CA	12	35	28

Excerpt of FiverThirtyEight’s bad-drivers Dataset

See the blanks and NULLS? The problem with them is they make the row in which they’re located a moot point. You cannot analyze AK (Alaska), for example, because it’s missing critical data. This means you cannot keep the cell as is.

When faced with NULL or blank data points, you have two choices. Either you create dummy data (aka fake data to act as a placeholder) or you delete the data. The rule of thumb here is that creating dummy data is easier, but deleting data is more reliable.

There are of course some exceptions. For example, deleting NULLed rows from a very small data table will make it less reliable because the sample is less representative.

Ensure your software recognizes data types correctly

When you’re entering data, particularly numeric data, your software may recognize it as a data type you don’t want. The most common example is numbers being recognized as text. This is what it looks like in Excel:

The problem with this is that you will not be able to perform calculations if excel believes the cell is not a number. To change it, simply click on the drop down arrow in the Text box and select “Number.”

Each data type (in Excel, General, Number, Currency, Accounting, Short date, Long data, Time, Percent, Fraction, Scientific, Text, and Custom), interacts with the software differently. Each data type has advantages and disadvantages, so ensure your software recognizes data types correctly.

Examples of data types and their advantages and disadvantages:

Data type	Advantage	Disadvantage
General	Works with most functions, auto-adapts to correct data type with context	No specific advantages alone
Number	Fundamental numeric unity, can perform all non-text Excel functions	Hard to read
Currency	Shows the currency next to the number, uses commas for easy reading	Limitations with advanced Excel finance and accounting formulas
Accounting	Shows negatives with brackets, performs all finance functions	The currency symbol hugs the left shoulder of the cell
Date	Transforms absolute computer dates (which look like 344485) into readable dates	Hard to modify/indentify in graphs
Time	Displays time	Cannot be use for calculations
Percent	Displays percents	Difficulties switching back and forth to other numeric data types

Table of advantages and disadvantages for each data type in Excel

Remove duplicates

For me, the most frustrating part of cleaning data is removing duplicates. While Excel has a button to do so for you, there are several cases in which Excel keeps duplicates without telling you.

Let me give you an example so its easy to understand. Imagine you’ve got the following small data table:

Name	Score
John	20
John	18
Jane	29
Sophie	22
Malissa	24

Sample data set for removing duplicates

As you can see, John has two entries here, so we need to get rid of one. In this situation, we want to higher of the two scores, which is luckily first in the list. We’re going to use the “Remove Duplicates” function, which always removes the first of two repeated values.

(Side note: If we wanted a lower value, we would need to Filter and Sort the column of interest first from lowest to highest.)

To remove the duplicate John values, let’s navigate to Data>Remove Duplicates. Before clicking the button, highlight the cells you want to work on. When you click, you will be prompted to “Expand the section.”

You almost ALWAYS want to expand, which meaning you include the entire table in the duplicate removal. If not, you’ll only modify the highlighted column.

Next you will see a filter option to select on which cells you want to base the duplication. Select only the column of interest. If you select others, both columns will need to have duplicate values in order for Excel to delete them… which would defeat the purpose of this exercise.

In addition, be aware that Excel WILL delete hidden cells in the remove duplicate process, but Excel WILL NOT remove duplicates with grouped cells in the remove duplicates process.

Skill #2: Know these 4 Excel Tricks & these 10 SQL Commands

If you didn’t already know, Excel and SQL are two critical programs in the data analysis world. Typically, spreadsheet softwares like Excel or LibreOffice are good for analysis in which the data tables are smaller, and for which it’s helpful to see the data table in front of you. Whereas SQL is useful for tables and databases too big for Excel to handle.

Excel

You should use excel to manipulate and understand data tables quickly. You should split, copy, and paste sections of interest, and filter them to create meaning. The following tricks will enable you to perform 90% of data analyst tasks.

The 4 Excel tricks you should know are:

Keyboard fluency:
- Control + Arrow;
- Control + Shift + Arrow;
- Control + X;
- Control + C;
- Control + V
Data Filters
Pivot Tables

The SUM formula & calculating % change

Keyboard fluency

One of the main reasons Excel is so dominant a software is its ease of use. By learning keyboard shortcuts, you will dramatically increase your ability to perform analytical tasks. For me, one of its most effective uses is the creation of small data tables or logic structures to answer on-the-fly questions your teammates or managers may ask. However, in order to do so, you need to know your shortcuts. That’s why they’re part of skill #2.

Control + Arrow. When you hold down Control (Cmd on Mac) and push an arrow key, the curser will move to the end of the data range. This is incredibly helpful for navigating big tables. Imagine you have 20 lines of text follows by a space, then another 20 lines, and this continues for 1000s of lines. Well, you can quickly move about them with this trick.

Control + Shift + Arrow. By adding shift to Control (Cmd) and the arrow, you highlight all of the cells in the data range. This is especially useful when you want to quickly delete or copy a section within others.

Control + X. This is a simple shortcut to cut.

Control + C. This is a simple shortcut to copy.

Control + V. This is a simple shortcut to paste.

Data filters

Data filters are arguably the most intuitive element on the list, and perhaps the one you’ll use most. They allow you to filter out lines of data based on a value in one column. This is especially useful when you only want to see a subset of a larger data table.

For example, let’s use the following sample data set of driver’s safety data:

STATE	Number of drivers involved in fatal collisions per billion miles	Percentage Of Drivers Involved In Fatal Collisions Who Were Speeding	Percentage Of Drivers Involved In Fatal Collisions Who Were Alcohol-Impaired
AL	18.8	39	30
AK	18.1	41	25
AZ	18.6	35	28
AR	22.4	18	26
CA	12	35	28

Excerpt of FiverThirtyEight’s bad-drivers Dataset

Now, imagine you’re only interested in states where the Number of drivers involved in a fatal collision per billion miles is above 18. You could look through them all to find out, but that would defeat the point of Excel!

Instead, use a filter. Navigate to Data>Filter. Click Filter. You should see drop down arrows on each cell in the top row. If not, highlight the top row and try again.

From there you can click any column with the data on which you want to filter. Then click on the box “Choose one…” to select “Greater than.”

Once you click on the filter you want and the value in the cell to its right, the table will automatically update. When you have what you want, you may simply click outside the box. Here’s a short video to show what I mean:

How to use a filter in Excel

As you can see, filters are an extremely easy way to navigate large datasets in Excel to pull out the information that you want. (Pro tip: the shortcut for this is Control + Shift + L, but you have to highlight the top row.)

Pivot tables

Sometimes we receive data tables that are hard to read. They might be too information rich, or organized in a way that’s less useful for our purposes. We wish we could just take the column headers and move them around to get what we want. Well, that’s what pivot tables do!

But what exactly are pivot tables?

Pivot tables let you grab any column header and make it a row or column in a new table (the Pivot Table), then select the value you want it calculate based on these new rows and columns.

This may seem intimidating at first, but don’t worry. Pivot tables are very flexible, and they provide a sort of playground that you can erase at any time without damaging the data source.

To launch a pivot table, follow these steps:

Start by highlighting the data you would like to manipulate
Navigate to Insert>Tables>PivotTable

A “Create pivot table” popup will appear in which you need to select a table or range as well as whether to put the PivotTable in the same worksheet or another. Since you highlighted the cells, the range is already selected. And I always like to have the Pivot in the same worksheet. It makes comparing easier.
To select the same worksheet, select a cell that will become the top left part of the PivotTable, leaving space below and to the right of that cell.
The popup box on the right shows you a section with your column headers, then 4 other boxes. Those other boxes allow you to drag-and-drop the column headers as either Columns, Rows, Values, or Filters.

Play with it!

Let’s look at an example using the same data from above. I’ve repasted it here for your convenience:

STATE	Number of drivers involved in fatal collisions per billion miles	Percentage Of Drivers Involved In Fatal Collisions Who Were Speeding	Percentage Of Drivers Involved In Fatal Collisions Who Were Alcohol-Impaired
AL	18.8	39	30
AK	18.1	41	25
AZ	18.6	35	28
AR	22.4	18	26
CA	12	35	28

Excerpt of FiverThirtyEight’s bad-drivers Dataset

Let’s say I want to see the above data organized in a way that shows me the absolute number of drivers in a fatal collision BASED ON the percentage of drivers who were alcohol-impaired. To do so, I would simply drag the relevant columns. Here’s a video to show you how:

How to Build a Pivot Table

Admittedly, this data set is structured in a way that the first two columns give raw data and the others provide percentages on those columns. This means building one on the other is less valuable.

The important thing to remember here is how powerful PivotTables are, and that you need to know them as a data analyst.

SUM formula & (easily) calculating % change

Last but not least of the Excel tricks are two simple hacks to working faster in Excel. The SUM formula allows you to find the sum of any number of cells by separating them with a comma, or creating data ranges using a colon. Calculating % change is something you’ve been doing since middle school, but entering that long formula each time is a drag. There’s a simpler way.

To use the SUM formula, simply enter the “=” sign in an empty cell to engage Excel’s autofill, then write “SUM(“. From this point you either enter non-adjacent cells by separating them with a comma, or adjacent cells using a colon. For example, you might write “=SUM(A3,D3,G3,I3:P3)”. This adds the first three non-adjacent cells to the I3:P3 adjacent cells.

To calculate % change easily, simply divide the end number by the starting number and subtract 1. For example, if you want to know what percent change was between 100 and 120, the calculation is (120/100)-1, or .20, or 20% increase. The same works for a decrease. Let’s say you want the percent decrease between 100 and 60. The calculation is (60/100)-1, or .4, or 40%. Use this calculation in Excel to save tons of time!

10 SQL Commands

According to W3Schools and in my experience, these are the 10 most common SQL commands. Obviously they are not comprehensive, but it’s a good place to get started. If you haven’t downloaded SQL yet, this article tells you how to do so very easily. You’ll want to do that before practicing commands.

But the good news is SQL is so intuitive that you don’t need to use it to understand it!

The reason this is part of the top 5 skills for data analysts is that SQL is very simple for the added value it brings. You can see how intuitive it is given these commands. For each of them, I’ve provided a simple explanation and the syntax, or basic structure, below.

SELECT – extracts data from a database

UPDATE – updates data in a database
DELETE – deletes data from a database
INSERT INTO – inserts new data into a database

CREATE DATABASE – creates a new database
ALTER DATABASE – modifies a database
CREATE TABLE – creates a new table

ALTER TABLE – modifies a table
DROP TABLE – deletes a table
CREATE INDEX – creates an index (search key)

SELECT

SELECT *
FROM customers
WHERE first_name

UPDATE

UPDATE table_name
SET column1 = value1, column2 = value2...., columnN = valueN
WHERE [condition]

DELETE

DELETE FROM table_name
WHERE condition;

INSTERT INTO

INSERT INTO table_name (column1, column2, column3, ...)
VALUES (value1, value2, value3, ...);

CREATE DATABASE

CREATE DATABASE databasename;

CREATE TABLE

CREATE TABLE table_name (
    column1 datatype,
    column2 datatype,
    column3 datatype,
   ....
);

ALTER TABLE

ALTER TABLE table_name
ADD column_name datatype;

DROP TABLE

DROP TABLE table_name;

CREATE INDEX

CREATE INDEX index_name
ON table_name (column1, column2, ...);

Seriously, if you want to learn more about SQL, use this guide to download it.

Skill#3: Apply these 4 Data Visualization Principles

Data visualization (dataviz) is the use of graphical elements to represent data — more than graphs, data visualizations exploit elements of design and research such as framing and visual cues.

Datavizes are key tools for communicating complex findings because they tell a story to which people can cling, rather than force the audience to think numerically from the start. Here’s a famous example of a data visualization:

The good news is data visualization is not rocket science — with a few basic principles you can perform like a professional, and you can do so today.

The 4 data visualization principles are:

Answer only 1 question that everyone in your audience will care about

Four charts cover 95% of visualization needs
Use the rule of thirds
Always write your point on the visualization

Let’s look at these in more detail.

Make 1 point that everyone in your audience will care about

The temptation when we start to build a data visualization is to pack in as much information as we can. Personally, this urge comes from the pleasure of slowly discovering a complex topic; it’s like good news — you want to share it!

But don’t.

Imagine yourself on the receiving end of the data visualization. A colleague or peer approaches you with a new topic and says s/he would like to share some charts with you. You may be happy to help, but there’s always a slight feeling that you’ll have to use energy to understand.

That’s because there’s too much information in most visualizations.

Just focus on 1 question that interests everyone in your audience. For example, if you want to show why your company should switch to a new accounting software, just show how many people prefer the new to the old in a dataviz.

This maintains the balance between complexity and usefulness to the whole audience of managers and accountants.

If you want to show the problem of obesity in the United States, just show the US vs. an average of all other countries. This maintains the balance between complexity and usefulness to the whole audience of students and policymakers alike.

In other words, always think about how you can prove your point with as little information as possible, AND in a way that interests everyone in your audience.

Four charts cover 95% of visualization needs

For one reason or another, I’ve noticed that analysts tend to envision charts as subtle form of art. Analysts like to add various decorative elements, build-in 3 dimensions, and add a lot of color. You end up seeing many different charts, each one with variations.

In reality, charts are not art. They’re visualization tools, and we should optimize them for the user. In 95% of cases, you only need line, column (or bar), scatter, and waterfall charts. Anything else is decorative and likely distracting. But when should you use each one?

Line – to show change in 1 variable over time (x=time)

Column – to compare 2 or more variables (y is measurement)
Scatter – to show the relationship between 2 variables
Waterfall – Additions and subtractions from a total

Line chart

As a data analyst, you often find yourself needing to explain change over time. In a company, the most obvious example of this is revenues. I highly discourage you from using anything but line charts when it comes to time.

Just remember: with time divine, go with a line. Here is an example of a line graph of Amazon’s revenues over time:

Column (aka “bar”) chart

When faced with a choice between 2 or more options, column charts come in handy. If you’re more familiar with the term “bar” charts, it’s fine to use that terminology as well. The only real difference between column charts and bar charts is that columns are vertical while bars are horizontal.

Just remember: when choices you must call ’em, just go with a column. Here is an example of a column chart of projects ROIs.

In this case, it’s easy to see that the most profitable project is the “Delight.”

Scatter

When you want to see the relationship between two variables with a significant number of data points, use a scatter plot. Scatter plots can feel a bit overwhelming because they feel like a large amount of information, but you shouldn’t think of them that way. They simply show how two variables relate to one another, if at all. They’re useful for establishing correlations, but do not show causation.

Example Correlation – Speeding vs Alcohol as Cause of Fatal Collisions

As you can see here, there is very little relationship between speeding and alcohol as cause of a fatal collision. We can call this a slightly positive relationship.

When the trend goes from the bottom left to the top right, the correlation is positive. When it goes from the top left to the bottom right, the correlation is negative.

For example, on this graph, when drivers involved in fatal collisions who were speeding increases, there is a slightly higher chance that those involved in fatal collisions who were alcohol impaired increases as well. The relationship is slightly positive.

Waterfall

Waterfall charts are great when you want to show how multiple values influence a whole. The most common use of these in a company is financing and budgeting. For example, you could use a waterfall chart to show how charges and expenses impact the bottom line, or profit.

Use the rule of thirds

The rule of thirds is to display important visual information at the crossroads of lines splitting an image into 3×3. You know what these are — most smartphones today have them built in. Here’s an example:

In other words, the rule of thirds is a simple way to ensure important graphical elements fall on those lines. Examples of important graphical information include:

Images:
- Bodies
- Eyes
- Hips
- Shoulders
Data visualizations:
- Titles
- Charts inflection points
- Important percentages

The reason we use the rule of thirds is to make visualization easier to consume for the viewer. Aren’t these photos nice?

Always write your point on the visualization

The last point I want to make about data visualizations is that it’s important to write the idea you want to communicate on the visualization. While numbers speak for themselves, the goal of the visualization is to help your readers understand your point of view for the data.

Be transparent, and tell them what to believe.

Skill#4: Use R for Regression Analysis

When it comes to statistical analysis, you’ll want to work with programming language R. It’s easy to set up — much easier than SQL in my opinion, and you can perform complicated analyses in a matter of minutes. You can learn how to get started with R in this article.

I won’t explain how to do each of the following steps because you can find that information in our free courses, but imagine how much time you will save when the following is all you need to perform a regression analysis:

Download the xlsx, cars, and tidyverse packages and install their libraries
Upload your data with read.xlsx
Use the summary() function to make sure your data is correct
Use the lm() function to perform the regression
If you’re performing a multivariable regression, use the formation lm(primary variable ~ secondary variable1, secondary variable2, secondary variable3, …)

About the Author

Noah

Noah is the founder & Editor-in-Chief at AnalystAnswers. He is a transatlantic professional and entrepreneur with 5+ years of corporate finance and data analytics experience, as well as 3+ years in consumer financial products and business software. He started AnalystAnswers to provide aspiring professionals with accessible explanations of otherwise dense finance and data concepts. Noah believes everyone can benefit from an analytical mindset in growing digital world. When he's not busy at work, Noah likes to explore new European cities, exercise, and spend time with friends and family.